What is Data Science?
Data Science is a blend of various tools, algorithms, and machine learning principles with the goal to discover hidden patterns from the raw data.
Data Science is primarily used to make decisions and predictions making use of predictive causal analytics, prescriptive analytics (predictive plus decision science) and machine learning.
Data Science is a more forward-looking approach, an exploratory way with the focus on analyzing the past or current data and predicting the future outcomes with the aim of making informed decisions.It answers the open-ended questions as to “what” and “how” events occur.
Why Data Science?
Now a days we get the data which is in unstructured or may be in semi-structured manner.The data will be generated from different sources like financial logs,text files,multimedia forms,sensors and instruments which simple BI tools weren’t capable of process this huge volume variety of data. So by using Data Science we get more complex and advanced analytical tools and algorithms for processing,analysis and drawing the sights out of it.
Example : To make a car self driving we need to collect live data from sensors including radars,cameras and lasers to create a map of surroundings.Here we can understand the role of data science which will take the decisions like when to speed up and down,when to overtake,where to take a turn using advanced machine learning algorithms.
Data Science Case Study – Diabetes Prevention
We could predict the occurrence of diabetes and take appropriate measures beforehand to prevent it?
Step 1: First, we will collect the data based on the medical history of the patient.
we have the various attributes as mentioned below.
Attributes:
- npreg – Number of times pregnant
- Glucose – Plasma glucose concentration
- Bp – Blood pressure
- Skin – Triceps skinfold thickness
- bmi – Body mass index
- ped – Diabetes pedigree function
- age – Age
- Income – Income
Step 2:
- Now, once we have the data, we need to clean and prepare the data for data analysis.
- This data has a lot of inconsistencies like missing values, blank columns, abrupt values and incorrect data format which need to be cleaned.
- Here, we have organized the data into a single table under different attributes – making it look more structured.
- Just have a look at the sample data below.
This data has a lot of inconsistencies.
- In the column npreg, “one” is written in words, whereas it should be in the numeric form like 1.
- In column bp one of the values is 6600 which is impossible (at least for humans) as bp cannot go up to such huge value.
- As you can see the Income column is blank and also makes no sense in predicting diabetes. Therefore, it is redundant to have it here and should be removed from the table.
- So, we will clean and preprocess this data by removing the outliers, filling up the null values and normalizing the data type. If you remember, this is our second phase which is data preprocessing.
- Finally, we get the clean data as shown below which can be used for analysis.
Step 3:
Now let’s do some analysis
- First, we will load the data into the analytical sandbox and apply various statistical functions on it. For example, R has functions like describe which gives us the number of missing values and unique values. We can also use the summary function which will give us statistical information like mean, median, range, min and max values.
- Then, we use visualization techniques like histograms, line graphs, box plots to get a fair idea of the distribution of data.
Step 4:
Now, based on insights derived from the previous step, the best fit for this kind of problem is the decision tree.
- Since, we already have the major attributes for analysis like npreg, bmi, etc., so we will uses supervised learning technique to build a model here.
- Further, we have particularly used decision tree because it takes all attributes into consideration in one go, like the ones which have a linear relationship as well as those which have a non-linear relationship. In our case, we have a linear relationship between npreg and age, whereas the nonlinear relationship between npreg and ped.
- Decision tree models are also very robust as we can use the different combination of attributes to make various trees and then finally implement the one with the maximum efficiency.
Let’s have a look at our decision tree.
Here, the most important parameter is the level of glucose, so it is our root node. Now, the current node and its value determine the next important parameter to be taken. It goes on until we get the result in terms of pos or neg. Pos means the tendency of having diabetes is positive and neg means the tendency of having diabetes is negative.
The skills of Data Scientists
A Data Scientist need to have various hard skills and soft skills and has to be good at statistics and mathematics to analyze and visualize the data.Machine Learning forms heart of Data Science.He must be capable of implementing various algorithms which requires good coding skills.
Why Techcovery for Data Science Certification
We work backwards from our customer. Our joy is in seeing our customers win. Our ethos is to keep it simple, work with tailor made consulting and training on niche technologies, work with excellent instructors, provide best in industry response time to clients and work at most competitive rates.
All in all, we are out to provide organizations the best and most convenient solution for Data Science certification.