Data mining and Machine learning expained in jargon free & lucid language

q-Maxim on Data mining and machine learning
Some intuition about data mining / machine learning in jargon free lucid language
1
By
Jagadish C.A. (Rao) , Founder of q-Maxim
V 1.4a 13-8-2013

BY READING ONE CAN GET SOME INTUITION ABOUT WHAT DATA MINING IS ALL ABOUT AND HOW ONE CAN APPLY IT IN THEIR OWN WORK
THIS PRESENTATION GIVES OVERVIEW OF DATA MINING & MACHINE LEARNING THEN GOES ON TO DESCRIBE SOME OF THE ASPECTS IN SOME DETAIL
2

3
•
Overview - what is data mining & machine learning – why, where used
•Types of data mining
•Data mining Steps - overview
•Data mining Steps in detail
•Caution notice, Data mining software, references
•About q-Maxim & Jagadish C A

What is data mining?
•Many interpretations about the term
•“Data mining is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems” – Wikipedia
•In other words -Data mining is process of knowledge discovery in large databases
4

What is data mining?
•Process of analyzing data to identify patterns or relationship.
•Data mining involves developing predictive capacity OR descriptive capacity for dataset of interest
•As compared to querying, reporting, or even OLAP it is possible to get information without asking specific questions.
•Usually involving complex algorithms and advanced statistical techniques
See an example of predictive data mining & terminology in the next slide. Data is generally in the form shown
5

What is data mining? Example of prediction - predicting house prices
6
Row no
Area [sq. Ft.]
Number of rooms
Age of flat [years]
Gym
[Y/N]
Swimming pool [Y/N]
............... Other features not shown.............
Market price X 100000 Rupees
1
1800
5
1.1
yes
yes
68.6
2
900
3
4
no
no
34.5
3
1720
5
8
yes
no
47.7
4
560
2
.7
no
no
25.4
.....
1000
2400
6
3
yes
yes
91.8
Our task is to predict market price of flats in Bangalore. We have the dataset (sample below) of 1000 flats & their market price of past data. Knowing various aspects like area, number of rooms , age of flat, etc of a flat we would like to predict market value the flat.
Called Target or outcome or output
Called Predictors or inputs or features
Records or rows

What is data mining? What it is & what it is not – some intuition
Example1:
My company has extensive sales related data related to various locations & time periods. We would like to answer following business questions.
“What were unit sales in New England last March? What is the trend like? Drill down to Boston”.
This is not a data mining problem.
“What’s likely to be Boston unit sales next month? Why?”
This is a data mining problem.
Example2:
I apply for a credit card. Bank checks through income, age, past credit record, assets and credit card repayment records of thousands of other credit card holders of background similar to mine to decide whether I am creditworthy or not.
This is a data mining problem.
7

Machine learning ?
•One of the most important applications of data mining is in “Machine Learning”
•Definition : “A computer is able to learn by experience without explicitly being programmed – & improves performance as it learns”
•Based on field of artificial intelligence
•Examples :
–Mining data from large datasets website click trough data to improve purchase conversion rate
–Autonomous self flying helicopter (Stanford University)
–Voice recognition (Siri in iPhone)
–Classify e-mail as spam or not spam (Outlook filtering spam)
–handwriting recognition (tablets)
–Computer Vision (reading car number plates & giving speeding tickets)
–Self driven cars (Google self driving car)
–Recommender systems (Amazon recommending books)
8

Why data mining?
•Data deluge, exponential growth of data (40% yearly growth of data –McKinsey global institute study. In 2012, every day, 2.5 quintillion bytes of data are created – other sources) but too little information
Note : quintillion = 1 followed by 18 zeros
•There is a great need to extract useful information from the data and to interpret the data to develop useful knowledge.
9

Why data mining? applications
Wide ranging applications:
–Biology –e.g. genome research
–Health care – e.g. Deciding on treatment for emergency room patients
–Pharma – e.g. drug discovery
–Artificial intelligence applications e.g. Self driven car, machine vision
–Manufacturing
– engineering
–Social media analysis
–Banking, finance
–Advanced data analysis in Six Sigma
10

12
• used
Overview - what is data mining & machine learning – why, where •Types of data mining

Types of data mining
1.Classification predicted target is of discrete class such as True/ false. Examples:
whether an email is spam or not, whether a financial transaction is fraud or not, whether tumor is malignant or not.
Number of classes could be 2 or more
Note: This is predictive type data mining
13

2. Regression predicted target is of continuous value type Examples: knowing area (m2), number of rooms (1-5), etc we are predicting market price (US$) of the house
14
Example : market price prediction based on area two predictive curves fitted
Are of house(m2)
Market Price (US$)

3. Clustering method of assigning a set of objects into groups based on similarities automatically. Example:
create customer segmentation based on income, age, race, location, etc
Note: This is descriptive type data mining
15
Example : Three clusters found

4. Anomaly Detection detecting anomaly based on patterns that do not conform to an established normal behavior. Example:
financial fraud detection, network intrusion attempt, aircraft engine failure prediction based on vibration, Monitoring machines in data center for detecting failures before they occur
16

5. Association Rule Discovering interesting rules between variables. An association algorithm creates rules that describe how often events have occurred together. Example:
“A supermarket chain found that people who buy hotdog sausages also buy tomato ketchups in 99% of cases” = High Support “People who buy hotdog buns buy hangers in 0.005% of cases” = Low support. Conclusion: Keep hotdog sausages & tomato ketchup in adjacent racks thus increasing the probability of purchase
Note : This presentation covers types #1 & #2 only
17

20
•
Overview - what is data mining & machine learning – why, where used
•Types of data mining
•Data mining software, references

Data mining Steps – overview Predictive data mining phases
Has two major phases:
1.Learning phase
Expose the dataset consisting of past data to learning algorithm (more of this later) so that it builds a predictive model (or learns). Tune the model until error between predicted vs actual values of target variable is as low as possible & is within acceptable limits.
2.Scoring phase
Use the model for making predictions (or score) in real time or productionize the model
See schematic in the next slide, details about each of the steps in subsequent slides
21

Data mining – overview
example - predicting market price of house using simple linear learning
algorithm
22
Sampled Training dataset
Known
1. Area of house
2. Number of rooms
3. Age of house
4. Location
5. Gym [y/n]
6. ..... Etc, etc
Learning algorithm
predictive hypothesis
h(x)
Prediction
market price of
house
Called target or
Called features outcome
or predictors
h(x) is a linear equation of
the type:
hθ(x) = θ0+ θ1x1 + θ2x2 +....... Θnxn
Past data of
housing market
having features &
predictors
Learning
phase
scoring
phase

23
• used

24
Data mining Steps in detail
Business
objectives
Data from
many sources
selection
Target data
Pre-processing,
clean ,exploring
Pre-processed
data
transformation
Transformed
data
Data mining
Train
model
Interpret / evaluate
Knowledge
model
in daily use;
evaluate
performance
Export in PMML
& deploy
Data
mining
project
1 2 3 4 5 6
identify and
define
business
opportunity

Data mining Steps in detail Predictive data mining steps
Each of the steps shown in the schematic diagram in the previous slide is explained in some detail in the following slides
Steps are numbered (such as this: ) as per the marking in the schematic diagram in the previous slide
25
6

Data mining Steps in detail selection
1.identify and define business opportunity
2.Select data mining project (s)
3.Identify data sources, could be
1.at many databases
2.External –Social media (Facebook, Twitter, news items, blogs)
3.Internal – ERP, CRM, Data warehouse ,relational technologies, XML-databases, MS-Office files, etc
Dataset might consist of thousands (even millions) of records and hundreds & sometimes thousands of features. For example, suppose we are doing a data mining project on census records of US citizens, dataset will have > 300 million records as population of US is about 300 million
26
1
2

Data mining Steps in detail Selection
•Extract data of interest
Many techniques may have to be used to extract useful information such as -
•SQL
•Roll-up
•Drill-down
•Slice and dice
•Pivot
27
1
2

Data mining Steps in detail Pre-processing –scrub, explore
•Scrub data
•Clean data – errors, inconsistent units, etc . E.g.: area of flat might in m2 in some records and in ft2 in other records
•Fill missing data e.g. some fields might be empty
•Hide identity if necessary e.g. Patient medical records
•Remove duplicate fields
28
3

Data mining Steps in detail Pre-processing –scrub, explore
•Explore data by visualisation
•Visualise the data to get a quick overview. Use some of these graph types:
–Scatter plots, Box plots, bar charts, Histograms, Scatter plots, histograms, density plots
–Advanced graphs: Heat maps, Cluster dendrograms
see next slide for pictures of graph types
29
3

Data mining Steps in detail pre-processing –visualisation graph types
30
3
histograms
Box plots
bar plots
Density plots
scatter plots
Heat maps Clustering dendrogram

Data mining Steps in detail transformation
Sometimes it is necessary to convert features or target variables to a different format. One or more of these may be used:
–Feature scaling
•Make sure features are on a similar scale – Convert every feature to a scale between -1 to +1 This makes some of the data mining programs to run faster.
–Mean normalization
•Replace each feature value by value- mean of the dataset so that features have zero mean.
31
4

Data mining Steps in detail transformation (cont.)
–Combine several features to a single feature (e.g. Convert dimensions of the house to area)
–Date conversion for doing date arithmetic
–Generally, if target variable data is skewed, apply one these functions
•Log, square root, squared, polynomial ...
32
4

Data mining Steps in detail train model
Reaching this stage constitutes typically as much as 60% of the data mining effort
This step has several sub-steps & is explained in some detail
Schematic picture of this step is in next slide. Additional explanation in subsequent slides
33
5

Data mining Steps in detail train model
34
5
Sampled data Split data into
1.Training 2. Validation 3. Test datasets typically in the ratio : 70:15:15
Training dataset Sample pre- dataset
processed, transformed Build predictive
model on training , validation datasets using one or more learning algorithms$$
Predictive model
Measure the
performance of prediction of model on validation dataset using error rate. Tune model as necessary
Tuned Predictive model
$$typical Learning algorithms :
1.Linear regression
2.Polynomial regression
3.Logistic regression
4.Neural network
5.Support vector machine
6.Random forest

Data mining Steps in detail train model -sample & split dataset
–Cleaned & transformed data is sampled as original data set may be very large
–Sampled data is split into three subsets typically in 70:15:15 ratio into:
–Training
–Validation
–Test
datasets
35
5

Data mining Steps in detail train model -sample & split dataset (cont.)
–Only Training and validation dataset is used to build model. Model is built on training dataset & predictive performance is repeatedly tested on validation dataset.
–Goodness of the Model so build is evaluated on Test dataset
36
5

Data mining Steps in detail train model -build model
–Depending on application, one or more of the learning algorithms is used to build predictive models.
–Each learning algorithm is based on different principles
–Most common algorithms are:
1.Linear regression
3.Logistic regression
4.Neural networks
5.Support vector machine (SVM)
6.Random forest
37
5

-Each learning algorithm has different parameters for improving its performance called tuning parameters
- Most of the data mining programs have libraries for doing this
Brief explanation about learning algorithms follows in next few slides
38
5

Learning algorithms :
1.Linear regression
Simplest of the lot assumes linear relationship between features and target . Hypothesis of model with n features would look like this: hθ(x) = θ0+ θ1x1 + θ2x2 +....... Θnxn
Assumes polynomial relationship between features and target . Typical hypothesis of a polynomial model would look like this: hθ(x) = θ0+ θ1x2 + θ2x3 + θ2x4
39
5

Learning algorithms :
3. Target is classification type e.g. E-mail spam or not spam, tumour malignant or benign. Typical hypothesis of for a model with 4 features would look like this
hθ(x) = g(θ0+ θ1x1 + θ2x2 +θ3x3 +Θ4x4)
40
5

Learning algorithms (advanced) :
4. Neural networks
Can handle categorical & regression target types. Is a machine learning type algorithm. Can handle non-linear & complicated type of hypothesis. Resemble functioning of neurons in human brain. Though not easy to understand working, can produce very good predictions.
5. Support vector machine (SVM)
Can handle categorical & regression target types. Is a machine learning type algorithm. Can handle non-linear & complicated type of hypothesis.
41
5

Learning algorithms (advanced) :
6. Random forest (decision tree)
Can handle categorical & regression target types. These are ensemble learning method that operate by constructing a multitude of decision trees (see next slide for example) . Is a recursive partitioning method of machine learning type algorithm. Can handle non-linear & complicated type of hypothesis. Can also list relative importance of features.
42
5

Data mining Steps in detail train model -build model-decision tree example
43
5
A tree showing survival of passengers on the Titanic ("sibsp" is the number of spouses or siblings aboard). The figures under the leaves show the probability of survival and the percentage of observations in the leaf. Source: WIKIPEDIA
Decision tree of possibility of a person surviving Titanic sinking

Data mining Steps in detail Interpret / evaluate performance
–Predictive ability of the model so built is evaluated applying on unseen data i.e. test dataset (also called scoring)
–predictive ability is measured by error measures
–Error measures are different for regression and classification problems
44
6

Data mining Steps in detail Interpret / evaluate performance (cont.)
–Common Error measures for regression
•Adjusted R2, AIC,BIC
• Root-mean-square error (RMSE),mean squared error (MSE) of an estimator is one of many ways to quantify the difference between values implied by an estimator and the true values of the quantity being estimated.
–Common Error measures for classification
•Precision, recall, F1 score, accuracy
•Lift, Area under ROC (receiver operating characteristic curve)
45
6

–These error measures are used as a basis for
–Confirming performance of the model
–Comparing performance of different algorithms
–Sometimes model is able to fit very well on the training & validation sets but unable to generalise on new samples. Could be a underfit (called high bias) or overfit (called high variance).
46
6

•Not always the performance of the model is to the desired level. One or more of the following measures could be tried to improve the performance:
–Increase training samples
–Increase number of features
–Decrease number of features
–Add polynomial features (e.g. hθ(x) = θ0+ θ1x2 + θ2x3 + θ2x4)
–improving the model by tuning learning algorithm. Each algorithm has tuning parameters e.g. For SVM learning algorithm it is cost, gamma, epsilon )
47
6

Data mining Steps in detail deploying model
–models are deployed for routine use & data can be scored in real-time
–Before deploying model is often exported to open standard -PMML format
–PMML (Predictive Model Markup Language) provides a standard way to represent data mining models. It allows for the interchange of models among different tools and environments
–Companies like Zemenentis provide PMML based scoring engines for many platforms
48
6

49
• used

Data mining caution notice
–One must carefully distinguish between correlation and causation
–Fact that Data mining studies indicate high level of performance of the model does not necessarily imply causation.
–It is possible to get good correlation by fitting data around just noise not signal
–Healthy scepticism is desirable. Before concluding about causation facts have to be verified.
50
6

Data mining software software
51
Several data mining packages exist which make data mining task relatively painless. Some of the prominent Open source ones are:
1.R
2.Rattle – R with graphical interface
3.Octave

Data mining software software
52
Some of the prominent Commercial ones are:
1.Revolution analytics – enhanced R
2.Minitab (some data mining aspects)
3.Ms-office data mining add-in
4.IBM-SPSS, SAS, Statistica
5.Microsoft Office -data mining extensions
6....... & Many more

Data mining references
1.Data mining - Wikipedia, the free encyclopedia
2.Big data: The next frontier for innovation, competition, and productivity –McKinsey global publication
3.Machine learning- Wikipedia
4.A. Guazzelli, M. Zeller, W. Chen, and G. Williams. PMML: An Open Standard for Sharing Models. The R Journal, Volume 1/1, May 2009.
5.Data analysis and machine learning online courses in Coursera
6.R: A programming language and software environment for statistical computing, data mining, and graphics. Numerous other R resources on the web
7.Rattle: A Data Mining GUI for R - WILLIAMS - The R Journal
8.Support vector machine (SVM) –Wikipedia
9.Neural network software - Wikipedia, the free encyclopedia
10.Random forest - Wikipedia, the free encyclopedia
11.Publications / websites of commercial data mining software companies listed in previous slide
12.Jagadish’s notes based on his past experience
53

54
• used
•About q-Maxim, Jagadish C A

CONTACT US FOR DETAILS (DETAILS NEXT SLIDE)
QUESTIONS?
DOUBTS?
WHAT NEXT?
WOULD YOU LIKE TO DISCUSS FURTHER TO EXPLORE DATA MINING/MACHINE LEARNING ?
55

Contact: : q-Maxim , Jagadish C.A. (Rao) Founder, President jagadish.chandra@qmaxim.com +91 9538328704 +91 80 2693 1804 LinkedIn: http://in.linkedin.com/in/jagdishca/ blog: qmaxim.wordpress.com
Note : Contents of this presentation, concepts, data, style are proprietary in nature and & is subject to intellectual property restrictions
Q-Maxim is niche consultancy focussed on advanced problem solving, Quality, optimization and Japanese quality methodologies.
About Us:

Data mining and Machine learning expained in jargon free & lucid language

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Data mining and Machine learning expained in jargon free & lucid language

Similar to Data mining and Machine learning expained in jargon free & lucid language (20)

Recently uploaded

Recently uploaded (20)

Data mining and Machine learning expained in jargon free & lucid language