Mais conteúdo relacionado
Semelhante a Data Mining & Predictive Analytics - Lesson 14 - Concepts Recapitulation and Conclusions - The Penultimate Lesson - Mr Rudy Ridwen (20)
Data Mining & Predictive Analytics - Lesson 14 - Concepts Recapitulation and Conclusions - The Penultimate Lesson - Mr Rudy Ridwen
- 1. C3249C - Data Mining and
Predictive Analytics
SpecialistDiplomainBusinessAnalytics(SDBA)
Lesson 14 – Concepts Recapitulation and
Conclusions: The Penultimate Lesson
6th June 2019
Rudy Ridwen
school•of•inforcomm
republic•polytechnic
- 5. ©2020 Republic Polytechnic
Data Never Sleeps
5
Source: Domo, Inc.
“By 2025, it’s
estimated that 463
exabytes of data will be
created each day
globally – that’s the
equivalent of
212,765,957 DVDs per
day!”
World Economic Forum,
2019
NB:
• a Gigabyte (GB) is 1,000
Megabytes (MB);
• a Terabyte (TB) is 1,000
Gigabytes;
• a Petabyte (PB) is 1,000
Gigabytes;
• an Exabyte (EB) is 1,000
Petabytes
- 6. ©2020 Republic Polytechnic
Data-Driven Innovation
6
“Data is a resource, much like water or
energy, and like any resource, data does
nothing on its own.
Rather, it is world-changing in how it is
employed in human decision making.”
Justin Hienz
Owner of Cogent Writing, LLC
- 7. ©2020 Republic Polytechnic
Data is the New Oil
7
“Data is the new oil."
Coined in 2006 by British
Mathematician, Clive Humby.
This now famous phrase was
embraced by the World
Economic Forum in a 2011
report.
- 8. ©2020 Republic Polytechnic
Data: the Basis of Everything
Cloud
Applications
(e.g. social media)
• Pervasive digitization exploded the amount of data being created and
collected.
• This provide the opportunity to make use of the data to gain insights
and to make better decision.
Social
Needs
Environment
Studies
Public
Services
Company
Operations
8
- 10. ©2020 Republic Polytechnic
Decision Making with Analytics
10
Analytics can overcome human limitations to
improve the speed, accuracy, consistency, and
transparency of decisions.
- 11. ©2020 Republic Polytechnic
11
Analytics 1.0 – the era of “business intelligence”
• This was the era of the enterprise data warehouse,
used to capture information, and of business
intelligence software, used to query and report it.
Analytics 2.0 – the era of big data
• Analytics 2.0 employed next-generation quantitative
analysts were called data scientists, and they
possessed both computational and analytical skills.
Analytics 3.0 – the era of data-enriched offerings
• Analytics 3.0 creates products and services from
analyses of data. Since every digital activity leaves a
trail, it provide the ability to embed analytics and
optimization into every business decision made at the
operation front lines.
The Evolution of Analytics
- 12. ©2020 Republic Polytechnic
12
Business Analytics:
• Mathematical and statistical process of
transforming data into insight for making
better decisions.
• The data-driven analytics insights are used as
a complement to the decision maker’s
experience and “gut-feel”.
Business Analytics - Defined
- 16. ©2020 Republic Polytechnic
Achieving Success with Business Analytics
16
CompetitiveAdvantage
Basic Reporting What happened?
Ad Hoc Reporting How many, how often, where?
Dynamic Reporting Where exactly are the problems?
Reporting with Early Warning What actions are needed?
Basic Statistical Analysis Why is this happening?
Forecasting What if these trends continue?
Predictive Modeling What will happen next?
Decision Optimization What is the best decision?
Data Information Intelligence
Advanced
Analytics
Basic
Analytics
Reporting
Decision Support Decision Guidance
- 17. ©2020 Republic Polytechnic
17
Planning from the Top Down
Analyze suitable analytics or
modeling that can answer
the business questions?
Define mission-critical
business questions that
must be answered.
Identify data do you have
that that can help to build
the model.
- 18. ©2020 Republic Polytechnic
18
Planning from the Bottom Up
Determine suitable analytics
or modeling can be done
using the available data.
Suggest business problem
that can be solved using
analytics.
Identify the data that you
have.
- 19. ©2020 Republic Polytechnic
19
Data Mining:
• Finding patterns or relationships among
elements of the data.
[unsupervised and supervised learning]
Predictive Analytics:
• Finding a pattern (from historical data) so that
an opportunity outcome can be identified
before it occurred.
[supervised learning]
Business Analytics
- 20. ©2020 Republic Polytechnic
Analytics Expertise Required for Success
Domain
Knowledge
Intimate knowledge
of related industry
critical to analytics
project success.
Data
Availability
Data always impose
the constraints of
analytics
Analytical
Methods
and
Principles
Data Analytics
Skills
20
- 21. ©2020 Republic Polytechnic
21
Deployment and use of BA:
• Financial analytics
• Human resource (HR) analytics
• Marketing analytics
• Health care analytics
• Supply chain analytics
• Analytics for government and non-profits
• Sports analytics
• Web and Social Media analytics
Business Analytics in Practice
- 23. ©2020 Republic Polytechnic
23
Several methodologies Data Mining have been
developed, each with their own perspective.
The popular methodologies are:
• SEMMA (SAS)
• SAS Enterprise Miner
• Fayyad et al. (Computer science)
• WEKA
• CRISP-DM (IBM)
• SPSS Modeler
Methodologies for Data
Mining
- 24. ©2020 Republic Polytechnic
SEMMA Methodology
24
Supported by SAS Enterprise Mining environment
SAMPLE
Input data,
Sampling,
Data partition
EXPLORE
Distribution explorer,
Multiplot,
Insight,
Association,
Variable selection
MODEL
Regression,
Tree,
Neural Network,
Ensemble
MODIFY
Transform variable,
Filter outliers,
Clustering,
SOM / Kohonen
ASSESS
Assessment,
Score,
Report
- 25. ©2020 Republic Polytechnic
Fayyad’s KDD Methodology
25
KDD: knowledge discovery and data mining
data
Target
data
Processed
data
Transformed
data Patterns
Knowledge
Selection
Preprocessing
& cleaning
Transformation
& feature
selection
Data Mining
Interpretation
Evaluation
Reproduced from: maastrichtuniversity.nl lecture notes
- 26. ©2020 Republic Polytechnic
CRISP-DM Methodology
26
CRISP-DM: Cross-industry standard process for data mining
Business understanding
• Business objective
• Assess situation
• Data mining goals
• Project plan
Data understanding
• Collect data
• Describe data
• Explore data
• Verify data quality
Data Preparation
• Select data
• Clean data
• Construct data
• Integrate data
• Format data
Modeling
• Select modeling
techniques
• Design the test
• Build model
• Assess model
Evaluation
• Evaluate results
• Review process
• Determine next steps
Deployment
• Plan deployment
• Plan monitoring and
maintenance
• Final report
• Review project
- 27. ©2020 Republic Polytechnic
CRISP-DM
27
What is CRISP-DM?
• Cross Industry Standard Process for Data
Mining (CRISP-DM) is a methodology
that describes the approach use in
tackling data mining problems.
[http://www.crisp-dm.org/]
• CRISP-DM allow data analytics
practitioners to follow a systematic
process in generating an analytics
solution that is:
1. Well-understood
2. Well-planned
3. Well-executed
4. Well-documented
- 28. ©2020 Republic Polytechnic
General Data-Mining Process
Data-mining process comprises the following steps:
Data Preparation
• Data Sampling:
Extract a sample
of data that is
relevant to the
business problem
under
consideration.
• Data Preparation:
Manipulate the
data to put it in a
form suitable for
formal modeling.
Model Construction
• Apply the
appropriate data-
mining technique
(e.g. k-means,
classification trees)
to accomplish the
desired data-
mining task
(prediction,
classification,
clustering, etc.).
Model Assessment
• Evaluate models
by comparing
performance on
appropriate data
sets.
• Decide on the
champion model.
28
- 29. ©2020 Republic Polytechnic
29
Analytics Framework in a
Nutshell
1. Frame a sharp question to be answered (i.e. the
business question)
2. Identify the data and prepare it
3. Create models to answer the question
4. Interpret and rationalise the results
5. Consolidate findings and tell a story (i.e. present
findings)
- 31. ©2020 Republic Polytechnic
31
Data Understanding &
Quality
Select useful inputs
Before any analytics adventure, the analyst must
have a clear understanding of the data:
• What each field/variable means
• Where did the data come from
• When data was saved (i.e. data frequency and
latency)
• How the data was created or collected
Quality of Data is Critical
• No quality data, no quality results
e.g. duplicate data may cause incorrect or
misleading statistics
- 32. ©2020 Republic Polytechnic
Data Preparation
32
Major Tasks in Data Preparation:
1. Data cleaning
2. Data integration
3. Data transformation
4. Data reduction
Expansion of tasks:
• Sampling: select a representative subset
from a large population of data
• Outlier data: investigate and accord
appropriate treatment of the data
• Missing data: investigate and have
strategies to handle this issue
• Normalisation or standardisation data
- 33. ©2020 Republic Polytechnic
33
Data Preparation
Select useful inputs
Preparing data for analytics work is very time
consuming.
At least 70% of time, in an analytics project, will
be spent on data understanding, cleaning and
preparation.
Image Source: https://pixabay.com/en/pie-chart-pacman-portion-shape-27359/
70%
- 35. ©2020 Republic Polytechnic
Supervised Learning
35
Predictive Analytics (PA):
• Finding a pattern (from historical data) so
that an opportunity outcome can be
identified before it occurred.
• PA is a supervised learning, where a
target (i.e. the data we want to predict) is
required.
• A supervised learning algorithm analyses
the historical (i.e. training) data and
produces an inferred function, which can
be used for mapping new examples (i.e.
predictions).
- 36. ©2020 Republic Polytechnic
36
Two Prediction Types
estimates
decisions
inputs prediction
A predictive model
uses input
measurements
to make the best
decision for each
case.
prediction
primary
secondary
secondary
primary
tertiary
A predictive model
uses input
measurements
to optimally estimate
the target value.
prediction
0.65
0.33
0.75
0.28
0.54
Decision Predictions Estimate Predictions
- 37. ©2020 Republic Polytechnic
37
Predictive Modeling Overview
Data
Training
Data
Testing
Data
Model A
Model B
Model C
Model D Model D is the
champion model
Training data
creates model
Test data
tests model
- 38. ©2020 Republic Polytechnic
38
Data Partitioning
• This data partitioning distribution is a Rule of Thumb
• Generally, the Training dataset is bigger than Validation
dataset. And Test dataset is smaller than modeling dataset.
70% 15% 15%
Full Dataset
Dataset for
Modeling
Dataset to
Assess Model
- 41. ©2020 Republic Polytechnic
41
Model Performance Assessment
and Selection
5
4
2
1
5
4
3
2
1
Training Data Validation Data
Model
Complexity
Validation
Assessment
Select the simplest model
with the highest validation
assessment.
inputs target inputs target
- 42. ©2020 Republic Polytechnic
42
Accuracy:
Overall, how often is the classifier correct?
(TP+TN)/(TP+TN+FP+FN)
Misclassification Rate or Error Rate:
Overall, how often is the classifier wrong?
(FP+FN)/(TP+TN+FP+FN) {or equivalent to 1 minus Accuracy}
Sensitivity, Recall, or True Positive Rate:
When it's actually YES, how often does it predict YES?
TP/(TP+FN)
Specificity:
When it's actually NO, how often does it predict NO?
TN/(TN+FP)
Precision:
When it predicts YES, how often is it correct?
TP/(TP+FP)
Prevalence:
How often does the YES condition actually occur in our sample?
(TP+FN)/(TP+TN+FP+FN)
Confusion
Matrix Rates
- 43. ©2020 Republic Polytechnic
Supervised Learning
43
Determining the target’s datatype is
important, as it will affect the choice of
algorithms.
Target can be:
• Classification
• Binary
• Multiclass
• Regression
Model assessment is dependant on the type
of target on hand.
Assessment can be:
• Classification
• Binary – Confusion Matrix
• Multiclass – F1 score [1]
• Regression – RMSE [2]
[1] F1 Score is not covered in SDBA programme
[2] Root mean square error (RMSE) metric is
not covered in SDBA programme
- 45. ©2020 Republic Polytechnic
Supervised Learning
45
Decision Trees Algorithm
• Decision Trees can be used to predict a
categorical or a continuous target (called
regression trees in the latter case)
• Unlike logistic regression and neural
networks, no equations are estimated in
decision trees
• A tree structure of rules over the input
variables are used to classify or predict the
cases according to the target variable
• The rules are of an IF-THEN form – for
example:
If Risk = Low, then predict on-time payment of a loan
- 46. ©2020 Republic Polytechnic
Supervised Learning
46
Algorithm: Regression (Logistic Regression)
• Regression is the attempt to explain the
variation in a dependent variable using the
variation in independent variables.
• If the independent variables sufficiently explain
the variation in the dependent variable, the
model can be used for prediction.
• There are many important research topics for
which the dependent variable is "limited."
• For example: whether or not a person smokes, or a
fraud is committed. For these the outcome is not
continuous or distributed normally.
• Logistic regression is a type of regression
analysis where the dependent variable is a
dummy variable: coded 0 (did not smoke) or
1(did smoke)
- 47. ©2020 Republic Polytechnic
Supervised Learning
47
Algorithm: Neural Networks
• Neural networks are exceptionally good at
performing pattern recognition that are
very difficult to program using
conventional techniques.
• Programs that employ neural nets are
also capable of learning on their own and
adapting to changing conditions.
• Neural networks pattern recognition can
be achieved by using the Backpropagation
algorithm. The algorithm searches for
weight values that minimize the total
error of the network over the set of
training examples (i.e. training set).
- 48. ©2020 Republic Polytechnic
48
Min-Max normalization
Min/Max normalization to [0,1]
40 2001 7
0 1
0 0.25 0.5 0.75 1
Min/Max normalization to [-1,1]
(where 0 is the central point)
1 7
0 1
-1 0.5 0 0.5 1
- 49. ©2020 Republic Polytechnic
49
Choosing
Champion Model
• Models created using various
algorithms will invariably produce
different results.
• Model assessment is required to
determine the which of the many
models create is the champion
model.
• ROC chart can be used to
determine the champion. Other
model assessment measurement
can also be used (e.g. Confusion
Matrix, RMSE).
- 50. ©2020 Republic Polytechnic
50
• Training data includes both the input (i.e.
independent variables) and the desired results (i.e.
dependent variable or target).
• Predictive models are constructed using the training
data.
• Testing data includes both the input and known
target.
• A model’s results from the test data will ascertain its
predictive prowess.
• A good model will be able to generalise. It will give
correct results when new input data are given
without knowing the target.
Recap: Supervised Learning
- 54. ©2020 Republic Polytechnic
54
• The model is not provided with the correct results (i.e.
target) during the training. In other words, there is no
target to aim for.
• The aim is to explore the data to find some intrinsic
structures in them.
• Model is the results of their statistical or mathematical
results only.
• Interpretation of the results from the unsupervised
learning is still done by humans.
• Unsupervised learning is unlike supervised learning,
there is no correct answers (i.e. no target to compare
against). Algorithms are left to their own devises to
discover and present the interesting structure in the
data for humans to interpret.
Unsupervised Learning
- 55. ©2020 Republic Polytechnic
Unsupervised Learning
55
Algorithm: Association Analysis
• Association Rule:
Given a set of transactions, find rules that
will predict the occurrence of an item
based on the occurrences of other items
in the transaction. Collectively these
items coupling is called, itemset.
• Rule Evaluation Metrics:
Support and Confidence calculations will
give an indication of the itemset status.
• Commonly used algorithm for association
analysis is Apriori principle.
- 56. ©2020 Republic Polytechnic
Unsupervised Learning
56
Algorithm: Cluster Analysis
• Cluster analysis is used to segment (i.e.
group) data objects without any
instructions or target.
• Data objects within a group are similar (or
related) to one another and different
from (or unrelated to) the data objects in
other groups.
• Cluster analysis constructs a partition of a
set of n records into a set of k clusters
• Each record belongs to exactly one
cluster
• The number of clusters k is given in
advance
• Commonly used algorithm for clustering
is the k-means.
- 59. ©2020 Republic Polytechnic
59
Machine Learning
Select useful inputs
• Data Mining/Predictive Analytics is a subset of
Machine Learning.
• Machine learning is a field of computer science
that gives computers the ability to learn without
being explicitly programmed.[1]
[1] Samuel, Arthur (1959). "Some Studies in Machine Learning Using the Game of Checkers"
- 60. ©2020 Republic Polytechnic
60
Data Mining
Select useful inputs
• Data Mining is about automating the process of
searching for patterns in the data.
• Two types of Machine Learning:
• Supervised
• Unsupervised
• In supervised learning, a good model will be able
to generalise. It will give correct results when
new input data are given without knowing the
target.
• In unsupervised learning, interpretation of the
results from the unsupervised learning is still
done by humans.
- 61. ©2020 Republic Polytechnic
61
Proof is in the Pudding
Select useful inputs
• A model is only as good as its test results
(i.e. from model assessment)
• A model must give better prediction than
the population’s probability to be useful.
• The best model is when it stood the test
after deployment to the real-world.
- 69. ©2020 Republic Polytechnic
69
Why smart statistics are the key to fighting crime
by Anne Milgram at TED@BCG
https://www.youtube.com/watch?v=ZJNESMhIxQ0
What is the Cambridge Analytica scandal?
by The Guardian
https://www.youtube.com/watch?v=Q91nvbJSmS4
Real-World Predictive Analytics
in Action
- 72. C3249C - Data Mining and
Predictive Analytics
SpecialistDiplomainBusinessAnalytics(SDBA)
Lesson 14 – Concepts Recapitulation and
Conclusions: The Penultimate Lesson
6th June 2019
Rudy Ridwen
school•of•inforcomm
republic•polytechnic
- 73. ©2020 Republic Polytechnic
2
Why smart statistics are the key to fighting crime
by Anne Milgram at TED@BCG
https://www.youtube.com/watch?v=ZJNESMhIxQ0
What is the Cambridge Analytica scandal?
by The Guardian
https://www.youtube.com/watch?v=Q91nvbJSmS4
Real-World Predictive Analytics
in Action