SlideShare uma empresa Scribd logo
1 de 73
C3249C - Data Mining and
Predictive Analytics
SpecialistDiplomainBusinessAnalytics(SDBA)
Lesson 14 – Concepts Recapitulation and
Conclusions: The Penultimate Lesson
6th June 2019
Rudy Ridwen
school•of•inforcomm
republic•polytechnic
©2020 Republic Polytechnic
Data Mining
Methodologies
A process guide for analytics
projects
2
©2020 Republic Polytechnic
Analytics
Why “Analytics”?
3https://www.freepik.com/free-vector/mechanical-
brain_769574.htm#term=sketch&page=5&position=2
©2020 Republic Polytechnic
Data Never Sleeps
4
Source: Domo, Inc.
©2020 Republic Polytechnic
Data Never Sleeps
5
Source: Domo, Inc.
“By 2025, it’s
estimated that 463
exabytes of data will be
created each day
globally – that’s the
equivalent of
212,765,957 DVDs per
day!”
World Economic Forum,
2019
NB:
• a Gigabyte (GB) is 1,000
Megabytes (MB);
• a Terabyte (TB) is 1,000
Gigabytes;
• a Petabyte (PB) is 1,000
Gigabytes;
• an Exabyte (EB) is 1,000
Petabytes
©2020 Republic Polytechnic
Data-Driven Innovation
6
“Data is a resource, much like water or
energy, and like any resource, data does
nothing on its own.
Rather, it is world-changing in how it is
employed in human decision making.”
Justin Hienz
Owner of Cogent Writing, LLC
©2020 Republic Polytechnic
Data is the New Oil
7
“Data is the new oil."
Coined in 2006 by British
Mathematician, Clive Humby.
This now famous phrase was
embraced by the World
Economic Forum in a 2011
report.
©2020 Republic Polytechnic
Data: the Basis of Everything
Cloud
Applications
(e.g. social media)
• Pervasive digitization exploded the amount of data being created and
collected.
• This provide the opportunity to make use of the data to gain insights
and to make better decision.
Social
Needs
Environment
Studies
Public
Services
Company
Operations
8
©2020 Republic Polytechnic
From Data to Wisdom
9
©2018 Republic Polytechnic
©2020 Republic Polytechnic
Decision Making with Analytics
10
Analytics can overcome human limitations to
improve the speed, accuracy, consistency, and
transparency of decisions.
©2020 Republic Polytechnic
11
Analytics 1.0 – the era of “business intelligence”
• This was the era of the enterprise data warehouse,
used to capture information, and of business
intelligence software, used to query and report it.
Analytics 2.0 – the era of big data
• Analytics 2.0 employed next-generation quantitative
analysts were called data scientists, and they
possessed both computational and analytical skills.
Analytics 3.0 – the era of data-enriched offerings
• Analytics 3.0 creates products and services from
analyses of data. Since every digital activity leaves a
trail, it provide the ability to embed analytics and
optimization into every business decision made at the
operation front lines.
The Evolution of Analytics
©2020 Republic Polytechnic
12
Business Analytics:
• Mathematical and statistical process of
transforming data into insight for making
better decisions.
• The data-driven analytics insights are used as
a complement to the decision maker’s
experience and “gut-feel”.
Business Analytics - Defined
©2020 Republic Polytechnic
2018 Gartner Magic Quadrant for BI and Analytics
13
©2020 Republic Polytechnic
2019 Gartner Magic Quadrant for BI and Analytics
14
©2020 Republic Polytechnic
Spectrum of Business Analytics
15
©2020 Republic Polytechnic
Achieving Success with Business Analytics
16
CompetitiveAdvantage
Basic Reporting What happened?
Ad Hoc Reporting How many, how often, where?
Dynamic Reporting Where exactly are the problems?
Reporting with Early Warning What actions are needed?
Basic Statistical Analysis Why is this happening?
Forecasting What if these trends continue?
Predictive Modeling What will happen next?
Decision Optimization What is the best decision?
Data Information Intelligence
Advanced
Analytics
Basic
Analytics
Reporting
Decision Support Decision Guidance
©2020 Republic Polytechnic
17
Planning from the Top Down
Analyze suitable analytics or
modeling that can answer
the business questions?
Define mission-critical
business questions that
must be answered.
Identify data do you have
that that can help to build
the model.
©2020 Republic Polytechnic
18
Planning from the Bottom Up
Determine suitable analytics
or modeling can be done
using the available data.
Suggest business problem
that can be solved using
analytics.
Identify the data that you
have.
©2020 Republic Polytechnic
19
Data Mining:
• Finding patterns or relationships among
elements of the data.
[unsupervised and supervised learning]
Predictive Analytics:
• Finding a pattern (from historical data) so that
an opportunity outcome can be identified
before it occurred.
[supervised learning]
Business Analytics
©2020 Republic Polytechnic
Analytics Expertise Required for Success
Domain
Knowledge
Intimate knowledge
of related industry
critical to analytics
project success.
Data
Availability
Data always impose
the constraints of
analytics
Analytical
Methods
and
Principles
Data Analytics
Skills
20
©2020 Republic Polytechnic
21
Deployment and use of BA:
• Financial analytics
• Human resource (HR) analytics
• Marketing analytics
• Health care analytics
• Supply chain analytics
• Analytics for government and non-profits
• Sports analytics
• Web and Social Media analytics
Business Analytics in Practice
©2020 Republic Polytechnic
Analytics
Frameworks
The Process of an Analytics Project
22https://www.freepik.com/free-vector/vintage-aircraft-
illustration_3043533.htm#term=sketch&page=11&position=12
©2020 Republic Polytechnic
23
Several methodologies Data Mining have been
developed, each with their own perspective.
The popular methodologies are:
• SEMMA (SAS)
• SAS Enterprise Miner
• Fayyad et al. (Computer science)
• WEKA
• CRISP-DM (IBM)
• SPSS Modeler
Methodologies for Data
Mining
©2020 Republic Polytechnic
SEMMA Methodology
24
Supported by SAS Enterprise Mining environment
SAMPLE
Input data,
Sampling,
Data partition
EXPLORE
Distribution explorer,
Multiplot,
Insight,
Association,
Variable selection
MODEL
Regression,
Tree,
Neural Network,
Ensemble
MODIFY
Transform variable,
Filter outliers,
Clustering,
SOM / Kohonen
ASSESS
Assessment,
Score,
Report
©2020 Republic Polytechnic
Fayyad’s KDD Methodology
25
KDD: knowledge discovery and data mining
data
Target
data
Processed
data
Transformed
data Patterns
Knowledge
Selection
Preprocessing
& cleaning
Transformation
& feature
selection
Data Mining
Interpretation
Evaluation
Reproduced from: maastrichtuniversity.nl lecture notes
©2020 Republic Polytechnic
CRISP-DM Methodology
26
CRISP-DM: Cross-industry standard process for data mining
Business understanding
• Business objective
• Assess situation
• Data mining goals
• Project plan
Data understanding
• Collect data
• Describe data
• Explore data
• Verify data quality
Data Preparation
• Select data
• Clean data
• Construct data
• Integrate data
• Format data
Modeling
• Select modeling
techniques
• Design the test
• Build model
• Assess model
Evaluation
• Evaluate results
• Review process
• Determine next steps
Deployment
• Plan deployment
• Plan monitoring and
maintenance
• Final report
• Review project
©2020 Republic Polytechnic
CRISP-DM
27
What is CRISP-DM?
• Cross Industry Standard Process for Data
Mining (CRISP-DM) is a methodology
that describes the approach use in
tackling data mining problems.
[http://www.crisp-dm.org/]
• CRISP-DM allow data analytics
practitioners to follow a systematic
process in generating an analytics
solution that is:
1. Well-understood
2. Well-planned
3. Well-executed
4. Well-documented
©2020 Republic Polytechnic
General Data-Mining Process
Data-mining process comprises the following steps:
Data Preparation
• Data Sampling:
Extract a sample
of data that is
relevant to the
business problem
under
consideration.
• Data Preparation:
Manipulate the
data to put it in a
form suitable for
formal modeling.
Model Construction
• Apply the
appropriate data-
mining technique
(e.g. k-means,
classification trees)
to accomplish the
desired data-
mining task
(prediction,
classification,
clustering, etc.).
Model Assessment
• Evaluate models
by comparing
performance on
appropriate data
sets.
• Decide on the
champion model.
28
©2020 Republic Polytechnic
29
Analytics Framework in a
Nutshell
1. Frame a sharp question to be answered (i.e. the
business question)
2. Identify the data and prepare it
3. Create models to answer the question
4. Interpret and rationalise the results
5. Consolidate findings and tell a story (i.e. present
findings)
©2020 Republic Polytechnic
Data
Data Data Everywhere
30
https://www.freepik.com/free-vector/sketchy-robot_794262.htm
©2020 Republic Polytechnic
31
Data Understanding &
Quality
Select useful inputs
Before any analytics adventure, the analyst must
have a clear understanding of the data:
• What each field/variable means
• Where did the data come from
• When data was saved (i.e. data frequency and
latency)
• How the data was created or collected
Quality of Data is Critical
• No quality data, no quality results
e.g. duplicate data may cause incorrect or
misleading statistics
©2020 Republic Polytechnic
Data Preparation
32
Major Tasks in Data Preparation:
1. Data cleaning
2. Data integration
3. Data transformation
4. Data reduction
Expansion of tasks:
• Sampling: select a representative subset
from a large population of data
• Outlier data: investigate and accord
appropriate treatment of the data
• Missing data: investigate and have
strategies to handle this issue
• Normalisation or standardisation data
©2020 Republic Polytechnic
33
Data Preparation
Select useful inputs
Preparing data for analytics work is very time
consuming.
At least 70% of time, in an analytics project, will
be spent on data understanding, cleaning and
preparation.
Image Source: https://pixabay.com/en/pie-chart-pacman-portion-shape-27359/
70%
©2020 Republic Polytechnic
Supervised
Learning
Make a Prediction
34
https://www.freepik.com/index.php?goto=74&idfoto=3043535
©2020 Republic Polytechnic
Supervised Learning
35
Predictive Analytics (PA):
• Finding a pattern (from historical data) so
that an opportunity outcome can be
identified before it occurred.
• PA is a supervised learning, where a
target (i.e. the data we want to predict) is
required.
• A supervised learning algorithm analyses
the historical (i.e. training) data and
produces an inferred function, which can
be used for mapping new examples (i.e.
predictions).
©2020 Republic Polytechnic
36
Two Prediction Types
estimates
decisions
inputs prediction
A predictive model
uses input
measurements
to make the best
decision for each
case.
prediction
primary
secondary
secondary
primary
tertiary
A predictive model
uses input
measurements
to optimally estimate
the target value.
prediction
0.65
0.33
0.75
0.28
0.54
Decision Predictions Estimate Predictions
©2020 Republic Polytechnic
37
Predictive Modeling Overview
Data
Training
Data
Testing
Data
Model A
Model B
Model C
Model D Model D is the
champion model
Training data
creates model
Test data
tests model
©2020 Republic Polytechnic
38
Data Partitioning
• This data partitioning distribution is a Rule of Thumb
• Generally, the Training dataset is bigger than Validation
dataset. And Test dataset is smaller than modeling dataset.
70% 15% 15%
Full Dataset
Dataset for
Modeling
Dataset to
Assess Model
©2020 Republic Polytechnic
39
The Curse of Dimensionality
1–D
2–D
3–D
©2020 Republic Polytechnic
40
Model Complexity
Too flexible
Just right
©2020 Republic Polytechnic
41
Model Performance Assessment
and Selection
5
4
2
1
5
4
3
2
1
Training Data Validation Data
Model
Complexity
Validation
Assessment
Select the simplest model
with the highest validation
assessment.
inputs target inputs target
©2020 Republic Polytechnic
42
Accuracy:
Overall, how often is the classifier correct?
(TP+TN)/(TP+TN+FP+FN)
Misclassification Rate or Error Rate:
Overall, how often is the classifier wrong?
(FP+FN)/(TP+TN+FP+FN) {or equivalent to 1 minus Accuracy}
Sensitivity, Recall, or True Positive Rate:
When it's actually YES, how often does it predict YES?
TP/(TP+FN)
Specificity:
When it's actually NO, how often does it predict NO?
TN/(TN+FP)
Precision:
When it predicts YES, how often is it correct?
TP/(TP+FP)
Prevalence:
How often does the YES condition actually occur in our sample?
(TP+FN)/(TP+TN+FP+FN)
Confusion
Matrix Rates
©2020 Republic Polytechnic
Supervised Learning
43
Determining the target’s datatype is
important, as it will affect the choice of
algorithms.
Target can be:
• Classification
• Binary
• Multiclass
• Regression
Model assessment is dependant on the type
of target on hand.
Assessment can be:
• Classification
• Binary – Confusion Matrix
• Multiclass – F1 score [1]
• Regression – RMSE [2]
[1] F1 Score is not covered in SDBA programme
[2] Root mean square error (RMSE) metric is
not covered in SDBA programme
©2020 Republic Polytechnic
Algorithms
Models are created from…
algorithms
44
https://www.freepik.com/index.php?goto=74&idfoto=2782996
©2020 Republic Polytechnic
Supervised Learning
45
Decision Trees Algorithm
• Decision Trees can be used to predict a
categorical or a continuous target (called
regression trees in the latter case)
• Unlike logistic regression and neural
networks, no equations are estimated in
decision trees
• A tree structure of rules over the input
variables are used to classify or predict the
cases according to the target variable
• The rules are of an IF-THEN form – for
example:
If Risk = Low, then predict on-time payment of a loan
©2020 Republic Polytechnic
Supervised Learning
46
Algorithm: Regression (Logistic Regression)
• Regression is the attempt to explain the
variation in a dependent variable using the
variation in independent variables.
• If the independent variables sufficiently explain
the variation in the dependent variable, the
model can be used for prediction.
• There are many important research topics for
which the dependent variable is "limited."
• For example: whether or not a person smokes, or a
fraud is committed. For these the outcome is not
continuous or distributed normally.
• Logistic regression is a type of regression
analysis where the dependent variable is a
dummy variable: coded 0 (did not smoke) or
1(did smoke)
©2020 Republic Polytechnic
Supervised Learning
47
Algorithm: Neural Networks
• Neural networks are exceptionally good at
performing pattern recognition that are
very difficult to program using
conventional techniques.
• Programs that employ neural nets are
also capable of learning on their own and
adapting to changing conditions.
• Neural networks pattern recognition can
be achieved by using the Backpropagation
algorithm. The algorithm searches for
weight values that minimize the total
error of the network over the set of
training examples (i.e. training set).
©2020 Republic Polytechnic
48
Min-Max normalization
Min/Max normalization to [0,1]
40 2001 7
0 1
0 0.25 0.5 0.75 1
Min/Max normalization to [-1,1]
(where 0 is the central point)
1 7
0 1
-1 0.5 0 0.5 1
©2020 Republic Polytechnic
49
Choosing
Champion Model
• Models created using various
algorithms will invariably produce
different results.
• Model assessment is required to
determine the which of the many
models create is the champion
model.
• ROC chart can be used to
determine the champion. Other
model assessment measurement
can also be used (e.g. Confusion
Matrix, RMSE).
©2020 Republic Polytechnic
50
• Training data includes both the input (i.e.
independent variables) and the desired results (i.e.
dependent variable or target).
• Predictive models are constructed using the training
data.
• Testing data includes both the input and known
target.
• A model’s results from the test data will ascertain its
predictive prowess.
• A good model will be able to generalise. It will give
correct results when new input data are given
without knowing the target.
Recap: Supervised Learning
©2020 Republic Polytechnic
51
Machine Learning Algorithms
Source:
https://s3.amazonaws.com/MLMastery/MachineLearningAlgorithms.png?__s=
yxwb9fsmnfj72ypjei1f
©2020 Republic Polytechnic
Unsupervised
Learning
Something is telling us…
52
©2020 Republic Polytechnic
Unsupervised
Learning
“Tell me what you see”
53
https://www.freepik.com/index.php?goto=74&idfoto=945899
©2020 Republic Polytechnic
54
• The model is not provided with the correct results (i.e.
target) during the training. In other words, there is no
target to aim for.
• The aim is to explore the data to find some intrinsic
structures in them.
• Model is the results of their statistical or mathematical
results only.
• Interpretation of the results from the unsupervised
learning is still done by humans.
• Unsupervised learning is unlike supervised learning,
there is no correct answers (i.e. no target to compare
against). Algorithms are left to their own devises to
discover and present the interesting structure in the
data for humans to interpret.
Unsupervised Learning
©2020 Republic Polytechnic
Unsupervised Learning
55
Algorithm: Association Analysis
• Association Rule:
Given a set of transactions, find rules that
will predict the occurrence of an item
based on the occurrences of other items
in the transaction. Collectively these
items coupling is called, itemset.
• Rule Evaluation Metrics:
Support and Confidence calculations will
give an indication of the itemset status.
• Commonly used algorithm for association
analysis is Apriori principle.
©2020 Republic Polytechnic
Unsupervised Learning
56
Algorithm: Cluster Analysis
• Cluster analysis is used to segment (i.e.
group) data objects without any
instructions or target.
• Data objects within a group are similar (or
related) to one another and different
from (or unrelated to) the data objects in
other groups.
• Cluster analysis constructs a partition of a
set of n records into a set of k clusters
• Each record belongs to exactly one
cluster
• The number of clusters k is given in
advance
• Commonly used algorithm for clustering
is the k-means.
©2020 Republic Polytechnic
57
Beyond the
module
demonstrations
Data Mining Tools
©2020 Republic Polytechnic
Summary and
Conclusion
58
©2020 Republic Polytechnic
59
Machine Learning
Select useful inputs
• Data Mining/Predictive Analytics is a subset of
Machine Learning.
• Machine learning is a field of computer science
that gives computers the ability to learn without
being explicitly programmed.[1]
[1] Samuel, Arthur (1959). "Some Studies in Machine Learning Using the Game of Checkers"
©2020 Republic Polytechnic
60
Data Mining
Select useful inputs
• Data Mining is about automating the process of
searching for patterns in the data.
• Two types of Machine Learning:
• Supervised
• Unsupervised
• In supervised learning, a good model will be able
to generalise. It will give correct results when
new input data are given without knowing the
target.
• In unsupervised learning, interpretation of the
results from the unsupervised learning is still
done by humans.
©2020 Republic Polytechnic
61
Proof is in the Pudding
Select useful inputs
• A model is only as good as its test results
(i.e. from model assessment)
• A model must give better prediction than
the population’s probability to be useful.
• The best model is when it stood the test
after deployment to the real-world.
©2020 Republic Polytechnic
The Analytics
Landscape
The Big Picture View
62
https://www.freepik.com/index.php?goto=74&idfoto=2783060
©2020 Republic Polytechnic
63
Analytics Use within 3 Years
Source: Operationalizing and Embedding Analytics for Action by Fern Halper. TDWI Research.
©2020 Republic Polytechnic
64
Transform with Predictive
Insights
Source: SAP (www.sap.com/predictive)
©2020 Republic Polytechnic
65
An Analytics Architecture
©2020 Republic Polytechnic
66
An Analytics Architecture
©2020 Republic Polytechnic
67
The Analytics Challenges
Source: Operationalizing and Embedding Analytics for Action by Fern Halper.
TDWI Research.
©2020 Republic Polytechnic
Conclusion and
Reflection
What is the future of
data analytics?
68
©2020 Republic Polytechnic
69
Why smart statistics are the key to fighting crime
by Anne Milgram at TED@BCG
https://www.youtube.com/watch?v=ZJNESMhIxQ0
What is the Cambridge Analytica scandal?
by The Guardian
https://www.youtube.com/watch?v=Q91nvbJSmS4
Real-World Predictive Analytics
in Action
©2020 Republic Polytechnic
70
The Analytics Challenges
Source: https://mashable.com/2017/04/27/man-tweets-pie-charts/
ThankYou
rudy_ridwen@rp.edu.sg
@rudyridwen
@rudyridwen
@rudy.ridwen
C3249C - Data Mining and
Predictive Analytics
SpecialistDiplomainBusinessAnalytics(SDBA)
Lesson 14 – Concepts Recapitulation and
Conclusions: The Penultimate Lesson
6th June 2019
Rudy Ridwen
school•of•inforcomm
republic•polytechnic
©2020 Republic Polytechnic
2
Why smart statistics are the key to fighting crime
by Anne Milgram at TED@BCG
https://www.youtube.com/watch?v=ZJNESMhIxQ0
What is the Cambridge Analytica scandal?
by The Guardian
https://www.youtube.com/watch?v=Q91nvbJSmS4
Real-World Predictive Analytics
in Action

Mais conteúdo relacionado

Mais procurados

Introduction to Big Data and Data Science
Introduction to Big Data and Data ScienceIntroduction to Big Data and Data Science
Introduction to Big Data and Data Science
Feyzi R. Bagirov
 
Big Data Analytics MIS presentation
Big Data Analytics MIS presentationBig Data Analytics MIS presentation
Big Data Analytics MIS presentation
AASTHA PANDEY
 

Mais procurados (20)

Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introduction
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analytics
 
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MININGDATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
 
Predictive analytics
Predictive analytics Predictive analytics
Predictive analytics
 
Introduction to Business Analytics
Introduction to Business AnalyticsIntroduction to Business Analytics
Introduction to Business Analytics
 
Importance of Data Analytics
 Importance of Data Analytics Importance of Data Analytics
Importance of Data Analytics
 
Data mining
Data miningData mining
Data mining
 
Introduction to Big Data and Data Science
Introduction to Big Data and Data ScienceIntroduction to Big Data and Data Science
Introduction to Big Data and Data Science
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data mining
 
Big Data Analytics MIS presentation
Big Data Analytics MIS presentationBig Data Analytics MIS presentation
Big Data Analytics MIS presentation
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
Digital data
Digital dataDigital data
Digital data
 
Big Data Ppt PowerPoint Presentation Slides
Big Data Ppt PowerPoint Presentation Slides Big Data Ppt PowerPoint Presentation Slides
Big Data Ppt PowerPoint Presentation Slides
 
Big Data ppt
Big Data pptBig Data ppt
Big Data ppt
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Application of predictive analytics
Application of predictive analyticsApplication of predictive analytics
Application of predictive analytics
 

Semelhante a Data Mining & Predictive Analytics - Lesson 14 - Concepts Recapitulation and Conclusions - The Penultimate Lesson - Mr Rudy Ridwen

Modern Business Intelligence - Design and Implementations
Modern Business Intelligence - Design and ImplementationsModern Business Intelligence - Design and Implementations
Modern Business Intelligence - Design and Implementations
David J Rosenthal
 
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
Denodo
 

Semelhante a Data Mining & Predictive Analytics - Lesson 14 - Concepts Recapitulation and Conclusions - The Penultimate Lesson - Mr Rudy Ridwen (20)

Elementary Data Analysis with MS excel_Day-1
Elementary Data Analysis with MS excel_Day-1Elementary Data Analysis with MS excel_Day-1
Elementary Data Analysis with MS excel_Day-1
 
MSR 2022 Foundational Contribution Award Talk: Software Analytics: Reflection...
MSR 2022 Foundational Contribution Award Talk: Software Analytics: Reflection...MSR 2022 Foundational Contribution Award Talk: Software Analytics: Reflection...
MSR 2022 Foundational Contribution Award Talk: Software Analytics: Reflection...
 
Assocham global conference audit data standards - 28.10.2020
Assocham global conference   audit data standards - 28.10.2020Assocham global conference   audit data standards - 28.10.2020
Assocham global conference audit data standards - 28.10.2020
 
Machine learning will transform how we deliver projects
Machine learning will transform how we deliver projectsMachine learning will transform how we deliver projects
Machine learning will transform how we deliver projects
 
BigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector Webinar
BigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector WebinarBigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector Webinar
BigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector Webinar
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business Goals
 
Certified Big Data Science Analyst (CBDSA)
Certified Big Data Science Analyst (CBDSA)Certified Big Data Science Analyst (CBDSA)
Certified Big Data Science Analyst (CBDSA)
 
BIG DATA ANALYTICS-2.pptx
BIG DATA ANALYTICS-2.pptxBIG DATA ANALYTICS-2.pptx
BIG DATA ANALYTICS-2.pptx
 
DATA BI: put key insights at the finger tip of decision makers.
DATA BI: put key insights at the finger tip of decision makers.DATA BI: put key insights at the finger tip of decision makers.
DATA BI: put key insights at the finger tip of decision makers.
 
How to make your data scientists happy
How to make your data scientists happy How to make your data scientists happy
How to make your data scientists happy
 
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
 
Building the Cognitive Era : Big Data Strategies
Building the Cognitive Era : Big Data StrategiesBuilding the Cognitive Era : Big Data Strategies
Building the Cognitive Era : Big Data Strategies
 
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
 
Big Data & Analytics (Conceptual and Practical Introduction)
Big Data & Analytics (Conceptual and Practical Introduction)Big Data & Analytics (Conceptual and Practical Introduction)
Big Data & Analytics (Conceptual and Practical Introduction)
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
Data is not the new snake oil
Data is not the new snake oilData is not the new snake oil
Data is not the new snake oil
 
Bitrock manufacturing
Bitrock manufacturing Bitrock manufacturing
Bitrock manufacturing
 
Modern Business Intelligence - Design and Implementations
Modern Business Intelligence - Design and ImplementationsModern Business Intelligence - Design and Implementations
Modern Business Intelligence - Design and Implementations
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?
 
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
 

Mais de Michael Lew

CE Diagnostic answers
CE Diagnostic answersCE Diagnostic answers
CE Diagnostic answers
Michael Lew
 
Ecdl v5 module 7 print
Ecdl v5 module 7 printEcdl v5 module 7 print
Ecdl v5 module 7 print
Michael Lew
 

Mais de Michael Lew (20)

Big Data & Text Analytics - Lesson Schedule
Big Data & Text Analytics - Lesson ScheduleBig Data & Text Analytics - Lesson Schedule
Big Data & Text Analytics - Lesson Schedule
 
ICDL Computer Fundamentals (MS Windows 10 & Office 2016)
ICDL Computer Fundamentals (MS Windows 10 & Office 2016)ICDL Computer Fundamentals (MS Windows 10 & Office 2016)
ICDL Computer Fundamentals (MS Windows 10 & Office 2016)
 
ICDL Image Editing (GIMP)
ICDL Image Editing (GIMP)ICDL Image Editing (GIMP)
ICDL Image Editing (GIMP)
 
Web browsing and communication using Outlook
Web browsing and communication using OutlookWeb browsing and communication using Outlook
Web browsing and communication using Outlook
 
Online collaboration
Online collaborationOnline collaboration
Online collaboration
 
Secure Use of IT
Secure Use of ITSecure Use of IT
Secure Use of IT
 
Scenario (Evaluation)
Scenario (Evaluation)Scenario (Evaluation)
Scenario (Evaluation)
 
Manage online information
Manage online informationManage online information
Manage online information
 
CE Diagnostic answers
CE Diagnostic answersCE Diagnostic answers
CE Diagnostic answers
 
OE Diagnostic Test Questions
OE Diagnostic Test QuestionsOE Diagnostic Test Questions
OE Diagnostic Test Questions
 
ICDL Module 2 - Using Computers & Managing Files (Windows XP) - Presentation ...
ICDL Module 2 - Using Computers & Managing Files (Windows XP) - Presentation ...ICDL Module 2 - Using Computers & Managing Files (Windows XP) - Presentation ...
ICDL Module 2 - Using Computers & Managing Files (Windows XP) - Presentation ...
 
ICDL Advanced Excel 2010 - Tutorial
ICDL Advanced Excel 2010 - TutorialICDL Advanced Excel 2010 - Tutorial
ICDL Advanced Excel 2010 - Tutorial
 
ICDL Module 1 - Concepts of ICT (Information and Communication Technology) - ...
ICDL Module 1 - Concepts of ICT (Information and Communication Technology) - ...ICDL Module 1 - Concepts of ICT (Information and Communication Technology) - ...
ICDL Module 1 - Concepts of ICT (Information and Communication Technology) - ...
 
ICDL Module 1 - Concepts of ICT (Information and Communication Technology) - ...
ICDL Module 1 - Concepts of ICT (Information and Communication Technology) - ...ICDL Module 1 - Concepts of ICT (Information and Communication Technology) - ...
ICDL Module 1 - Concepts of ICT (Information and Communication Technology) - ...
 
ICT Blog1
ICT Blog1ICT Blog1
ICT Blog1
 
Ecdl v5 module 7 print
Ecdl v5 module 7 printEcdl v5 module 7 print
Ecdl v5 module 7 print
 
Ecdl v5 module 6 print
Ecdl v5 module 6 printEcdl v5 module 6 print
Ecdl v5 module 6 print
 
Ecdl v5 module 5 print
Ecdl v5 module 5 printEcdl v5 module 5 print
Ecdl v5 module 5 print
 
Ecdl v5 module 4 print
Ecdl v5 module 4 printEcdl v5 module 4 print
Ecdl v5 module 4 print
 
Ecdl v5 module 3 print
Ecdl v5 module 3 printEcdl v5 module 3 print
Ecdl v5 module 3 print
 

Último

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
MateoGardella
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
SanaAli374401
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Último (20)

Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 

Data Mining & Predictive Analytics - Lesson 14 - Concepts Recapitulation and Conclusions - The Penultimate Lesson - Mr Rudy Ridwen

  • 1. C3249C - Data Mining and Predictive Analytics SpecialistDiplomainBusinessAnalytics(SDBA) Lesson 14 – Concepts Recapitulation and Conclusions: The Penultimate Lesson 6th June 2019 Rudy Ridwen school•of•inforcomm republic•polytechnic
  • 2. ©2020 Republic Polytechnic Data Mining Methodologies A process guide for analytics projects 2
  • 3. ©2020 Republic Polytechnic Analytics Why “Analytics”? 3https://www.freepik.com/free-vector/mechanical- brain_769574.htm#term=sketch&page=5&position=2
  • 4. ©2020 Republic Polytechnic Data Never Sleeps 4 Source: Domo, Inc.
  • 5. ©2020 Republic Polytechnic Data Never Sleeps 5 Source: Domo, Inc. “By 2025, it’s estimated that 463 exabytes of data will be created each day globally – that’s the equivalent of 212,765,957 DVDs per day!” World Economic Forum, 2019 NB: • a Gigabyte (GB) is 1,000 Megabytes (MB); • a Terabyte (TB) is 1,000 Gigabytes; • a Petabyte (PB) is 1,000 Gigabytes; • an Exabyte (EB) is 1,000 Petabytes
  • 6. ©2020 Republic Polytechnic Data-Driven Innovation 6 “Data is a resource, much like water or energy, and like any resource, data does nothing on its own. Rather, it is world-changing in how it is employed in human decision making.” Justin Hienz Owner of Cogent Writing, LLC
  • 7. ©2020 Republic Polytechnic Data is the New Oil 7 “Data is the new oil." Coined in 2006 by British Mathematician, Clive Humby. This now famous phrase was embraced by the World Economic Forum in a 2011 report.
  • 8. ©2020 Republic Polytechnic Data: the Basis of Everything Cloud Applications (e.g. social media) • Pervasive digitization exploded the amount of data being created and collected. • This provide the opportunity to make use of the data to gain insights and to make better decision. Social Needs Environment Studies Public Services Company Operations 8
  • 9. ©2020 Republic Polytechnic From Data to Wisdom 9 ©2018 Republic Polytechnic
  • 10. ©2020 Republic Polytechnic Decision Making with Analytics 10 Analytics can overcome human limitations to improve the speed, accuracy, consistency, and transparency of decisions.
  • 11. ©2020 Republic Polytechnic 11 Analytics 1.0 – the era of “business intelligence” • This was the era of the enterprise data warehouse, used to capture information, and of business intelligence software, used to query and report it. Analytics 2.0 – the era of big data • Analytics 2.0 employed next-generation quantitative analysts were called data scientists, and they possessed both computational and analytical skills. Analytics 3.0 – the era of data-enriched offerings • Analytics 3.0 creates products and services from analyses of data. Since every digital activity leaves a trail, it provide the ability to embed analytics and optimization into every business decision made at the operation front lines. The Evolution of Analytics
  • 12. ©2020 Republic Polytechnic 12 Business Analytics: • Mathematical and statistical process of transforming data into insight for making better decisions. • The data-driven analytics insights are used as a complement to the decision maker’s experience and “gut-feel”. Business Analytics - Defined
  • 13. ©2020 Republic Polytechnic 2018 Gartner Magic Quadrant for BI and Analytics 13
  • 14. ©2020 Republic Polytechnic 2019 Gartner Magic Quadrant for BI and Analytics 14
  • 15. ©2020 Republic Polytechnic Spectrum of Business Analytics 15
  • 16. ©2020 Republic Polytechnic Achieving Success with Business Analytics 16 CompetitiveAdvantage Basic Reporting What happened? Ad Hoc Reporting How many, how often, where? Dynamic Reporting Where exactly are the problems? Reporting with Early Warning What actions are needed? Basic Statistical Analysis Why is this happening? Forecasting What if these trends continue? Predictive Modeling What will happen next? Decision Optimization What is the best decision? Data Information Intelligence Advanced Analytics Basic Analytics Reporting Decision Support Decision Guidance
  • 17. ©2020 Republic Polytechnic 17 Planning from the Top Down Analyze suitable analytics or modeling that can answer the business questions? Define mission-critical business questions that must be answered. Identify data do you have that that can help to build the model.
  • 18. ©2020 Republic Polytechnic 18 Planning from the Bottom Up Determine suitable analytics or modeling can be done using the available data. Suggest business problem that can be solved using analytics. Identify the data that you have.
  • 19. ©2020 Republic Polytechnic 19 Data Mining: • Finding patterns or relationships among elements of the data. [unsupervised and supervised learning] Predictive Analytics: • Finding a pattern (from historical data) so that an opportunity outcome can be identified before it occurred. [supervised learning] Business Analytics
  • 20. ©2020 Republic Polytechnic Analytics Expertise Required for Success Domain Knowledge Intimate knowledge of related industry critical to analytics project success. Data Availability Data always impose the constraints of analytics Analytical Methods and Principles Data Analytics Skills 20
  • 21. ©2020 Republic Polytechnic 21 Deployment and use of BA: • Financial analytics • Human resource (HR) analytics • Marketing analytics • Health care analytics • Supply chain analytics • Analytics for government and non-profits • Sports analytics • Web and Social Media analytics Business Analytics in Practice
  • 22. ©2020 Republic Polytechnic Analytics Frameworks The Process of an Analytics Project 22https://www.freepik.com/free-vector/vintage-aircraft- illustration_3043533.htm#term=sketch&page=11&position=12
  • 23. ©2020 Republic Polytechnic 23 Several methodologies Data Mining have been developed, each with their own perspective. The popular methodologies are: • SEMMA (SAS) • SAS Enterprise Miner • Fayyad et al. (Computer science) • WEKA • CRISP-DM (IBM) • SPSS Modeler Methodologies for Data Mining
  • 24. ©2020 Republic Polytechnic SEMMA Methodology 24 Supported by SAS Enterprise Mining environment SAMPLE Input data, Sampling, Data partition EXPLORE Distribution explorer, Multiplot, Insight, Association, Variable selection MODEL Regression, Tree, Neural Network, Ensemble MODIFY Transform variable, Filter outliers, Clustering, SOM / Kohonen ASSESS Assessment, Score, Report
  • 25. ©2020 Republic Polytechnic Fayyad’s KDD Methodology 25 KDD: knowledge discovery and data mining data Target data Processed data Transformed data Patterns Knowledge Selection Preprocessing & cleaning Transformation & feature selection Data Mining Interpretation Evaluation Reproduced from: maastrichtuniversity.nl lecture notes
  • 26. ©2020 Republic Polytechnic CRISP-DM Methodology 26 CRISP-DM: Cross-industry standard process for data mining Business understanding • Business objective • Assess situation • Data mining goals • Project plan Data understanding • Collect data • Describe data • Explore data • Verify data quality Data Preparation • Select data • Clean data • Construct data • Integrate data • Format data Modeling • Select modeling techniques • Design the test • Build model • Assess model Evaluation • Evaluate results • Review process • Determine next steps Deployment • Plan deployment • Plan monitoring and maintenance • Final report • Review project
  • 27. ©2020 Republic Polytechnic CRISP-DM 27 What is CRISP-DM? • Cross Industry Standard Process for Data Mining (CRISP-DM) is a methodology that describes the approach use in tackling data mining problems. [http://www.crisp-dm.org/] • CRISP-DM allow data analytics practitioners to follow a systematic process in generating an analytics solution that is: 1. Well-understood 2. Well-planned 3. Well-executed 4. Well-documented
  • 28. ©2020 Republic Polytechnic General Data-Mining Process Data-mining process comprises the following steps: Data Preparation • Data Sampling: Extract a sample of data that is relevant to the business problem under consideration. • Data Preparation: Manipulate the data to put it in a form suitable for formal modeling. Model Construction • Apply the appropriate data- mining technique (e.g. k-means, classification trees) to accomplish the desired data- mining task (prediction, classification, clustering, etc.). Model Assessment • Evaluate models by comparing performance on appropriate data sets. • Decide on the champion model. 28
  • 29. ©2020 Republic Polytechnic 29 Analytics Framework in a Nutshell 1. Frame a sharp question to be answered (i.e. the business question) 2. Identify the data and prepare it 3. Create models to answer the question 4. Interpret and rationalise the results 5. Consolidate findings and tell a story (i.e. present findings)
  • 30. ©2020 Republic Polytechnic Data Data Data Everywhere 30 https://www.freepik.com/free-vector/sketchy-robot_794262.htm
  • 31. ©2020 Republic Polytechnic 31 Data Understanding & Quality Select useful inputs Before any analytics adventure, the analyst must have a clear understanding of the data: • What each field/variable means • Where did the data come from • When data was saved (i.e. data frequency and latency) • How the data was created or collected Quality of Data is Critical • No quality data, no quality results e.g. duplicate data may cause incorrect or misleading statistics
  • 32. ©2020 Republic Polytechnic Data Preparation 32 Major Tasks in Data Preparation: 1. Data cleaning 2. Data integration 3. Data transformation 4. Data reduction Expansion of tasks: • Sampling: select a representative subset from a large population of data • Outlier data: investigate and accord appropriate treatment of the data • Missing data: investigate and have strategies to handle this issue • Normalisation or standardisation data
  • 33. ©2020 Republic Polytechnic 33 Data Preparation Select useful inputs Preparing data for analytics work is very time consuming. At least 70% of time, in an analytics project, will be spent on data understanding, cleaning and preparation. Image Source: https://pixabay.com/en/pie-chart-pacman-portion-shape-27359/ 70%
  • 34. ©2020 Republic Polytechnic Supervised Learning Make a Prediction 34 https://www.freepik.com/index.php?goto=74&idfoto=3043535
  • 35. ©2020 Republic Polytechnic Supervised Learning 35 Predictive Analytics (PA): • Finding a pattern (from historical data) so that an opportunity outcome can be identified before it occurred. • PA is a supervised learning, where a target (i.e. the data we want to predict) is required. • A supervised learning algorithm analyses the historical (i.e. training) data and produces an inferred function, which can be used for mapping new examples (i.e. predictions).
  • 36. ©2020 Republic Polytechnic 36 Two Prediction Types estimates decisions inputs prediction A predictive model uses input measurements to make the best decision for each case. prediction primary secondary secondary primary tertiary A predictive model uses input measurements to optimally estimate the target value. prediction 0.65 0.33 0.75 0.28 0.54 Decision Predictions Estimate Predictions
  • 37. ©2020 Republic Polytechnic 37 Predictive Modeling Overview Data Training Data Testing Data Model A Model B Model C Model D Model D is the champion model Training data creates model Test data tests model
  • 38. ©2020 Republic Polytechnic 38 Data Partitioning • This data partitioning distribution is a Rule of Thumb • Generally, the Training dataset is bigger than Validation dataset. And Test dataset is smaller than modeling dataset. 70% 15% 15% Full Dataset Dataset for Modeling Dataset to Assess Model
  • 39. ©2020 Republic Polytechnic 39 The Curse of Dimensionality 1–D 2–D 3–D
  • 40. ©2020 Republic Polytechnic 40 Model Complexity Too flexible Just right
  • 41. ©2020 Republic Polytechnic 41 Model Performance Assessment and Selection 5 4 2 1 5 4 3 2 1 Training Data Validation Data Model Complexity Validation Assessment Select the simplest model with the highest validation assessment. inputs target inputs target
  • 42. ©2020 Republic Polytechnic 42 Accuracy: Overall, how often is the classifier correct? (TP+TN)/(TP+TN+FP+FN) Misclassification Rate or Error Rate: Overall, how often is the classifier wrong? (FP+FN)/(TP+TN+FP+FN) {or equivalent to 1 minus Accuracy} Sensitivity, Recall, or True Positive Rate: When it's actually YES, how often does it predict YES? TP/(TP+FN) Specificity: When it's actually NO, how often does it predict NO? TN/(TN+FP) Precision: When it predicts YES, how often is it correct? TP/(TP+FP) Prevalence: How often does the YES condition actually occur in our sample? (TP+FN)/(TP+TN+FP+FN) Confusion Matrix Rates
  • 43. ©2020 Republic Polytechnic Supervised Learning 43 Determining the target’s datatype is important, as it will affect the choice of algorithms. Target can be: • Classification • Binary • Multiclass • Regression Model assessment is dependant on the type of target on hand. Assessment can be: • Classification • Binary – Confusion Matrix • Multiclass – F1 score [1] • Regression – RMSE [2] [1] F1 Score is not covered in SDBA programme [2] Root mean square error (RMSE) metric is not covered in SDBA programme
  • 44. ©2020 Republic Polytechnic Algorithms Models are created from… algorithms 44 https://www.freepik.com/index.php?goto=74&idfoto=2782996
  • 45. ©2020 Republic Polytechnic Supervised Learning 45 Decision Trees Algorithm • Decision Trees can be used to predict a categorical or a continuous target (called regression trees in the latter case) • Unlike logistic regression and neural networks, no equations are estimated in decision trees • A tree structure of rules over the input variables are used to classify or predict the cases according to the target variable • The rules are of an IF-THEN form – for example: If Risk = Low, then predict on-time payment of a loan
  • 46. ©2020 Republic Polytechnic Supervised Learning 46 Algorithm: Regression (Logistic Regression) • Regression is the attempt to explain the variation in a dependent variable using the variation in independent variables. • If the independent variables sufficiently explain the variation in the dependent variable, the model can be used for prediction. • There are many important research topics for which the dependent variable is "limited." • For example: whether or not a person smokes, or a fraud is committed. For these the outcome is not continuous or distributed normally. • Logistic regression is a type of regression analysis where the dependent variable is a dummy variable: coded 0 (did not smoke) or 1(did smoke)
  • 47. ©2020 Republic Polytechnic Supervised Learning 47 Algorithm: Neural Networks • Neural networks are exceptionally good at performing pattern recognition that are very difficult to program using conventional techniques. • Programs that employ neural nets are also capable of learning on their own and adapting to changing conditions. • Neural networks pattern recognition can be achieved by using the Backpropagation algorithm. The algorithm searches for weight values that minimize the total error of the network over the set of training examples (i.e. training set).
  • 48. ©2020 Republic Polytechnic 48 Min-Max normalization Min/Max normalization to [0,1] 40 2001 7 0 1 0 0.25 0.5 0.75 1 Min/Max normalization to [-1,1] (where 0 is the central point) 1 7 0 1 -1 0.5 0 0.5 1
  • 49. ©2020 Republic Polytechnic 49 Choosing Champion Model • Models created using various algorithms will invariably produce different results. • Model assessment is required to determine the which of the many models create is the champion model. • ROC chart can be used to determine the champion. Other model assessment measurement can also be used (e.g. Confusion Matrix, RMSE).
  • 50. ©2020 Republic Polytechnic 50 • Training data includes both the input (i.e. independent variables) and the desired results (i.e. dependent variable or target). • Predictive models are constructed using the training data. • Testing data includes both the input and known target. • A model’s results from the test data will ascertain its predictive prowess. • A good model will be able to generalise. It will give correct results when new input data are given without knowing the target. Recap: Supervised Learning
  • 51. ©2020 Republic Polytechnic 51 Machine Learning Algorithms Source: https://s3.amazonaws.com/MLMastery/MachineLearningAlgorithms.png?__s= yxwb9fsmnfj72ypjei1f
  • 53. ©2020 Republic Polytechnic Unsupervised Learning “Tell me what you see” 53 https://www.freepik.com/index.php?goto=74&idfoto=945899
  • 54. ©2020 Republic Polytechnic 54 • The model is not provided with the correct results (i.e. target) during the training. In other words, there is no target to aim for. • The aim is to explore the data to find some intrinsic structures in them. • Model is the results of their statistical or mathematical results only. • Interpretation of the results from the unsupervised learning is still done by humans. • Unsupervised learning is unlike supervised learning, there is no correct answers (i.e. no target to compare against). Algorithms are left to their own devises to discover and present the interesting structure in the data for humans to interpret. Unsupervised Learning
  • 55. ©2020 Republic Polytechnic Unsupervised Learning 55 Algorithm: Association Analysis • Association Rule: Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction. Collectively these items coupling is called, itemset. • Rule Evaluation Metrics: Support and Confidence calculations will give an indication of the itemset status. • Commonly used algorithm for association analysis is Apriori principle.
  • 56. ©2020 Republic Polytechnic Unsupervised Learning 56 Algorithm: Cluster Analysis • Cluster analysis is used to segment (i.e. group) data objects without any instructions or target. • Data objects within a group are similar (or related) to one another and different from (or unrelated to) the data objects in other groups. • Cluster analysis constructs a partition of a set of n records into a set of k clusters • Each record belongs to exactly one cluster • The number of clusters k is given in advance • Commonly used algorithm for clustering is the k-means.
  • 57. ©2020 Republic Polytechnic 57 Beyond the module demonstrations Data Mining Tools
  • 59. ©2020 Republic Polytechnic 59 Machine Learning Select useful inputs • Data Mining/Predictive Analytics is a subset of Machine Learning. • Machine learning is a field of computer science that gives computers the ability to learn without being explicitly programmed.[1] [1] Samuel, Arthur (1959). "Some Studies in Machine Learning Using the Game of Checkers"
  • 60. ©2020 Republic Polytechnic 60 Data Mining Select useful inputs • Data Mining is about automating the process of searching for patterns in the data. • Two types of Machine Learning: • Supervised • Unsupervised • In supervised learning, a good model will be able to generalise. It will give correct results when new input data are given without knowing the target. • In unsupervised learning, interpretation of the results from the unsupervised learning is still done by humans.
  • 61. ©2020 Republic Polytechnic 61 Proof is in the Pudding Select useful inputs • A model is only as good as its test results (i.e. from model assessment) • A model must give better prediction than the population’s probability to be useful. • The best model is when it stood the test after deployment to the real-world.
  • 62. ©2020 Republic Polytechnic The Analytics Landscape The Big Picture View 62 https://www.freepik.com/index.php?goto=74&idfoto=2783060
  • 63. ©2020 Republic Polytechnic 63 Analytics Use within 3 Years Source: Operationalizing and Embedding Analytics for Action by Fern Halper. TDWI Research.
  • 64. ©2020 Republic Polytechnic 64 Transform with Predictive Insights Source: SAP (www.sap.com/predictive)
  • 65. ©2020 Republic Polytechnic 65 An Analytics Architecture
  • 66. ©2020 Republic Polytechnic 66 An Analytics Architecture
  • 67. ©2020 Republic Polytechnic 67 The Analytics Challenges Source: Operationalizing and Embedding Analytics for Action by Fern Halper. TDWI Research.
  • 68. ©2020 Republic Polytechnic Conclusion and Reflection What is the future of data analytics? 68
  • 69. ©2020 Republic Polytechnic 69 Why smart statistics are the key to fighting crime by Anne Milgram at TED@BCG https://www.youtube.com/watch?v=ZJNESMhIxQ0 What is the Cambridge Analytica scandal? by The Guardian https://www.youtube.com/watch?v=Q91nvbJSmS4 Real-World Predictive Analytics in Action
  • 70. ©2020 Republic Polytechnic 70 The Analytics Challenges Source: https://mashable.com/2017/04/27/man-tweets-pie-charts/
  • 72. C3249C - Data Mining and Predictive Analytics SpecialistDiplomainBusinessAnalytics(SDBA) Lesson 14 – Concepts Recapitulation and Conclusions: The Penultimate Lesson 6th June 2019 Rudy Ridwen school•of•inforcomm republic•polytechnic
  • 73. ©2020 Republic Polytechnic 2 Why smart statistics are the key to fighting crime by Anne Milgram at TED@BCG https://www.youtube.com/watch?v=ZJNESMhIxQ0 What is the Cambridge Analytica scandal? by The Guardian https://www.youtube.com/watch?v=Q91nvbJSmS4 Real-World Predictive Analytics in Action