SlideShare uma empresa Scribd logo
1 de 23
Baixar para ler offline
CRISP-DM
Agile Approach to Data Mining Projects
Michał Łopuszyński
Warsaw Data Science Meetup, 2016.06.07
About me
I work at ICM UW•
Our group = Applied Data Analysis Lab•
Supercomputing centre, weather forecast , virtual library,
open science platform, visualization solutions, ...
•
Involved in modelling and data analysis projects from cosmology, medicine,
bioinformatics, quantum chemistry, biophysics, fluid dynamics, materials
science, social network analysis ...
•
Automatic information extraction from PDFs•
Text-mining in scientific literature•
Variety of application projects (analysis of court judgments, aviation,
deploying solutions on the big data stack Spark/Hadoop, trainings)
•
About me
adalab.icm.edu.pl
What is CRISP-DM?
Cross Industry Standard Process
for Data Mining
•
SPSS, Teradata, Daimler, OCHRA, NCR
Developed in 1996 by big players
in data analysis
•
•
I follow "CRISP-DM 1.0 Step-by-step data mining guide"•
01001110010101
011100100111000110
100101110101
100010011101001
10000000111000001
10000110110110
110000110010010001
DATA
Business
Understanding
Data
Understanding
Data
Preparation
Modelling
Evaluation
Deployment
Most popular methodology
for data-centric projects
See KDNuggets Polls•
Runner-up SEMMA•
I find it agile•
Introduces almost no overhead•
Emphasizes adaptive transitions
between project phases
•
2007, 2014
Business Understanding
Determine business objectives•
Resources (data!), risks, costs & benefits
Assess situation•
Ideally with quantitative success criteria
Determine data mining goals•
Estimate time line, budget, but also tools and
techniques
Develop project plan•
01001110010101
011100100111000110
100101110101
100010011101001
10000000111000001
10000110110110
110000110010010001
DATA
Business
Understanding
Data
Understanding
Data
Preparation
Modelling
Evaluation
Deployment
Business Understanding
Difficult!•
Often, you have to enter a new field•
You have to explain data science
limitations to non-experts
•
Source: http://xkcd.com/1425
No, performance will not be 100%•
We need much more data to train
an accurate model
•
For tomorrow, it is impossible•
Business Understanding – my DOs and DON'Ts
Have a lot of patience for vaguely defined problems•
Do not waste your time on ill-defined, unrealistic projects•
Learn to concretize or even reduce the scope of the initial idea•
Data sample•
Real-life use cases•
Quantitative success metrics•
Data Understanding
Collect initial data•
Persist results
Describe data•
Persist results
Explore data•
Carefully document problems and issues found!
Verify data quality•
01001110010101
011100100111000110
100101110101
100010011101001
10000000111000001
10000110110110
110000110010010001
DATA
Business
Understanding
Data
Understanding
Data
Preparation
Modelling
Evaluation
Deployment
Data Understanding – Validate Everything
<judgement id="...">
<date>3013-12-04 00:00:00.0 CET</date>
<publicationDate>2014-07-23 02:52:17.0 CEST</publicationDate>
<courtId>15250000</courtId>
<departmentId>503</departmentId>
<chairman>Małgorzata ...</chairman>
<judges>
<judge>Małgorzata ...</judge>
</judges>
...
</judgement>
<judgement id="...">
<date>2012-10-01 00:00:00.0 CEST</date>
<publicationDate>2014-12-31 18:15:05.0 CET</publicationDate>
<courtId>15450500</courtId>
<departmentId>6027</departmentId>
<judges>
<judge>Piotr ...</judge>
<judge>wskazał</judge>
<judge>czego wymaga art. 17a ust. 2 ustawy</judge>
...
</judges>
</judgement>
Data Understanding – Spot Anomalies
Histogram of certain smooth quantity measured using "precise equipment"
Explanation – effect of human interface between precise equipment & db
Data Understanding – Spot Anomalies
Secondary school examination (Matura) score distribution from Polish
Exploratory data analysis can reveal imperfections of conducted
experiment
Source: CKE Materials, Matura 2012
Data Understanding – my DOs and DON'Ts
Do not trust data quality estimates provided by your customer•
Verify as far as you can, if your data is correct, complete, coherent,
deduplicated, representative, independent, up-to-date, stationary
•
Understand anomalies and outliers•
Do not economize on this phase•
The earlier you discover issues with your data the better (yes, your data will
have issues!)
•
Data understanding leads to domain understanding, it will pay off in
the modelling phase
•
Investigate what sort of processing was applied to the raw data•
Data Preparation
Select data•
Clean data•
Generate derived attributes
Construct data•
Merge information from different sources
Integrate data•
Convert to format convenient for modelling
Format data•
01001110010101
011100100111000110
100101110101
100010011101001
10000000111000001
10000110110110
110000110010010001
DATA
Business
Understanding
Data
Understanding
Data
Preparation
Modelling
Evaluation
Deployment
Data Preparation
Tedious!•
Make, Drake
Use workflow tools to document, automate & parallelize data prep.•
classification-jsonl
data-aux/class-riffle
data-clean/joind-jsonl
data-aux/metad-riffle data-aux/priis-json data-aux/prinf-json
stat/basic stat/basic-fp7 stat/collab
metadata-jsonl projects-from-iis-jsonl projects-from-infspace-jsonlmetadata-extracted-jsonl
Oozie, Azkaban, Luigi, Airflow, ...
Data Preparation
Data understanding and preparation will usually consume half or
more of your project time!
•
20% 20%
14%
10% 10%10%
What % of time in your data mining project(s) is
spent on data cleaning and preparation?
8%
4%
25%
25%
39%
Percentage of responses
Percentageoftime
Source: M.A.Munson, A Study on the Importance of
and Time Spent Different Modeling Steps,
ACM SIGKDD Explorations Newsletter
13, 65-71 (2011)
Source: KDNuggets Poll 2003
Data Preparation – my DOs and DON'Ts
Use workflow tools to help you with the above•
Prepare your customer that data understanding and preparation
take considerable amount of time
•
Automate this phase as far as possible•
When merging multiple sources, track provenance of your data•
Modelling
Generate test design•
Feature eng., optimize model parameters
Build model•
Iterate the above
Assess model•
Assumptions, measure of accuracy
Select modelling technique•
01001110010101
011100100111000110
100101110101
100010011101001
10000000111000001
10000110110110
110000110010010001
DATA
Business
Understanding
Data
Understanding
Data
Preparation
Modelling
Evaluation
Deployment
Modelling – Tooling Selection
Where your model will be deployed?•
Do you need to distribute your
computations? (avoid!)
•
Breadth = performance, lots of general
purpose libraries and tooling, easy creation
of web services
Should I use general purpose language?•
C++
Java
C#
R
Matlab
Mathematica
Python
Scala
ClojureF#
BreadthDepth
(quality of general purpose tooling)
(qualityofdataanalysistooling)
Depth = easy data manipulation, latest
models and statistical techniques available
Should I use data analysis language?•
Can I afford a prototype?•
Modelling – my DOs and DON'Ts
Develop your model with deployment conditions in mind•
Allocate time for hyperparameter optimization•
• Whenever possible, peek inside your model and consult it with
domain expert
Assess feature importance•
Run your model on simulated data•
Be creative with your features (feature engineering)•
Esp. from textual data or time-series you can generate a lot of std. features•
Make conscious decision about missing data (NAs) and outliers (regression!)•
Evaluation
Review process•
To deploy or not to deploy?
Determine next steps• Determine next steps
Business success criteria fulfilled?
Evaluate results•
01001110010101
011100100111000110
100101110101
100010011101001
10000000111000001
10000110110110
110000110010010001
DATA
Business
Understanding
Data
Understanding
Data
Preparation
Modelling
Evaluation
Deployment
Evaluation – my DOs and DON'Ts
Work with the performance criteria dictated by your customer's
business model
•
Assess not only performance, but also practical aspects, related to
deployment, for example:
•
Training and prediction speed•
Robustness and maintainability
(tooling, dependence on other subsystems, library vs. homegrown code)
•
Watch out for data leakage, for example:•
Time series – mixing past and future•
Meaningful identifiers•
Other nasty ways of artificially introducing extra information, not available
in production
•
Deployment
Plan monitoring and maintenance•
Produce final report•
Plan deployment•
Collect lessons learned!
Review project•
01001110010101
011100100111000110
100101110101
100010011101001
10000000111000001
10000110110110
110000110010010001
DATA
Business
Understanding
Data
Understanding
Data
Preparation
Modelling
Evaluation
Deployment
Deployment – my DOs and DON'Ts
Read this paper, for excellent insights!
Thank you!
Questions?
@lopusz

Mais conteúdo relacionado

Mais procurados

Foundational Methodology for Data Science
Foundational Methodology for Data ScienceFoundational Methodology for Data Science
Foundational Methodology for Data Science
John B. Rollins, Ph.D.
 

Mais procurados (20)

Data Mining Technique - SEMMA
Data Mining Technique - SEMMAData Mining Technique - SEMMA
Data Mining Technique - SEMMA
 
Foundational Methodology for Data Science
Foundational Methodology for Data ScienceFoundational Methodology for Data Science
Foundational Methodology for Data Science
 
Exploring the Data science Process
Exploring the Data science ProcessExploring the Data science Process
Exploring the Data science Process
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
 
Predictive Analytics Project in Automotive Industry
Predictive Analytics Project in Automotive IndustryPredictive Analytics Project in Automotive Industry
Predictive Analytics Project in Automotive Industry
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data Science
 
H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark Landry
 
Data Visualization: Sales forecasting
Data Visualization: Sales forecastingData Visualization: Sales forecasting
Data Visualization: Sales forecasting
 
Machine learning in action at Pipedrive
Machine learning in action at PipedriveMachine learning in action at Pipedrive
Machine learning in action at Pipedrive
 
1. introduction to data science —
1. introduction to data science —1. introduction to data science —
1. introduction to data science —
 
Text Analytics for Legal work
Text Analytics for Legal workText Analytics for Legal work
Text Analytics for Legal work
 
Data analysis
Data analysisData analysis
Data analysis
 
840 plenary elder_using his laptop
840 plenary elder_using his laptop840 plenary elder_using his laptop
840 plenary elder_using his laptop
 
Image Analytics In Healthcare
Image Analytics In HealthcareImage Analytics In Healthcare
Image Analytics In Healthcare
 
Image Analytics: Caption Generation/Image Descriptions
Image Analytics: Caption Generation/Image DescriptionsImage Analytics: Caption Generation/Image Descriptions
Image Analytics: Caption Generation/Image Descriptions
 
Machine Learning in Healthcare: A Case Study
Machine Learning in Healthcare: A Case StudyMachine Learning in Healthcare: A Case Study
Machine Learning in Healthcare: A Case Study
 
Predictive Analytics: Advanced techniques in data mining
Predictive Analytics: Advanced techniques in data miningPredictive Analytics: Advanced techniques in data mining
Predictive Analytics: Advanced techniques in data mining
 
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
 
1030 track 2 barrett_using our laptop
1030 track 2 barrett_using our laptop1030 track 2 barrett_using our laptop
1030 track 2 barrett_using our laptop
 
Chatbots: Automated Conversational Model using Machine Learning
Chatbots: Automated Conversational Model using Machine LearningChatbots: Automated Conversational Model using Machine Learning
Chatbots: Automated Conversational Model using Machine Learning
 

Semelhante a CRISP-DM Agile Approach to Data Mining Projects

351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
XanGwaps
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data science
Jordan Engbers
 
Harnessing Big Data_UCLA
Harnessing Big Data_UCLAHarnessing Big Data_UCLA
Harnessing Big Data_UCLA
Paul Barsch
 

Semelhante a CRISP-DM Agile Approach to Data Mining Projects (20)

WWV2015: Jibes Paul van der Hulst big data
WWV2015: Jibes Paul van der Hulst big dataWWV2015: Jibes Paul van der Hulst big data
WWV2015: Jibes Paul van der Hulst big data
 
Breed data scientists_ A Presentation.pptx
Breed data scientists_ A Presentation.pptxBreed data scientists_ A Presentation.pptx
Breed data scientists_ A Presentation.pptx
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptx
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data Scientists
 
Bridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsBridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable Workflows
 
Data science 101 Masterclass
Data science 101 MasterclassData science 101 Masterclass
Data science 101 Masterclass
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data science
 
Data Scientists
 Data Scientists Data Scientists
Data Scientists
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
predictive analysis and usage in procurement ppt 2017
predictive analysis and usage in procurement  ppt 2017predictive analysis and usage in procurement  ppt 2017
predictive analysis and usage in procurement ppt 2017
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
 
Advanced Analytics and Data Science Expertise
Advanced Analytics and Data Science ExpertiseAdvanced Analytics and Data Science Expertise
Advanced Analytics and Data Science Expertise
 
Big Data on The Cloud
Big Data on The CloudBig Data on The Cloud
Big Data on The Cloud
 
DataScience.pptx
DataScience.pptxDataScience.pptx
DataScience.pptx
 
Harnessing Big Data_UCLA
Harnessing Big Data_UCLAHarnessing Big Data_UCLA
Harnessing Big Data_UCLA
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
 

Mais de Data Science Warsaw

Mais de Data Science Warsaw (20)

Wizualne budowanie aplikacji na Sparku przy pomocy narzędzia Seahorse
Wizualne budowanie aplikacji na Sparku przy pomocy narzędzia SeahorseWizualne budowanie aplikacji na Sparku przy pomocy narzędzia Seahorse
Wizualne budowanie aplikacji na Sparku przy pomocy narzędzia Seahorse
 
Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...
Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...
Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...
 
Online content popularity prediction
Online content popularity predictionOnline content popularity prediction
Online content popularity prediction
 
Rozwiązywanie problemów optymalizacyjnych
Rozwiązywanie problemów optymalizacyjnychRozwiązywanie problemów optymalizacyjnych
Rozwiązywanie problemów optymalizacyjnych
 
Ile informacji jest w danych?
Ile informacji jest w danych?Ile informacji jest w danych?
Ile informacji jest w danych?
 
Analiza języka naturalnego
Analiza języka naturalnegoAnaliza języka naturalnego
Analiza języka naturalnego
 
Otwarte Miasta
Otwarte MiastaOtwarte Miasta
Otwarte Miasta
 
How to build your own google
How to build your own googleHow to build your own google
How to build your own google
 
To się w ram ie nie zmieści
To się w ram ie nie zmieściTo się w ram ie nie zmieści
To się w ram ie nie zmieści
 
Azure - Duże zbiory w chmurze
Azure - Duże zbiory w chmurzeAzure - Duże zbiory w chmurze
Azure - Duże zbiory w chmurze
 
Data Science Warsaw
Data Science WarsawData Science Warsaw
Data Science Warsaw
 
Data science w ubezpieczeniach
Data science w ubezpieczeniachData science w ubezpieczeniach
Data science w ubezpieczeniach
 
As simple as Apache Spark
As simple as Apache SparkAs simple as Apache Spark
As simple as Apache Spark
 
Big Data, Wearable, sztuczna inteligencja i ekonomia współpracy
Big  Data, Wearable, sztuczna inteligencja i ekonomia współpracyBig  Data, Wearable, sztuczna inteligencja i ekonomia współpracy
Big Data, Wearable, sztuczna inteligencja i ekonomia współpracy
 
Ask Data Anything
Ask Data AnythingAsk Data Anything
Ask Data Anything
 
Oracle Big Data Discovery - ludzka twarz Hadoop'a
Oracle Big Data Discovery - ludzka twarz Hadoop'aOracle Big Data Discovery - ludzka twarz Hadoop'a
Oracle Big Data Discovery - ludzka twarz Hadoop'a
 
Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!
Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!
Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!
 
Data Exchange - the missing link in the big data value chain
Data Exchange - the missing link in the big data value chainData Exchange - the missing link in the big data value chain
Data Exchange - the missing link in the big data value chain
 
Metody logiczne w analizie danych
Metody logiczne w analizie danych Metody logiczne w analizie danych
Metody logiczne w analizie danych
 
Małe dane, duży wpływ - Dominik Batorski ICM
Małe dane, duży wpływ - Dominik Batorski ICMMałe dane, duży wpływ - Dominik Batorski ICM
Małe dane, duży wpływ - Dominik Batorski ICM
 

Último

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Último (20)

Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 

CRISP-DM Agile Approach to Data Mining Projects

  • 1. CRISP-DM Agile Approach to Data Mining Projects Michał Łopuszyński Warsaw Data Science Meetup, 2016.06.07
  • 2. About me I work at ICM UW• Our group = Applied Data Analysis Lab• Supercomputing centre, weather forecast , virtual library, open science platform, visualization solutions, ... • Involved in modelling and data analysis projects from cosmology, medicine, bioinformatics, quantum chemistry, biophysics, fluid dynamics, materials science, social network analysis ... • Automatic information extraction from PDFs• Text-mining in scientific literature• Variety of application projects (analysis of court judgments, aviation, deploying solutions on the big data stack Spark/Hadoop, trainings) • About me adalab.icm.edu.pl
  • 3. What is CRISP-DM? Cross Industry Standard Process for Data Mining • SPSS, Teradata, Daimler, OCHRA, NCR Developed in 1996 by big players in data analysis • • I follow "CRISP-DM 1.0 Step-by-step data mining guide"• 01001110010101 011100100111000110 100101110101 100010011101001 10000000111000001 10000110110110 110000110010010001 DATA Business Understanding Data Understanding Data Preparation Modelling Evaluation Deployment Most popular methodology for data-centric projects See KDNuggets Polls• Runner-up SEMMA• I find it agile• Introduces almost no overhead• Emphasizes adaptive transitions between project phases • 2007, 2014
  • 4. Business Understanding Determine business objectives• Resources (data!), risks, costs & benefits Assess situation• Ideally with quantitative success criteria Determine data mining goals• Estimate time line, budget, but also tools and techniques Develop project plan• 01001110010101 011100100111000110 100101110101 100010011101001 10000000111000001 10000110110110 110000110010010001 DATA Business Understanding Data Understanding Data Preparation Modelling Evaluation Deployment
  • 5. Business Understanding Difficult!• Often, you have to enter a new field• You have to explain data science limitations to non-experts • Source: http://xkcd.com/1425 No, performance will not be 100%• We need much more data to train an accurate model • For tomorrow, it is impossible•
  • 6. Business Understanding – my DOs and DON'Ts Have a lot of patience for vaguely defined problems• Do not waste your time on ill-defined, unrealistic projects• Learn to concretize or even reduce the scope of the initial idea• Data sample• Real-life use cases• Quantitative success metrics•
  • 7. Data Understanding Collect initial data• Persist results Describe data• Persist results Explore data• Carefully document problems and issues found! Verify data quality• 01001110010101 011100100111000110 100101110101 100010011101001 10000000111000001 10000110110110 110000110010010001 DATA Business Understanding Data Understanding Data Preparation Modelling Evaluation Deployment
  • 8. Data Understanding – Validate Everything <judgement id="..."> <date>3013-12-04 00:00:00.0 CET</date> <publicationDate>2014-07-23 02:52:17.0 CEST</publicationDate> <courtId>15250000</courtId> <departmentId>503</departmentId> <chairman>Małgorzata ...</chairman> <judges> <judge>Małgorzata ...</judge> </judges> ... </judgement> <judgement id="..."> <date>2012-10-01 00:00:00.0 CEST</date> <publicationDate>2014-12-31 18:15:05.0 CET</publicationDate> <courtId>15450500</courtId> <departmentId>6027</departmentId> <judges> <judge>Piotr ...</judge> <judge>wskazał</judge> <judge>czego wymaga art. 17a ust. 2 ustawy</judge> ... </judges> </judgement>
  • 9. Data Understanding – Spot Anomalies Histogram of certain smooth quantity measured using "precise equipment" Explanation – effect of human interface between precise equipment & db
  • 10. Data Understanding – Spot Anomalies Secondary school examination (Matura) score distribution from Polish Exploratory data analysis can reveal imperfections of conducted experiment Source: CKE Materials, Matura 2012
  • 11. Data Understanding – my DOs and DON'Ts Do not trust data quality estimates provided by your customer• Verify as far as you can, if your data is correct, complete, coherent, deduplicated, representative, independent, up-to-date, stationary • Understand anomalies and outliers• Do not economize on this phase• The earlier you discover issues with your data the better (yes, your data will have issues!) • Data understanding leads to domain understanding, it will pay off in the modelling phase • Investigate what sort of processing was applied to the raw data•
  • 12. Data Preparation Select data• Clean data• Generate derived attributes Construct data• Merge information from different sources Integrate data• Convert to format convenient for modelling Format data• 01001110010101 011100100111000110 100101110101 100010011101001 10000000111000001 10000110110110 110000110010010001 DATA Business Understanding Data Understanding Data Preparation Modelling Evaluation Deployment
  • 13. Data Preparation Tedious!• Make, Drake Use workflow tools to document, automate & parallelize data prep.• classification-jsonl data-aux/class-riffle data-clean/joind-jsonl data-aux/metad-riffle data-aux/priis-json data-aux/prinf-json stat/basic stat/basic-fp7 stat/collab metadata-jsonl projects-from-iis-jsonl projects-from-infspace-jsonlmetadata-extracted-jsonl Oozie, Azkaban, Luigi, Airflow, ...
  • 14. Data Preparation Data understanding and preparation will usually consume half or more of your project time! • 20% 20% 14% 10% 10%10% What % of time in your data mining project(s) is spent on data cleaning and preparation? 8% 4% 25% 25% 39% Percentage of responses Percentageoftime Source: M.A.Munson, A Study on the Importance of and Time Spent Different Modeling Steps, ACM SIGKDD Explorations Newsletter 13, 65-71 (2011) Source: KDNuggets Poll 2003
  • 15. Data Preparation – my DOs and DON'Ts Use workflow tools to help you with the above• Prepare your customer that data understanding and preparation take considerable amount of time • Automate this phase as far as possible• When merging multiple sources, track provenance of your data•
  • 16. Modelling Generate test design• Feature eng., optimize model parameters Build model• Iterate the above Assess model• Assumptions, measure of accuracy Select modelling technique• 01001110010101 011100100111000110 100101110101 100010011101001 10000000111000001 10000110110110 110000110010010001 DATA Business Understanding Data Understanding Data Preparation Modelling Evaluation Deployment
  • 17. Modelling – Tooling Selection Where your model will be deployed?• Do you need to distribute your computations? (avoid!) • Breadth = performance, lots of general purpose libraries and tooling, easy creation of web services Should I use general purpose language?• C++ Java C# R Matlab Mathematica Python Scala ClojureF# BreadthDepth (quality of general purpose tooling) (qualityofdataanalysistooling) Depth = easy data manipulation, latest models and statistical techniques available Should I use data analysis language?• Can I afford a prototype?•
  • 18. Modelling – my DOs and DON'Ts Develop your model with deployment conditions in mind• Allocate time for hyperparameter optimization• • Whenever possible, peek inside your model and consult it with domain expert Assess feature importance• Run your model on simulated data• Be creative with your features (feature engineering)• Esp. from textual data or time-series you can generate a lot of std. features• Make conscious decision about missing data (NAs) and outliers (regression!)•
  • 19. Evaluation Review process• To deploy or not to deploy? Determine next steps• Determine next steps Business success criteria fulfilled? Evaluate results• 01001110010101 011100100111000110 100101110101 100010011101001 10000000111000001 10000110110110 110000110010010001 DATA Business Understanding Data Understanding Data Preparation Modelling Evaluation Deployment
  • 20. Evaluation – my DOs and DON'Ts Work with the performance criteria dictated by your customer's business model • Assess not only performance, but also practical aspects, related to deployment, for example: • Training and prediction speed• Robustness and maintainability (tooling, dependence on other subsystems, library vs. homegrown code) • Watch out for data leakage, for example:• Time series – mixing past and future• Meaningful identifiers• Other nasty ways of artificially introducing extra information, not available in production •
  • 21. Deployment Plan monitoring and maintenance• Produce final report• Plan deployment• Collect lessons learned! Review project• 01001110010101 011100100111000110 100101110101 100010011101001 10000000111000001 10000110110110 110000110010010001 DATA Business Understanding Data Understanding Data Preparation Modelling Evaluation Deployment
  • 22. Deployment – my DOs and DON'Ts Read this paper, for excellent insights!