O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
From Raw Data
to
Deployed Product
Fast & Agile with CRISP-DM
Michał Łopuszyński
AnalyticsConf, Gdańsk, 2016.11.15
About me
I work at ICM UW•
Our group = Applied Data Analysis Lab•
Supercomputing centre, weather forecast , virtual librar...
Introduction
Data Product – Components
Problem
Data
Modelling
Process
Metrics
End User
Exposition
Data
Product
Data Product – Components
Problem
Data
Modelling
Process
Metrics
End User
Exposition
Data
Product
What is CRISP-DM?
Cross Industry Standard Process
for Data Mining
•
SPSS, Teradata, Daimler, OCHRA, NCR
Developed in 1996 ...
Business Understanding
Business Understanding
Determine business objectives•
Resources (data!), risks, costs & benefits
Assess situation•
Ideally...
Business Understanding
Difficult!•
Often, you have to enter a new field•
You have to explain data science
limitations to n...
Business Understanding – my DOs and DON'Ts
Have a lot of patience for vaguely defined problems•
Do not waste your time on ...
Data Understanding
Data Understanding
Collect initial data•
Persist results
Describe data•
Persist results
Explore data•
Carefully document p...
Data Understanding – Validate Everything
<judgement id="...">
<date>3013-12-04 00:00:00.0 CET</date>
<publicationDate>2014...
Data Understanding – Spot Anomalies
Histogram of certain smooth quantity measured using "precise equipment"
Explanation – ...
Data Understanding – Spot Anomalies
Secondary school examination (Matura) score distribution from Polish
Exploratory data ...
Data Understanding – my DOs and DON'Ts
Do not trust data quality estimates provided by your customer•
Verify as far as you...
Data Preparation
Data Preparation
Select data•
Clean data•
Generate derived attributes
Construct data•
Merge information from different sou...
Data Preparation
Tedious!•
Make, Drake
Use workflow tools to document, automate & parallelize data prep.•
classification-j...
Data Preparation
Data understanding and preparation will usually consume half or
more of your project time!
•
20% 20%
14%
...
Data Preparation – my DOs and DON'Ts
Use workflow tools to help you with the above•
Prepare your customer that data unders...
Modelling
Modelling
Generate test design•
Feature eng., optimize model parameters
Build model•
Iterate the above
Assess model•
Assum...
Modelling – Tooling Selection
Where your model will be deployed?•
Do you need to distribute your
computations? (avoid!)
•
...
Modelling – Resist the Hype
We have to use X for this project!
X is the best software/method/technology ever!
•
Hadoop
Spa...
Modelling – my DOs and DON'Ts
Develop your model with deployment conditions in mind•
Allocate time for hyperparameter opti...
Evaluation
Evaluation
Review process•
To deploy or not to deploy?
Determine next steps• Determine next steps
Business success criteri...
Evaluation – watch out for overfitting & leakage
Overfitting & leakage are lethal dangers for every model•
Data leakage = ...
Evaluation – watch out for overfitting & leakage
Good overview of leakage problem is presented in this paper.
Evaluation – my DOs and DON'Ts
Work with the performance criteria dictated by your customer's
business model
•
Assess not ...
Deployment
Deployment
Plan monitoring and maintenance•
Produce final report•
Plan deployment•
Collect lessons learned!
Review project...
Deployment – my DOs and DON'Ts
Read this paper, for excellent insights!
Summary
01001110010101
011100100111000110
100101110101
100010011101001
10000000111000001
10000110110110
110000110010010001...
Thank you!
Questions?
@lopusz
Próximos SlideShares
Carregando em…5
×
Próximos SlideShares
CRISP-DM: a data science project methodology
Avançar
Transfira para ler offline e ver em ecrã inteiro.

5

Compartilhar

Baixar para ler offline

From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Baixar para ler offline

From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Livros relacionados

Gratuito durante 30 dias do Scribd

Ver tudo

From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

  1. 1. From Raw Data to Deployed Product Fast & Agile with CRISP-DM Michał Łopuszyński AnalyticsConf, Gdańsk, 2016.11.15
  2. 2. About me I work at ICM UW• Our group = Applied Data Analysis Lab• Supercomputing centre, weather forecast , virtual library, open science platform, visualization solutions, ... • Involved in modelling and data analysis projects from cosmology, medicine, bioinformatics, quantum chemistry, biophysics, fluid dynamics, materials science, social network analysis ... • Automatic information extraction from PDFs• Text-mining in scientific literature• Variety of application projects (analysis of court judgments, aviation, deploying solutions on the big data stack Spark/Hadoop, trainings) • About me adalab.icm.edu.pl
  3. 3. Introduction
  4. 4. Data Product – Components Problem Data Modelling Process Metrics End User Exposition Data Product
  5. 5. Data Product – Components Problem Data Modelling Process Metrics End User Exposition Data Product
  6. 6. What is CRISP-DM? Cross Industry Standard Process for Data Mining • SPSS, Teradata, Daimler, OCHRA, NCR Developed in 1996 by big players in data analysis • • I follow "CRISP-DM 1.0 Step-by-step data mining guide"• 01001110010101 011100100111000110 100101110101 100010011101001 10000000111000001 10000110110110 110000110010010001 DATA Business Understanding Data Understanding Data Preparation Modelling Evaluation Deployment Most popular methodology for data-centric projects See KDNuggets Polls• Runner-up SEMMA• I find it agile• Introduces almost no overhead• Emphasizes adaptive transitions between project phases • 2007, 2014
  7. 7. Business Understanding
  8. 8. Business Understanding Determine business objectives• Resources (data!), risks, costs & benefits Assess situation• Ideally with quantitative success criteria Determine data mining goals• Estimate time line, budget, but also tools and techniques Develop project plan• 01001110010101 011100100111000110 100101110101 100010011101001 10000000111000001 10000110110110 110000110010010001 DATA Business Understanding Data Understanding Data Preparation Modelling Evaluation Deployment
  9. 9. Business Understanding Difficult!• Often, you have to enter a new field• You have to explain data science limitations to non-experts • Source: http://xkcd.com/1425 No, performance will not be 100%• We need much more data to train an accurate model • For tomorrow, it is impossible•
  10. 10. Business Understanding – my DOs and DON'Ts Have a lot of patience for vaguely defined problems• Do not waste your time on ill-defined, unrealistic projects• Learn to concretize or even reduce the scope of the initial idea• Data sample• Real-life use cases• Quantitative success metrics• Try to talk as much as possible with domain experts•
  11. 11. Data Understanding
  12. 12. Data Understanding Collect initial data• Persist results Describe data• Persist results Explore data• Carefully document problems and issues found! Verify data quality• 01001110010101 011100100111000110 100101110101 100010011101001 10000000111000001 10000110110110 110000110010010001 DATA Business Understanding Data Understanding Data Preparation Modelling Evaluation Deployment
  13. 13. Data Understanding – Validate Everything <judgement id="..."> <date>3013-12-04 00:00:00.0 CET</date> <publicationDate>2014-07-23 02:52:17.0 CEST</publicationDate> <courtId>15250000</courtId> <departmentId>503</departmentId> <chairman>Małgorzata ...</chairman> <judges> <judge>Małgorzata ...</judge> </judges> ... </judgement> <judgement id="..."> <date>2012-10-01 00:00:00.0 CEST</date> <publicationDate>2014-12-31 18:15:05.0 CET</publicationDate> <courtId>15450500</courtId> <departmentId>6027</departmentId> <judges> <judge>Piotr ...</judge> <judge>wskazał</judge> <judge>czego wymaga art. 17a ust. 2 ustawy</judge> ... </judges> </judgement>
  14. 14. Data Understanding – Spot Anomalies Histogram of certain smooth quantity measured using "precise equipment" Explanation – effect of human interface between precise equipment & db
  15. 15. Data Understanding – Spot Anomalies Secondary school examination (Matura) score distribution from Polish Exploratory data analysis can reveal imperfections of conducted experiment Source: CKE Materials, Matura 2012
  16. 16. Data Understanding – my DOs and DON'Ts Do not trust data quality estimates provided by your customer• Verify as far as you can, if your data is correct, complete, coherent, deduplicated, representative, independent, up-to-date, stationary • Understand anomalies, outliers, missing data• Do not economize on this phase• The earlier you discover issues with your data the better (yes, your data will have issues!) • Data understanding leads to domain understanding, it will pay off in the modelling phase • Investigate what sort of processing was applied to the raw data•
  17. 17. Data Preparation
  18. 18. Data Preparation Select data• Clean data• Generate derived attributes Construct data• Merge information from different sources Integrate data• Convert to format convenient for modelling Format data• 01001110010101 011100100111000110 100101110101 100010011101001 10000000111000001 10000110110110 110000110010010001 DATA Business Understanding Data Understanding Data Preparation Modelling Evaluation Deployment
  19. 19. Data Preparation Tedious!• Make, Drake Use workflow tools to document, automate & parallelize data prep.• classification-jsonl data-aux/class-riffle data-clean/joind-jsonl data-aux/metad-riffle data-aux/priis-json data-aux/prinf-json stat/basic stat/basic-fp7 stat/collab metadata-jsonl projects-from-iis-jsonl projects-from-infspace-jsonlmetadata-extracted-jsonl Oozie, Azkaban, Luigi, Airflow, ...
  20. 20. Data Preparation Data understanding and preparation will usually consume half or more of your project time! • 20% 20% 14% 10% 10%10% What % of time in your data mining project(s) is spent on data cleaning and preparation? 8% 4% 25% 25% 39% Percentage of responses Percentageoftime Source: M.A.Munson, A Study on the Importance of and Time Spent Different Modeling Steps, ACM SIGKDD Explorations Newsletter 13, 65-71 (2011) Source: KDNuggets Poll 2003
  21. 21. Data Preparation – my DOs and DON'Ts Use workflow tools to help you with the above• Prepare your customer that data understanding and preparation take considerable amount of time • Automate this phase as far as possible• When merging multiple sources, track provenance of your data•
  22. 22. Modelling
  23. 23. Modelling Generate test design• Feature eng., optimize model parameters Build model• Iterate the above Assess model• Assumptions, measure of accuracy Select modelling technique• 01001110010101 011100100111000110 100101110101 100010011101001 10000000111000001 10000110110110 110000110010010001 DATA Business Understanding Data Understanding Data Preparation Modelling Evaluation Deployment
  24. 24. Modelling – Tooling Selection Where your model will be deployed?• Do you need to distribute your computations? (avoid!) • Breadth = performance, lots of general purpose libraries and tooling, easy creation of web services Should I use general purpose language?• C++ Java C# R Matlab Mathematica Python Scala ClojureF# BreadthDepth (quality of general purpose tooling) (qualityofdataanalysistooling) Depth = easy data manipulation, latest models and statistical techniques available Should I use data analysis language?• Can I afford a prototype?•
  25. 25. Modelling – Resist the Hype We have to use X for this project! X is the best software/method/technology ever! • Hadoop Spark Deep Learning NoSQL HBase Be adventurous, but also critical when it comes to technology/method choice! None is silver bullet for everything! • XGBoost Cloud ROI not the hype should drive your choices!•
  26. 26. Modelling – my DOs and DON'Ts Develop your model with deployment conditions in mind• Allocate time for hyperparameter optimization• • Whenever possible, peek inside your model and consult it with domain expert Assess feature importance• Run your model on simulated data• Be creative with your features (feature engineering)• Esp. from textual data or time-series you can generate a lot of std. features• Make conscious decision about missing data (NAs) and outliers (regression!)•
  27. 27. Evaluation
  28. 28. Evaluation Review process• To deploy or not to deploy? Determine next steps• Determine next steps Business success criteria fulfilled? Evaluate results• 01001110010101 011100100111000110 100101110101 100010011101001 10000000111000001 10000110110110 110000110010010001 DATA Business Understanding Data Understanding Data Preparation Modelling Evaluation Deployment
  29. 29. Evaluation – watch out for overfitting & leakage Overfitting & leakage are lethal dangers for every model• Data leakage = artificially injecting parts of solution to input data• Time series – mixing past and future• Meaningful identifiers• Overfitting = learning too much from data• Well-known danger, a lot of techniques to avoid it (cross-validation, regularization, early stopping ...) • Hard to define precisely, best understood by example• Using parts of training set in test set• Much lower awareness, not many techniques to avoid•
  30. 30. Evaluation – watch out for overfitting & leakage Good overview of leakage problem is presented in this paper.
  31. 31. Evaluation – my DOs and DON'Ts Work with the performance criteria dictated by your customer's business model • Assess not only performance, but also practical aspects, related to deployment, for example: • Training and prediction speed• Robustness and maintainability (tooling, dependence on other subsystems, library vs. homegrown code) • Keep in mind the dreadful modelling dangers leakage & overfitting• Consider pre-deployment (a la paper trading) as a part of evaluation strategy • Remember "too good to be true principle" (useful, but crude filter)•
  32. 32. Deployment
  33. 33. Deployment Plan monitoring and maintenance• Produce final report• Plan deployment• Collect lessons learned! Review project• 01001110010101 011100100111000110 100101110101 100010011101001 10000000111000001 10000110110110 110000110010010001 DATA Business Understanding Data Understanding Data Preparation Modelling Evaluation Deployment
  34. 34. Deployment – my DOs and DON'Ts Read this paper, for excellent insights!
  35. 35. Summary 01001110010101 011100100111000110 100101110101 100010011101001 10000000111000001 10000110110110 110000110010010001 DATA Business Understanding Data Understanding Data Preparation Modelling Evaluation Deployment
  36. 36. Thank you! Questions? @lopusz
  • AlettaDieben

    Apr. 5, 2020
  • PakTiankrua

    Jan. 31, 2019
  • ChairatBoonjarionsuk

    Oct. 19, 2018
  • RuijianWang

    Jan. 24, 2018
  • PrakharGupta27

    Feb. 23, 2017

From Raw Data to Deployed Product. Fast & Agile with CRISP-DM

Vistos

Vistos totais

831

No Slideshare

0

De incorporações

0

Número de incorporações

12

Ações

Baixados

20

Compartilhados

0

Comentários

0

Curtir

5

×