SlideShare uma empresa Scribd logo
1 de 55
Baixar para ler offline
From Lab to Factory
LessonslearneddevelopingDataProducts
Keynote PyCon Colombia
Feb 2017
peadarcoyle@googlemail.com
All opinions my own
Who am I?
Data Scientist at Elevate (Recruitment Startup) and Author
@springcoil
About 5 years in Machine Learning/Data Science
OSS contributor (mostly PyMC3)
Built Ad-hoc analysis, Data Products and Data Pipelines at
Why talk about this?
Data Moats
Roadmap
What is Data Science?
Type of Data Scientists
Deploying Data Science (tech/culture)
Success stories and recommendations
Bigger idea (Change)
What IS a Data Scientist?
There's a Data Science Spectrum
HT: Sean J. Taylor (Facebook Research Scientist)
Data Science to Value
Data doesn't do anything!
People need to make decisions (strategic/ tactical)
Or the product needs to alter someones decision making (Spotify
Discover Weekly, Amazon Recommendations, etc)
(Taken from Josh Wills talk - Lab to Factory 2014)
or https://hbr.org/2013/04/two-departments-for-data-succe
Expectations and reality
Data is the new oil...
But. Data isn't a product. It needs to be turned into one!
Data Science is hard
Data and technical problems
Skill set (emotional and tech)
Machine Learning Ops
Data Science projects are risky!
Many stakeholders think that data science is just an engineering problem,
but research is high risk and high reward
De-risking the project - how? Send me examples :)
http://www.martingoodson.com/ten-ways-your-data-project-is-going-to-
fail/
https://github.com/ianozsvald/data_science_delivered
Success with Data Science
1. People
2. Ideas
3. Things
Source: USAF
R and D needs cultural support
Your culture needs to avoid the HiPPO (highest paid
persons opinion) to be data-informed or data-
driven.
Good cultures
Data democratization
Clear objectives and metrics against objectives
Financial metrics don't always win - sometimes brand wins
See by Martin Goodsonhttp://bit.ly/cultureanddata
Some data scientists
do experiments and
build prototypes
Production
(HT: The Yhat people - )www.yhathq.com
Deploying Data Products is hard
Monitoring and Alerting
Deployment
Real world data is tricky (training/ testing)
Interpretability (explaining fraud detection/ad tech)
Feature engineering
Models decay over time
What projects work?
Explain existing data (visualization!)
Automate repetitive/ slow processes
Augment data to make new data (Search engines, ML models)
Predict the future (do something more accurately than gut feel )
Simulate using statistics :) (Sports analytics models)
"You need data first" - Peadar Coyle
Copying and pasting PDF/PNG data
Getting data in some areas is hard!!
Some tools for web data extraction
Messy APIs without documentation :(
Visualise ALL THE THINGS!!
(Relay foods dataset - HT Greg Reda)
Consumer behaviour at a Fast Food Restaurant per year in the USA
Augmenting data and using API's
Machine Learning
Production data/ Real world
HT: Andrew Ng
Everyone ETLS
Only 1% of your time will be spent modelling
Data pipelines and your infrastructure matters -
Tools such as Spark, Luigi, Airflow or Dask are important
Monitoring of pipelines. Scalability. Machine provision
Eoin Brazil Talk
Success Story (1)
1. Ad Tech project (Media company) -
original process took 10 hours per week of analyst time.
2. Broke down process and produced simple predictive model
3. Data type challenges (unicode in Python 2.7) and business
expectations
4. Now takes less than 30 minutes each week.
Success Story (2)
Elevate is a Total Talent Management platform (tools for streamlining
recruitment)
Several models in production - microservices (recommenders, job type
prediction)
Data pipeline (Luigi) necessary for producing features
Lots of challenges about Docker, time, updated data.
Works: MyPy (Python 3.5), code review, testing -
Next: Moving to PySpark for ETL, more pair programming
http://hypothesis.works/
Failure
1. Lack of support from tech teams
2. Lack of clearly defined customer
3. No culture of DevOps
4. Management had no clear vision
Lessons learned from Lab to Factory
1. Monitoring - data products need evaluation in production
2. Lack of a shared language between software engineers and data
scientists. Pair programming/ Code review are some solutions.
3. To help data scientists and analysts succeed your business needs to be
prepared to invest in tooling. (Pipelines, DevOps, Reproducible)
4. Often you're working with other teams who use different languages -
so micro services can be a good idea (example C#.net app and Python
microservice)
How to deploy a model?
Palladium (Otto Group)
Azure
Flask Microservice (DIY) and Docker
Dev Ops
We need each other!
Use small data where possible!!
Small problems with clean data are more important - (Ian Ozsvald)
Amazon machine with many Xeons and 244GB of RAM is less than 3
euros per hour. - (Ian Ozsvald)
Blaze, Xray, Dask, Ibis, etc etc -
"The mean size of a cluster will remain 1" - Matt Rocklin
PyData Bikeshed
Focus on the real problem
Closing remarks
Dirty data stops projects
Change happens, we need to help businesses adapt or be agile
There are some good projects like Icy, Luigi, etc for transforming data
and improving data extraction
- Google paper
on rule for building ML systems
http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf
Closing Remarks (2)
Stakeholder management is a challenge too
Send me your dirty data and data deployment stories :)
www.peadarcoyle.com
https://leanpub.com/interviewswithdatascientists
A New World
Simulate: Rugby games with MCMC
(PyMC3)
What is the Data Science process?
Obtain
Scrub
Explore
Model
Interpret
Communicate (or Deploy)
Some NLP on the Interviews!
What do Data Scientists talk about?
Based on my Interview series!Dataconomy
A famous 'data product' - Recommendation engines
Credits
http://cdn.yourarticlelibrary.com/wp-content/uploads/2013/12/080.jpg
https://www.flickr.com/photos/82066314@N06/7624218468/sizes/z/
From Lab to Factory: Creating value with data

Mais conteúdo relacionado

Mais procurados

H2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in PythonH2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in PythonSri Ambati
 
Fortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache SparkFortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache SparkBas Geerdink
 
Data science services YLS
Data science services YLSData science services YLS
Data science services YLSDima Semchuk
 
Ilkay Altintas: Kepler
Ilkay Altintas: KeplerIlkay Altintas: Kepler
Ilkay Altintas: KeplerDavid LeBauer
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data ScienceDhiana Deva
 
Experience Big Data Analytics use cases ranging from cancer research to IoT a...
Experience Big Data Analytics use cases ranging from cancer research to IoT a...Experience Big Data Analytics use cases ranging from cancer research to IoT a...
Experience Big Data Analytics use cases ranging from cancer research to IoT a...Fujitsu Middle East
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning InfrastructureSigOpt
 
Seeing at the Speed of Thought: Empowering Others Through Data Exploration
Seeing at the Speed of Thought: Empowering Others Through Data ExplorationSeeing at the Speed of Thought: Empowering Others Through Data Exploration
Seeing at the Speed of Thought: Empowering Others Through Data ExplorationGreg Goltsov
 
cyREST: Cytoscape as a Service
cyREST: Cytoscape as a ServicecyREST: Cytoscape as a Service
cyREST: Cytoscape as a ServiceKeiichiro Ono
 
Towards the Cytoscape Cyberinfrastructure
Towards the Cytoscape CyberinfrastructureTowards the Cytoscape Cyberinfrastructure
Towards the Cytoscape CyberinfrastructureKeiichiro Ono
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsDomino Data Lab
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Agile development of data science projects | Part 1
Agile development of data science projects | Part 1 Agile development of data science projects | Part 1
Agile development of data science projects | Part 1 Anubhav Dhiman
 
Agile data science
Agile data scienceAgile data science
Agile data scienceJoel Horwitz
 
Data Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake FansData Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake FansJameel Syed
 
Crowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic PerspectivesCrowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic PerspectivesAditya Parameswaran
 
H2O Machine Learning and Kalman Filters for Machine Prognostics
H2O Machine Learning and Kalman Filters for Machine PrognosticsH2O Machine Learning and Kalman Filters for Machine Prognostics
H2O Machine Learning and Kalman Filters for Machine PrognosticsSri Ambati
 
Intro to machine learning with scikit learn
Intro to machine learning with scikit learnIntro to machine learning with scikit learn
Intro to machine learning with scikit learnYoss Cohen
 

Mais procurados (20)

H2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in PythonH2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in Python
 
Fortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache SparkFortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache Spark
 
Data science services YLS
Data science services YLSData science services YLS
Data science services YLS
 
Ilkay Altintas: Kepler
Ilkay Altintas: KeplerIlkay Altintas: Kepler
Ilkay Altintas: Kepler
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
Experience Big Data Analytics use cases ranging from cancer research to IoT a...
Experience Big Data Analytics use cases ranging from cancer research to IoT a...Experience Big Data Analytics use cases ranging from cancer research to IoT a...
Experience Big Data Analytics use cases ranging from cancer research to IoT a...
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning Infrastructure
 
Seeing at the Speed of Thought: Empowering Others Through Data Exploration
Seeing at the Speed of Thought: Empowering Others Through Data ExplorationSeeing at the Speed of Thought: Empowering Others Through Data Exploration
Seeing at the Speed of Thought: Empowering Others Through Data Exploration
 
cyREST: Cytoscape as a Service
cyREST: Cytoscape as a ServicecyREST: Cytoscape as a Service
cyREST: Cytoscape as a Service
 
Hadoop
HadoopHadoop
Hadoop
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
Towards the Cytoscape Cyberinfrastructure
Towards the Cytoscape CyberinfrastructureTowards the Cytoscape Cyberinfrastructure
Towards the Cytoscape Cyberinfrastructure
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Agile development of data science projects | Part 1
Agile development of data science projects | Part 1 Agile development of data science projects | Part 1
Agile development of data science projects | Part 1
 
Agile data science
Agile data scienceAgile data science
Agile data science
 
Data Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake FansData Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake Fans
 
Crowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic PerspectivesCrowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic Perspectives
 
H2O Machine Learning and Kalman Filters for Machine Prognostics
H2O Machine Learning and Kalman Filters for Machine PrognosticsH2O Machine Learning and Kalman Filters for Machine Prognostics
H2O Machine Learning and Kalman Filters for Machine Prognostics
 
Intro to machine learning with scikit learn
Intro to machine learning with scikit learnIntro to machine learning with scikit learn
Intro to machine learning with scikit learn
 

Destaque

Strux engels lezen week 26
Strux engels lezen week 26Strux engels lezen week 26
Strux engels lezen week 26Jos Begeman
 
MySQL 5.7 InnoDB 日本語全文検索
MySQL 5.7 InnoDB 日本語全文検索MySQL 5.7 InnoDB 日本語全文検索
MySQL 5.7 InnoDB 日本語全文検索yoyamasaki
 
[INFOGRAPHIC] Women in Leadership: Why It Matters
[INFOGRAPHIC] Women in Leadership: Why It Matters[INFOGRAPHIC] Women in Leadership: Why It Matters
[INFOGRAPHIC] Women in Leadership: Why It MattersThe Rockefeller Foundation
 
Tips to Handling Dental Emergencies - #PPT
Tips to Handling Dental Emergencies - #PPTTips to Handling Dental Emergencies - #PPT
Tips to Handling Dental Emergencies - #PPTTanya Altmann
 
Strategies to Support Open Educational Resources for Student Success: Case Ex...
Strategies to Support Open Educational Resources for Student Success: Case Ex...Strategies to Support Open Educational Resources for Student Success: Case Ex...
Strategies to Support Open Educational Resources for Student Success: Case Ex...Robin M. Ashford, MSLIS
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
Qualité, bonnes pratiques et CMS - WordCamp Bordeaux - 18 mars 2017
Qualité, bonnes pratiques et CMS - WordCamp Bordeaux - 18 mars 2017Qualité, bonnes pratiques et CMS - WordCamp Bordeaux - 18 mars 2017
Qualité, bonnes pratiques et CMS - WordCamp Bordeaux - 18 mars 2017Elie Sloïm
 
Anomaly detection in real-time data streams using Heron
Anomaly detection in real-time data streams using HeronAnomaly detection in real-time data streams using Heron
Anomaly detection in real-time data streams using HeronArun Kejariwal
 
CSW2017 Henry li how to find the vulnerability to bypass the control flow gua...
CSW2017 Henry li how to find the vulnerability to bypass the control flow gua...CSW2017 Henry li how to find the vulnerability to bypass the control flow gua...
CSW2017 Henry li how to find the vulnerability to bypass the control flow gua...CanSecWest
 
The Ultimate Security Checklist While Launching Your Android App
The Ultimate Security Checklist While Launching Your Android AppThe Ultimate Security Checklist While Launching Your Android App
The Ultimate Security Checklist While Launching Your Android AppAppknox
 
10 Good Reasons - NetApp AltaVault
10 Good Reasons - NetApp AltaVault10 Good Reasons - NetApp AltaVault
10 Good Reasons - NetApp AltaVaultNetAppUK
 
Secure development environment @ Meet Magento Croatia 2017
Secure development environment @ Meet Magento Croatia 2017Secure development environment @ Meet Magento Croatia 2017
Secure development environment @ Meet Magento Croatia 2017Anna Völkl
 
スタートアップを陰ながら支えるときに心がけるべき5ヶ条
スタートアップを陰ながら支えるときに心がけるべき5ヶ条スタートアップを陰ながら支えるときに心がけるべき5ヶ条
スタートアップを陰ながら支えるときに心がけるべき5ヶ条Atsumi Kawashima
 
Diagnóstico SEO Técnico con Herramientas #TheInbounder
Diagnóstico SEO Técnico con Herramientas #TheInbounderDiagnóstico SEO Técnico con Herramientas #TheInbounder
Diagnóstico SEO Técnico con Herramientas #TheInbounderMJ Cachón Yáñez
 
Global Education and Skills Forum 2017 - Educating Global Citizens
Global Education and Skills Forum  2017 -  Educating Global CitizensGlobal Education and Skills Forum  2017 -  Educating Global Citizens
Global Education and Skills Forum 2017 - Educating Global CitizensEduSkills OECD
 
Can social media change health behavior?
Can social media change health behavior?Can social media change health behavior?
Can social media change health behavior?Iris Thiele Isip-Tan
 
一般的なチートの手法と対策について
一般的なチートの手法と対策について一般的なチートの手法と対策について
一般的なチートの手法と対策について優介 黒河
 
Cancer Care in a Post Truth World
Cancer Care in a Post Truth World Cancer Care in a Post Truth World
Cancer Care in a Post Truth World Matthew Katz
 

Destaque (20)

Strux engels lezen week 26
Strux engels lezen week 26Strux engels lezen week 26
Strux engels lezen week 26
 
MySQL 5.7 InnoDB 日本語全文検索
MySQL 5.7 InnoDB 日本語全文検索MySQL 5.7 InnoDB 日本語全文検索
MySQL 5.7 InnoDB 日本語全文検索
 
[INFOGRAPHIC] Women in Leadership: Why It Matters
[INFOGRAPHIC] Women in Leadership: Why It Matters[INFOGRAPHIC] Women in Leadership: Why It Matters
[INFOGRAPHIC] Women in Leadership: Why It Matters
 
Tips to Handling Dental Emergencies - #PPT
Tips to Handling Dental Emergencies - #PPTTips to Handling Dental Emergencies - #PPT
Tips to Handling Dental Emergencies - #PPT
 
Strategies to Support Open Educational Resources for Student Success: Case Ex...
Strategies to Support Open Educational Resources for Student Success: Case Ex...Strategies to Support Open Educational Resources for Student Success: Case Ex...
Strategies to Support Open Educational Resources for Student Success: Case Ex...
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Weld Strata talk
Weld Strata talkWeld Strata talk
Weld Strata talk
 
Qualité, bonnes pratiques et CMS - WordCamp Bordeaux - 18 mars 2017
Qualité, bonnes pratiques et CMS - WordCamp Bordeaux - 18 mars 2017Qualité, bonnes pratiques et CMS - WordCamp Bordeaux - 18 mars 2017
Qualité, bonnes pratiques et CMS - WordCamp Bordeaux - 18 mars 2017
 
Anomaly detection in real-time data streams using Heron
Anomaly detection in real-time data streams using HeronAnomaly detection in real-time data streams using Heron
Anomaly detection in real-time data streams using Heron
 
Let’s grow
Let’s growLet’s grow
Let’s grow
 
CSW2017 Henry li how to find the vulnerability to bypass the control flow gua...
CSW2017 Henry li how to find the vulnerability to bypass the control flow gua...CSW2017 Henry li how to find the vulnerability to bypass the control flow gua...
CSW2017 Henry li how to find the vulnerability to bypass the control flow gua...
 
The Ultimate Security Checklist While Launching Your Android App
The Ultimate Security Checklist While Launching Your Android AppThe Ultimate Security Checklist While Launching Your Android App
The Ultimate Security Checklist While Launching Your Android App
 
10 Good Reasons - NetApp AltaVault
10 Good Reasons - NetApp AltaVault10 Good Reasons - NetApp AltaVault
10 Good Reasons - NetApp AltaVault
 
Secure development environment @ Meet Magento Croatia 2017
Secure development environment @ Meet Magento Croatia 2017Secure development environment @ Meet Magento Croatia 2017
Secure development environment @ Meet Magento Croatia 2017
 
スタートアップを陰ながら支えるときに心がけるべき5ヶ条
スタートアップを陰ながら支えるときに心がけるべき5ヶ条スタートアップを陰ながら支えるときに心がけるべき5ヶ条
スタートアップを陰ながら支えるときに心がけるべき5ヶ条
 
Diagnóstico SEO Técnico con Herramientas #TheInbounder
Diagnóstico SEO Técnico con Herramientas #TheInbounderDiagnóstico SEO Técnico con Herramientas #TheInbounder
Diagnóstico SEO Técnico con Herramientas #TheInbounder
 
Global Education and Skills Forum 2017 - Educating Global Citizens
Global Education and Skills Forum  2017 -  Educating Global CitizensGlobal Education and Skills Forum  2017 -  Educating Global Citizens
Global Education and Skills Forum 2017 - Educating Global Citizens
 
Can social media change health behavior?
Can social media change health behavior?Can social media change health behavior?
Can social media change health behavior?
 
一般的なチートの手法と対策について
一般的なチートの手法と対策について一般的なチートの手法と対策について
一般的なチートの手法と対策について
 
Cancer Care in a Post Truth World
Cancer Care in a Post Truth World Cancer Care in a Post Truth World
Cancer Care in a Post Truth World
 

Semelhante a From Lab to Factory: Creating value with data

From Lab to Factory: Or how to turn data into value
From Lab to Factory: Or how to turn data into valueFrom Lab to Factory: Or how to turn data into value
From Lab to Factory: Or how to turn data into valuePeadar Coyle
 
The Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape OverviewThe Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape OverviewDr. Ananth Krishnamoorthy
 
Adi Wijaya - Scrum in Data Science, What Works and What Doesn’t
Adi Wijaya - Scrum in Data Science, What Works and What Doesn’tAdi Wijaya - Scrum in Data Science, What Works and What Doesn’t
Adi Wijaya - Scrum in Data Science, What Works and What Doesn’tAgile Impact Conference
 
Adi Wijaya - Scrum in Data Science, What Works and What Doesn’t
Adi Wijaya - Scrum in Data Science, What Works and What Doesn’tAdi Wijaya - Scrum in Data Science, What Works and What Doesn’t
Adi Wijaya - Scrum in Data Science, What Works and What Doesn’tAgile Impact
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
 
Building successful data science teams
Building successful data science teamsBuilding successful data science teams
Building successful data science teamsVenkatesh Umaashankar
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattooMohamed Magdy
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLPaco Nathan
 
How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)Denodo
 
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...Matt Stubbs
 
The Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value ThereafterThe Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value ThereafterInside Analysis
 
Data science tools of the trade
Data science tools of the tradeData science tools of the trade
Data science tools of the tradeFangda Wang
 
Neurodb Engr245 2021 Lessons Learned
Neurodb Engr245 2021 Lessons LearnedNeurodb Engr245 2021 Lessons Learned
Neurodb Engr245 2021 Lessons LearnedStanford University
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Data_and_Analytics_Industry_IESE_v3.pdf
Data_and_Analytics_Industry_IESE_v3.pdfData_and_Analytics_Industry_IESE_v3.pdf
Data_and_Analytics_Industry_IESE_v3.pdfprevota
 
Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Adrien Blind
 

Semelhante a From Lab to Factory: Creating value with data (20)

From Lab to Factory: Or how to turn data into value
From Lab to Factory: Or how to turn data into valueFrom Lab to Factory: Or how to turn data into value
From Lab to Factory: Or how to turn data into value
 
Proposed Talk Outline for Pycon2017
Proposed Talk Outline for Pycon2017 Proposed Talk Outline for Pycon2017
Proposed Talk Outline for Pycon2017
 
The Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape OverviewThe Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape Overview
 
Adi Wijaya - Scrum in Data Science, What Works and What Doesn’t
Adi Wijaya - Scrum in Data Science, What Works and What Doesn’tAdi Wijaya - Scrum in Data Science, What Works and What Doesn’t
Adi Wijaya - Scrum in Data Science, What Works and What Doesn’t
 
Adi Wijaya - Scrum in Data Science, What Works and What Doesn’t
Adi Wijaya - Scrum in Data Science, What Works and What Doesn’tAdi Wijaya - Scrum in Data Science, What Works and What Doesn’t
Adi Wijaya - Scrum in Data Science, What Works and What Doesn’t
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
HadoopWorkshopJuly2014
HadoopWorkshopJuly2014HadoopWorkshopJuly2014
HadoopWorkshopJuly2014
 
Building successful data science teams
Building successful data science teamsBuilding successful data science teams
Building successful data science teams
 
On Big Data
On Big DataOn Big Data
On Big Data
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattoo
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
 
Ornl IT
Ornl ITOrnl IT
Ornl IT
 
How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)
 
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
 
The Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value ThereafterThe Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value Thereafter
 
Data science tools of the trade
Data science tools of the tradeData science tools of the trade
Data science tools of the trade
 
Neurodb Engr245 2021 Lessons Learned
Neurodb Engr245 2021 Lessons LearnedNeurodb Engr245 2021 Lessons Learned
Neurodb Engr245 2021 Lessons Learned
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Data_and_Analytics_Industry_IESE_v3.pdf
Data_and_Analytics_Industry_IESE_v3.pdfData_and_Analytics_Industry_IESE_v3.pdf
Data_and_Analytics_Industry_IESE_v3.pdf
 
Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)
 

Mais de Peadar Coyle

Introduction to Bayesian Analysis in Python
Introduction to Bayesian Analysis in PythonIntroduction to Bayesian Analysis in Python
Introduction to Bayesian Analysis in PythonPeadar Coyle
 
Variational Inference in Python
Variational Inference in PythonVariational Inference in Python
Variational Inference in PythonPeadar Coyle
 
Consulting Skills for Data Scientists
Consulting Skills for Data ScientistsConsulting Skills for Data Scientists
Consulting Skills for Data ScientistsPeadar Coyle
 
A Map of the PyData Stack
A Map of the PyData StackA Map of the PyData Stack
A Map of the PyData StackPeadar Coyle
 
Big Data and Internet of Things for Managers
Big Data and Internet of Things for ManagersBig Data and Internet of Things for Managers
Big Data and Internet of Things for ManagersPeadar Coyle
 
Introduction to Spark: Or how I learned to love 'big data' after all.
Introduction to Spark: Or how I learned to love 'big data' after all.Introduction to Spark: Or how I learned to love 'big data' after all.
Introduction to Spark: Or how I learned to love 'big data' after all.Peadar Coyle
 
Probabilistic Programming in Python
Probabilistic Programming in PythonProbabilistic Programming in Python
Probabilistic Programming in PythonPeadar Coyle
 
How can Data Science benefit your business?
How can Data Science benefit your business?How can Data Science benefit your business?
How can Data Science benefit your business?Peadar Coyle
 

Mais de Peadar Coyle (8)

Introduction to Bayesian Analysis in Python
Introduction to Bayesian Analysis in PythonIntroduction to Bayesian Analysis in Python
Introduction to Bayesian Analysis in Python
 
Variational Inference in Python
Variational Inference in PythonVariational Inference in Python
Variational Inference in Python
 
Consulting Skills for Data Scientists
Consulting Skills for Data ScientistsConsulting Skills for Data Scientists
Consulting Skills for Data Scientists
 
A Map of the PyData Stack
A Map of the PyData StackA Map of the PyData Stack
A Map of the PyData Stack
 
Big Data and Internet of Things for Managers
Big Data and Internet of Things for ManagersBig Data and Internet of Things for Managers
Big Data and Internet of Things for Managers
 
Introduction to Spark: Or how I learned to love 'big data' after all.
Introduction to Spark: Or how I learned to love 'big data' after all.Introduction to Spark: Or how I learned to love 'big data' after all.
Introduction to Spark: Or how I learned to love 'big data' after all.
 
Probabilistic Programming in Python
Probabilistic Programming in PythonProbabilistic Programming in Python
Probabilistic Programming in Python
 
How can Data Science benefit your business?
How can Data Science benefit your business?How can Data Science benefit your business?
How can Data Science benefit your business?
 

Último

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 

Último (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 

From Lab to Factory: Creating value with data

  • 1. From Lab to Factory LessonslearneddevelopingDataProducts Keynote PyCon Colombia Feb 2017 peadarcoyle@googlemail.com All opinions my own
  • 2. Who am I? Data Scientist at Elevate (Recruitment Startup) and Author @springcoil About 5 years in Machine Learning/Data Science OSS contributor (mostly PyMC3) Built Ad-hoc analysis, Data Products and Data Pipelines at
  • 3.
  • 6. Roadmap What is Data Science? Type of Data Scientists Deploying Data Science (tech/culture) Success stories and recommendations Bigger idea (Change)
  • 7. What IS a Data Scientist?
  • 8. There's a Data Science Spectrum HT: Sean J. Taylor (Facebook Research Scientist)
  • 9. Data Science to Value Data doesn't do anything! People need to make decisions (strategic/ tactical) Or the product needs to alter someones decision making (Spotify Discover Weekly, Amazon Recommendations, etc)
  • 10. (Taken from Josh Wills talk - Lab to Factory 2014) or https://hbr.org/2013/04/two-departments-for-data-succe
  • 11.
  • 13. Data is the new oil... But. Data isn't a product. It needs to be turned into one!
  • 14. Data Science is hard Data and technical problems Skill set (emotional and tech) Machine Learning Ops
  • 15. Data Science projects are risky! Many stakeholders think that data science is just an engineering problem, but research is high risk and high reward De-risking the project - how? Send me examples :) http://www.martingoodson.com/ten-ways-your-data-project-is-going-to- fail/ https://github.com/ianozsvald/data_science_delivered
  • 16. Success with Data Science 1. People 2. Ideas 3. Things Source: USAF
  • 17.
  • 18. R and D needs cultural support
  • 19. Your culture needs to avoid the HiPPO (highest paid persons opinion) to be data-informed or data- driven.
  • 20. Good cultures Data democratization Clear objectives and metrics against objectives Financial metrics don't always win - sometimes brand wins See by Martin Goodsonhttp://bit.ly/cultureanddata
  • 21. Some data scientists do experiments and build prototypes
  • 23. (HT: The Yhat people - )www.yhathq.com
  • 24. Deploying Data Products is hard Monitoring and Alerting Deployment Real world data is tricky (training/ testing) Interpretability (explaining fraud detection/ad tech) Feature engineering Models decay over time
  • 25. What projects work? Explain existing data (visualization!) Automate repetitive/ slow processes Augment data to make new data (Search engines, ML models) Predict the future (do something more accurately than gut feel ) Simulate using statistics :) (Sports analytics models)
  • 26. "You need data first" - Peadar Coyle Copying and pasting PDF/PNG data Getting data in some areas is hard!! Some tools for web data extraction Messy APIs without documentation :(
  • 27. Visualise ALL THE THINGS!! (Relay foods dataset - HT Greg Reda) Consumer behaviour at a Fast Food Restaurant per year in the USA
  • 28.
  • 29. Augmenting data and using API's
  • 31. Production data/ Real world HT: Andrew Ng
  • 32. Everyone ETLS Only 1% of your time will be spent modelling Data pipelines and your infrastructure matters - Tools such as Spark, Luigi, Airflow or Dask are important Monitoring of pipelines. Scalability. Machine provision Eoin Brazil Talk
  • 33. Success Story (1) 1. Ad Tech project (Media company) - original process took 10 hours per week of analyst time. 2. Broke down process and produced simple predictive model 3. Data type challenges (unicode in Python 2.7) and business expectations 4. Now takes less than 30 minutes each week.
  • 34. Success Story (2) Elevate is a Total Talent Management platform (tools for streamlining recruitment) Several models in production - microservices (recommenders, job type prediction) Data pipeline (Luigi) necessary for producing features Lots of challenges about Docker, time, updated data. Works: MyPy (Python 3.5), code review, testing - Next: Moving to PySpark for ETL, more pair programming http://hypothesis.works/
  • 35. Failure 1. Lack of support from tech teams 2. Lack of clearly defined customer 3. No culture of DevOps 4. Management had no clear vision
  • 36.
  • 37. Lessons learned from Lab to Factory 1. Monitoring - data products need evaluation in production 2. Lack of a shared language between software engineers and data scientists. Pair programming/ Code review are some solutions. 3. To help data scientists and analysts succeed your business needs to be prepared to invest in tooling. (Pipelines, DevOps, Reproducible) 4. Often you're working with other teams who use different languages - so micro services can be a good idea (example C#.net app and Python microservice)
  • 38. How to deploy a model? Palladium (Otto Group) Azure Flask Microservice (DIY) and Docker
  • 40.
  • 41. We need each other!
  • 42. Use small data where possible!! Small problems with clean data are more important - (Ian Ozsvald) Amazon machine with many Xeons and 244GB of RAM is less than 3 euros per hour. - (Ian Ozsvald) Blaze, Xray, Dask, Ibis, etc etc - "The mean size of a cluster will remain 1" - Matt Rocklin PyData Bikeshed
  • 43. Focus on the real problem
  • 44. Closing remarks Dirty data stops projects Change happens, we need to help businesses adapt or be agile There are some good projects like Icy, Luigi, etc for transforming data and improving data extraction - Google paper on rule for building ML systems http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf
  • 45. Closing Remarks (2) Stakeholder management is a challenge too Send me your dirty data and data deployment stories :) www.peadarcoyle.com https://leanpub.com/interviewswithdatascientists
  • 47.
  • 48. Simulate: Rugby games with MCMC (PyMC3)
  • 49. What is the Data Science process? Obtain Scrub Explore Model Interpret Communicate (or Deploy)
  • 50. Some NLP on the Interviews!
  • 51. What do Data Scientists talk about? Based on my Interview series!Dataconomy
  • 52. A famous 'data product' - Recommendation engines
  • 53.