SlideShare uma empresa Scribd logo
1 de 33
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
1 © Pariveda Solutions. Confidential & Proprietary.
August 2017
Managing Data Science
| Lessons from the Field
Mac Steele
Director of Product | Domino Data Lab
mac@dominodatalab.com
@macsteele
What You’ll Learn Today
GOALS
What is the bar for data science teams
PITFALLS
What are common data science struggles
DIAGNOSES
Why so many of our efforts fail to deliver value
RECOMMENDATIONS
How to address these struggles with best practices
Lots of Legitimate
Promises
Saved $40M
In claims with predictive analytics
200
180
160
140
120
100
80
60
40
20
0
Q1-08 Q2-09 Q3-10 Q4-11 Q1-13 Q2-14 Q3-15
Companies Mentioning
‘Artificial Intelligence’
On Earnings Calls
Q4-16
Lots of Hype
35% of Sales
Come from product recommendations
Saved $450M
By detecting fraudulent tax returns
Lots of Risk of
Disappointment
This Sounds
Eerily Familiar
MACHINE
LEARNING
TIME
Innovation
Trigger
Peak of
Inflated
Expectations
Trough of
Disillusionment
Slope of
Enlightenment
Plateau of
Productivity
EXPECTATION
S
TIME
RELATIVE
IMPORTANCE WITHIN
ENTERPRISE
1997 20302010
Software
Developers
Data
Scientists
What is the Goal?
Measurable
Your “quality” indicator.
Reliable
Your “hit rate.”
Scalable
Your “throughput.”
DATA SCIENCE PITFALLS
I SOLVED THE PROBLEM BUT…
Oops, already
solved by
someone else
It was the
wrong problem
Solved the
wrong way
Have the wrong
tools for this problem
Too slow for it
to matter
World changes
while solving
problem
Problems mulitply,
can’t tackle all
at once
Results used
Wrong way
DIAGNOSES
Data Science is Different from Software Development
• Research versus development focus
• No answer is a valid answer
• Traditional testing is insufficient given
non-deterministic nature
• No generally accepted process metrics (e.g.
story points)
• Data must be tracked
Forget About Other Stakeholders in the Process
Access powerful infrastructure &
preferred tools
For Data Scientists For IT Leaders
•Ensure stability & security
•Leverage existing infrastructure
•Minimize operational burden
For Business Leaders
•Understand real-world impact
•Reliable, predictable insights
•Minimize change to existing workflows
For Data Science
Managers
• Accelerate project delivery through reuse,
knowledge management
• Mitigate key-man risk / accelerate onboarding
• Hire & retain top talent
Fixation on Tools at the Expense of People and
Process
Moonshot vs.
Laps Around the Track
• Perfection as enemy of shipped
• Muddle “pure research” and
“applied templates”
Disconnected from the
Business
• Little familiarity with practical
business constraints
• Limited ability to drive
adoption
Missing Some Key
Personnel Muscles
• The full stack data scientist is
a myth
• Gap in ”soft” skills training
Artisan Thinking vs.
Modular System Thinking
• Limited culture of re-use and
compounding
• Not planning for future iterations
(e.g., no reproducibility /
documentation)
Bad Incentive Structures
• Key responsibilities fall between
gaps
• Significant information loss in
project transitions
How about divider slides for
each general section?RECOMMENDATIONS
Best Practices Take Many Forms
Process
Both a single project and portfolio of projects
People
Types of capabilities and org design
Technology
Flexible infrastructure and tooling without the
wild west
Data science system at many levels
Single
Step
Data
Exploration
Single Project
Ideation
Validation
& Review
Deployment
&
Publishing
Monitoring
& Feedback
Data
Exploration R & D
Portfolio of Projects
Managing the lifecycle
• Expect and embrace iteration
• Enable compounding collaboration
• Ensure auditability and
reproducibility, even if you’re not
regulated (yet)
Ideation
• Problem first, not data first
• Practice and master order of
magnitude ROI math
• Maintain repo of past work
• Create and enforce templates for
MRDs
• Maintain a stakeholder-driven
backlog
Artifact Selection
• Leverage rapid prototyping and
design sprint methodology
• Create multiple mock-ups of
different deliverable types
• Consider creating synthetic data
with baseline models
Research & Development
• Establish standard software
configurations, but give flexibility
to experiment
• Abstract away compute
provisioning
• Build simple models first
• Set a cadence for delivering
insights
• Ensure business KPI tracked
consistently over time
Validation
• More than just code review, get
stakeholder and IT sign-off
• Ensure reproducibility and clear
lineage
• Use automated validation checks
to support human inspection
• Preserve results (even nulls) to
central repo
WHAT INFLUENCES A RESULT?
Results
The statistical analyses selected
The R scripts that implemented the analyses
The R libraries that implement the statistical functions
The C libraries that perform the mathematical computations
The operating system running the computational framework
Reduced data
Scripts that reduce the data
Raw data
Depend on
Depend on
Depend on
Depend on
Depend on
Depend on
Depend on
Depend on
Delivery
• Support for many deliverable
artifacts (reports, dashboards,
apps, batch APIs, real-time APIs)
• Define a promote-to-production
workflow
• Flag upstream and downstream
dependencies
Monitoring
• Build ROI testing into all major
deliverables
• Require monitoring plans before
considering “done”
• Integrate with tools where people
spend most of their time (e.g.,
email / Slack)
• Anticipate risk and change
management burdens
Keeping all the balls
in the air
• Measure everything, including
yourself
• Focus on reducing time to iterate
• Socialize aggregate portfolio
impact
The many hats of data science
PRIORITIES PITTFALLS WITHOUT THEM
Creating engaging visual and narrative journeys
for analytical solutionsData Storyteller
Articulating the business problem, translating to
day-to-day work, ensuring ongoing engagement.
Data Product
Manager
Vetting the priortization and ROI, providing ongoing
feedback
Business
Stakeholder
ROLE
Low engagement and
adoption from
end users
Projects miss the mark, don’t
translate into tangible business
value
ROI decisions aren’t made
sensibly, not knowing when to pull
the plug
Generating and communicating insights,
understanding the strengths and risksData Scientist
Naïve or low power insights
Building scalable pipelines and infrastructure that
make it possible to do the higher levels of needs.
Data
Infrastructure
Engineer
Insight generation is slow,
because DS is spending their
time doing infrastructure work
Organizational Design Dilemmas
• False centralization /
decentralization dichotomy
• Most evolve as they scale
and as business demands
shift
• Technology can help
bridge the gap
• Deeper understanding
of business processes
and priorities
• Easier change
management
• Less technical
knowledge
compounding
• Harder to codify best
practices
• Risk of shadow IT
DECENTRALIZATIONCENTRALIZATION
• Community and
mentorship
• easier transparency for
managers and IT
• More passive technical
knowledge sharing
• Isolation on data
science island
• Loss of credibility with
business
• Frustrated data
scientists
Pros
Cons
What We Covered Today
GOALS
What is the bar for data science teams
PITFALLS
What are common data science struggles
DIAGNOSES
Why so many of our efforts fail to deliver value
RECOMMENDATIONS
How to address these struggles with best practices
QUESTIONS?
Check out dominodatalab.com or find us
in the AWS Marketplace
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
33 © Pariveda Solutions. Confidential & Proprietary.

Mais conteúdo relacionado

Mais procurados

Why Data Science Projects Fail
Why Data Science Projects FailWhy Data Science Projects Fail
Why Data Science Projects FailSense Corp
 
IT & Innovation - short summary
IT & Innovation - short summaryIT & Innovation - short summary
IT & Innovation - short summaryPerry Nouwens
 
Back to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from ScratchBack to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from ScratchKlaas Bosteels
 
Why Data Science Projects Fail?
Why Data Science Projects Fail?Why Data Science Projects Fail?
Why Data Science Projects Fail?Ethan Ram
 
Data Architecture: OMG It’s Made of People
Data Architecture: OMG It’s Made of PeopleData Architecture: OMG It’s Made of People
Data Architecture: OMG It’s Made of Peoplemark madsen
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellSri Ambati
 
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)mark madsen
 
Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019mark madsen
 
Pay no attention to the man behind the curtain - the unseen work behind data ...
Pay no attention to the man behind the curtain - the unseen work behind data ...Pay no attention to the man behind the curtain - the unseen work behind data ...
Pay no attention to the man behind the curtain - the unseen work behind data ...mark madsen
 
H2O World - Advanced Analytics at Macys.com - Daqing Zhao
H2O World - Advanced Analytics at Macys.com - Daqing ZhaoH2O World - Advanced Analytics at Macys.com - Daqing Zhao
H2O World - Advanced Analytics at Macys.com - Daqing ZhaoSri Ambati
 
H2O World - What you need before doing predictive analysis - Keen.io
H2O World - What you need before doing predictive analysis - Keen.ioH2O World - What you need before doing predictive analysis - Keen.io
H2O World - What you need before doing predictive analysis - Keen.ioSri Ambati
 
Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018mark madsen
 
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...Formulatedby
 
Solve User Problems: Data Architecture for Humans
Solve User Problems: Data Architecture for HumansSolve User Problems: Data Architecture for Humans
Solve User Problems: Data Architecture for Humansmark madsen
 
Giovanni Lanzani GoDataDriven
Giovanni Lanzani GoDataDrivenGiovanni Lanzani GoDataDriven
Giovanni Lanzani GoDataDrivenBigDataExpo
 
Idiots guide to setting up a data science team
Idiots guide to setting up a data science teamIdiots guide to setting up a data science team
Idiots guide to setting up a data science teamAshish Bansal
 
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...Thoughtworks
 

Mais procurados (20)

Why Data Science Projects Fail
Why Data Science Projects FailWhy Data Science Projects Fail
Why Data Science Projects Fail
 
IT & Innovation - short summary
IT & Innovation - short summaryIT & Innovation - short summary
IT & Innovation - short summary
 
Back to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from ScratchBack to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from Scratch
 
Why Data Science Projects Fail?
Why Data Science Projects Fail?Why Data Science Projects Fail?
Why Data Science Projects Fail?
 
Data Architecture: OMG It’s Made of People
Data Architecture: OMG It’s Made of PeopleData Architecture: OMG It’s Made of People
Data Architecture: OMG It’s Made of People
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
 
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)Architecting a Data Platform For Enterprise Use (Strata NY 2018)
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
 
Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019Building a Data Platform Strata SF 2019
Building a Data Platform Strata SF 2019
 
Pay no attention to the man behind the curtain - the unseen work behind data ...
Pay no attention to the man behind the curtain - the unseen work behind data ...Pay no attention to the man behind the curtain - the unseen work behind data ...
Pay no attention to the man behind the curtain - the unseen work behind data ...
 
H2O World - Advanced Analytics at Macys.com - Daqing Zhao
H2O World - Advanced Analytics at Macys.com - Daqing ZhaoH2O World - Advanced Analytics at Macys.com - Daqing Zhao
H2O World - Advanced Analytics at Macys.com - Daqing Zhao
 
H2O World - What you need before doing predictive analysis - Keen.io
H2O World - What you need before doing predictive analysis - Keen.ioH2O World - What you need before doing predictive analysis - Keen.io
H2O World - What you need before doing predictive analysis - Keen.io
 
Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Platform for Enterprise Use - Strata London 2018
 
The Big Data Dream Team
The Big Data Dream TeamThe Big Data Dream Team
The Big Data Dream Team
 
Notilyze SAS
Notilyze SASNotilyze SAS
Notilyze SAS
 
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
 
Solve User Problems: Data Architecture for Humans
Solve User Problems: Data Architecture for HumansSolve User Problems: Data Architecture for Humans
Solve User Problems: Data Architecture for Humans
 
Andreas weigend
Andreas weigendAndreas weigend
Andreas weigend
 
Giovanni Lanzani GoDataDriven
Giovanni Lanzani GoDataDrivenGiovanni Lanzani GoDataDriven
Giovanni Lanzani GoDataDriven
 
Idiots guide to setting up a data science team
Idiots guide to setting up a data science teamIdiots guide to setting up a data science team
Idiots guide to setting up a data science team
 
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
 

Destaque

Vowpal Wabbit
Vowpal WabbitVowpal Wabbit
Vowpal Wabbitodsc
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command lineSharat Chikkerur
 
Tda presentation
Tda presentationTda presentation
Tda presentationHJ van Veen
 
How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?HackerEarth
 
How hackathons can drive top line revenue growth
How hackathons can drive top line revenue growthHow hackathons can drive top line revenue growth
How hackathons can drive top line revenue growthHackerEarth
 
HackerEarth helping a startup hire developers - The Practo Case Study
HackerEarth helping a startup hire developers - The Practo Case StudyHackerEarth helping a startup hire developers - The Practo Case Study
HackerEarth helping a startup hire developers - The Practo Case StudyHackerEarth
 
How to recruit excellent tech talent
How to recruit excellent tech talentHow to recruit excellent tech talent
How to recruit excellent tech talentHackerEarth
 
HackerEarth Sourcing Solution
HackerEarth Sourcing SolutionHackerEarth Sourcing Solution
HackerEarth Sourcing SolutionHackerEarth
 
USC LIGHT Ministry Introduction
USC LIGHT Ministry IntroductionUSC LIGHT Ministry Introduction
USC LIGHT Ministry IntroductionJeong-Yoon Lee
 
Intra company hackathons using HackerEarth
Intra company hackathons using HackerEarthIntra company hackathons using HackerEarth
Intra company hackathons using HackerEarthHackerEarth
 
Marriage - LIGHT Ministry
Marriage - LIGHT MinistryMarriage - LIGHT Ministry
Marriage - LIGHT MinistryJeong-Yoon Lee
 
Druva Casestudy - HackerEarth
Druva Casestudy - HackerEarthDruva Casestudy - HackerEarth
Druva Casestudy - HackerEarthHackerEarth
 
Open Innovation - A Case Study
Open Innovation - A Case StudyOpen Innovation - A Case Study
Open Innovation - A Case StudyHackerEarth
 
Menstrual Health Reader - mEo
Menstrual Health Reader - mEoMenstrual Health Reader - mEo
Menstrual Health Reader - mEoHackerEarth
 
Smart Switchboard: An home automation system
Smart Switchboard: An home automation systemSmart Switchboard: An home automation system
Smart Switchboard: An home automation systemHackerEarth
 

Destaque (18)

Vowpal Wabbit
Vowpal WabbitVowpal Wabbit
Vowpal Wabbit
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command line
 
Tda presentation
Tda presentationTda presentation
Tda presentation
 
How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?
 
How hackathons can drive top line revenue growth
How hackathons can drive top line revenue growthHow hackathons can drive top line revenue growth
How hackathons can drive top line revenue growth
 
HackerEarth helping a startup hire developers - The Practo Case Study
HackerEarth helping a startup hire developers - The Practo Case StudyHackerEarth helping a startup hire developers - The Practo Case Study
HackerEarth helping a startup hire developers - The Practo Case Study
 
How to recruit excellent tech talent
How to recruit excellent tech talentHow to recruit excellent tech talent
How to recruit excellent tech talent
 
HackerEarth Sourcing Solution
HackerEarth Sourcing SolutionHackerEarth Sourcing Solution
HackerEarth Sourcing Solution
 
USC LIGHT Ministry Introduction
USC LIGHT Ministry IntroductionUSC LIGHT Ministry Introduction
USC LIGHT Ministry Introduction
 
Kill the wabbit
Kill the wabbitKill the wabbit
Kill the wabbit
 
Intra company hackathons using HackerEarth
Intra company hackathons using HackerEarthIntra company hackathons using HackerEarth
Intra company hackathons using HackerEarth
 
No-Bullshit Data Science
No-Bullshit Data ScienceNo-Bullshit Data Science
No-Bullshit Data Science
 
Marriage - LIGHT Ministry
Marriage - LIGHT MinistryMarriage - LIGHT Ministry
Marriage - LIGHT Ministry
 
Druva Casestudy - HackerEarth
Druva Casestudy - HackerEarthDruva Casestudy - HackerEarth
Druva Casestudy - HackerEarth
 
Work - LIGHT Ministry
Work - LIGHT MinistryWork - LIGHT Ministry
Work - LIGHT Ministry
 
Open Innovation - A Case Study
Open Innovation - A Case StudyOpen Innovation - A Case Study
Open Innovation - A Case Study
 
Menstrual Health Reader - mEo
Menstrual Health Reader - mEoMenstrual Health Reader - mEo
Menstrual Health Reader - mEo
 
Smart Switchboard: An home automation system
Smart Switchboard: An home automation systemSmart Switchboard: An home automation system
Smart Switchboard: An home automation system
 

Semelhante a Managing Data Science | Lessons from the Field

Advanced Project Data Analytics for Improved Project Delivery
Advanced Project Data Analytics for Improved Project DeliveryAdvanced Project Data Analytics for Improved Project Delivery
Advanced Project Data Analytics for Improved Project DeliveryMark Constable
 
Five Attributes to a Successful Big Data Strategy
Five Attributes to a Successful Big Data StrategyFive Attributes to a Successful Big Data Strategy
Five Attributes to a Successful Big Data StrategyPerficient, Inc.
 
Building successful data science teams
Building successful data science teamsBuilding successful data science teams
Building successful data science teamsVenkatesh Umaashankar
 
Data-Ed Webinar: Data Modeling Fundamentals
Data-Ed Webinar: Data Modeling FundamentalsData-Ed Webinar: Data Modeling Fundamentals
Data-Ed Webinar: Data Modeling FundamentalsDATAVERSITY
 
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...Databricks
 
Success Through an Actionable Data Science Stack
Success Through an Actionable Data Science StackSuccess Through an Actionable Data Science Stack
Success Through an Actionable Data Science StackDomino Data Lab
 
Big Data LA 2016: Backstage to a Data Driven Culture
Big Data LA 2016: Backstage to a Data Driven CultureBig Data LA 2016: Backstage to a Data Driven Culture
Big Data LA 2016: Backstage to a Data Driven CulturePauline Chow
 
Putting data science in your business a first utility feedback
Putting data science in your business a first utility feedbackPutting data science in your business a first utility feedback
Putting data science in your business a first utility feedbackPeculium Crypto
 
What Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceWhat Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceAnnie Flippo
 
[DSC Europe 22] The Making of a Data Organization - Denys Holovatyi
[DSC Europe 22] The Making of a Data Organization - Denys Holovatyi[DSC Europe 22] The Making of a Data Organization - Denys Holovatyi
[DSC Europe 22] The Making of a Data Organization - Denys HolovatyiDataScienceConferenc1
 
The Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value ThereafterThe Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value ThereafterInside Analysis
 
Warehouse components
Warehouse componentsWarehouse components
Warehouse componentsganblues
 
Dow Chemical presentation at the Chief Analytics Officer Forum East Coast USA...
Dow Chemical presentation at the Chief Analytics Officer Forum East Coast USA...Dow Chemical presentation at the Chief Analytics Officer Forum East Coast USA...
Dow Chemical presentation at the Chief Analytics Officer Forum East Coast USA...Chief Analytics Officer Forum
 
Self-service Analytic for Business Users-19july2017-final
Self-service Analytic for Business Users-19july2017-finalSelf-service Analytic for Business Users-19july2017-final
Self-service Analytic for Business Users-19july2017-finalstelligence
 
Why Data Science Projects Fail
Why Data Science Projects FailWhy Data Science Projects Fail
Why Data Science Projects FailSense Corp
 
How to classify documents automatically using NLP
How to classify documents automatically using NLPHow to classify documents automatically using NLP
How to classify documents automatically using NLPSkyl.ai
 
Max Cottica slides from Future of Business Intelligence
Max Cottica slides from Future of Business Intelligence Max Cottica slides from Future of Business Intelligence
Max Cottica slides from Future of Business Intelligence Lauren Campbell Assoc CIPD
 
Keeping the Pulse of Your Data:  Why You Need Data Observability 
Keeping the Pulse of Your Data:  Why You Need Data Observability Keeping the Pulse of Your Data:  Why You Need Data Observability 
Keeping the Pulse of Your Data:  Why You Need Data Observability Precisely
 

Semelhante a Managing Data Science | Lessons from the Field (20)

Advanced Project Data Analytics for Improved Project Delivery
Advanced Project Data Analytics for Improved Project DeliveryAdvanced Project Data Analytics for Improved Project Delivery
Advanced Project Data Analytics for Improved Project Delivery
 
Five Attributes to a Successful Big Data Strategy
Five Attributes to a Successful Big Data StrategyFive Attributes to a Successful Big Data Strategy
Five Attributes to a Successful Big Data Strategy
 
Lean Analytics: How to get more out of your data science team
Lean Analytics: How to get more out of your data science teamLean Analytics: How to get more out of your data science team
Lean Analytics: How to get more out of your data science team
 
Building successful data science teams
Building successful data science teamsBuilding successful data science teams
Building successful data science teams
 
Data-Ed Webinar: Data Modeling Fundamentals
Data-Ed Webinar: Data Modeling FundamentalsData-Ed Webinar: Data Modeling Fundamentals
Data-Ed Webinar: Data Modeling Fundamentals
 
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
 
Success Through an Actionable Data Science Stack
Success Through an Actionable Data Science StackSuccess Through an Actionable Data Science Stack
Success Through an Actionable Data Science Stack
 
Big Data LA 2016: Backstage to a Data Driven Culture
Big Data LA 2016: Backstage to a Data Driven CultureBig Data LA 2016: Backstage to a Data Driven Culture
Big Data LA 2016: Backstage to a Data Driven Culture
 
Putting data science in your business a first utility feedback
Putting data science in your business a first utility feedbackPutting data science in your business a first utility feedback
Putting data science in your business a first utility feedback
 
What Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceWhat Managers Need to Know about Data Science
What Managers Need to Know about Data Science
 
[DSC Europe 22] The Making of a Data Organization - Denys Holovatyi
[DSC Europe 22] The Making of a Data Organization - Denys Holovatyi[DSC Europe 22] The Making of a Data Organization - Denys Holovatyi
[DSC Europe 22] The Making of a Data Organization - Denys Holovatyi
 
Challenges of Executing AI
Challenges of Executing AIChallenges of Executing AI
Challenges of Executing AI
 
The Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value ThereafterThe Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value Thereafter
 
Warehouse components
Warehouse componentsWarehouse components
Warehouse components
 
Dow Chemical presentation at the Chief Analytics Officer Forum East Coast USA...
Dow Chemical presentation at the Chief Analytics Officer Forum East Coast USA...Dow Chemical presentation at the Chief Analytics Officer Forum East Coast USA...
Dow Chemical presentation at the Chief Analytics Officer Forum East Coast USA...
 
Self-service Analytic for Business Users-19july2017-final
Self-service Analytic for Business Users-19july2017-finalSelf-service Analytic for Business Users-19july2017-final
Self-service Analytic for Business Users-19july2017-final
 
Why Data Science Projects Fail
Why Data Science Projects FailWhy Data Science Projects Fail
Why Data Science Projects Fail
 
How to classify documents automatically using NLP
How to classify documents automatically using NLPHow to classify documents automatically using NLP
How to classify documents automatically using NLP
 
Max Cottica slides from Future of Business Intelligence
Max Cottica slides from Future of Business Intelligence Max Cottica slides from Future of Business Intelligence
Max Cottica slides from Future of Business Intelligence
 
Keeping the Pulse of Your Data:  Why You Need Data Observability 
Keeping the Pulse of Your Data:  Why You Need Data Observability Keeping the Pulse of Your Data:  Why You Need Data Observability 
Keeping the Pulse of Your Data:  Why You Need Data Observability 
 

Mais de Domino Data Lab

What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...Domino Data Lab
 
The Proliferation of New Database Technologies and Implications for Data Scie...
The Proliferation of New Database Technologies and Implications for Data Scie...The Proliferation of New Database Technologies and Implications for Data Scie...
The Proliferation of New Database Technologies and Implications for Data Scie...Domino Data Lab
 
Racial Bias in Policing: an analysis of Illinois traffic stops data
Racial Bias in Policing: an analysis of Illinois traffic stops dataRacial Bias in Policing: an analysis of Illinois traffic stops data
Racial Bias in Policing: an analysis of Illinois traffic stops dataDomino Data Lab
 
Leveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive IndustryLeveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive IndustryDomino Data Lab
 
Summertime Analytics: Predicting E. coli and West Nile Virus
Summertime Analytics: Predicting E. coli and West Nile VirusSummertime Analytics: Predicting E. coli and West Nile Virus
Summertime Analytics: Predicting E. coli and West Nile VirusDomino Data Lab
 
GeoViz: A Canvas for Data Science
GeoViz: A Canvas for Data ScienceGeoViz: A Canvas for Data Science
GeoViz: A Canvas for Data ScienceDomino Data Lab
 
Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Domino Data Lab
 
How I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked DataHow I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked DataDomino Data Lab
 
Software Engineering for Data Scientists
Software Engineering for Data ScientistsSoftware Engineering for Data Scientists
Software Engineering for Data ScientistsDomino Data Lab
 
Building Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technologyBuilding Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technologyDomino Data Lab
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsDomino Data Lab
 
The Role and Importance of Curiosity in Data Science
The Role and Importance of Curiosity in Data ScienceThe Role and Importance of Curiosity in Data Science
The Role and Importance of Curiosity in Data ScienceDomino Data Lab
 
Fuzzy Matching to the Rescue
Fuzzy Matching to the RescueFuzzy Matching to the Rescue
Fuzzy Matching to the RescueDomino Data Lab
 
How to Effectively Combine Numerical Features and Categorical Features
How to Effectively Combine Numerical Features and Categorical FeaturesHow to Effectively Combine Numerical Features and Categorical Features
How to Effectively Combine Numerical Features and Categorical FeaturesDomino Data Lab
 
Building Up Local Models of Customers
Building Up Local Models of CustomersBuilding Up Local Models of Customers
Building Up Local Models of CustomersDomino Data Lab
 
Making Investing A Science
Making Investing A ScienceMaking Investing A Science
Making Investing A ScienceDomino Data Lab
 
How to Use Data Science to Affect Company Change
How to Use Data Science to Affect Company ChangeHow to Use Data Science to Affect Company Change
How to Use Data Science to Affect Company ChangeDomino Data Lab
 

Mais de Domino Data Lab (20)

What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...
 
The Proliferation of New Database Technologies and Implications for Data Scie...
The Proliferation of New Database Technologies and Implications for Data Scie...The Proliferation of New Database Technologies and Implications for Data Scie...
The Proliferation of New Database Technologies and Implications for Data Scie...
 
Racial Bias in Policing: an analysis of Illinois traffic stops data
Racial Bias in Policing: an analysis of Illinois traffic stops dataRacial Bias in Policing: an analysis of Illinois traffic stops data
Racial Bias in Policing: an analysis of Illinois traffic stops data
 
Leveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive IndustryLeveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive Industry
 
Summertime Analytics: Predicting E. coli and West Nile Virus
Summertime Analytics: Predicting E. coli and West Nile VirusSummertime Analytics: Predicting E. coli and West Nile Virus
Summertime Analytics: Predicting E. coli and West Nile Virus
 
GeoViz: A Canvas for Data Science
GeoViz: A Canvas for Data ScienceGeoViz: A Canvas for Data Science
GeoViz: A Canvas for Data Science
 
Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)
 
How I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked DataHow I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked Data
 
Software Engineering for Data Scientists
Software Engineering for Data ScientistsSoftware Engineering for Data Scientists
Software Engineering for Data Scientists
 
Making Big Data Smart
Making Big Data SmartMaking Big Data Smart
Making Big Data Smart
 
Building Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technologyBuilding Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technology
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
 
The Role and Importance of Curiosity in Data Science
The Role and Importance of Curiosity in Data ScienceThe Role and Importance of Curiosity in Data Science
The Role and Importance of Curiosity in Data Science
 
Fuzzy Matching to the Rescue
Fuzzy Matching to the RescueFuzzy Matching to the Rescue
Fuzzy Matching to the Rescue
 
How to Effectively Combine Numerical Features and Categorical Features
How to Effectively Combine Numerical Features and Categorical FeaturesHow to Effectively Combine Numerical Features and Categorical Features
How to Effectively Combine Numerical Features and Categorical Features
 
Building Up Local Models of Customers
Building Up Local Models of CustomersBuilding Up Local Models of Customers
Building Up Local Models of Customers
 
Making Investing A Science
Making Investing A ScienceMaking Investing A Science
Making Investing A Science
 
How to Use Data Science to Affect Company Change
How to Use Data Science to Affect Company ChangeHow to Use Data Science to Affect Company Change
How to Use Data Science to Affect Company Change
 
Making Media with Jupyter
Making Media with JupyterMaking Media with Jupyter
Making Media with Jupyter
 
Lean Data Science
Lean Data ScienceLean Data Science
Lean Data Science
 

Último

Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 

Último (20)

Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 

Managing Data Science | Lessons from the Field

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 1 © Pariveda Solutions. Confidential & Proprietary. August 2017 Managing Data Science | Lessons from the Field Mac Steele Director of Product | Domino Data Lab mac@dominodatalab.com @macsteele
  • 2. What You’ll Learn Today GOALS What is the bar for data science teams PITFALLS What are common data science struggles DIAGNOSES Why so many of our efforts fail to deliver value RECOMMENDATIONS How to address these struggles with best practices
  • 3. Lots of Legitimate Promises Saved $40M In claims with predictive analytics 200 180 160 140 120 100 80 60 40 20 0 Q1-08 Q2-09 Q3-10 Q4-11 Q1-13 Q2-14 Q3-15 Companies Mentioning ‘Artificial Intelligence’ On Earnings Calls Q4-16 Lots of Hype 35% of Sales Come from product recommendations Saved $450M By detecting fraudulent tax returns
  • 4. Lots of Risk of Disappointment This Sounds Eerily Familiar MACHINE LEARNING TIME Innovation Trigger Peak of Inflated Expectations Trough of Disillusionment Slope of Enlightenment Plateau of Productivity EXPECTATION S TIME RELATIVE IMPORTANCE WITHIN ENTERPRISE 1997 20302010 Software Developers Data Scientists
  • 5. What is the Goal? Measurable Your “quality” indicator. Reliable Your “hit rate.” Scalable Your “throughput.”
  • 7. I SOLVED THE PROBLEM BUT… Oops, already solved by someone else It was the wrong problem Solved the wrong way Have the wrong tools for this problem Too slow for it to matter World changes while solving problem Problems mulitply, can’t tackle all at once Results used Wrong way
  • 9. Data Science is Different from Software Development • Research versus development focus • No answer is a valid answer • Traditional testing is insufficient given non-deterministic nature • No generally accepted process metrics (e.g. story points) • Data must be tracked
  • 10. Forget About Other Stakeholders in the Process Access powerful infrastructure & preferred tools For Data Scientists For IT Leaders •Ensure stability & security •Leverage existing infrastructure •Minimize operational burden For Business Leaders •Understand real-world impact •Reliable, predictable insights •Minimize change to existing workflows For Data Science Managers • Accelerate project delivery through reuse, knowledge management • Mitigate key-man risk / accelerate onboarding • Hire & retain top talent
  • 11. Fixation on Tools at the Expense of People and Process
  • 12. Moonshot vs. Laps Around the Track • Perfection as enemy of shipped • Muddle “pure research” and “applied templates”
  • 13. Disconnected from the Business • Little familiarity with practical business constraints • Limited ability to drive adoption
  • 14. Missing Some Key Personnel Muscles • The full stack data scientist is a myth • Gap in ”soft” skills training
  • 15. Artisan Thinking vs. Modular System Thinking • Limited culture of re-use and compounding • Not planning for future iterations (e.g., no reproducibility / documentation)
  • 16. Bad Incentive Structures • Key responsibilities fall between gaps • Significant information loss in project transitions
  • 17. How about divider slides for each general section?RECOMMENDATIONS
  • 18. Best Practices Take Many Forms Process Both a single project and portfolio of projects People Types of capabilities and org design Technology Flexible infrastructure and tooling without the wild west
  • 19. Data science system at many levels Single Step Data Exploration Single Project Ideation Validation & Review Deployment & Publishing Monitoring & Feedback Data Exploration R & D
  • 21. Managing the lifecycle • Expect and embrace iteration • Enable compounding collaboration • Ensure auditability and reproducibility, even if you’re not regulated (yet)
  • 22. Ideation • Problem first, not data first • Practice and master order of magnitude ROI math • Maintain repo of past work • Create and enforce templates for MRDs • Maintain a stakeholder-driven backlog
  • 23. Artifact Selection • Leverage rapid prototyping and design sprint methodology • Create multiple mock-ups of different deliverable types • Consider creating synthetic data with baseline models
  • 24. Research & Development • Establish standard software configurations, but give flexibility to experiment • Abstract away compute provisioning • Build simple models first • Set a cadence for delivering insights • Ensure business KPI tracked consistently over time
  • 25. Validation • More than just code review, get stakeholder and IT sign-off • Ensure reproducibility and clear lineage • Use automated validation checks to support human inspection • Preserve results (even nulls) to central repo WHAT INFLUENCES A RESULT? Results The statistical analyses selected The R scripts that implemented the analyses The R libraries that implement the statistical functions The C libraries that perform the mathematical computations The operating system running the computational framework Reduced data Scripts that reduce the data Raw data Depend on Depend on Depend on Depend on Depend on Depend on Depend on Depend on
  • 26. Delivery • Support for many deliverable artifacts (reports, dashboards, apps, batch APIs, real-time APIs) • Define a promote-to-production workflow • Flag upstream and downstream dependencies
  • 27. Monitoring • Build ROI testing into all major deliverables • Require monitoring plans before considering “done” • Integrate with tools where people spend most of their time (e.g., email / Slack) • Anticipate risk and change management burdens
  • 28. Keeping all the balls in the air • Measure everything, including yourself • Focus on reducing time to iterate • Socialize aggregate portfolio impact
  • 29. The many hats of data science PRIORITIES PITTFALLS WITHOUT THEM Creating engaging visual and narrative journeys for analytical solutionsData Storyteller Articulating the business problem, translating to day-to-day work, ensuring ongoing engagement. Data Product Manager Vetting the priortization and ROI, providing ongoing feedback Business Stakeholder ROLE Low engagement and adoption from end users Projects miss the mark, don’t translate into tangible business value ROI decisions aren’t made sensibly, not knowing when to pull the plug Generating and communicating insights, understanding the strengths and risksData Scientist Naïve or low power insights Building scalable pipelines and infrastructure that make it possible to do the higher levels of needs. Data Infrastructure Engineer Insight generation is slow, because DS is spending their time doing infrastructure work
  • 30. Organizational Design Dilemmas • False centralization / decentralization dichotomy • Most evolve as they scale and as business demands shift • Technology can help bridge the gap • Deeper understanding of business processes and priorities • Easier change management • Less technical knowledge compounding • Harder to codify best practices • Risk of shadow IT DECENTRALIZATIONCENTRALIZATION • Community and mentorship • easier transparency for managers and IT • More passive technical knowledge sharing • Isolation on data science island • Loss of credibility with business • Frustrated data scientists Pros Cons
  • 31. What We Covered Today GOALS What is the bar for data science teams PITFALLS What are common data science struggles DIAGNOSES Why so many of our efforts fail to deliver value RECOMMENDATIONS How to address these struggles with best practices
  • 32. QUESTIONS? Check out dominodatalab.com or find us in the AWS Marketplace
  • 33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 33 © Pariveda Solutions. Confidential & Proprietary.

Notas do Editor

  1. Who am I? I work at Domino Data Lab. We build a data science platform that helps organizations build a more mature data science practice. In my role, I get to work with large enterprises and small start-ups to understand how data science is changing their business. What I’m going to talk about today is largely just a synthesis of what we’ve heard over the past few years from companies that have failed hard and those that have had great success. What you’ll learn today What are common data science struggles Why so many of our efforts fail to deliver value How to address these struggles with best practices Who is doing this well today and what are their principles Where to focus your efforts tomorrow
  2. Let’s start by saying something really obvious. Everyone is really excited about data science. There is lots of legitimate promise, with companies like Google, Facebook, and Amazon building defensible businesses around the breadth and quality of their models. At the same time, the pervasive hype has created risk of disappointment and disillusionment if not proactively addressed.
  3. We believe data science is in the throes of a transition from a niche capability leveraged by a few pioneers to a core capability across every enterprise. What was once a “nice to have” has become a survival imperative. As with the evolution of software development, the tooling has advanced dramatically in recent years. But also like software development, tooling alone is not enough. The hardening of a new roles (people), processes, and technology will be key to cementing data science’s position as a core function.
  4. The goal of any data science organization should be measurable, reliable, and scalable impact on the business decisions and metrics that they are charged with improving. Were business decisions positively changed in an observable and ideally, quantifiable, way? If I take on five projects, I want 3-4 to deliver business value. If my reliability is 80% with five projects and seven people, can I expand that to 50 projects and 40 people?
  5. Opening talk track: Everyone talks about data being the problem. “Data Scientists spend 80% of their time cleaning data.” Don’t get me wrong, it’s a big pain point. But I think it’s a convenient scapegoat that distracts from some of the industry’s real problems that are more easily solvable.
  6. Wrong problem: Over-zealous data science teams often dive straight into the data looking for “something interesting.” We’ve seen large organizations hire 30+ PhD’s with no clear mandate. They then emerge from a six week hole only to realize they had misunderstood the target variable, rendering the analysis irrelevant.   Solved by someone else: We consistently hear data scientists complain about re-inventing the wheel. Anecdotal estimates put it at 30-40% of their time in large organizations with significant amounts of prior art. In the fortunate situation where a past project can be discovered, reproducing it is often impossible given inconsistent preservation of relevant artifacts like data, packages, documentation, and intermediate results.   Wrong tools: Given the explosion of data and tooling functionality, data scientists are still often dramatically ill-equipped to explore the full range of possible domains and solutions. Analysis is still often confined to individual laptops that are easily overwhelmed. We’ve heard of organizations where it can take 6+ months to approve a widely-utilized open source Python package for research purposes, prompting employees to bring their personal laptops and work under their desks. Right problem/Too slow: That data scientist who will spend an extra two weeks to eek out a bit more AUC on a targeting model, only to realize the marketing team’s deadline passed Wrong way: For example, the team that builds a powerful predictive model for underwriters, wraps it in a standalone scoring front end and realizes the underwriters never actually click to a new tab from their existing workflow. One large insurer described it as, “We don’t fail because of the math… we fail because we don’t understand how people will use the math.”   Used inappropriately: Google describes this as the undeclared consumer problem. Results can be thrown “over the fence” and data science teams have little control or even visibility into how those results are being used. For example, someone builds a model for predicting the value of California residential mortgages but then an over-zealous banker uses it to predict the value of Florida commercial mortgages even though the original model creator knew that would be a bad idea. World changes: Models are by definition an approximation of the real-world. If you don’t keep track of how the world is changing and monitor your models ongoing performance, you imperil the business and likely leave value on the table. My favorite story in this space was a large financial institution that issued credit cards. They had a probability of default model that expected a credit score. The credit bureau changed how they report “not present in the DB” from a null to a 999. Their model didn’t account for this and they just thought a bunch of risky people had perfect credit scores. It took weeks and millions of dollars in bad loans before they caught it.        Can’t solve 100 at once: Many teams have had early wins from their low hanging fruit. Working in a tight-knit team on a single business initiative is great. However, they start to experience negative returns to scale as their existing processes can’t cope with a swollen backlog, an influx of new hires, and heightened expectations from the business.
  7. Opening talk track: Everyone talks about data being the problem. “Data Scientists spend 80% of their time cleaning data.” Don’t get me wrong, it’s a big pain point. But I think it’s a convenient scapegoat that distracts from some of the industry’s real problems that are more easily solvable.
  8. First, let me say that there could be a whole series of talks on this topic alone.
  9. Data science is bigger than just data scientists. Obviously data scientists are a critical component, but there are a whole host of other stakeholders who must come along on the journey for their to be reliable wins at scale. And those stakeholder have very different backgrounds and priorities from data scientists. Data science managers often act as the bridge to the business and are focused on the quality and pace of output. They worry about things like key man risk and the pace of onboarding Business leaders don’t care as much about how the sausage is made, but they need to know they can count on data science output to make better decisions without having to drastically change how they and their teams work. IT leaders care about stability and serving their internal customers. They have KPIs like uptime and 20 minute SLAs, plus initiatives like cloud migration and enterprise standardization. They also want to ensure that new tools fit within existing infrastructure. The typical data science process neglects most of these stakeholders, letting the legitimately great promise of data science go unrealized.
  10. Reddit blogs on the optimal data science organizational structure don’t get the same traction as throwdowns about Python and R Data scientists’ wear their tool wrangling as a badge of honor and wrapped up in their identity.
  11. Many organizations have not built a culture of delivery and iteration. This could be a result of many data scientist’s extensive academic backgrounds, though it likely also stems from a confusion between what type of work is really happening: “pure research” and “applying templates to novel business situations.”   
  12. Teams are often hired into disconnected Innovation Labs without real business accountability to hone their process. Data science becomes “those people over there in the corner.” This also means they don’t have a deep understanding of the target KPIs and the nuances of how a team works today which is critical to ensuring adoption of their results.
  13. The full stack data scientist is dead, if she ever existed at all. The move towards specialization isn’t just data engineers, it’s a whole host of other roles that cover the concepts of change management, feasibility assessments, rapid prototyping, ROI estimation, training, and stakeholder education. Data science training often focuses on the technical skills, which are necessary but insufficient for driving impact. Increasingly, the role is being partitioned into many roles, as happened with software development over the last two decades.  
  14. Data scientists often think of their work as bespoke and highly specialized. While their skillset may be, there are often many artifacts they create that can and should be re-used. Whether those are software packages, data viz suites, feature stores, or anything else. Moreover, many data scientists barely document their development process, much less modularize it.
  15. Many data scientists have told me “I get paid for what I build this year, not maintaining what I built last year.” That leads to huge gaps in monitoring live production models as the responsibility falls to IT who focuses on the system performance, but doesn’t understand if the model is “still right” or being used appropriately.
  16. Who am I? Work with large enterprises and small start-ups to understand how data science is changing their business Worked at Bridgewater Associates, hyper-focused on research Managed (often poorly) data scientists and data engineers What you’ll learn today What are common data science struggles Why so many of our efforts fail to deliver value How to address these struggles with best practices Who is doing this well today and what are their principles Where to focus your efforts tomorrow
  17. I’m here to tell you that there’s a better way. You can overcome this by getting the right people and processes working through a centralized system of record. Domino offers a detailed inventory of existing projects and moels, supporting knowledge management efforts. We track all research and development work to ensure auditability and facilitate best practices like subject matter expert involvement and code review. We support publication to human consumers via interactive web applications and automated reports, along with publication to other software systems via enterprise-grade APIs. Finally, we track how different projects relate and who is working on them, providing meta-analytics to better direct your future investment. As more and more of your business decisions are augmented or replaced entirely by data science products, you can’t afford an ad hoc system.
  18. I’m here to tell you that there’s a better way. You can overcome this by getting the right people and processes working through a centralized system of record. Domino offers a detailed inventory of existing projects and moels, supporting knowledge management efforts. We track all research and development work to ensure auditability and facilitate best practices like subject matter expert involvement and code review. We support publication to human consumers via interactive web applications and automated reports, along with publication to other software systems via enterprise-grade APIs. Finally, we track how different projects relate and who is working on them, providing meta-analytics to better direct your future investment. As more and more of your business decisions are augmented or replaced entirely by data science products, you can’t afford an ad hoc system.
  19. Some of the most important work in the overall lifecycle happens before a line of code is written. If done well, the ideation stage dramatically de-risks a project by driving alignment across stakeholders. This is where the business objective is identified, success criteria laid out, prior art is reviewed, and initial ROI calculations are performed. Ideation is when feasibility is assessed, both in terms of “Does the data even exist?” and “Can we realistically change how the business process works?” It is also where prioritization happens relative to other potential projects. Below are some best practices we’ve observed that get to the root of many of the problems discussed earlier. Expect and embrace iteration Data science is never linear. All the flow charts in the world won’t stop a project from frequently needing to go back to find new data or re-validate a solution. That’s OK. Enable high-impact collaboration Collaboration means being able to find, discuss, understand, and build on past work. It shouldn’t matter if that person has left the company, or if the project was 4 versions of Pandas ago. Ensure auditability and reproducibility For regulated industries, understanding all the steps in a model’s lifecycle is mandated by law. Even for those that aren’t, as more and more of your models affect critical parts of citizen’s lives (what they read, what disease they’re diagnosed with, etc.) then it’s critical to lay the groundwork for seamless auditability.
  20. Some of the most important work in the overall lifecycle happens before a line of code is written. If done well, the ideation stage dramatically de-risks a project by driving alignment across stakeholders. This is where the business objective is identified, success criteria laid out, prior art is reviewed, and initial ROI calculations are performed. Ideation is when feasibility is assessed, both in terms of “Does the data even exist?” and “Can we realistically change how the business process works?” It is also where prioritization happens relative to other potential projects. Below are some best practices we’ve observed that get to the root of many of the problems discussed earlier. Problem first, not data first Many organizations start with the data and look for something “interesting” rather than building a deep understanding of the existing business process and then pinpointing the decision point that can be augmented or automated. Leading organizations go so far as to literally map existing business processes in tools like Vizio, PPT, or LucidChart and then circle on that map the exact points that data science could potentially focus. Practice and master order of magnitude ROI math The ability to estimate the potential business impact of a change in a statistical measure is one the best predictors of success for a data science team. For example, if we reduce fraudulent insurance claims by 1%, how much would we save? What is a conservative estimate of how much improvement we can expect by the data scientist's’ efforts? Settle on a number based on past experiences but erring on the conservative side. Maintain repo of past work with business domain and technical experts As teams grow, no person can be an expert in everything. It’s critical to have a way to search to see who is most familiar with the latest version of TensorFlow or who has done the most work in the marketing attribution space. Code search is helpful, but ideally includes relevant discussion, environments, and data. Create and enforce templates for model requirements documents Documentation up front saves time 10:1 down the road. Create a template for 80% of cases, knowing there will always be exceptions. Maintain a stakeholder-driven backlog Your stakeholders should always be able to see what’s in flight and what’s been put in the backlog. Like any product org, they don’t necessarily get to change it, but you should have recurring check-in’s with them to ensure priorities haven’t shifted.
  21. This is where the shape of the final deliverable is agreed upon. It’s always possible to amend the agreed upon deliverable or to have multiple, but visualizing the ultimate consumption medium and working backwards is key. Are you building a one-off answer to support a strategic decision, a standalone lightweight app for stakeholders to use, or a real-time data product that integrates into other systems? The best organizations start simple, get the result into the business, learn and measure before updating the model with a more sophisticated approach (more features, more complex algorithm, deeper integration).   Create multiple mock-ups of different deliverable types - A leading e-commerce company creates 3-5 mocks for every data science project they take on, even bringing in a designer to make it feel real. For example, they discovered exposing their model as a HipChat bot was the most user-friendly way to leverage the model. By iterating on design possibilities before they get data, they ensure they’ve surfaced any previously undiscovered requirements and maximize their odds of adoption.     Bring IT and engineering stakeholders in early - A model may work spectacularly in the lab, but not have any hope of ever working in production the way envisioned by the business. IT and engineering stakeholders need a seat at the table this early in order to identify constraints like “We only backfill that data monthly from the vendor, so we can’t do a real-time scoring engine.”      Consider creating synthetic data with baseline models - Some organizations even create synthetic data and naive baseline models to show how the model would impact existing business processes. A leading agriculture company devotes an entire team to creating synthetic “perfect” data (e.g., no nulls, full history, realistic distribution) to establish potential value with the business before they go contract with expensive satellite data providers to get “real” data
  22. Establish standard software configurations, but give flexibility to experiment Data scientists’ can often spend the first 8 weeks on the job configuring their workstation rather than exploring existing work and understanding their stakeholder’s priorities. Having a few standard environments gets people onboarded faster. Yet, it’s important they retain flexibility to try new tools and techniques. The tool acquisition process can be so arduous that some data scientists covertly bring their personal machines to work so they don’t have to wait 8 months for a Python package to be approved. Technologies like Docker can eliminate much of this headache. Abstract away compute provisioning Data scientists can wait weeks or even months to get the hardware necessary to accelerate their workflows. At Monsanto, they were able to take a research task that took 24 hours to run and complete it in 30 minutes by running it in parallel across dozens of EC2 machines. Build simple models first Resist the temptation to use 500 features. One company we know did this, spent weeks engineering the features and tuning the hyperparameters, only to learn that many of them were either a) not collected in real-time so couldn’t be used in the target use case or b) not allowed for compliance reasons. They ended up using a simple 5 features model and then working with their IT team to capture other data in real-time. Set a cadence for delivering insights The most common failure mode is data science delivers results that are either too late or don’t fit into how the business works today so results gather dust. Share insights early and often. One leading organization has their data scientists share an insight every 3-4 days. If they can’t publish a short post on incremental findings in business-friendly language, then chances are they are down a rabbit hole. This lets the manager coach more junior or academically-oriented team members, plus gives an easily consumable timeline of the progress for stakeholders. Ensure business KPI tracked consistently over time Too often, data scientists lose sight of the business KPI they are trying to affect and instead focus on a narrow statistical measure. Leading teams ensure that the relevant KPI is never far their experiments, whether it’s the Sharpe ratio of a hedge fund’s backtest or the Customer Acquisition Cost for an e-commerce company.
  23. More than just code review, get stakeholder and IT sign-off This helps prevent delays in the process of delivery and avoids user adoption hurdles down the road. Ensure reproducibility and clear lineage of project Quality validation entails dissecting a model and checking assumptions and sensitivities. This is nearly impossible if a validator spends 90% of their time just gathering documentation. Use automated validation checks to support human inspection While data science’s non-deterministic nature means that unit testing does not directly apply, there are often repeated steps in a validation process that can be automated. That may be a set of summary statistics and charts, a portfolio backtest, or any other step that could turned into an automated diagnostic. Preserve null results Even if a project yields no material uplift and doesn’t get deployed into production, it’s critical to document it and preserve it in the same knowledge repo. Too often, we hear that data scientists’ re-do’ing work someone explored without knowledge of previous inquiries.
  24. Support for many deliverable artifacts (reports, dashboards, apps, batch APIs, real-time APIs) While real-time scoring gets all the glory, the vast majority of models will at one time or another be prototype apps, dashboards, or batch scoring engines. It’s important to keep a link between all those deliverables because it saves time and avoids risk that key feedback is lost. Have a promote-to-production workflow Too often data science teams throw a result over the wall. If you establish the workflow ahead of time, you lower the burden for iterating on a new version of the model. Know what environments and packages are acceptable in production. Know who can make those decisions and what the escalation path is. Flag upstream and downstream dependencies A model is at it’s most risky when it finally makes it to production. Ensure that you know the upstream dependencies: what training data was used, what transformations were done with what tools, what modeling packages were used, etc. Also make sure you know the downstream dependencies (e.g., this nightly batch model is stacked on another model).
  25. Monitoring is often forgotten but because getting to delivery is so hard. It’s important not to monitor just the system performance (uptime, latency), but also the usage (more or less than expected) and the statistical performance (is the model degrading?). IT often owns this process but is ill-equipped with traditional tools like New Relic and little context from data scientists who handed off the model. Build testing into all major deliverables Build testing into all major deliverables - One leading organization established a global holdout group from all of their customer segmentation and price elasticity models. After a year, they compared the average revenue from holdout group to the customers whose experience was guided by the predictive models. The overall lift was more than $1 billion, which gave them the credibility to dramatically expand the team and push models into more steps of the customer journey. Require monitoring plans for proactive alerting, acceptable uses, and notification thresholds The data scientist who created the model is the one best positioned to know what risks are inherent from their approach. Rather than wait for the business to notice something is wrong or a metric to drift, codify that knowledge into your monitoring system. Do you expect certain input types and ranges? If it’s outside of those, what should you do? Rollback? Stop serving predictions? What if someone in a totally different department starts consuming the model in a way that may be risky or outright wrong? Working collaboratively with IT or engineering, data scientists can put the appropriate guardrails on their creations.   Integrate with tools where people spend most of their time (e.g., email / Slack) High performing teams realize that monitoring is only good if someone acknowledges, inspects, and changes behavior if necessary. We’ve seen organizations build alerts into chatbots or email systems to ensure they can keep up with the alerts as their number of production models scales. Anticipate risk and change management burdens At one large insurer, they have a team called Business Analytics Engineers who proactively assess and address change management problems when they deploy a data science like a new claims pricing app. They cover things like training, provide pre-determined feedback channels, and measure usage and engagement to ensure success.
  26. Measure everything, including yourself Ironically, data scientists live in the world of measurement yet rarely turn that lens on themselves. Tracking patterns in aggregate workflows helps create modular templates, disseminate best practices from high-performing teams, and guide investment to in internal tooling and people to alleviate bottlenecks. Monsanto, a large AWS customer, tracks more than 350 simultaneous projects across 10 business units with more than 200 data scientists. By looking at the aggregate portfolio, they can see that they may need to devote more resources to hiring data engineers. Another large tech company similarly estimated that across their entire body of work that there were fundamentally only 15 ”unique” types of problems and set about formulating templates to streamline their process. Focus on reducing time to iterate This “feature” is ultimately the best predictor of data science organizational success we’ve seen. Minimal obstacles (without sacrificing rigorous review and checks) to test real results is another great predictor of data science success. Big tech companies deploy new models in minutes, whereas large financial services companies can take 18 months. Socialize aggregate portfolio impact Even if it’s not precise, it’s critical to socialize the impact of the whole portfolio of data science projects. Doing so addresses data scientists’ concerns about impact and helps address executive level concerns about investing further in data science. Importantly, don’t claim the credit for yourselves, but as a collective achievement of all the stakeholders.
  27. As mentioned earlier, the full stack data scientist no longer exists and the roles are increasingly specialized. This is a natural evolution that we expect will continue as data science becomes ingrained into the fabric of how organizations function. The most consistent feedback we’ve heard is the increasing demand for a “product manager” type role as most organizations move from delivering mathematical results to stakeholder-facing apps. In large tech organizations, data science sits peer with product management to drive strategic priorities and ongoing optimization of engagement and impact.
  28. Most evolve as they scale and as business demands shift We see many organizations start with a centralized “Center of Excellence” for data science to build their core technical infrastructure before evolving to a hybrid structure. In this structure, the central team focuses on building templates (documentation, software environments, project stage flows) and codifying best practices while embedded groups sit next to each major business line to address the “bookend” problems of identifying the right data science problem and maximizing adoption of solutions. Sometimes a full data science guild (to borrow Spotify’s term) exists and meets regardless of their day-to-day functional department. Technology is much better positioned to help address the pains of decentralization. A data science platform can facilitate technical knowledge sharing, encourage or enforce best practices, and provide transparency while still allowing data scientists to be closer to the business