Managing Data Science | Lessons from the Field

•Transferir como PPTX, PDF•

3 gostaram•1,187 visualizações

The document discusses best practices for managing data science teams based on lessons learned. It outlines common pitfalls such as solving the wrong problem, having the wrong tools, or results being used incorrectly. Issues include data science being different from software development and forgetting other stakeholders. Recommendations include establishing processes for the full lifecycle from ideation to monitoring, using modular systems thinking, and defining roles like data scientists, managers, and product owners to address organizational challenges. The goal is to deliver measurable, reliable, and scalable insights.

Dados e análise

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
1 © Pariveda Solutions. Confidential & Proprietary.
August 2017
Managing Data Science
| Lessons from the Field
Mac Steele
Director of Product | Domino Data Lab
mac@dominodatalab.com
@macsteele

What You’ll Learn Today
GOALS
What is the bar for data science teams
PITFALLS
What are common data science struggles
DIAGNOSES
Why so many of our efforts fail to deliver value
RECOMMENDATIONS
How to address these struggles with best practices

Lots of Legitimate
Promises
Saved $40M
In claims with predictive analytics
200
180
160
140
120
100
80
60
40
20
0
Q1-08 Q2-09 Q3-10 Q4-11 Q1-13 Q2-14 Q3-15
Companies Mentioning
‘Artificial Intelligence’
On Earnings Calls
Q4-16
Lots of Hype
35% of Sales
Come from product recommendations
Saved $450M
By detecting fraudulent tax returns

Lots of Risk of
Disappointment
This Sounds
Eerily Familiar
MACHINE
LEARNING
TIME
Innovation
Trigger
Peak of
Inflated
Expectations
Trough of
Disillusionment
Slope of
Enlightenment
Plateau of
Productivity
EXPECTATION
S
TIME
RELATIVE
IMPORTANCE WITHIN
ENTERPRISE
1997 20302010
Software
Developers
Data
Scientists

What is the Goal?
Measurable
Your “quality” indicator.
Reliable
Your “hit rate.”
Scalable
Your “throughput.”

I SOLVED THE PROBLEM BUT…
Oops, already
solved by
someone else
It was the
wrong problem
Solved the
wrong way
Have the wrong
tools for this problem
Too slow for it
to matter
World changes
while solving
problem
Problems mulitply,
can’t tackle all
at once
Results used
Wrong way

Data Science is Different from Software Development
• Research versus development focus
• No answer is a valid answer
• Traditional testing is insufficient given
non-deterministic nature
• No generally accepted process metrics (e.g.
story points)
• Data must be tracked

Forget About Other Stakeholders in the Process
Access powerful infrastructure &
preferred tools
For Data Scientists For IT Leaders
•Ensure stability & security
•Leverage existing infrastructure
•Minimize operational burden
For Business Leaders
•Understand real-world impact
•Reliable, predictable insights
•Minimize change to existing workflows
For Data Science
Managers
• Accelerate project delivery through reuse,
knowledge management
• Mitigate key-man risk / accelerate onboarding
• Hire & retain top talent

Fixation on Tools at the Expense of People and
Process

Moonshot vs.
Laps Around the Track
• Perfection as enemy of shipped
• Muddle “pure research” and
“applied templates”

Disconnected from the
Business
• Little familiarity with practical
business constraints
• Limited ability to drive
adoption

Missing Some Key
Personnel Muscles
• The full stack data scientist is
a myth
• Gap in ”soft” skills training

Artisan Thinking vs.
Modular System Thinking
• Limited culture of re-use and
compounding
• Not planning for future iterations
(e.g., no reproducibility /
documentation)

Bad Incentive Structures
• Key responsibilities fall between
gaps
• Significant information loss in
project transitions

How about divider slides for
each general section?RECOMMENDATIONS

Best Practices Take Many Forms
Process
Both a single project and portfolio of projects
People
Types of capabilities and org design
Technology
Flexible infrastructure and tooling without the
wild west

Data science system at many levels
Single
Step
Data
Exploration
Single Project
Ideation
Validation
& Review
Deployment
&
Publishing
Monitoring
& Feedback
Data
Exploration R & D

Managing the lifecycle
• Expect and embrace iteration
• Enable compounding collaboration
• Ensure auditability and
reproducibility, even if you’re not
regulated (yet)

Ideation
• Problem first, not data first
• Practice and master order of
magnitude ROI math
• Maintain repo of past work
• Create and enforce templates for
MRDs
• Maintain a stakeholder-driven
backlog

Artifact Selection
• Leverage rapid prototyping and
design sprint methodology
• Create multiple mock-ups of
different deliverable types
• Consider creating synthetic data
with baseline models

Research & Development
• Establish standard software
configurations, but give flexibility
to experiment
• Abstract away compute
provisioning
• Build simple models first
• Set a cadence for delivering
insights
• Ensure business KPI tracked
consistently over time

Validation
• More than just code review, get
stakeholder and IT sign-off
• Ensure reproducibility and clear
lineage
• Use automated validation checks
to support human inspection
• Preserve results (even nulls) to
central repo
WHAT INFLUENCES A RESULT?
Results
The statistical analyses selected
The R scripts that implemented the analyses
The R libraries that implement the statistical functions
The C libraries that perform the mathematical computations
The operating system running the computational framework
Reduced data
Scripts that reduce the data
Raw data
Depend on
Depend on
Depend on
Depend on
Depend on
Depend on
Depend on
Depend on

Delivery
• Support for many deliverable
artifacts (reports, dashboards,
apps, batch APIs, real-time APIs)
• Define a promote-to-production
workflow
• Flag upstream and downstream
dependencies

Monitoring
• Build ROI testing into all major
deliverables
• Require monitoring plans before
considering “done”
• Integrate with tools where people
spend most of their time (e.g.,
email / Slack)
• Anticipate risk and change
management burdens

Keeping all the balls
in the air
• Measure everything, including
yourself
• Focus on reducing time to iterate
• Socialize aggregate portfolio
impact

The many hats of data science
PRIORITIES PITTFALLS WITHOUT THEM
Creating engaging visual and narrative journeys
for analytical solutionsData Storyteller
Articulating the business problem, translating to
day-to-day work, ensuring ongoing engagement.
Data Product
Manager
Vetting the priortization and ROI, providing ongoing
feedback
Business
Stakeholder
ROLE
Low engagement and
adoption from
end users
Projects miss the mark, don’t
translate into tangible business
value
ROI decisions aren’t made
sensibly, not knowing when to pull
the plug
Generating and communicating insights,
understanding the strengths and risksData Scientist
Naïve or low power insights
Building scalable pipelines and infrastructure that
make it possible to do the higher levels of needs.
Data
Infrastructure
Engineer
Insight generation is slow,
because DS is spending their
time doing infrastructure work

Organizational Design Dilemmas
• False centralization /
decentralization dichotomy
• Most evolve as they scale
and as business demands
shift
• Technology can help
bridge the gap
• Deeper understanding
of business processes
and priorities
• Easier change
management
• Less technical
knowledge
compounding
• Harder to codify best
practices
• Risk of shadow IT
DECENTRALIZATIONCENTRALIZATION
• Community and
mentorship
• easier transparency for
managers and IT
• More passive technical
knowledge sharing
• Isolation on data
science island
• Loss of credibility with
business
• Frustrated data
scientists
Pros
Cons

What We Covered Today
GOALS
What is the bar for data science teams
PITFALLS
What are common data science struggles
DIAGNOSES
Why so many of our efforts fail to deliver value
RECOMMENDATIONS
How to address these struggles with best practices

QUESTIONS?
Check out dominodatalab.com or find us
in the AWS Marketplace

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
33 © Pariveda Solutions. Confidential & Proprietary.

Mais conteúdo relacionado

Mais procurados

Why Data Science Projects FailSense Corp

IT & Innovation - short summaryPerry Nouwens

Back to Square One: Building a Data Science Team from ScratchKlaas Bosteels

Why Data Science Projects Fail?Ethan Ram

Data Architecture: OMG It’s Made of Peoplemark madsen

H2O World - Intro to Data Science with Erin LedellSri Ambati

Architecting a Data Platform For Enterprise Use (Strata NY 2018)mark madsen

Building a Data Platform Strata SF 2019mark madsen

Pay no attention to the man behind the curtain - the unseen work behind data ...mark madsen

H2O World - Advanced Analytics at Macys.com - Daqing ZhaoSri Ambati

H2O World - What you need before doing predictive analysis - Keen.ioSri Ambati

Architecting a Platform for Enterprise Use - Strata London 2018mark madsen

The Big Data Dream TeamAccenture Analytics

Notilyze SASBigDataExpo

Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...Formulatedby

Solve User Problems: Data Architecture for Humansmark madsen

Andreas weigendBigDataExpo

Giovanni Lanzani GoDataDrivenBigDataExpo

Idiots guide to setting up a data science teamAshish Bansal

Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...Thoughtworks

Mais procurados (20)

Why Data Science Projects Fail

IT & Innovation - short summary

Back to Square One: Building a Data Science Team from Scratch

Why Data Science Projects Fail?

Data Architecture: OMG It’s Made of People

H2O World - Intro to Data Science with Erin Ledell

Architecting a Data Platform For Enterprise Use (Strata NY 2018)

Building a Data Platform Strata SF 2019

Pay no attention to the man behind the curtain - the unseen work behind data ...

H2O World - Advanced Analytics at Macys.com - Daqing Zhao

H2O World - What you need before doing predictive analysis - Keen.io

Architecting a Platform for Enterprise Use - Strata London 2018

The Big Data Dream Team

Notilyze SAS

Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...

Solve User Problems: Data Architecture for Humans

Andreas weigend

Giovanni Lanzani GoDataDriven

Idiots guide to setting up a data science team

Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...

Destaque

Vowpal Wabbitodsc

Data science at the command lineSharat Chikkerur

Tda presentationHJ van Veen

How to assess & hire Java developers accurately?HackerEarth

How hackathons can drive top line revenue growthHackerEarth

HackerEarth helping a startup hire developers - The Practo Case StudyHackerEarth

How to recruit excellent tech talentHackerEarth

HackerEarth Sourcing SolutionHackerEarth

USC LIGHT Ministry IntroductionJeong-Yoon Lee

Kill the wabbitJoe Kleinwaechter

Intra company hackathons using HackerEarthHackerEarth

No-Bullshit Data ScienceDomino Data Lab

Marriage - LIGHT MinistryJeong-Yoon Lee

Druva Casestudy - HackerEarthHackerEarth

Work - LIGHT MinistryJeong-Yoon Lee

Open Innovation - A Case StudyHackerEarth

Menstrual Health Reader - mEoHackerEarth

Smart Switchboard: An home automation systemHackerEarth

Destaque (18)

Vowpal Wabbit

Data science at the command line

Tda presentation

How to assess & hire Java developers accurately?

How hackathons can drive top line revenue growth

HackerEarth helping a startup hire developers - The Practo Case Study

How to recruit excellent tech talent

HackerEarth Sourcing Solution

USC LIGHT Ministry Introduction

Kill the wabbit

Intra company hackathons using HackerEarth

No-Bullshit Data Science

Marriage - LIGHT Ministry

Druva Casestudy - HackerEarth

Work - LIGHT Ministry

Open Innovation - A Case Study

Menstrual Health Reader - mEo

Smart Switchboard: An home automation system

Semelhante a Managing Data Science | Lessons from the Field

Advanced Project Data Analytics for Improved Project DeliveryMark Constable

Five Attributes to a Successful Big Data StrategyPerficient, Inc.

Lean Analytics: How to get more out of your data science teamDigital Transformation EXPO Event Series

Building successful data science teamsVenkatesh Umaashankar

Data-Ed Webinar: Data Modeling FundamentalsDATAVERSITY

Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...Databricks

Success Through an Actionable Data Science StackDomino Data Lab

Big Data LA 2016: Backstage to a Data Driven CulturePauline Chow

Putting data science in your business a first utility feedbackPeculium Crypto

What Managers Need to Know about Data ScienceAnnie Flippo

[DSC Europe 22] The Making of a Data Organization - Denys HolovatyiDataScienceConferenc1

Challenges of Executing AIDr. Umesh Rao.Hodeghatta

The Right Data Warehouse: Automation Now, Business Value ThereafterInside Analysis

Warehouse componentsganblues

Dow Chemical presentation at the Chief Analytics Officer Forum East Coast USA...Chief Analytics Officer Forum

Self-service Analytic for Business Users-19july2017-finalstelligence

Why Data Science Projects FailSense Corp

How to classify documents automatically using NLPSkyl.ai

Max Cottica slides from Future of Business Intelligence Lauren Campbell Assoc CIPD

Keeping the Pulse of Your Data: Why You Need Data Observability Precisely

Semelhante a Managing Data Science | Lessons from the Field (20)

Advanced Project Data Analytics for Improved Project Delivery

Five Attributes to a Successful Big Data Strategy

Lean Analytics: How to get more out of your data science team

Building successful data science teams

Data-Ed Webinar: Data Modeling Fundamentals

Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...

Success Through an Actionable Data Science Stack

Big Data LA 2016: Backstage to a Data Driven Culture

Putting data science in your business a first utility feedback

What Managers Need to Know about Data Science

[DSC Europe 22] The Making of a Data Organization - Denys Holovatyi

Challenges of Executing AI

The Right Data Warehouse: Automation Now, Business Value Thereafter

Warehouse components

Dow Chemical presentation at the Chief Analytics Officer Forum East Coast USA...

Self-service Analytic for Business Users-19july2017-final

Why Data Science Projects Fail

How to classify documents automatically using NLP

Max Cottica slides from Future of Business Intelligence

Keeping the Pulse of Your Data: Why You Need Data Observability

Mais de Domino Data Lab

What's in your workflow? Bringing data science workflows to business analysis...Domino Data Lab

The Proliferation of New Database Technologies and Implications for Data Scie...Domino Data Lab

Racial Bias in Policing: an analysis of Illinois traffic stops dataDomino Data Lab

Leveraging Data Science in the Automotive IndustryDomino Data Lab

Summertime Analytics: Predicting E. coli and West Nile VirusDomino Data Lab

GeoViz: A Canvas for Data ScienceDomino Data Lab

Doing your first Kaggle (Python for Big Data sets)Domino Data Lab

How I Learned to Stop Worrying and Love Linked DataDomino Data Lab

Software Engineering for Data ScientistsDomino Data Lab

Making Big Data SmartDomino Data Lab

Building Data Analytics pipelines in the cloud using serverless technologyDomino Data Lab

Leveraging Open Source Automated Data Science ToolsDomino Data Lab

The Role and Importance of Curiosity in Data ScienceDomino Data Lab

Fuzzy Matching to the RescueDomino Data Lab

How to Effectively Combine Numerical Features and Categorical FeaturesDomino Data Lab

Building Up Local Models of CustomersDomino Data Lab

Making Investing A ScienceDomino Data Lab

How to Use Data Science to Affect Company ChangeDomino Data Lab

Making Media with JupyterDomino Data Lab

Lean Data ScienceDomino Data Lab

Mais de Domino Data Lab (20)

What's in your workflow? Bringing data science workflows to business analysis...

The Proliferation of New Database Technologies and Implications for Data Scie...

Racial Bias in Policing: an analysis of Illinois traffic stops data

Leveraging Data Science in the Automotive Industry

Summertime Analytics: Predicting E. coli and West Nile Virus

GeoViz: A Canvas for Data Science

Doing your first Kaggle (Python for Big Data sets)

How I Learned to Stop Worrying and Love Linked Data

Software Engineering for Data Scientists

Making Big Data Smart

Building Data Analytics pipelines in the cloud using serverless technology

Leveraging Open Source Automated Data Science Tools

The Role and Importance of Curiosity in Data Science

Fuzzy Matching to the Rescue

How to Effectively Combine Numerical Features and Categorical Features

Building Up Local Models of Customers

Making Investing A Science

How to Use Data Science to Affect Company Change

Making Media with Jupyter

Lean Data Science

Último

Carero dropshipping via API with DroFx.pptxolyaivanovalion

FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg

April 2024 - Crypto Market Report's Analysismanisha194592

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823

ELKO dropshipping via API with DroFx.pptxolyaivanovalion

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7Call Girls in Nagpur High Profile Call Girls

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823

BabyOno dropshipping via API with DroFx.pptxolyaivanovalion

Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823

Predicting Loan Approval: A Data Science ProjectBoston Institute of Analytics

Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823

Week-01-2.ppt BBB human Computer interactionfulawalesam

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal

Discover Why Less is More in B2B Researchmichael115558

Midocean dropshipping via API with DroFxolyaivanovalion

Managing Data Science | Lessons from the Field

1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 1 © Pariveda Solutions. Confidential & Proprietary. August 2017 Managing Data Science | Lessons from the Field Mac Steele Director of Product | Domino Data Lab mac@dominodatalab.com @macsteele

2. What You’ll Learn Today GOALS What is the bar for data science teams PITFALLS What are common data science struggles DIAGNOSES Why so many of our efforts fail to deliver value RECOMMENDATIONS How to address these struggles with best practices

3. Lots of Legitimate Promises Saved $40M In claims with predictive analytics 200 180 160 140 120 100 80 60 40 20 0 Q1-08 Q2-09 Q3-10 Q4-11 Q1-13 Q2-14 Q3-15 Companies Mentioning ‘Artificial Intelligence’ On Earnings Calls Q4-16 Lots of Hype 35% of Sales Come from product recommendations Saved $450M By detecting fraudulent tax returns

4. Lots of Risk of Disappointment This Sounds Eerily Familiar MACHINE LEARNING TIME Innovation Trigger Peak of Inflated Expectations Trough of Disillusionment Slope of Enlightenment Plateau of Productivity EXPECTATION S TIME RELATIVE IMPORTANCE WITHIN ENTERPRISE 1997 20302010 Software Developers Data Scientists

5. What is the Goal? Measurable Your “quality” indicator. Reliable Your “hit rate.” Scalable Your “throughput.”

6. DATA SCIENCE PITFALLS

7. I SOLVED THE PROBLEM BUT… Oops, already solved by someone else It was the wrong problem Solved the wrong way Have the wrong tools for this problem Too slow for it to matter World changes while solving problem Problems mulitply, can’t tackle all at once Results used Wrong way

8. DIAGNOSES

9. Data Science is Different from Software Development • Research versus development focus • No answer is a valid answer • Traditional testing is insufficient given non-deterministic nature • No generally accepted process metrics (e.g. story points) • Data must be tracked

10. Forget About Other Stakeholders in the Process Access powerful infrastructure & preferred tools For Data Scientists For IT Leaders •Ensure stability & security •Leverage existing infrastructure •Minimize operational burden For Business Leaders •Understand real-world impact •Reliable, predictable insights •Minimize change to existing workflows For Data Science Managers • Accelerate project delivery through reuse, knowledge management • Mitigate key-man risk / accelerate onboarding • Hire & retain top talent

11. Fixation on Tools at the Expense of People and Process

12. Moonshot vs. Laps Around the Track • Perfection as enemy of shipped • Muddle “pure research” and “applied templates”

13. Disconnected from the Business • Little familiarity with practical business constraints • Limited ability to drive adoption

14. Missing Some Key Personnel Muscles • The full stack data scientist is a myth • Gap in ”soft” skills training

15. Artisan Thinking vs. Modular System Thinking • Limited culture of re-use and compounding • Not planning for future iterations (e.g., no reproducibility / documentation)

16. Bad Incentive Structures • Key responsibilities fall between gaps • Significant information loss in project transitions

17. How about divider slides for each general section?RECOMMENDATIONS

18. Best Practices Take Many Forms Process Both a single project and portfolio of projects People Types of capabilities and org design Technology Flexible infrastructure and tooling without the wild west

19. Data science system at many levels Single Step Data Exploration Single Project Ideation Validation & Review Deployment & Publishing Monitoring & Feedback Data Exploration R & D

20. Portfolio of Projects

21. Managing the lifecycle • Expect and embrace iteration • Enable compounding collaboration • Ensure auditability and reproducibility, even if you’re not regulated (yet)

22. Ideation • Problem first, not data first • Practice and master order of magnitude ROI math • Maintain repo of past work • Create and enforce templates for MRDs • Maintain a stakeholder-driven backlog

23. Artifact Selection • Leverage rapid prototyping and design sprint methodology • Create multiple mock-ups of different deliverable types • Consider creating synthetic data with baseline models

24. Research & Development • Establish standard software configurations, but give flexibility to experiment • Abstract away compute provisioning • Build simple models first • Set a cadence for delivering insights • Ensure business KPI tracked consistently over time

25. Validation • More than just code review, get stakeholder and IT sign-off • Ensure reproducibility and clear lineage • Use automated validation checks to support human inspection • Preserve results (even nulls) to central repo WHAT INFLUENCES A RESULT? Results The statistical analyses selected The R scripts that implemented the analyses The R libraries that implement the statistical functions The C libraries that perform the mathematical computations The operating system running the computational framework Reduced data Scripts that reduce the data Raw data Depend on Depend on Depend on Depend on Depend on Depend on Depend on Depend on

26. Delivery • Support for many deliverable artifacts (reports, dashboards, apps, batch APIs, real-time APIs) • Define a promote-to-production workflow • Flag upstream and downstream dependencies

27. Monitoring • Build ROI testing into all major deliverables • Require monitoring plans before considering “done” • Integrate with tools where people spend most of their time (e.g., email / Slack) • Anticipate risk and change management burdens

28. Keeping all the balls in the air • Measure everything, including yourself • Focus on reducing time to iterate • Socialize aggregate portfolio impact

29. The many hats of data science PRIORITIES PITTFALLS WITHOUT THEM Creating engaging visual and narrative journeys for analytical solutionsData Storyteller Articulating the business problem, translating to day-to-day work, ensuring ongoing engagement. Data Product Manager Vetting the priortization and ROI, providing ongoing feedback Business Stakeholder ROLE Low engagement and adoption from end users Projects miss the mark, don’t translate into tangible business value ROI decisions aren’t made sensibly, not knowing when to pull the plug Generating and communicating insights, understanding the strengths and risksData Scientist Naïve or low power insights Building scalable pipelines and infrastructure that make it possible to do the higher levels of needs. Data Infrastructure Engineer Insight generation is slow, because DS is spending their time doing infrastructure work

30. Organizational Design Dilemmas • False centralization / decentralization dichotomy • Most evolve as they scale and as business demands shift • Technology can help bridge the gap • Deeper understanding of business processes and priorities • Easier change management • Less technical knowledge compounding • Harder to codify best practices • Risk of shadow IT DECENTRALIZATIONCENTRALIZATION • Community and mentorship • easier transparency for managers and IT • More passive technical knowledge sharing • Isolation on data science island • Loss of credibility with business • Frustrated data scientists Pros Cons

31. What We Covered Today GOALS What is the bar for data science teams PITFALLS What are common data science struggles DIAGNOSES Why so many of our efforts fail to deliver value RECOMMENDATIONS How to address these struggles with best practices

32. QUESTIONS? Check out dominodatalab.com or find us in the AWS Marketplace

Notas do Editor

Who am I? I work at Domino Data Lab. We build a data science platform that helps organizations build a more mature data science practice. In my role, I get to work with large enterprises and small start-ups to understand how data science is changing their business. What I’m going to talk about today is largely just a synthesis of what we’ve heard over the past few years from companies that have failed hard and those that have had great success. What you’ll learn today What are common data science struggles Why so many of our efforts fail to deliver value How to address these struggles with best practices Who is doing this well today and what are their principles Where to focus your efforts tomorrow
Let’s start by saying something really obvious. Everyone is really excited about data science. There is lots of legitimate promise, with companies like Google, Facebook, and Amazon building defensible businesses around the breadth and quality of their models. At the same time, the pervasive hype has created risk of disappointment and disillusionment if not proactively addressed.
We believe data science is in the throes of a transition from a niche capability leveraged by a few pioneers to a core capability across every enterprise. What was once a “nice to have” has become a survival imperative. As with the evolution of software development, the tooling has advanced dramatically in recent years. But also like software development, tooling alone is not enough. The hardening of a new roles (people), processes, and technology will be key to cementing data science’s position as a core function.
The goal of any data science organization should be measurable, reliable, and scalable impact on the business decisions and metrics that they are charged with improving. Were business decisions positively changed in an observable and ideally, quantifiable, way? If I take on five projects, I want 3-4 to deliver business value. If my reliability is 80% with five projects and seven people, can I expand that to 50 projects and 40 people?
Opening talk track: Everyone talks about data being the problem. “Data Scientists spend 80% of their time cleaning data.” Don’t get me wrong, it’s a big pain point. But I think it’s a convenient scapegoat that distracts from some of the industry’s real problems that are more easily solvable.
Wrong problem: Over-zealous data science teams often dive straight into the data looking for “something interesting.” We’ve seen large organizations hire 30+ PhD’s with no clear mandate. They then emerge from a six week hole only to realize they had misunderstood the target variable, rendering the analysis irrelevant. Solved by someone else: We consistently hear data scientists complain about re-inventing the wheel. Anecdotal estimates put it at 30-40% of their time in large organizations with significant amounts of prior art. In the fortunate situation where a past project can be discovered, reproducing it is often impossible given inconsistent preservation of relevant artifacts like data, packages, documentation, and intermediate results. Wrong tools: Given the explosion of data and tooling functionality, data scientists are still often dramatically ill-equipped to explore the full range of possible domains and solutions. Analysis is still often confined to individual laptops that are easily overwhelmed. We’ve heard of organizations where it can take 6+ months to approve a widely-utilized open source Python package for research purposes, prompting employees to bring their personal laptops and work under their desks. Right problem/Too slow: That data scientist who will spend an extra two weeks to eek out a bit more AUC on a targeting model, only to realize the marketing team’s deadline passed Wrong way: For example, the team that builds a powerful predictive model for underwriters, wraps it in a standalone scoring front end and realizes the underwriters never actually click to a new tab from their existing workflow. One large insurer described it as, “We don’t fail because of the math… we fail because we don’t understand how people will use the math.” Used inappropriately: Google describes this as the undeclared consumer problem. Results can be thrown “over the fence” and data science teams have little control or even visibility into how those results are being used. For example, someone builds a model for predicting the value of California residential mortgages but then an over-zealous banker uses it to predict the value of Florida commercial mortgages even though the original model creator knew that would be a bad idea. World changes: Models are by definition an approximation of the real-world. If you don’t keep track of how the world is changing and monitor your models ongoing performance, you imperil the business and likely leave value on the table. My favorite story in this space was a large financial institution that issued credit cards. They had a probability of default model that expected a credit score. The credit bureau changed how they report “not present in the DB” from a null to a 999. Their model didn’t account for this and they just thought a bunch of risky people had perfect credit scores. It took weeks and millions of dollars in bad loans before they caught it. Can’t solve 100 at once: Many teams have had early wins from their low hanging fruit. Working in a tight-knit team on a single business initiative is great. However, they start to experience negative returns to scale as their existing processes can’t cope with a swollen backlog, an influx of new hires, and heightened expectations from the business.
Opening talk track: Everyone talks about data being the problem. “Data Scientists spend 80% of their time cleaning data.” Don’t get me wrong, it’s a big pain point. But I think it’s a convenient scapegoat that distracts from some of the industry’s real problems that are more easily solvable.
First, let me say that there could be a whole series of talks on this topic alone.
Data science is bigger than just data scientists. Obviously data scientists are a critical component, but there are a whole host of other stakeholders who must come along on the journey for their to be reliable wins at scale. And those stakeholder have very different backgrounds and priorities from data scientists. Data science managers often act as the bridge to the business and are focused on the quality and pace of output. They worry about things like key man risk and the pace of onboarding Business leaders don’t care as much about how the sausage is made, but they need to know they can count on data science output to make better decisions without having to drastically change how they and their teams work. IT leaders care about stability and serving their internal customers. They have KPIs like uptime and 20 minute SLAs, plus initiatives like cloud migration and enterprise standardization. They also want to ensure that new tools fit within existing infrastructure. The typical data science process neglects most of these stakeholders, letting the legitimately great promise of data science go unrealized.
Reddit blogs on the optimal data science organizational structure don’t get the same traction as throwdowns about Python and R Data scientists’ wear their tool wrangling as a badge of honor and wrapped up in their identity.
Many organizations have not built a culture of delivery and iteration. This could be a result of many data scientist’s extensive academic backgrounds, though it likely also stems from a confusion between what type of work is really happening: “pure research” and “applying templates to novel business situations.”
Teams are often hired into disconnected Innovation Labs without real business accountability to hone their process. Data science becomes “those people over there in the corner.” This also means they don’t have a deep understanding of the target KPIs and the nuances of how a team works today which is critical to ensuring adoption of their results.
The full stack data scientist is dead, if she ever existed at all. The move towards specialization isn’t just data engineers, it’s a whole host of other roles that cover the concepts of change management, feasibility assessments, rapid prototyping, ROI estimation, training, and stakeholder education. Data science training often focuses on the technical skills, which are necessary but insufficient for driving impact. Increasingly, the role is being partitioned into many roles, as happened with software development over the last two decades.
Data scientists often think of their work as bespoke and highly specialized. While their skillset may be, there are often many artifacts they create that can and should be re-used. Whether those are software packages, data viz suites, feature stores, or anything else. Moreover, many data scientists barely document their development process, much less modularize it.
Many data scientists have told me “I get paid for what I build this year, not maintaining what I built last year.” That leads to huge gaps in monitoring live production models as the responsibility falls to IT who focuses on the system performance, but doesn’t understand if the model is “still right” or being used appropriately.
Who am I? Work with large enterprises and small start-ups to understand how data science is changing their business Worked at Bridgewater Associates, hyper-focused on research Managed (often poorly) data scientists and data engineers What you’ll learn today What are common data science struggles Why so many of our efforts fail to deliver value How to address these struggles with best practices Who is doing this well today and what are their principles Where to focus your efforts tomorrow
I’m here to tell you that there’s a better way. You can overcome this by getting the right people and processes working through a centralized system of record. Domino offers a detailed inventory of existing projects and moels, supporting knowledge management efforts. We track all research and development work to ensure auditability and facilitate best practices like subject matter expert involvement and code review. We support publication to human consumers via interactive web applications and automated reports, along with publication to other software systems via enterprise-grade APIs. Finally, we track how different projects relate and who is working on them, providing meta-analytics to better direct your future investment. As more and more of your business decisions are augmented or replaced entirely by data science products, you can’t afford an ad hoc system.
I’m here to tell you that there’s a better way. You can overcome this by getting the right people and processes working through a centralized system of record. Domino offers a detailed inventory of existing projects and moels, supporting knowledge management efforts. We track all research and development work to ensure auditability and facilitate best practices like subject matter expert involvement and code review. We support publication to human consumers via interactive web applications and automated reports, along with publication to other software systems via enterprise-grade APIs. Finally, we track how different projects relate and who is working on them, providing meta-analytics to better direct your future investment. As more and more of your business decisions are augmented or replaced entirely by data science products, you can’t afford an ad hoc system.
Some of the most important work in the overall lifecycle happens before a line of code is written. If done well, the ideation stage dramatically de-risks a project by driving alignment across stakeholders. This is where the business objective is identified, success criteria laid out, prior art is reviewed, and initial ROI calculations are performed. Ideation is when feasibility is assessed, both in terms of “Does the data even exist?” and “Can we realistically change how the business process works?” It is also where prioritization happens relative to other potential projects. Below are some best practices we’ve observed that get to the root of many of the problems discussed earlier. Expect and embrace iteration Data science is never linear. All the flow charts in the world won’t stop a project from frequently needing to go back to find new data or re-validate a solution. That’s OK. Enable high-impact collaboration Collaboration means being able to find, discuss, understand, and build on past work. It shouldn’t matter if that person has left the company, or if the project was 4 versions of Pandas ago. Ensure auditability and reproducibility For regulated industries, understanding all the steps in a model’s lifecycle is mandated by law. Even for those that aren’t, as more and more of your models affect critical parts of citizen’s lives (what they read, what disease they’re diagnosed with, etc.) then it’s critical to lay the groundwork for seamless auditability.
Some of the most important work in the overall lifecycle happens before a line of code is written. If done well, the ideation stage dramatically de-risks a project by driving alignment across stakeholders. This is where the business objective is identified, success criteria laid out, prior art is reviewed, and initial ROI calculations are performed. Ideation is when feasibility is assessed, both in terms of “Does the data even exist?” and “Can we realistically change how the business process works?” It is also where prioritization happens relative to other potential projects. Below are some best practices we’ve observed that get to the root of many of the problems discussed earlier. Problem first, not data first Many organizations start with the data and look for something “interesting” rather than building a deep understanding of the existing business process and then pinpointing the decision point that can be augmented or automated. Leading organizations go so far as to literally map existing business processes in tools like Vizio, PPT, or LucidChart and then circle on that map the exact points that data science could potentially focus. Practice and master order of magnitude ROI math The ability to estimate the potential business impact of a change in a statistical measure is one the best predictors of success for a data science team. For example, if we reduce fraudulent insurance claims by 1%, how much would we save? What is a conservative estimate of how much improvement we can expect by the data scientist's’ efforts? Settle on a number based on past experiences but erring on the conservative side. Maintain repo of past work with business domain and technical experts As teams grow, no person can be an expert in everything. It’s critical to have a way to search to see who is most familiar with the latest version of TensorFlow or who has done the most work in the marketing attribution space. Code search is helpful, but ideally includes relevant discussion, environments, and data. Create and enforce templates for model requirements documents Documentation up front saves time 10:1 down the road. Create a template for 80% of cases, knowing there will always be exceptions. Maintain a stakeholder-driven backlog Your stakeholders should always be able to see what’s in flight and what’s been put in the backlog. Like any product org, they don’t necessarily get to change it, but you should have recurring check-in’s with them to ensure priorities haven’t shifted.
This is where the shape of the final deliverable is agreed upon. It’s always possible to amend the agreed upon deliverable or to have multiple, but visualizing the ultimate consumption medium and working backwards is key. Are you building a one-off answer to support a strategic decision, a standalone lightweight app for stakeholders to use, or a real-time data product that integrates into other systems? The best organizations start simple, get the result into the business, learn and measure before updating the model with a more sophisticated approach (more features, more complex algorithm, deeper integration). Create multiple mock-ups of different deliverable types - A leading e-commerce company creates 3-5 mocks for every data science project they take on, even bringing in a designer to make it feel real. For example, they discovered exposing their model as a HipChat bot was the most user-friendly way to leverage the model. By iterating on design possibilities before they get data, they ensure they’ve surfaced any previously undiscovered requirements and maximize their odds of adoption. Bring IT and engineering stakeholders in early - A model may work spectacularly in the lab, but not have any hope of ever working in production the way envisioned by the business. IT and engineering stakeholders need a seat at the table this early in order to identify constraints like “We only backfill that data monthly from the vendor, so we can’t do a real-time scoring engine.” Consider creating synthetic data with baseline models - Some organizations even create synthetic data and naive baseline models to show how the model would impact existing business processes. A leading agriculture company devotes an entire team to creating synthetic “perfect” data (e.g., no nulls, full history, realistic distribution) to establish potential value with the business before they go contract with expensive satellite data providers to get “real” data
Establish standard software configurations, but give flexibility to experiment Data scientists’ can often spend the first 8 weeks on the job configuring their workstation rather than exploring existing work and understanding their stakeholder’s priorities. Having a few standard environments gets people onboarded faster. Yet, it’s important they retain flexibility to try new tools and techniques. The tool acquisition process can be so arduous that some data scientists covertly bring their personal machines to work so they don’t have to wait 8 months for a Python package to be approved. Technologies like Docker can eliminate much of this headache. Abstract away compute provisioning Data scientists can wait weeks or even months to get the hardware necessary to accelerate their workflows. At Monsanto, they were able to take a research task that took 24 hours to run and complete it in 30 minutes by running it in parallel across dozens of EC2 machines. Build simple models first Resist the temptation to use 500 features. One company we know did this, spent weeks engineering the features and tuning the hyperparameters, only to learn that many of them were either a) not collected in real-time so couldn’t be used in the target use case or b) not allowed for compliance reasons. They ended up using a simple 5 features model and then working with their IT team to capture other data in real-time. Set a cadence for delivering insights The most common failure mode is data science delivers results that are either too late or don’t fit into how the business works today so results gather dust. Share insights early and often. One leading organization has their data scientists share an insight every 3-4 days. If they can’t publish a short post on incremental findings in business-friendly language, then chances are they are down a rabbit hole. This lets the manager coach more junior or academically-oriented team members, plus gives an easily consumable timeline of the progress for stakeholders. Ensure business KPI tracked consistently over time Too often, data scientists lose sight of the business KPI they are trying to affect and instead focus on a narrow statistical measure. Leading teams ensure that the relevant KPI is never far their experiments, whether it’s the Sharpe ratio of a hedge fund’s backtest or the Customer Acquisition Cost for an e-commerce company.
More than just code review, get stakeholder and IT sign-off This helps prevent delays in the process of delivery and avoids user adoption hurdles down the road. Ensure reproducibility and clear lineage of project Quality validation entails dissecting a model and checking assumptions and sensitivities. This is nearly impossible if a validator spends 90% of their time just gathering documentation. Use automated validation checks to support human inspection While data science’s non-deterministic nature means that unit testing does not directly apply, there are often repeated steps in a validation process that can be automated. That may be a set of summary statistics and charts, a portfolio backtest, or any other step that could turned into an automated diagnostic. Preserve null results Even if a project yields no material uplift and doesn’t get deployed into production, it’s critical to document it and preserve it in the same knowledge repo. Too often, we hear that data scientists’ re-do’ing work someone explored without knowledge of previous inquiries.
Support for many deliverable artifacts (reports, dashboards, apps, batch APIs, real-time APIs) While real-time scoring gets all the glory, the vast majority of models will at one time or another be prototype apps, dashboards, or batch scoring engines. It’s important to keep a link between all those deliverables because it saves time and avoids risk that key feedback is lost. Have a promote-to-production workflow Too often data science teams throw a result over the wall. If you establish the workflow ahead of time, you lower the burden for iterating on a new version of the model. Know what environments and packages are acceptable in production. Know who can make those decisions and what the escalation path is. Flag upstream and downstream dependencies A model is at it’s most risky when it finally makes it to production. Ensure that you know the upstream dependencies: what training data was used, what transformations were done with what tools, what modeling packages were used, etc. Also make sure you know the downstream dependencies (e.g., this nightly batch model is stacked on another model).
Monitoring is often forgotten but because getting to delivery is so hard. It’s important not to monitor just the system performance (uptime, latency), but also the usage (more or less than expected) and the statistical performance (is the model degrading?). IT often owns this process but is ill-equipped with traditional tools like New Relic and little context from data scientists who handed off the model. Build testing into all major deliverables Build testing into all major deliverables - One leading organization established a global holdout group from all of their customer segmentation and price elasticity models. After a year, they compared the average revenue from holdout group to the customers whose experience was guided by the predictive models. The overall lift was more than $1 billion, which gave them the credibility to dramatically expand the team and push models into more steps of the customer journey. Require monitoring plans for proactive alerting, acceptable uses, and notification thresholds The data scientist who created the model is the one best positioned to know what risks are inherent from their approach. Rather than wait for the business to notice something is wrong or a metric to drift, codify that knowledge into your monitoring system. Do you expect certain input types and ranges? If it’s outside of those, what should you do? Rollback? Stop serving predictions? What if someone in a totally different department starts consuming the model in a way that may be risky or outright wrong? Working collaboratively with IT or engineering, data scientists can put the appropriate guardrails on their creations. Integrate with tools where people spend most of their time (e.g., email / Slack) High performing teams realize that monitoring is only good if someone acknowledges, inspects, and changes behavior if necessary. We’ve seen organizations build alerts into chatbots or email systems to ensure they can keep up with the alerts as their number of production models scales. Anticipate risk and change management burdens At one large insurer, they have a team called Business Analytics Engineers who proactively assess and address change management problems when they deploy a data science like a new claims pricing app. They cover things like training, provide pre-determined feedback channels, and measure usage and engagement to ensure success.
Measure everything, including yourself Ironically, data scientists live in the world of measurement yet rarely turn that lens on themselves. Tracking patterns in aggregate workflows helps create modular templates, disseminate best practices from high-performing teams, and guide investment to in internal tooling and people to alleviate bottlenecks. Monsanto, a large AWS customer, tracks more than 350 simultaneous projects across 10 business units with more than 200 data scientists. By looking at the aggregate portfolio, they can see that they may need to devote more resources to hiring data engineers. Another large tech company similarly estimated that across their entire body of work that there were fundamentally only 15 ”unique” types of problems and set about formulating templates to streamline their process. Focus on reducing time to iterate This “feature” is ultimately the best predictor of data science organizational success we’ve seen. Minimal obstacles (without sacrificing rigorous review and checks) to test real results is another great predictor of data science success. Big tech companies deploy new models in minutes, whereas large financial services companies can take 18 months. Socialize aggregate portfolio impact Even if it’s not precise, it’s critical to socialize the impact of the whole portfolio of data science projects. Doing so addresses data scientists’ concerns about impact and helps address executive level concerns about investing further in data science. Importantly, don’t claim the credit for yourselves, but as a collective achievement of all the stakeholders.
As mentioned earlier, the full stack data scientist no longer exists and the roles are increasingly specialized. This is a natural evolution that we expect will continue as data science becomes ingrained into the fabric of how organizations function. The most consistent feedback we’ve heard is the increasing demand for a “product manager” type role as most organizations move from delivering mathematical results to stakeholder-facing apps. In large tech organizations, data science sits peer with product management to drive strategic priorities and ongoing optimization of engagement and impact.
Most evolve as they scale and as business demands shift We see many organizations start with a centralized “Center of Excellence” for data science to build their core technical infrastructure before evolving to a hybrid structure. In this structure, the central team focuses on building templates (documentation, software environments, project stage flows) and codifying best practices while embedded groups sit next to each major business line to address the “bookend” problems of identifying the right data science problem and maximizing adoption of solutions. Sometimes a full data science guild (to borrow Spotify’s term) exists and meets regardless of their day-to-day functional department. Technology is much better positioned to help address the pains of decentralization. A data science platform can facilitate technical knowledge sharing, encourage or enforce best practices, and provide transparency while still allowing data scientists to be closer to the business

Managing Data Science | Lessons from the Field

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (18)

Semelhante a Managing Data Science | Lessons from the Field

Semelhante a Managing Data Science | Lessons from the Field (20)

Mais de Domino Data Lab

Mais de Domino Data Lab (20)

Último

Último (20)

Managing Data Science | Lessons from the Field

Notas do Editor