Production machine learning_infrastructure

•Download as PPTX, PDF•

16 likes•9,246 views

joshwills

Slides from Josh Wills' talk on building machine learning infrastructure at Data Day Texas 2014.

Technology Education

From The Lab to the Factory
Building A Production Machine Learning Infrastructure
Josh Wills, Senior Director of Data Science
Cloudera

1

The Two Kinds of Data Scientists
•

The Lab
•

•

•

The Factory
•

5

Statisticians who got
really good at
programming
Neuroscientists, geneticis
ts, etc.
Software engineers who
were in the wrong place
at the wrong time

A Shift In Perspective
Analytics in the Lab

Question-driven
• Interactive
• Ad-hoc, post-hoc
• Fixed data
• Focus on speed and
flexibility
• Output is embedded into a
report or in-database
scoring engine
•

14

Analytics in the Factory
•
•
•
•
•
•

Metric-driven
Automated
Systematic
Fluid data
Focus on transparency and
reliability
Output is a production
system that makes
customer-facing decisions

Data Science as Decision Engineering

15

From the Lab to the Factory:
First Steps

23

Gertrude: Experimenting with ML
•

Multivariate Testing
•

•

Overlapping
Experiments
•
•

37

Define and explore a
space of parameters

Tang et al. (2010)
Runs multiple
independent
experiments on every
request

Simple Conditional Logic
•

Declare experiment
flags in compiled code
•

•

38

Settings that can vary
per request

Create a config file that
contains simple rules
for calculating flag
values and rules for
experiment diversion

Separate Data Push from Code Push
•

Validate config files and
push updates to servers
•
•

•

39

Zookeeper via Curator
File-based

Servers pick up new
configs, load them, and
update experiment
space and flag value
calculations

A Few Links I Love
•

http://research.google.com/pubs/pub36500.html
•

•

http://www.exp-platform.com/
•

•

Collection of all of Microsoft’s papers and presentations on
their experimentation platform

http://www.deaneckles.com/blog/596_lossy-betterthan-lossless-in-online-bootstrapping/
•

42

The original paper on the overlapping experiments
infrastrucure at Google

Dean Eckles on his paper about bootstrapped confidence
intervals with multiple dependencies

A Day In The Life of a Data Scientist

44

Thank you!
Josh Wills, Director of Data Science, Cloudera

@josh_wills

What's hot

AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...Bill Liu

Production ready big ml workflows from zero to hero daniel marcous @ wazeIdo Shilon

Ml infra at an early stageNick Handel

Machine Learning system architecture – Microsoft Translator, a Case Study : ...Vishal Chowdhary

Making Data Science Scalable - 5 Lessons LearnedLaurenz Wuttke

Getting Started With Dato - August 2015Turi, Inc.

Machine Learning with Apache SparkIBM Cloud Data Services

Machine Learning with GraphLab CreateTuri, Inc.

Importance of ML Reproducibility & Applications with MLfLowDatabricks

Data ops: Machine Learning in productionStepan Pushkarev

Knowledge DiscoveryAndré Karpištšenko

The Quest for an Open Source Data Science PlatformQAware GmbH

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)

MLOps and Data Quality: Deploying Reliable ML Models in ProductionProvectus

A Fast Decision Rule Engine for Anomaly DetectionDatabricks

AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman

Bootstrapping of PySpark Models for Factorial A/B TestsDatabricks

Building Personalized Data Products with DatoTuri, Inc.

Consolidating MLOps at One of Europe’s Biggest AirportsDatabricks

Rest microservice ml_deployment_ntalagala_ai_conf_2019Nisha Talagala

What's hot (20)

AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...

Production ready big ml workflows from zero to hero daniel marcous @ waze

Ml infra at an early stage

Machine Learning system architecture – Microsoft Translator, a Case Study : ...

Making Data Science Scalable - 5 Lessons Learned

Getting Started With Dato - August 2015

Machine Learning with Apache Spark

Machine Learning with GraphLab Create

Importance of ML Reproducibility & Applications with MLfLow

Data ops: Machine Learning in production

Knowledge Discovery

The Quest for an Open Source Data Science Platform

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...

MLOps and Data Quality: Deploying Reliable ML Models in Production

A Fast Decision Rule Engine for Anomaly Detection

AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...

Bootstrapping of PySpark Models for Factorial A/B Tests

Building Personalized Data Products with Dato

Consolidating MLOps at One of Europe’s Biggest Airports

Rest microservice ml_deployment_ntalagala_ai_conf_2019

Viewers also liked

Square's Machine Learning Infrastructure and Applications - Rong YanHakka Labs

Production and Beyond: Deploying and Managing Machine Learning ModelsTuri, Inc.

Using PySpark to Process Boat Loads of DataRobert Dempsey

Multi runtime serving pipelines for machine learningStepan Pushkarev

Building A Production-Level Machine Learning PipelineRobert Dempsey

Python as part of a production machine learning stack by Michael Manapat PyDa...PyData

Serverless machine learning operationsStepan Pushkarev

PostgreSQL + Kafka: The Delight of Change Data CaptureJeff Klukas

Machine learning in productionTuri, Inc.

Managing and Versioning Machine Learning Models in PythonSimon Frid

Machine learning in production with scikit-learnJeff Klukas

Machine Learning Pipelinesjeykottalam

Spark and machine learning in microservices architectureStepan Pushkarev

AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017Carol Smith

Viewers also liked (14)

Square's Machine Learning Infrastructure and Applications - Rong Yan

Production and Beyond: Deploying and Managing Machine Learning Models

Using PySpark to Process Boat Loads of Data

Multi runtime serving pipelines for machine learning

Building A Production-Level Machine Learning Pipeline

Python as part of a production machine learning stack by Michael Manapat PyDa...

Serverless machine learning operations

PostgreSQL + Kafka: The Delight of Change Data Capture

Machine learning in production

Managing and Versioning Machine Learning Models in Python

Machine learning in production with scikit-learn

Machine Learning Pipelines

Spark and machine learning in microservices architecture

AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017

Similar to Production machine learning_infrastructure

Cloudera User Group - From the Lab to the FactoryClouderaUserGroups

Josh Wills, MLconf 2013MLconf

MLconf NYC Josh WillsMLconf

Machine Learning InfrastructureSigOpt

Continuum Analytics and PythonTravis Oliphant

Data Discovery and Metadatamarkgrover

Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malwareDaveEdwards12

Data Warehouse Testing in the Pharmaceutical IndustryRTTS

Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malwarePriyanka Aash

Building an Experimentation Platform in ClojureSrihari Sriraman

Ds for finance day 4QuantUniversity

Transferring Software Testing Tools to PracticeTao Xie

Code PaLOUsa Azure IoT WorkshopMike Branstein

How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...Databricks

Model Based Test Validation and Oracles for Data Acquisition SystemsLionel Briand

Can we induce change with what we measure?Michaela Greiler

Applications of Machine Learning and Metaheuristic Search to Security TestingLionel Briand

Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS

Intake at AnacondaConMartin Durant

Silicon Valley Code Camp 2016 - MongoDB in productionDaniel Coupal

Similar to Production machine learning_infrastructure (20)

Cloudera User Group - From the Lab to the Factory

Josh Wills, MLconf 2013

MLconf NYC Josh Wills

Machine Learning Infrastructure

Continuum Analytics and Python

Data Discovery and Metadata

Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware

Data Warehouse Testing in the Pharmaceutical Industry

Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware

Building an Experimentation Platform in Clojure

Ds for finance day 4

Transferring Software Testing Tools to Practice

Code PaLOUsa Azure IoT Workshop

How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...

Model Based Test Validation and Oracles for Data Acquisition Systems

Can we induce change with what we measure?

Applications of Machine Learning and Metaheuristic Search to Security Testing

Testing Big Data: Automated Testing of Hadoop with QuerySurge

Intake at AnacondaCon

Silicon Valley Code Camp 2016 - MongoDB in production

Recently uploaded

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays

ICT role in 21st century education and its challengesrafiqahmad00786416

AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer

A Year of the Servo Reboot: Where Are We Now?Igalia

MINDCTI Revenue Release Quarter One 2024MIND CTI

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Real Time Object Detection Using Open CVKhem

Manulife - Insurer Transformation Award 2024The Digital Insurer

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz

FWD Group - Insurer Innovation Award 2024The Digital Insurer

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Recently uploaded (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...

ICT role in 21st century education and its challenges

AXA XL - Insurer Innovation Award Americas 2024

A Year of the Servo Reboot: Where Are We Now?

MINDCTI Revenue Release Quarter One 2024

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu

Powerful Google developer tools for immediate impact! (2023-24 C)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Real Time Object Detection Using Open CV

Manulife - Insurer Transformation Award 2024

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

presentation ICT roal in 21st century education

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

FWD Group - Insurer Innovation Award 2024

Strategies for Landing an Oracle DBA Job as a Fresher

Production machine learning_infrastructure

1. From The Lab to the Factory Building A Production Machine Learning Infrastructure Josh Wills, Senior Director of Data Science Cloudera 1

2. What is a Data Scientist? 2

3. One Definition… 3

4. …versus Another 4

5. The Two Kinds of Data Scientists • The Lab • • • The Factory • 5 Statisticians who got really good at programming Neuroscientists, geneticis ts, etc. Software engineers who were in the wrong place at the wrong time

6. Data Science In The Lab 6

7. Data Science as Statistics 7

8. Investigative Analytics 8

9. Tools for Investigative Analytics 9

10. Inputs and Outputs 10

11. On Actionable Insights 11

12. Data Science In The Factory 12

13. Building Data Products 13

14. A Shift In Perspective Analytics in the Lab Question-driven • Interactive • Ad-hoc, post-hoc • Fixed data • Focus on speed and flexibility • Output is embedded into a report or in-database scoring engine • 14 Analytics in the Factory • • • • • • Metric-driven Automated Systematic Fluid data Focus on transparency and reliability Output is a production system that makes customer-facing decisions

15. Data Science as Decision Engineering 15

16. All* Products Become Data Products 16

17. Sounds Great. So Who Is Doing This? 17

18. From The Lab To The Factory 18

19. The Art of Machine Learning 19

20. A New Kind of Statistics 20

21. DevOps for Data Science 21

22. The Model: Information Retrieval 22

23. From the Lab to the Factory: First Steps 23

24. Step 1: Choose a Good Problem 24

25. Step 2: DTSTCPWTM 25

26. Step 3: Log Everything 26

27. Step 4: Hire (More) Data Scientists 27

28. Things We’re Working On 28

29. The Data Science Workflow 29

30. Identifying the Bottlenecks 30

31. Myrrix 31

32. Oryx: Simple and Scalable ML 32

33. Generational Thinking 33

34. Working on the Gaps 34

35. Space Exploration 35

36. The Limits of Our Models 36

37. Gertrude: Experimenting with ML • Multivariate Testing • • Overlapping Experiments • • 37 Define and explore a space of parameters Tang et al. (2010) Runs multiple independent experiments on every request

38. Simple Conditional Logic • Declare experiment flags in compiled code • • 38 Settings that can vary per request Create a config file that contains simple rules for calculating flag values and rules for experiment diversion

39. Separate Data Push from Code Push • Validate config files and push updates to servers • • • 39 Zookeeper via Curator File-based Servers pick up new configs, load them, and update experiment space and flag value calculations

40. Computational Hypothesis Testing 40

41. The Experiments Dashboard 41

42. A Few Links I Love • http://research.google.com/pubs/pub36500.html • • http://www.exp-platform.com/ • • Collection of all of Microsoft’s papers and presentations on their experimentation platform http://www.deaneckles.com/blog/596_lossy-betterthan-lossless-in-online-bootstrapping/ • 42 The original paper on the overlapping experiments infrastrucure at Google Dean Eckles on his paper about bootstrapped confidence intervals with multiple dependencies

43. One More Thing 43

44. A Day In The Life of a Data Scientist 44

45. On Functional Programming 45

46. On Lineage 46

47. Thank you! Josh Wills, Director of Data Science, Cloudera @josh_wills

Editor's Notes

A popular definition. Also, an example of how correlation != causation.
A vastly superior definition. ;-)See also: http://www.quora.com/Data-Science/What-is-the-difference-between-a-data-scientist-and-a-statistician/answer/Josh-Wills
How I hate this definition.
Question-drivenInteractiveAd-hoc, post-hocFixed data
Tools focus on speed and flexibility.
The source of data is the data warehouse– the ultimate source of truth in the enterprise. The output are reports, charts, maybe a dashboard or two.
The output that most people seem to want are insights– specifically, “actionable insights.”An actionable insight is one that allows us to make a clear decision, a useful correlation between a short-term behavior and a long-term outcome. They are pretty rare. You can basically build an entire business on a handful of actionable insights.
Data scientists love Venn diagrams. Harlan Harris recently created this one to explain data products, and he commented on his definition in this blog post:http://datacommunitydc.org/blog/2013/09/the-data-products-venn-diagram/Data products combine software, domain expertise, and statistical modeling in order to solve a problem. We can compare data products to the combination of any two of these three aspects:One-off analyses done by an analyst or a statistician to help inform a decision are good, but creating repeatable and scalable processes into software is better.BI and stats tools are general purpose– they aren’t optimized for solving a specific problem in your business.Rules engines allow you to create maintainable software in the face of frequent policy changes, but they can be made smarter and more robust by bringing modeling and analysis to bear on the decisions they encode.
Curt Monashmakes a distinction between investigative analytics (which he defines here: http://www.dbms2.com/2011/03/03/investigative-analytics/ ) and operational analytics that I like, and I expanded it into my own set of differences that I want to walk through here.Investigative analytics is what we think of when we think of traditional BI: there’s an analyst or an executive that is searching for previously unknown patterns in a data set, either by looking at a series of visualizations mediated by database queries, or by applying some statistical models to a prepared data set to tease out some deeper explanations. This is where the vast majority of the BI market is focused right now.Operational analytics, on the other hand, is a nascent market, and I don’t believe the existing BI tools have done a good job of supporting companies that want to start leveraging their modeling and analytical prowess in order to make better decisions in real-time. I’d like to shift some of the conversation and the focus in the market from the lab to the factory.
Every customer interaction results in hundreds of decisions– both by us and by the customer.As interactions with customers move primarily to the digital realm, we have the opportunity to use data and modeling to optimize the very large number of small transactions we engage in with our customers.The number of decisions embedded in this page that would be amenable to statistical modeling and designed experiments is simply enormous: not just the price, but the wording, the images, the use of a timer, the selection of which upsell opportunity is right for the current customer, etc., etc.
* Slightly longer: All products of any consequence will become data products.
Basically nobody. Most models that gets deployed to production happen in one of two ways:In-database scoring, like for a marketing campaign. This isn’t really “production”– there’s not usually an SLA here or an ops person involved beyond the DBA.By taking an existing model definition in SAS or R and converting it (often by hand) into C or Java code for use in a production server. This becomes THE MODEL, which is THE MODEL for the next six months to a year. Because this process is tedious and awful, we don’t do it very often, and it’s not a very glamorous software engineering assignment.Of course, there are a handful of companies that have been building and deploying models continuously for a while now, but that’s usually because their business depends on it (Google, FB, Twitter, LinkedIn, Amazon, etc.)
Machine learning is not an engineering discipline. Not even close. There are aspects of it that are familiar to software engineers, like pipeline building, but lots of things are lacking.
I suspect that we teach advanced statistics in a way that tends to scare off computer scientists by relying too heavily on parametric models that involve lots of integrals and multivariate calculus, instead of focusing on the non-parametric models that are primarily computational. I would like to create a course that taught advanced statistics (including bootstrapping) without requiring any calculus.
Data science needsdevops. If we can’t deploy new code quickly, deploying new models and running experiments quickly isn’t going to happen.
Search is, for me, very much a data product. Daniel Tunkelang, one of the best data scientists in the world, is the head of search quality at LinkedIn.Ranking results is an information retrieval problem.Information retrieval is the model of what I would like to see happen with machine learning: IR made the leap from academic research area to a true engineering discipline that can be tackled by any reasonably clever engineer with Lucene/Solr/ElasticSearch.
A good problem is one that allows you to get fast feedback and take advantage of that feedback to improve your solution.http://uxmag.com/articles/you-are-solving-the-wrong-problem
Do The Simplest Thing That Could Possibly Work. Don’t start with the super-advanced machine learning model until you know that the problem you’re solving is important enough to justify the work involved.A good rule of thumb: choose something that seems laughably simple. You’ll often be surprised at how effective it is, and it will be great material for me to use at other presentations.
Log files are the bread-and-butter of data science. They are the river of Nile, they give life to data science teams. Three reasons:Raw and unfiltered: reflect the reality of an event (usually an action that was taken by a user or a process) as it happened at the time, not mediated by anything else.Real-time: Apache Flume can pick log files up and transport them to our Hadoop cluster in a matter of minutes: I don’t need to wait a day for an ETL process to copy operational data into the EDW system before I can start answering questions.One of the most important places to log things are where decisions get made– either user decisions that we wish to understand better, or the decision points in our own internal workflows and processes that drive meaningful outcomes. In many businesses, these decision points involve business rules– either directly embedded in a business rules engine, or in code that is acting much like a business rules engine.The logs will be the primary input to our machine learning models, because they reflect what information was available to the system at the time a decision was made. This is one of the more obvious aspects of doing production machine learning, but it also seems to trip up most people at the get-go: a model that is trained on data that isn’t available to the system at the time a decision is made is at best a useful curiosity and at worse is actively harmful.
If you have meaningful problems to work on and an environment that lets your people iterate on them quickly and try new ideas, you won’t need to try to hire data scientists. They’ll be beating down your door.
Most tools are focused on collapsing the interface between feature extraction and model fitting. We’d like to focus on collapsing the interface between model building and model serving.
Feature creation and model fitting. Lots of folks are focused on this space, because it’s so visible; it’s what data scientists spend most of their time doing, so finding ways to help them do it faster is an obviously good thing to do.But I think that there are other bottlenecks that are less obvious, because they are so narrow we don’t even bother to enter them in the first place, and I think that one of those bottlenecks is between building a model and putting it into production. And there are lots of reasons for this– primarily b/c it’s hard. Companies like Google/FB/LI/etc.
What attracted me to Myrrix wasn’t just the algorithms--- because algorithms are commodities– but that they were thinking about these problems in the right way.
Oryx builds models and serves models– that’s it. No visualization, no data munging, none of that stuff– there are plenty of great tools to choose from to help data scientists solve those problems.http://github.com/cloudera/oryx
The idea that feedback will be coming to the system in real-time is built into the computation and serving layers.
There are inevitably rules, and tuning parameters, and additional logic that needs to get deployed around any model that rolls into production. And just like we can’t be completely sure of how all of those parameters and settings will interact with each other, and with our customers, we end up running lots of experiments to understand how changes impact user behavior– especially in cases where we can’t necessarily re-create the conditions that would make backtesting of the changes possible (examples of this.)
There is an inevitable gap between the lab environment and the factory, even after we ensure that everyone is operating on the same data sources by logging everything. The gap is that what the model fits is not the same thing as what the business is trying to optimize. (A couple of examples of this.)
Gertrude Cox studied math and statistics at Iowa State University, earning the first master’s degree in statistics ever granted by the university. When they asked her why she decided to study math, she said, “Because it was easy.” #badass
Really simple if-then logic. Easy enough for a data scientist (or even a product manager) to understand.
This is the part of the talk where the ops people freak out a little bit.
Another technique every data scientist should know: http://en.wikipedia.org/wiki/Bootstrapping_(statistics)
Automate metric collection and confidence interval calculation. Make it stupid easy to not just run experiments, but evaluate their performance.
Most of what data scientist do (whetherthey’e in the lab or the factory) involves cleaning and transforming datasets. But for as much as we talk about this, we know relatively little about the process of what data scientists do and what techniques are most effective on different data sets. And this seems unfortunate to me.
I’ve been spending a lot of time with the Twitter guys, and it’s starting to get to me.Seriously, monads are pretty useful. In particular, the Writer Monad: http://learnyouahaskell.com/for-a-few-monads-more
Playing around with lineage tracking for data transformations in R: https://github.com/jwills/lineageBy building logging into our data analysis tools, we can start to analyze the process of analysis. It’s a little meta, I know.

Production machine learning_infrastructure

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Production machine learning_infrastructure

Similar to Production machine learning_infrastructure (20)

Recently uploaded

Recently uploaded (20)

Production machine learning_infrastructure

Editor's Notes