Rental Cars and Industrialized Learning to Rank with Sean Downes

•

0 gostou•531 visualizações

Data can be viewed as the exhaust of online activity. With the rise of cloud-based data platforms, barriers to data storage and transfer have crumbled. The demand for creative applications and learning from those datasets has accelerated. Rapid acceleration can quickly accrue disorder, and disorderly data design can turn the deepest data lake into an impenetrable swamp. In this talk, I will discuss the evolution of the data science workflow at Expedia with a special emphasis on Learning to Rank problems. From the heroic early days of ad-hoc Spark exploration to our first production sort model on the cloud, we will explore the process of industrializing the workflow. Layered over our story, I will share some best practices and suggestions on how to keep your data productive, or even pull your organization out of the data swamp.

Dados e análise

… and associated Idiosyncratic Operating Principles.
Industrializing
DataScience Workflows
Sean Downes
Sr DataScientist @ Expedia, Inc.

The Problem
So you’ve been asked to bring the infrastructure into the cloud.
So your Data Lake is actually a Data Swamp.

Login
Impressions
Clicks
Purchases
The Problem
Every Line of Business has its own Structure
Every MicroService has a log
And you want to A|B Test

Can you please turn…
The Problem
into
using

Preview
Context / Disclaimer
Lightning Review of Data Platforms
(idiosyncratic) Organizing Principles

Context / Disclaimer
Academic
Theoretical Physicist
We’ve got some work to do.
So…
I’m implicitly assuming this talk will be one in an ensemble of opinions

supercomputers…
Lightning Review of Data Platforms

… the Commerical Data Center…
Lightning Review of Data Platforms

… Virtualized Everything
Lightning Review of Data Platforms
1. Assign Tasks their own virtual hardware
2. Expend / Contract Resources by demand
3. Real-time HotSwapping
4. Software Updates Built In
5. Etc Etc Etc.

idiosyncratic Organizing Principles
iOP1) Clarity
iOP2) Engineers are not Data Scientists
iOP3) PMs are not Data Scientists
iOP4) Data Scientists are not Engineers
iOP5) Close the Data Loop

iOP1: Data Clarity
PUBLISH THIS INTERNALLY!
“big data, big noise”
Where is what data?
Who owns what field?
What is this this field?
Where did this field go?
Why is this field NULL?

iOP1: Data Clarity
Minibatch streaming into nested JSON?
O(10kB)?
GZip?
O(50-500 MB)
Parquet.
Snappy.
“Expect Data Science”
Spark
Big Thanks to Jason Pohl @ DB!
And Charles Pritchard!

iOP2: Engineers are not Data Scientists
“why would you need to do that?”
Scratch Space
Cluster Bootstrap Permissions
Access S3 Buckets
Sandbox Clusters
Share Notebooks Across Accounts
We DO NOT SPEAK IAM Role/Anything

iOP2: Engineers are not Data Scientists
“why would you need to do that?”
if possible:
Write your own Pipelines.
else:
Explain Data Science.

iOP3: PMs are not Data Scientists
“you don’t need that!”
Once upon a time in the Flight DataLake…
Only 10% of a Search Impression was recorded
Worse: It was only the Cheapest10%
Many of the bookings where not included in this list!

iOP4: Data Scientists are Not Engineers
“we need to support models in @#&%? format”
Pick a Robust Standard and Stick to It.
If you’re big enough to worry about this, you can commit code
jPMML
Everybody Use Git. Now. Yes You.
Production Code Matters. Format. Document.
Pipelines Count as Production Code.

iOP5: Close that Data Loop
What is your data doing?
New Data? Consider Bandits!
Big Data? Set up a learning problem.
Empower by Design.

Contact information or call to action goes here.
Thank You.

Mais conteúdo relacionado

Mais procurados

Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Databricks

SSR: Structured Streaming for R and Machine Learningfelixcss

Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...Spark Summit

Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Databricks

Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit

The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin Databricks

Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteDatabricks

Analytics at Scale with Apache Spark on AWS with Jonathan FritzDatabricks

Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Spark Summit

Spark Summit EU talk by Christos ErotocritouSpark Summit

Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...Databricks

Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...Databricks

Introduction to Streaming Distributed Processing with StormBrandon O'Brien

Big Telco - Yousun JeongSpark Summit

Building Data Pipelines in PythonC4Media

Spark Summit EU talk by Michael NitschingerSpark Summit

Developing high frequency indicators using real time tick data on apache supe...Zekeriya Besiroglu

Bullet: A Real Time Data Query EngineDataWorks Summit

Presto@Netflix Presto Meetup 03-19-15Zhenxiao Luo

From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks

Mais procurados (20)

Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...

SSR: Structured Streaming for R and Machine Learning

Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...

Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...

Spark Summit EU talk by Kaarthik Sivashanmugam

The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin

Spark Summit San Francisco 2016 - Ali Ghodsi Keynote

Analytics at Scale with Apache Spark on AWS with Jonathan Fritz

Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...

Spark Summit EU talk by Christos Erotocritou

Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...

Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...

Introduction to Streaming Distributed Processing with Storm

Big Telco - Yousun Jeong

Building Data Pipelines in Python

Spark Summit EU talk by Michael Nitschinger

Developing high frequency indicators using real time tick data on apache supe...

Bullet: A Real Time Data Query Engine

Presto@Netflix Presto Meetup 03-19-15

From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data

Semelhante a Rental Cars and Industrialized Learning to Rank with Sean Downes

Data Science in Future TensePaco Nathan

Big Data made easy in the era of the Cloud - Demi Ben-AriDemi Ben-Ari

From a student to an apache committer practice of apache io tdbjixuan1989

The Hitchhiker's Guide to Machine Learning with Python & Apache SparkKrishna Sankar

Searching Chinese Patents Presentation at Enterprise Data WorldOpenSource Connections

Essential Data Engineering for Data Scientist SoftServe

Paytm labs soyouwanttodatascienceAdam Muise

Data Science with SparkKrishna Sankar

Measure All the Things! - Austin Data Day 2014gdusbabek

GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan

Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit

(Big) Data (Science) SkillsOscar Corcho

From Lab to Factory: Or how to turn data into valuePeadar Coyle

Making the Most of In-Memory: More than SpeedInside Analysis

Data Workflows for Machine Learning - SF Bay Area MLPaco Nathan

Microsoft DryadColin Clark

Big Graph Analytics on Neo4j with Apache SparkKenny Bastani

Maintainable Machine Learning ProductsAndrew Musselman

Architecting a Platform for Enterprise Use - Strata London 2018mark madsen

Machine learning model to productionGeorg Heiler

Semelhante a Rental Cars and Industrialized Learning to Rank with Sean Downes (20)

Data Science in Future Tense

Big Data made easy in the era of the Cloud - Demi Ben-Ari

From a student to an apache committer practice of apache io tdb

The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Searching Chinese Patents Presentation at Enterprise Data World

Essential Data Engineering for Data Scientist

Paytm labs soyouwanttodatascience

Data Science with Spark

Measure All the Things! - Austin Data Day 2014

GalvanizeU Seattle: Eleven Almost-Truisms About Data

Data infrastructure architecture for medium size organization: tips for colle...

(Big) Data (Science) Skills

From Lab to Factory: Or how to turn data into value

Making the Most of In-Memory: More than Speed

Data Workflows for Machine Learning - SF Bay Area ML

Microsoft Dryad

Big Graph Analytics on Neo4j with Apache Spark

Maintainable Machine Learning Products

Architecting a Platform for Enterprise Use - Strata London 2018

Machine learning model to production

Mais de Databricks

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Último

FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

BabyOno dropshipping via API with DroFx.pptxolyaivanovalion

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Industrialised data - the key to AI success.pdfLars Albertsson

Ravak dropshipping via API with DroFx.pptxolyaivanovalion

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

BigBuy dropshipping via API with DroFx.pptxolyaivanovalion

100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate

Data-Analysis for Chicago Crime Data 2023ymrp368

Smarteg dropshipping via API with DroFx.pptxolyaivanovalion

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh9953056974 Low Rate Call Girls In Saket, Delhi NCR

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor

Invezz.com - Grow your wealth with trading signalsInvezz1

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H

Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal

CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion

Rental Cars and Industrialized Learning to Rank with Sean Downes

1. … and associated Idiosyncratic Operating Principles. Industrializing DataScience Workflows Sean Downes Sr DataScientist @ Expedia, Inc.

2. The Problem So you’ve been asked to bring the infrastructure into the cloud. So your Data Lake is actually a Data Swamp.

4. Login Impressions Clicks Purchases The Problem Every Line of Business has its own Structure Every MicroService has a log And you want to A|B Test

5. Can you please turn… The Problem into using

6. Preview Context / Disclaimer Lightning Review of Data Platforms (idiosyncratic) Organizing Principles

7. Context / Disclaimer Academic Theoretical Physicist We’ve got some work to do. So… I’m implicitly assuming this talk will be one in an ensemble of opinions

8. supercomputers… Lightning Review of Data Platforms

9. … the Commerical Data Center… Lightning Review of Data Platforms

10. … Virtualized Everything Lightning Review of Data Platforms 1. Assign Tasks their own virtual hardware 2. Expend / Contract Resources by demand 3. Real-time HotSwapping 4. Software Updates Built In 5. Etc Etc Etc.

11. idiosyncratic Organizing Principles iOP1) Clarity iOP2) Engineers are not Data Scientists iOP3) PMs are not Data Scientists iOP4) Data Scientists are not Engineers iOP5) Close the Data Loop

12. iOP1: Data Clarity PUBLISH THIS INTERNALLY! “big data, big noise” Where is what data? Who owns what field? What is this this field? Where did this field go? Why is this field NULL?

13. iOP1: Data Clarity Minibatch streaming into nested JSON? O(10kB)? GZip? O(50-500 MB) Parquet. Snappy. “Expect Data Science” Spark Big Thanks to Jason Pohl @ DB! And Charles Pritchard!

14. iOP2: Engineers are not Data Scientists “why would you need to do that?” Scratch Space Cluster Bootstrap Permissions Access S3 Buckets Sandbox Clusters Share Notebooks Across Accounts We DO NOT SPEAK IAM Role/Anything

15. iOP2: Engineers are not Data Scientists “why would you need to do that?” if possible: Write your own Pipelines. else: Explain Data Science.

16. iOP3: PMs are not Data Scientists “you don’t need that!” Once upon a time in the Flight DataLake… Only 10% of a Search Impression was recorded Worse: It was only the Cheapest10% Many of the bookings where not included in this list!

17. iOP4: Data Scientists are Not Engineers “we need to support models in @#&%? format” Pick a Robust Standard and Stick to It. If you’re big enough to worry about this, you can commit code jPMML Everybody Use Git. Now. Yes You. Production Code Matters. Format. Document. Pipelines Count as Production Code.

18. iOP5: Close that Data Loop What is your data doing? New Data? Consider Bandits! Big Data? Set up a learning problem. Empower by Design.

19. Empower by Design.

20. Contact information or call to action goes here. Thank You.

Rental Cars and Industrialized Learning to Rank with Sean Downes

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Rental Cars and Industrialized Learning to Rank with Sean Downes

Semelhante a Rental Cars and Industrialized Learning to Rank with Sean Downes (20)

Mais de Databricks

Mais de Databricks (20)

Último

Último (20)

Rental Cars and Industrialized Learning to Rank with Sean Downes