SlideShare uma empresa Scribd logo
1 de 36
11 ZILLOW | TRULIA | STREETEASY | HOTPADS | NAKED APARTMENTS
Vincent Yates, Director of Analytics Engineering
@VincentYates8
FOUNTAIN OF YOUTH OR POLLUTED SWAMP:
IS YOUR DATA LAKE REVITALIZING YOUR BUSINESS OR ERODING THE FOUNDATION?
2
One of these is worth $42,000 more
Finished sq-
ft 2,602 2,602
Lot Size 4,400 5,342
Bathrooms 3 3
Bedrooms 4 4
Year Built 2004 2005
Sale Price 861,000 819,000
3
One of these is worth $164,000 more
Finished sq-
ft 1,620 1,620
Lot Size 1,620 1,620
Bathrooms 2.5 3
Bedrooms 3 3
Year Built 2007 2007
Sale Price 499,000 663,000
4
One of these is worth >$10M annually
http://www.exp-platform.com/Pages/SevenRulesofThumbforWebSiteExperimenters.aspx
55
DATA SCIENCE’S DIRTY
LITTLE SECRET
66
$3.1 TRILLION
IBM Big Data Hub
7
Unknowns ≠ Seasonality
Seasonality
Seasonality
Seasonality
Seasonalit
y
Seasonality
Seasonality
88
Seriously
DATA SCIENCE IS HARD
9
Product &
Communicatio
n
Programming
Statistics
1010
24% of data scientists
UNSURE OF HOW MUCH OF
THEIR DATA ARE
INACCURATE
IBM Big Data Hub
11
Errors Propagate in Dynamic Ways
12
1313
66% of data scientists
CLEANING DATA IS THE MOST
TIME CONSUMING TASK
CroundFlower 2015 Data Science Report
1414
My data is pretty good.
DOES IT REALLY MATTER?
1515
52.3% of data scientists
POOR DATA QUALITY IS THEIR
BIGGEST HURDLE
CroundFlower 2015 Data Science Report
1616
The cost of poor data quality
15-25% OF OPERATING
PROFIT
Kaufman,Morgan: The Accuracy Dimension
1717
Someone would have noticed and fixed it
HOW DID WE GET HERE?
18
Cracks start to show under pressure
Data Quality: The Accuracy Dimension
The Morgan Kaufmann Series in Data Management Systems
OperationalIntegration Replication
19
Complexity/Agility is the scapegoat
Transaction
applications,
APIs, Third-
party data
producers
Transactio
n
databases
Data
Marts
Data Lake
20
Complexity/Agility is the scapegoat
Transaction
applications,
APIs, Third-
party data
producers
Transactio
n
databases
Data
Marts
Data Lake
21
Complexity/Agility is the scapegoat
Transaction
applications,
APIs, Third-
party data
producers
Transactio
n
databases
Data
Marts
Data Lake
22
Complexity/Agility is the scapegoat
23
Complexity/Agility is the scapegoat
24
Complexity/Agility is the scapegoat
Data
Marts
Data Lake
25
Moral Hazard is the culprit
2626
HOW DO WE GET OUT?
A few simple tricks to head in the right direction
2727
PROACTIVE NOT REACTIVE
Data scientist is not great under duress
28
Get Back to Raw Data
29
Centralize Definitions
30
Model Where Possible
3131
MODELING IS HARD
Build tools to make reactive easier
32
33
34
Data Problems are as Old as Data
35
Many mistakes are required for catastrophe
• Climate caused more icebergs
– Ignored Forecasts
• Tides sent icebergs southward
– Poor/Wrong Measurement
• The ship was going too fast
– Business needs over best data
• Iceberg warnings went unheeded
– Data was Disregarded for Intuition
• The binoculars were locked up
– Tools were behind lock and key
• The steersman took a wrong turn
– Reactive action under stress lead to wrong
decisions
• The iron rivets were too weak
– Cost savings over best data
• There were too few lifeboats
– Marketing owned the message
http://cosmiclog.nbcnews.com/_news/2012/04/01/10970732-10-causes-of-the-titanic-tragedy
3636
VincentYa@zillowgroup.com
@VincentYates8
THANK YOU!

Mais conteúdo relacionado

Destaque

Replicon Solution Overview
Replicon Solution OverviewReplicon Solution Overview
Replicon Solution Overview
Sunny Aggarwal
 

Destaque (15)

Capturing the Mirage: Machine Learning in Media and Entertainment Industries
Capturing the Mirage: Machine Learning in Media and Entertainment IndustriesCapturing the Mirage: Machine Learning in Media and Entertainment Industries
Capturing the Mirage: Machine Learning in Media and Entertainment Industries
 
A Tour of the Data Science Process, a Case Study Using Movie Industry Data
A Tour of the Data Science Process, a Case Study Using Movie Industry DataA Tour of the Data Science Process, a Case Study Using Movie Industry Data
A Tour of the Data Science Process, a Case Study Using Movie Industry Data
 
Open Data for Social Good
Open Data for Social GoodOpen Data for Social Good
Open Data for Social Good
 
The Right Question
The Right QuestionThe Right Question
The Right Question
 
Realtime Learning: Using Triggers to Know What the ?$# is Going On
Realtime Learning: Using Triggers to Know What the ?$# is Going OnRealtime Learning: Using Triggers to Know What the ?$# is Going On
Realtime Learning: Using Triggers to Know What the ?$# is Going On
 
Machine Learning at Netflix
Machine Learning at NetflixMachine Learning at Netflix
Machine Learning at Netflix
 
Challenges of Predicting User Engagement
Challenges of Predicting User EngagementChallenges of Predicting User Engagement
Challenges of Predicting User Engagement
 
Annotated bibliography
Annotated bibliographyAnnotated bibliography
Annotated bibliography
 
Audit of site usability, SEO - Execujet
Audit of site usability, SEO - ExecujetAudit of site usability, SEO - Execujet
Audit of site usability, SEO - Execujet
 
Introduccion a la ingenieria del softwarer
Introduccion a la ingenieria del softwarerIntroduccion a la ingenieria del softwarer
Introduccion a la ingenieria del softwarer
 
Replicon Solution Overview
Replicon Solution OverviewReplicon Solution Overview
Replicon Solution Overview
 
Current Compliance Trends
Current Compliance Trends Current Compliance Trends
Current Compliance Trends
 
ναυπλιο
ναυπλιοναυπλιο
ναυπλιο
 
Το νερό στο σώμα μας
Το νερό στο σώμα μαςΤο νερό στο σώμα μας
Το νερό στο σώμα μας
 
Furia ultralight helicopter plans 2
  Furia ultralight helicopter plans 2  Furia ultralight helicopter plans 2
Furia ultralight helicopter plans 2
 

Mais de Domino Data Lab

What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...
Domino Data Lab
 
Building Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technologyBuilding Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technology
Domino Data Lab
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
Domino Data Lab
 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino Data Lab
 

Mais de Domino Data Lab (20)

What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...
 
The Proliferation of New Database Technologies and Implications for Data Scie...
The Proliferation of New Database Technologies and Implications for Data Scie...The Proliferation of New Database Technologies and Implications for Data Scie...
The Proliferation of New Database Technologies and Implications for Data Scie...
 
Racial Bias in Policing: an analysis of Illinois traffic stops data
Racial Bias in Policing: an analysis of Illinois traffic stops dataRacial Bias in Policing: an analysis of Illinois traffic stops data
Racial Bias in Policing: an analysis of Illinois traffic stops data
 
Data Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using itData Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using it
 
Supporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentationSupporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentation
 
Leveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive IndustryLeveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive Industry
 
Summertime Analytics: Predicting E. coli and West Nile Virus
Summertime Analytics: Predicting E. coli and West Nile VirusSummertime Analytics: Predicting E. coli and West Nile Virus
Summertime Analytics: Predicting E. coli and West Nile Virus
 
Reproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with JupyterReproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with Jupyter
 
GeoViz: A Canvas for Data Science
GeoViz: A Canvas for Data ScienceGeoViz: A Canvas for Data Science
GeoViz: A Canvas for Data Science
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
 
Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at Scale
 
How I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked DataHow I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked Data
 
Software Engineering for Data Scientists
Software Engineering for Data ScientistsSoftware Engineering for Data Scientists
Software Engineering for Data Scientists
 
Making Big Data Smart
Making Big Data SmartMaking Big Data Smart
Making Big Data Smart
 
Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...
 
Building Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technologyBuilding Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technology
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...
 
The Role and Importance of Curiosity in Data Science
The Role and Importance of Curiosity in Data ScienceThe Role and Importance of Curiosity in Data Science
The Role and Importance of Curiosity in Data Science
 

Último

Último (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

Fountain of Youth or Polluted Swamp: Is your data lake revitalizing your business or eroding the foundation?

Notas do Editor

  1. http://www.ibmbigdatahub.com/infographic/four-vs-big-data
  2. Election Fake News NeXT Market Data
  3. http://www.ibmbigdatahub.com/infographic/four-vs-big-data
  4. https://whatsthebigdata.com/2015/02/11/being-a-data-scientist-in-2015-infographic/
  5. https://whatsthebigdata.com/2015/02/11/being-a-data-scientist-in-2015-infographic/
  6. https://books.google.com/books?id=x8ahL57VOtcC&lpg=PP1&dq=Olson%2C%20J.E.%2C%202003%2C%20Data%20quality%3A%20The%20accuracy%20dimension%2C%20Morgan%20Kaufmann%20publishers%2C%20Burlington&lr&pg=PA8#v=onepage&q&f=false
  7. Restate numbers to public investors Manually fixing my one problem is much cheaper than fixing the root issue