by Vincent Yates
Director of Analytic Engineering at Zillow Group
Fountain of Youth or Polluted Swamp: Is your data lake revitalizing your business or eroding the foundation?
We’ve all been promised the shangri-la that is data lakes: more data means more insights—synergy! But has it really panned out? The trouble is that data lakes are more like the early days of the internet than they are a panacea of pristine useful information. Anyone can publish data, and even when they have the best of intentions, priorities shift, people leave and ultimately the priceless data become worthless. Those data may have been reliable when they were first published but are now wrong. Yet like many stale webpages, there is no way to tell, and the business continues to rely on those wrong data to make decisions. We at Zillow faced the same problem and decided to change it. I will describe the tools we’ve built and the tenants behind our team to help you ensure your lake rejuvenates your organization. Einstein said it best, “whoever is careless with the truth in small matters cannot be trusted with important matters.”
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Fountain of Youth or Polluted Swamp: Is your data lake revitalizing your business or eroding the foundation?
1. 11 ZILLOW | TRULIA | STREETEASY | HOTPADS | NAKED APARTMENTS
Vincent Yates, Director of Analytics Engineering
@VincentYates8
FOUNTAIN OF YOUTH OR POLLUTED SWAMP:
IS YOUR DATA LAKE REVITALIZING YOUR BUSINESS OR ERODING THE FOUNDATION?
2. 2
One of these is worth $42,000 more
Finished sq-
ft 2,602 2,602
Lot Size 4,400 5,342
Bathrooms 3 3
Bedrooms 4 4
Year Built 2004 2005
Sale Price 861,000 819,000
3. 3
One of these is worth $164,000 more
Finished sq-
ft 1,620 1,620
Lot Size 1,620 1,620
Bathrooms 2.5 3
Bedrooms 3 3
Year Built 2007 2007
Sale Price 499,000 663,000
4. 4
One of these is worth >$10M annually
http://www.exp-platform.com/Pages/SevenRulesofThumbforWebSiteExperimenters.aspx
18. 18
Cracks start to show under pressure
Data Quality: The Accuracy Dimension
The Morgan Kaufmann Series in Data Management Systems
OperationalIntegration Replication
19. 19
Complexity/Agility is the scapegoat
Transaction
applications,
APIs, Third-
party data
producers
Transactio
n
databases
Data
Marts
Data Lake
20. 20
Complexity/Agility is the scapegoat
Transaction
applications,
APIs, Third-
party data
producers
Transactio
n
databases
Data
Marts
Data Lake
21. 21
Complexity/Agility is the scapegoat
Transaction
applications,
APIs, Third-
party data
producers
Transactio
n
databases
Data
Marts
Data Lake
35. 35
Many mistakes are required for catastrophe
• Climate caused more icebergs
– Ignored Forecasts
• Tides sent icebergs southward
– Poor/Wrong Measurement
• The ship was going too fast
– Business needs over best data
• Iceberg warnings went unheeded
– Data was Disregarded for Intuition
• The binoculars were locked up
– Tools were behind lock and key
• The steersman took a wrong turn
– Reactive action under stress lead to wrong
decisions
• The iron rivets were too weak
– Cost savings over best data
• There were too few lifeboats
– Marketing owned the message
http://cosmiclog.nbcnews.com/_news/2012/04/01/10970732-10-causes-of-the-titanic-tragedy