SlideShare uma empresa Scribd logo
1 de 35
Baixar para ler offline
Tackling Data Curation in
Three Generations
Mike Stonebraker
Silos everywhere….
The Current State of Affairs
By the Numbers
Number of data
stores in a typical
enterprise:
5,000
Number of data
stores in a LARGE
telco company:
10,000
• Enterprises are divided into business units, which are typically
independent
• With independent data stores
• One large money center bank had hundreds
• The last time I looked
Why so many data stores?
• Enterprises buy other enterprises
• With great regularity
• Such acquired silos are difficult to remove
• Customer contracts
• Different mechanisms for treating employees, retirees ….
Why so many data stores?
• CFO’s budget is on a spreadsheet on his PC
• Lots of Excel data
• And there is public data from the web with business value
• Weather, population, census tracts, ZIP codes …
• Data.gov
Not to Mention . . .
• Business units are independent
• Different customer ids, product ids, …
• Enterprises have tried to construct such models in the past…..
• Multi-year project
• Out-of-date on day 1 of the project, let alone on the proposed
completion date
• Standards are difficult
• Remember how difficult it is to stamp out multiple DBMSs in an
enterprise
• Let alone Macs…
And there is NO Global Data Model
• The sins of your predecessors
• Your CEO is not in IT
• May not have the COBOL source code
• Politics
• Data is power
Lots of Silos is a Fact of Life
• Cross selling
• Combining procurement orders
• To get better pricing
• Social networking
• People working on the same thing
• Rollups/better information
• How many employees do we have?
• Etc….
Why Integrate Silos?
• Biggest problem facing many
enterprises
Data Integration is a VERY Big Deal
• Ingest
• The data source
• Validate
• Have to get rid of (or correct) garbage
• Transform
• E.g., Euros to dollar; Airport code to city name
• Match Schemas
• Your salary is my wages
• Consolidate (dedup)(entity resolution)
• E.g., Mike Stonebraker and Michael Stonebraker
Requirement: Data Curation
• Gen 1 (1990s): Traditional ETL
• Gen 2 (2000s): ETL on steroids
• Gen 3 (appearing now): Scalable Data Curation
Three Generations of Data Curation Products
• Retail sector started integrating sales data into a data warehouse in the
mid 1990’s
• To make better stock decisions
• Pet rocks are out, Barbie dolls are in
• Tie up the Barbie doll factory with a big order
• Send the pet rocks back or discount them up front
• Warehouse paid for itself within 6 months with smarter buying
decisions!
Gen 1 (Early Data Warehouses)
• Essentially all enterprises followed suit and built warehouses of
customer-facing data
• Serviced by so-called Extract-Transform-and-Load (ETL) tools
The Pile-On
• Average system was 2-3X over budget
• and 2-3X late
• Because of data integration headaches
The Dark Side . . .
• Bought $100K of widgets from IBM, Inc.
• Bought 800K Euros of m-widgets from IBM, SA
• Bought -9999 of *wids* from 500 Madison Ave., NY, NY 10022
• Insufficient/incomplete meta-data: May not know that 800K is in Euros
• Missing data: -9999 is a code for “I don’t know”
• Dirty data: *wids* means what?
Why is Data Integration Hard?
• Bought $100K of widgets from IBM, Inc.
• Bought 800K Euros of m-widgets from IBM, SA
• Bought -9999 of *wids* from 500 Madison Ave., NY, NY 10022
• Disparate fields: Have to translate currencies to a common form
• Entity resolution: Is IBM, SA the same as IBM, Inc.?
• Entity resolution: Are m-widgets the same as widgets?
Why is Data Integration Hard?
Local data Source(s)
Local Schema
Data Warehouse
Global SchemaETL
ETL Architecture
• Human defines a global schema
• Up front
• Assign a programmer to each data source to
• Understand it
• Write local to global mapping (in a scripting language)
• Write cleaning routine
• Run the ETL
• Scales to (maybe) 25 data sources
• Twist my arm, and I will give you 50
Traditional ETL Wisdom
• Bigger global schema upfront is really hard
• Too much manual heavy lifting
• By a trained programmer
• No automation
Why?
Gen 2 – Curation Tools Added to ETL
• Deduplication systems
– For addresses, names, …
• Outlier detection for data cleaning
• Standard domains for data cleaning
• …
• Augments the generation 1 architecture
– Still only scales to 25 data sources!
• Enterprises want to integrate more and more data sources
– Milwaukee beer example
• Weather data
• Business analysts have an insatiable demand for “MORE”
Current Situation
• Enterprises want to integrate more and more data sources
– Big Pharma example
• Has a traditional data warehouse of bio assay data
• Has ~3,000 scientists doing “wet” biology and chemistry across multiple
types of experiments
• And writing results in an electronic lab notebook (think 27,000
spreadsheets)
• No standard vocabulary (Is an ICU-50 the same as an ICE-50?) – both are
biophysical parameters of drugs
• No standard units and units may not even be recorded
• No standard language (e.g., English)
• Variable encoding (some results are numeric, some are text, some are
numbers stored as text with text comments!)
Current Situation
• Enterprises want to integrate more and more data sources
– Web aggregator example
• Currently integrating 80,000 web URLs
• With “event” and “things to do” data
• All the standard headaches
– At scale 80,000
Current Situation
• Traditional ETL won’t scale to these kinds of numbers
– Too much manual effort
– I.e., traditional ETL way too heavy-weight!!!
• Also a personnel mismatch
– Are widgets and m-widgets the same thing?
– Only a business expert knows the answer
– The ETL programmer certainly does not!!!!
Current Situation
Gen 3: Scalability
26
• Must pick the low-hanging fruit automatically
– Machine learning
– Statistics
• Rarely an upfront global schema
– Must build it “bottom up”
• Must involve human (non-programmer) experts to help with the
cleaning
Tamr is an example of this 3rd generation!
Ingest
Schema
integration
Crowd
Sourcing
De-
Duplication
Vis/XForm
Cleaning
Tamr Architecture
27
Tamr
Console
RDBMS
• Starts integrating data sources
– Using synonyms, templates, and authoritative tables for help
– 1st couple of sources may require help from the human experts
– System learns over time and gets better and better
Tamr – Schema Integration
Tamr – Schema Integration
• Inner loop is a collection of “experts” (programs)
• T-test on the data
• Cosine similarity on attribute names
• Cosine similarity on the data
• Scores combined heuristically
• After modest training, gets 90+% of the matching attributes
automatically
• In several domains
• Cuts human cost dramatically!!!
• Hierarchy of experts
• With specializations
• With algorithms to adjust the “expertness” of experts
• And a marketplace to perform load balancing
• Working well at scale!!!
• Biggest problem: getting the experts to participate.
Tamr – Expert Sourcing
• Can adjust the threshold for automatic acceptance
• Cost-accuracy tradeoff
• Even if a human checks everything (threshold is certainty), you still
save money -- Tamr organizes the information and makes humans
more productive
Tamr – Entity Consolidation
• A major consolidator of financial data
• Entity consolidation and expert sourcing on a collection of internal
and external sources
• ROI relative to existing homebrew system
• A major manufacturing conglomerate
• Combine disparate ERP systems
• ROI is better procurement
Tamr Customer Success Stories
• A major bio-pharm company
• Combining inputs from 2000 medical-diagnostic pieces of
equipment by equipment type
• Decision support – how is stuff used?
• ROI is order-of-magnitude faster integration
• A major car company
• Customer data from multiple countries in Europe
• ROI is better marketing across a continent
• ROI is more effective sales engagement
Tamr Customer Success Stories
• Text sources
• Relationships
• More adaptors for different data sources and sinks
• Better algorithms
• User-defined operations
• For popular cleaning tools like Google Refine
• Web transformation tool
• Syntactic transformations (e.g., dates)
• Semantic transformations (e.g., airport codes)
Tamr Future
www.tamr.com
Thank you!

Mais conteúdo relacionado

Mais procurados

Graph Grid by Atom Rain
Graph Grid by Atom RainGraph Grid by Atom Rain
Graph Grid by Atom RainMeg Vorland
 
Importance of Big data for your Business
Importance of Big data for your BusinessImportance of Big data for your Business
Importance of Big data for your Businessazuyo.com
 
Getting down to business on Big Data analytics
Getting down to business on Big Data analyticsGetting down to business on Big Data analytics
Getting down to business on Big Data analyticsThe Marketing Distillery
 
A Dynamic Data Catalog for Autonomy and Self-Service
A Dynamic Data Catalog for Autonomy and Self-ServiceA Dynamic Data Catalog for Autonomy and Self-Service
A Dynamic Data Catalog for Autonomy and Self-ServiceDenodo
 
Business case for Big Data Analytics
Business case for Big Data AnalyticsBusiness case for Big Data Analytics
Business case for Big Data AnalyticsVijay Rao
 
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...DATAVERSITY
 
Chief Data & Analytics Officer Fall Boston - Presentation
Chief Data & Analytics Officer Fall Boston - PresentationChief Data & Analytics Officer Fall Boston - Presentation
Chief Data & Analytics Officer Fall Boston - PresentationSrinivasan Sankar
 
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...Kevin Pledge
 
Data Catalog as a Business Enabler
Data Catalog as a Business EnablerData Catalog as a Business Enabler
Data Catalog as a Business EnablerSrinivasan Sankar
 
Data Lake: A simple introduction
Data Lake: A simple introductionData Lake: A simple introduction
Data Lake: A simple introductionIBM Analytics
 
Analytics for actuaries cia
Analytics for actuaries ciaAnalytics for actuaries cia
Analytics for actuaries ciaKevin Pledge
 
An Overview of the Neo4j Cloud Strategy and the Future of Graph Databases in ...
An Overview of the Neo4j Cloud Strategy and the Future of Graph Databases in ...An Overview of the Neo4j Cloud Strategy and the Future of Graph Databases in ...
An Overview of the Neo4j Cloud Strategy and the Future of Graph Databases in ...Neo4j
 
Business Value of Data
Business Value of Data Business Value of Data
Business Value of Data UIResearchPark
 
Data Modeling for Big Data
Data Modeling for Big DataData Modeling for Big Data
Data Modeling for Big DataDATAVERSITY
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityPrecisely
 
The data quality challenge
The data quality challengeThe data quality challenge
The data quality challengeLenia Miltiadous
 
How different between Big Data, Business Intelligence and Analytics ?
How different between Big Data, Business Intelligence and Analytics ?How different between Big Data, Business Intelligence and Analytics ?
How different between Big Data, Business Intelligence and Analytics ?Thanakrit Lersmethasakul
 
Bigdata for sme-industrial intelligence information-24july2017-final
Bigdata for sme-industrial intelligence information-24july2017-finalBigdata for sme-industrial intelligence information-24july2017-final
Bigdata for sme-industrial intelligence information-24july2017-finalstelligence
 
Big Data, Business Intelligence and Data Analytics
Big Data, Business Intelligence and Data AnalyticsBig Data, Business Intelligence and Data Analytics
Big Data, Business Intelligence and Data AnalyticsSystems Limited
 

Mais procurados (20)

Graph Grid by Atom Rain
Graph Grid by Atom RainGraph Grid by Atom Rain
Graph Grid by Atom Rain
 
Importance of Big data for your Business
Importance of Big data for your BusinessImportance of Big data for your Business
Importance of Big data for your Business
 
Getting down to business on Big Data analytics
Getting down to business on Big Data analyticsGetting down to business on Big Data analytics
Getting down to business on Big Data analytics
 
A Dynamic Data Catalog for Autonomy and Self-Service
A Dynamic Data Catalog for Autonomy and Self-ServiceA Dynamic Data Catalog for Autonomy and Self-Service
A Dynamic Data Catalog for Autonomy and Self-Service
 
Business case for Big Data Analytics
Business case for Big Data AnalyticsBusiness case for Big Data Analytics
Business case for Big Data Analytics
 
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data Using InfoSphere BigInsights...
 
Chief Data & Analytics Officer Fall Boston - Presentation
Chief Data & Analytics Officer Fall Boston - PresentationChief Data & Analytics Officer Fall Boston - Presentation
Chief Data & Analytics Officer Fall Boston - Presentation
 
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
 
Data Catalog as a Business Enabler
Data Catalog as a Business EnablerData Catalog as a Business Enabler
Data Catalog as a Business Enabler
 
Data Lake: A simple introduction
Data Lake: A simple introductionData Lake: A simple introduction
Data Lake: A simple introduction
 
Analytics for actuaries cia
Analytics for actuaries ciaAnalytics for actuaries cia
Analytics for actuaries cia
 
An Overview of the Neo4j Cloud Strategy and the Future of Graph Databases in ...
An Overview of the Neo4j Cloud Strategy and the Future of Graph Databases in ...An Overview of the Neo4j Cloud Strategy and the Future of Graph Databases in ...
An Overview of the Neo4j Cloud Strategy and the Future of Graph Databases in ...
 
Business Value of Data
Business Value of Data Business Value of Data
Business Value of Data
 
Data Modeling for Big Data
Data Modeling for Big DataData Modeling for Big Data
Data Modeling for Big Data
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data Quality
 
The data quality challenge
The data quality challengeThe data quality challenge
The data quality challenge
 
How different between Big Data, Business Intelligence and Analytics ?
How different between Big Data, Business Intelligence and Analytics ?How different between Big Data, Business Intelligence and Analytics ?
How different between Big Data, Business Intelligence and Analytics ?
 
Bigdata for sme-industrial intelligence information-24july2017-final
Bigdata for sme-industrial intelligence information-24july2017-finalBigdata for sme-industrial intelligence information-24july2017-final
Bigdata for sme-industrial intelligence information-24july2017-final
 
Data Quality Definitions
Data Quality DefinitionsData Quality Definitions
Data Quality Definitions
 
Big Data, Business Intelligence and Data Analytics
Big Data, Business Intelligence and Data AnalyticsBig Data, Business Intelligence and Data Analytics
Big Data, Business Intelligence and Data Analytics
 

Destaque

Michael Stonebraker How to do Complex Analytics
Michael Stonebraker How to do Complex AnalyticsMichael Stonebraker How to do Complex Analytics
Michael Stonebraker How to do Complex AnalyticsMassTLC
 
Tamr presentation
Tamr presentationTamr presentation
Tamr presentationAdam Hasler
 
Hadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsHadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsMapR Technologies
 
K mac donaldwk2teampaper
K mac donaldwk2teampaperK mac donaldwk2teampaper
K mac donaldwk2teampaperKaren MacDonald
 
презентация бим-радио
презентация бим-радиопрезентация бим-радио
презентация бим-радиоSimon Yeah
 
Sqrrl September Webinar: Cell-Level Security
Sqrrl September Webinar: Cell-Level SecuritySqrrl September Webinar: Cell-Level Security
Sqrrl September Webinar: Cell-Level SecuritySqrrl
 
Text Indexing in Accumulo
Text Indexing in AccumuloText Indexing in Accumulo
Text Indexing in AccumuloAaron Cordova
 
Strata+Hadoop 2015 Keynote: Impacting Business as it Happens
Strata+Hadoop 2015 Keynote: Impacting Business as it HappensStrata+Hadoop 2015 Keynote: Impacting Business as it Happens
Strata+Hadoop 2015 Keynote: Impacting Business as it HappensMapR Technologies
 
Map r seattle streams meetup oct 2016
Map r seattle streams meetup   oct 2016Map r seattle streams meetup   oct 2016
Map r seattle streams meetup oct 2016Nitin Kumar
 
SapientNitro: Multi-channel and the Convergence of Marketing, Commerce & Cust...
SapientNitro: Multi-channel and the Convergence of Marketing, Commerce & Cust...SapientNitro: Multi-channel and the Convergence of Marketing, Commerce & Cust...
SapientNitro: Multi-channel and the Convergence of Marketing, Commerce & Cust...Day Software
 
Hadoop Self-Service Data Prep Fuels Analytics
Hadoop Self-Service Data Prep Fuels AnalyticsHadoop Self-Service Data Prep Fuels Analytics
Hadoop Self-Service Data Prep Fuels AnalyticsSenturus
 

Destaque (15)

Michael Stonebraker How to do Complex Analytics
Michael Stonebraker How to do Complex AnalyticsMichael Stonebraker How to do Complex Analytics
Michael Stonebraker How to do Complex Analytics
 
Tamr presentation
Tamr presentationTamr presentation
Tamr presentation
 
Hadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsHadoop as a Platform for Genomics
Hadoop as a Platform for Genomics
 
Zija ppt6 shari 05-2015
Zija ppt6  shari 05-2015Zija ppt6  shari 05-2015
Zija ppt6 shari 05-2015
 
K mac donaldwk2teampaper
K mac donaldwk2teampaperK mac donaldwk2teampaper
K mac donaldwk2teampaper
 
презентация бим-радио
презентация бим-радиопрезентация бим-радио
презентация бим-радио
 
Sqrrl September Webinar: Cell-Level Security
Sqrrl September Webinar: Cell-Level SecuritySqrrl September Webinar: Cell-Level Security
Sqrrl September Webinar: Cell-Level Security
 
Accumulo on EC2
Accumulo on EC2Accumulo on EC2
Accumulo on EC2
 
Spain
SpainSpain
Spain
 
Text Indexing in Accumulo
Text Indexing in AccumuloText Indexing in Accumulo
Text Indexing in Accumulo
 
Starsoft tm1
Starsoft tm1Starsoft tm1
Starsoft tm1
 
Strata+Hadoop 2015 Keynote: Impacting Business as it Happens
Strata+Hadoop 2015 Keynote: Impacting Business as it HappensStrata+Hadoop 2015 Keynote: Impacting Business as it Happens
Strata+Hadoop 2015 Keynote: Impacting Business as it Happens
 
Map r seattle streams meetup oct 2016
Map r seattle streams meetup   oct 2016Map r seattle streams meetup   oct 2016
Map r seattle streams meetup oct 2016
 
SapientNitro: Multi-channel and the Convergence of Marketing, Commerce & Cust...
SapientNitro: Multi-channel and the Convergence of Marketing, Commerce & Cust...SapientNitro: Multi-channel and the Convergence of Marketing, Commerce & Cust...
SapientNitro: Multi-channel and the Convergence of Marketing, Commerce & Cust...
 
Hadoop Self-Service Data Prep Fuels Analytics
Hadoop Self-Service Data Prep Fuels AnalyticsHadoop Self-Service Data Prep Fuels Analytics
Hadoop Self-Service Data Prep Fuels Analytics
 

Semelhante a Tamr | Strata hadoop 2014 Michael Stonebraker

Data modeling trends for Analytics
Data modeling trends for AnalyticsData modeling trends for Analytics
Data modeling trends for AnalyticsIke Ellis
 
Data Science and Machine Learning for eCommerce and Retail
Data Science and Machine Learning for eCommerce and RetailData Science and Machine Learning for eCommerce and Retail
Data Science and Machine Learning for eCommerce and RetailAndrei Lopatenko
 
Data Science and Enterprise Engineering with Michael Finger and Chris Robison
Data Science and Enterprise Engineering with Michael Finger and Chris RobisonData Science and Enterprise Engineering with Michael Finger and Chris Robison
Data Science and Enterprise Engineering with Michael Finger and Chris RobisonDatabricks
 
Toigo Critical Convergence
Toigo  Critical ConvergenceToigo  Critical Convergence
Toigo Critical Convergencehypknight
 
Business Intelligence Presentation (1/2)
Business Intelligence Presentation (1/2)Business Intelligence Presentation (1/2)
Business Intelligence Presentation (1/2)Bernardo Najlis
 
dataWarehouse.pptx
dataWarehouse.pptxdataWarehouse.pptx
dataWarehouse.pptxhqlm1
 
Michael Stonebraker: Big Data, Disruption, and the 800 Pound Gorilla in the ...
Michael Stonebraker:  Big Data, Disruption, and the 800 Pound Gorilla in the ...Michael Stonebraker:  Big Data, Disruption, and the 800 Pound Gorilla in the ...
Michael Stonebraker: Big Data, Disruption, and the 800 Pound Gorilla in the ...TamrMarketing
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data WarehouseCaserta
 
Data Detectives - Presentation
Data Detectives - PresentationData Detectives - Presentation
Data Detectives - PresentationClint Campbell
 
Data Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationData Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationSunderland City Council
 
predictive analysis and usage in procurement ppt 2017
predictive analysis and usage in procurement  ppt 2017predictive analysis and usage in procurement  ppt 2017
predictive analysis and usage in procurement ppt 2017Prashant Bhatmule
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkVivian S. Zhang
 
Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems
Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data SystemsWhere Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems
Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data SystemsInsightDataScience
 
Lecture 01.ppt
Lecture 01.pptLecture 01.ppt
Lecture 01.pptHFLEX
 
Creating a Smarter Shopping Experience with IBM Solutions at Carter's
Creating a Smarter Shopping Experience with IBM Solutions at Carter'sCreating a Smarter Shopping Experience with IBM Solutions at Carter's
Creating a Smarter Shopping Experience with IBM Solutions at Carter'sPerficient, Inc.
 
What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?RTTS
 
The final frontier
The final frontierThe final frontier
The final frontierTerry Bunio
 

Semelhante a Tamr | Strata hadoop 2014 Michael Stonebraker (20)

Data modeling trends for Analytics
Data modeling trends for AnalyticsData modeling trends for Analytics
Data modeling trends for Analytics
 
Data Science and Machine Learning for eCommerce and Retail
Data Science and Machine Learning for eCommerce and RetailData Science and Machine Learning for eCommerce and Retail
Data Science and Machine Learning for eCommerce and Retail
 
Data Science and Enterprise Engineering with Michael Finger and Chris Robison
Data Science and Enterprise Engineering with Michael Finger and Chris RobisonData Science and Enterprise Engineering with Michael Finger and Chris Robison
Data Science and Enterprise Engineering with Michael Finger and Chris Robison
 
Toigo Critical Convergence
Toigo  Critical ConvergenceToigo  Critical Convergence
Toigo Critical Convergence
 
Business Intelligence Presentation (1/2)
Business Intelligence Presentation (1/2)Business Intelligence Presentation (1/2)
Business Intelligence Presentation (1/2)
 
dataWarehouse.pptx
dataWarehouse.pptxdataWarehouse.pptx
dataWarehouse.pptx
 
Michael Stonebraker: Big Data, Disruption, and the 800 Pound Gorilla in the ...
Michael Stonebraker:  Big Data, Disruption, and the 800 Pound Gorilla in the ...Michael Stonebraker:  Big Data, Disruption, and the 800 Pound Gorilla in the ...
Michael Stonebraker: Big Data, Disruption, and the 800 Pound Gorilla in the ...
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
 
Data Detectives - Presentation
Data Detectives - PresentationData Detectives - Presentation
Data Detectives - Presentation
 
Data Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationData Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data Visualisation
 
predictive analysis and usage in procurement ppt 2017
predictive analysis and usage in procurement  ppt 2017predictive analysis and usage in procurement  ppt 2017
predictive analysis and usage in procurement ppt 2017
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
 
Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems
Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data SystemsWhere Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems
Where Is Your Data?: An Introduction to Problems and Bottlenecks in Data Systems
 
Lecture 01.ppt
Lecture 01.pptLecture 01.ppt
Lecture 01.ppt
 
Ch 1 intro_dw
Ch 1 intro_dwCh 1 intro_dw
Ch 1 intro_dw
 
Creating a Smarter Shopping Experience with IBM Solutions at Carter's
Creating a Smarter Shopping Experience with IBM Solutions at Carter'sCreating a Smarter Shopping Experience with IBM Solutions at Carter's
Creating a Smarter Shopping Experience with IBM Solutions at Carter's
 
What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?
 
The final frontier
The final frontierThe final frontier
The final frontier
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Information systems
Information systemsInformation systems
Information systems
 

Último

LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0DanBrown980551
 
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveKeep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveIES VE
 
From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and businessFrancesco Corti
 
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxEmil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxNeo4j
 
Flow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameFlow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameKapil Thakar
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1DianaGray10
 
UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4DianaGray10
 
UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3DianaGray10
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNeo4j
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kitJamie (Taka) Wang
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationKnoldus Inc.
 
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Alkin Tezuysal
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfInfopole1
 
UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2DianaGray10
 
Technical SEO for Improved Accessibility WTS FEST
Technical SEO for Improved Accessibility  WTS FESTTechnical SEO for Improved Accessibility  WTS FEST
Technical SEO for Improved Accessibility WTS FESTBillieHyde
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxNeo4j
 
The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)IES VE
 
Planetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl
 
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechWebinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechProduct School
 
EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarThousandEyes
 

Último (20)

LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0
 
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveKeep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
 
From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and business
 
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptxEmil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
Emil Eifrem at GraphSummit Copenhagen 2024 - The Art of the Possible.pptx
 
Flow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameFlow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First Frame
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1
 
UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4
 
UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4j
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kit
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdf
 
UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2
 
Technical SEO for Improved Accessibility WTS FEST
Technical SEO for Improved Accessibility  WTS FESTTechnical SEO for Improved Accessibility  WTS FEST
Technical SEO for Improved Accessibility WTS FEST
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
 
The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)
 
Planetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile Brochure
 
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechWebinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
 
EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? Webinar
 

Tamr | Strata hadoop 2014 Michael Stonebraker

  • 1. Tackling Data Curation in Three Generations Mike Stonebraker
  • 3. By the Numbers Number of data stores in a typical enterprise: 5,000 Number of data stores in a LARGE telco company: 10,000
  • 4. • Enterprises are divided into business units, which are typically independent • With independent data stores • One large money center bank had hundreds • The last time I looked Why so many data stores?
  • 5. • Enterprises buy other enterprises • With great regularity • Such acquired silos are difficult to remove • Customer contracts • Different mechanisms for treating employees, retirees …. Why so many data stores?
  • 6. • CFO’s budget is on a spreadsheet on his PC • Lots of Excel data • And there is public data from the web with business value • Weather, population, census tracts, ZIP codes … • Data.gov Not to Mention . . .
  • 7. • Business units are independent • Different customer ids, product ids, … • Enterprises have tried to construct such models in the past….. • Multi-year project • Out-of-date on day 1 of the project, let alone on the proposed completion date • Standards are difficult • Remember how difficult it is to stamp out multiple DBMSs in an enterprise • Let alone Macs… And there is NO Global Data Model
  • 8. • The sins of your predecessors • Your CEO is not in IT • May not have the COBOL source code • Politics • Data is power Lots of Silos is a Fact of Life
  • 9. • Cross selling • Combining procurement orders • To get better pricing • Social networking • People working on the same thing • Rollups/better information • How many employees do we have? • Etc…. Why Integrate Silos?
  • 10. • Biggest problem facing many enterprises Data Integration is a VERY Big Deal
  • 11. • Ingest • The data source • Validate • Have to get rid of (or correct) garbage • Transform • E.g., Euros to dollar; Airport code to city name • Match Schemas • Your salary is my wages • Consolidate (dedup)(entity resolution) • E.g., Mike Stonebraker and Michael Stonebraker Requirement: Data Curation
  • 12. • Gen 1 (1990s): Traditional ETL • Gen 2 (2000s): ETL on steroids • Gen 3 (appearing now): Scalable Data Curation Three Generations of Data Curation Products
  • 13. • Retail sector started integrating sales data into a data warehouse in the mid 1990’s • To make better stock decisions • Pet rocks are out, Barbie dolls are in • Tie up the Barbie doll factory with a big order • Send the pet rocks back or discount them up front • Warehouse paid for itself within 6 months with smarter buying decisions! Gen 1 (Early Data Warehouses)
  • 14. • Essentially all enterprises followed suit and built warehouses of customer-facing data • Serviced by so-called Extract-Transform-and-Load (ETL) tools The Pile-On
  • 15. • Average system was 2-3X over budget • and 2-3X late • Because of data integration headaches The Dark Side . . .
  • 16. • Bought $100K of widgets from IBM, Inc. • Bought 800K Euros of m-widgets from IBM, SA • Bought -9999 of *wids* from 500 Madison Ave., NY, NY 10022 • Insufficient/incomplete meta-data: May not know that 800K is in Euros • Missing data: -9999 is a code for “I don’t know” • Dirty data: *wids* means what? Why is Data Integration Hard?
  • 17. • Bought $100K of widgets from IBM, Inc. • Bought 800K Euros of m-widgets from IBM, SA • Bought -9999 of *wids* from 500 Madison Ave., NY, NY 10022 • Disparate fields: Have to translate currencies to a common form • Entity resolution: Is IBM, SA the same as IBM, Inc.? • Entity resolution: Are m-widgets the same as widgets? Why is Data Integration Hard?
  • 18. Local data Source(s) Local Schema Data Warehouse Global SchemaETL ETL Architecture
  • 19. • Human defines a global schema • Up front • Assign a programmer to each data source to • Understand it • Write local to global mapping (in a scripting language) • Write cleaning routine • Run the ETL • Scales to (maybe) 25 data sources • Twist my arm, and I will give you 50 Traditional ETL Wisdom
  • 20. • Bigger global schema upfront is really hard • Too much manual heavy lifting • By a trained programmer • No automation Why?
  • 21. Gen 2 – Curation Tools Added to ETL • Deduplication systems – For addresses, names, … • Outlier detection for data cleaning • Standard domains for data cleaning • … • Augments the generation 1 architecture – Still only scales to 25 data sources!
  • 22. • Enterprises want to integrate more and more data sources – Milwaukee beer example • Weather data • Business analysts have an insatiable demand for “MORE” Current Situation
  • 23. • Enterprises want to integrate more and more data sources – Big Pharma example • Has a traditional data warehouse of bio assay data • Has ~3,000 scientists doing “wet” biology and chemistry across multiple types of experiments • And writing results in an electronic lab notebook (think 27,000 spreadsheets) • No standard vocabulary (Is an ICU-50 the same as an ICE-50?) – both are biophysical parameters of drugs • No standard units and units may not even be recorded • No standard language (e.g., English) • Variable encoding (some results are numeric, some are text, some are numbers stored as text with text comments!) Current Situation
  • 24. • Enterprises want to integrate more and more data sources – Web aggregator example • Currently integrating 80,000 web URLs • With “event” and “things to do” data • All the standard headaches – At scale 80,000 Current Situation
  • 25. • Traditional ETL won’t scale to these kinds of numbers – Too much manual effort – I.e., traditional ETL way too heavy-weight!!! • Also a personnel mismatch – Are widgets and m-widgets the same thing? – Only a business expert knows the answer – The ETL programmer certainly does not!!!! Current Situation
  • 26. Gen 3: Scalability 26 • Must pick the low-hanging fruit automatically – Machine learning – Statistics • Rarely an upfront global schema – Must build it “bottom up” • Must involve human (non-programmer) experts to help with the cleaning Tamr is an example of this 3rd generation!
  • 28. • Starts integrating data sources – Using synonyms, templates, and authoritative tables for help – 1st couple of sources may require help from the human experts – System learns over time and gets better and better Tamr – Schema Integration
  • 29. Tamr – Schema Integration • Inner loop is a collection of “experts” (programs) • T-test on the data • Cosine similarity on attribute names • Cosine similarity on the data • Scores combined heuristically • After modest training, gets 90+% of the matching attributes automatically • In several domains • Cuts human cost dramatically!!!
  • 30. • Hierarchy of experts • With specializations • With algorithms to adjust the “expertness” of experts • And a marketplace to perform load balancing • Working well at scale!!! • Biggest problem: getting the experts to participate. Tamr – Expert Sourcing
  • 31. • Can adjust the threshold for automatic acceptance • Cost-accuracy tradeoff • Even if a human checks everything (threshold is certainty), you still save money -- Tamr organizes the information and makes humans more productive Tamr – Entity Consolidation
  • 32. • A major consolidator of financial data • Entity consolidation and expert sourcing on a collection of internal and external sources • ROI relative to existing homebrew system • A major manufacturing conglomerate • Combine disparate ERP systems • ROI is better procurement Tamr Customer Success Stories
  • 33. • A major bio-pharm company • Combining inputs from 2000 medical-diagnostic pieces of equipment by equipment type • Decision support – how is stuff used? • ROI is order-of-magnitude faster integration • A major car company • Customer data from multiple countries in Europe • ROI is better marketing across a continent • ROI is more effective sales engagement Tamr Customer Success Stories
  • 34. • Text sources • Relationships • More adaptors for different data sources and sinks • Better algorithms • User-defined operations • For popular cleaning tools like Google Refine • Web transformation tool • Syntactic transformations (e.g., dates) • Semantic transformations (e.g., airport codes) Tamr Future