SlideShare uma empresa Scribd logo
1 de 22
Using Hadoop to build a Data Quality
Service for both real-time and batch data
Griffin – https://github.com/ebay/griffin
About Us:
• Alex Lv (lzhixing@ebay.com)
Senior Staff Software Engineer – Data Products
Platform & Engineering at eBay
• Amber Vaidya (amvaidya@ebay.com)
Lead Product Manager - Data Products
Platform & Engineering at eBay
Agenda
• Background
• Introduction to Griffin
• Demo
Background
eBay Marketplace at a Glance
Q2 2016 data
$19.8B
GMV in Q2 2016
10M
New listings added via mobile per
week
300M
Searches each day
65%
Transactions that ship for free
(in US, UK, DE)
80%
Items sold as new
1B
Live listings
One of the world’s largest and most vibrant marketplaces
Velocity Stats
US
3 car parts or accessories are sold every
A smartphone is sold every
A dress is sold every
1 sec
4 sec
6 sec
UK
A necklace is sold every
A make-up product is sold every
A Lego product is sold every
10 sec
3 sec
19 sec
GERMANY
A truck or car is sold every
A pair of women’s jeans is sold every
A video game is sold every
5 min
4 sec
11 sec
AUSTRALIA
A pair of men’s sunglasses is sold every
A home décor item is sold every
A car or truck part is sold every
1 min
12 sec
4 sec
Mobile Velocity Stats
US
A woman’s handbag is sold every
A car or truck is sold every
An action figure is sold every
10 sec
5 min
10 sec
UK
A tablet is sold every
A cookware item is sold every
A car is sold every
1 min
6 sec
2 min
GERMANY
A pair of women’s shoes is sold every
A watch is sold every
A tire or car part is sold every
20 sec
48 sec
35 sec
AUSTRALIA
A piece of jewelry is sold every
A baby clothing item is sold every
A motorcycle part is sold every
12 sec
46 sec
51 sec
Big Data @
We manage one of the
largest data platforms in
the world
We utilize one of the
largest data platforms in
the world
Challenging to ensure data quality for such scale!
Challenges at eBay:
• No unified view of data quality across multiple systems and teams
• No shared platform to manage data quality
• No system to measure near real-time data quality
What is Data Quality?
Definition
• How well it meets the expectations of
data consumers
• How well it represents the objects,
events, and concepts it is created to
represent
Dimensions
Completeness
Uniqueness
Timeliness
Validity
Accuracy
Consistency
Core Dimensions
Virtuous Cycle of Data Quality
Define Measure
AnalyzeImprove
• Define the scope, dimensions, goals,
thresholds, etc.
• Measure data quality values
• Analyze data quality results
• Improve data quality
Our Goal
A solution with all the below capabilities
Capability Commercial
DQ software
Open source
DQ software
Support eBay’s scale x x
Data Quality measurement √ x
Support real-time data x x
Support unstructured data x x
Service based API √ x
Data Profiling √ √
Pluggable measurement types x x
Griffin
What is Griffin?
• Data Quality Platform built on Hadoop
and Spark
 Batch data
 Real-time data
 Unstructured data
• A unified process to detect DQ issues
 Incomplete
 Inaccurate
 Invalid
 ……
• An open source solution
https://github.com/ebay/griffin
Data Quality Framework in Griffin
DefineMeasureAnalyze
• Define Data Quality Dimensions
• Define Metrics, Goals, Thresholds
Calculators running on
Source
RDBMS
Accuracy
Completeness
Uniqueness
Timeliness
Validity
Consistency
Metrics
Metrics
Repository
Scorecards
• Scorecard Reports generated and displayed
• Measurement values and quality scores calculated
and stored
• Data quality trending graphs generated
Measure
Repository
Technical Highlights
Component Design
Measure Calculator Example – Accuracy of Viewitem
~300M customer view events per day
Accuracy Calculator
Metric
Target
Source
Item_view
• User Id
• Page Id
• Site Id
• Title
• Date
• ……
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑅𝑎𝑡𝑒(%) =
𝐶𝑜𝑢𝑛𝑡(𝑠𝑜𝑢𝑟𝑐𝑒.𝑓𝑖𝑒𝑙𝑑1 == 𝑡𝑎𝑟𝑔𝑒𝑡.𝑓𝑖𝑒𝑙𝑑1 && … )
𝐶𝑜𝑢𝑛𝑡(𝑠𝑜𝑢𝑟𝑐𝑒)
X 100%
Use Cases
• Griffin has been deployed in production at eBay and provided the centralized data
quality service for several eBay systems.
~1.2PB 800+M 100+
Data Daily Records Metrics
Now life is easier……
Demo
We are open source
and welcome contributions
Github: https://github.com/eBay/griffin
Blog: http://www.ebaytechblog.com/?p=5877/
Contact: ebay-griffin-dev@googlegroups.com

Mais conteúdo relacionado

Mais procurados

Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
Databricks
 

Mais procurados (20)

Advanced Design Patterns for Amazon DynamoDB - Workshop (DAT404-R1) - AWS re:...
Advanced Design Patterns for Amazon DynamoDB - Workshop (DAT404-R1) - AWS re:...Advanced Design Patterns for Amazon DynamoDB - Workshop (DAT404-R1) - AWS re:...
Advanced Design Patterns for Amazon DynamoDB - Workshop (DAT404-R1) - AWS re:...
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architecture
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Tableau Software - Business Analytics and Data Visualization
Tableau Software - Business Analytics and Data VisualizationTableau Software - Business Analytics and Data Visualization
Tableau Software - Business Analytics and Data Visualization
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Introduction to Azure Data Lake
Introduction to Azure Data LakeIntroduction to Azure Data Lake
Introduction to Azure Data Lake
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
How Expedia’s Entity Graph Powers Global Travel
How Expedia’s Entity Graph Powers Global TravelHow Expedia’s Entity Graph Powers Global Travel
How Expedia’s Entity Graph Powers Global Travel
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Inside the InfluxDB storage engine
Inside the InfluxDB storage engineInside the InfluxDB storage engine
Inside the InfluxDB storage engine
 
Vertica
VerticaVertica
Vertica
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouse
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 

Semelhante a Using Hadoop to build a Data Quality Service for both real-time and batch data

Five Things I Wish I Knew the First Day I Used Tableau
Five Things I Wish I Knew the First Day I Used TableauFive Things I Wish I Knew the First Day I Used Tableau
Five Things I Wish I Knew the First Day I Used Tableau
Ryan Sleeper
 

Semelhante a Using Hadoop to build a Data Quality Service for both real-time and batch data (20)

GOTO Night: Decision Making Based on Machine Learning
GOTO Night: Decision Making Based on Machine LearningGOTO Night: Decision Making Based on Machine Learning
GOTO Night: Decision Making Based on Machine Learning
 
Decision Making based on Machine Learning at Outfittery (W-JAX 2017)
Decision Making based on Machine Learning at Outfittery (W-JAX 2017)Decision Making based on Machine Learning at Outfittery (W-JAX 2017)
Decision Making based on Machine Learning at Outfittery (W-JAX 2017)
 
How we integrate Machine Learning Algorithms into our IT Platform at Outfitte...
How we integrate Machine Learning Algorithms into our IT Platform at Outfitte...How we integrate Machine Learning Algorithms into our IT Platform at Outfitte...
How we integrate Machine Learning Algorithms into our IT Platform at Outfitte...
 
Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks
 
Keys to Continuous Delivery Success - Mark Warren, Product Director, Perforc...
Keys to Continuous  Delivery Success - Mark Warren, Product Director, Perforc...Keys to Continuous  Delivery Success - Mark Warren, Product Director, Perforc...
Keys to Continuous Delivery Success - Mark Warren, Product Director, Perforc...
 
Five Things I Wish I Knew the First Day I Used Tableau
Five Things I Wish I Knew the First Day I Used TableauFive Things I Wish I Knew the First Day I Used Tableau
Five Things I Wish I Knew the First Day I Used Tableau
 
seven steps to dataops @ dataops.rocks conference Oct 2019
seven steps to dataops @ dataops.rocks conference Oct 2019seven steps to dataops @ dataops.rocks conference Oct 2019
seven steps to dataops @ dataops.rocks conference Oct 2019
 
Washington DC DataOps Meetup -- Nov 2019
Washington DC DataOps Meetup   -- Nov 2019Washington DC DataOps Meetup   -- Nov 2019
Washington DC DataOps Meetup -- Nov 2019
 
Extreme Analytics @ eBay
Extreme Analytics @ eBayExtreme Analytics @ eBay
Extreme Analytics @ eBay
 
Extreme Analytics @ eBay
Extreme Analytics @ eBayExtreme Analytics @ eBay
Extreme Analytics @ eBay
 
Graphs in Action: In-depth look at Neo4j in Production
Graphs in Action: In-depth look at Neo4j in ProductionGraphs in Action: In-depth look at Neo4j in Production
Graphs in Action: In-depth look at Neo4j in Production
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Advanced Analytics Implementations at EA scale
Advanced Analytics Implementations at EA scaleAdvanced Analytics Implementations at EA scale
Advanced Analytics Implementations at EA scale
 
Graph processing at scale using spark & graph frames
Graph processing at scale using spark & graph framesGraph processing at scale using spark & graph frames
Graph processing at scale using spark & graph frames
 
"Taming Advanced Analytics Implementations at EA Scale" - Electronic Arts, Di...
"Taming Advanced Analytics Implementations at EA Scale" - Electronic Arts, Di..."Taming Advanced Analytics Implementations at EA Scale" - Electronic Arts, Di...
"Taming Advanced Analytics Implementations at EA Scale" - Electronic Arts, Di...
 
Netflix Recommender System : Big Data Case Study
Netflix Recommender System : Big Data Case StudyNetflix Recommender System : Big Data Case Study
Netflix Recommender System : Big Data Case Study
 
Data science tools - A.Marchev and K.Haralampiev
Data science tools - A.Marchev and K.HaralampievData science tools - A.Marchev and K.Haralampiev
Data science tools - A.Marchev and K.Haralampiev
 
Creating a Data Driven Culture with Amazon QuickSight - Technical 201
Creating a Data Driven Culture with Amazon QuickSight - Technical 201Creating a Data Driven Culture with Amazon QuickSight - Technical 201
Creating a Data Driven Culture with Amazon QuickSight - Technical 201
 
Power to the People: A Stack to Empower Every User to Make Data-Driven Decisions
Power to the People: A Stack to Empower Every User to Make Data-Driven DecisionsPower to the People: A Stack to Empower Every User to Make Data-Driven Decisions
Power to the People: A Stack to Empower Every User to Make Data-Driven Decisions
 
From Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender systemFrom Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender system
 

Mais de DataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

Mais de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Using Hadoop to build a Data Quality Service for both real-time and batch data

  • 1. Using Hadoop to build a Data Quality Service for both real-time and batch data Griffin – https://github.com/ebay/griffin
  • 2. About Us: • Alex Lv (lzhixing@ebay.com) Senior Staff Software Engineer – Data Products Platform & Engineering at eBay • Amber Vaidya (amvaidya@ebay.com) Lead Product Manager - Data Products Platform & Engineering at eBay
  • 5. eBay Marketplace at a Glance Q2 2016 data $19.8B GMV in Q2 2016 10M New listings added via mobile per week 300M Searches each day 65% Transactions that ship for free (in US, UK, DE) 80% Items sold as new 1B Live listings One of the world’s largest and most vibrant marketplaces
  • 6. Velocity Stats US 3 car parts or accessories are sold every A smartphone is sold every A dress is sold every 1 sec 4 sec 6 sec UK A necklace is sold every A make-up product is sold every A Lego product is sold every 10 sec 3 sec 19 sec GERMANY A truck or car is sold every A pair of women’s jeans is sold every A video game is sold every 5 min 4 sec 11 sec AUSTRALIA A pair of men’s sunglasses is sold every A home décor item is sold every A car or truck part is sold every 1 min 12 sec 4 sec
  • 7. Mobile Velocity Stats US A woman’s handbag is sold every A car or truck is sold every An action figure is sold every 10 sec 5 min 10 sec UK A tablet is sold every A cookware item is sold every A car is sold every 1 min 6 sec 2 min GERMANY A pair of women’s shoes is sold every A watch is sold every A tire or car part is sold every 20 sec 48 sec 35 sec AUSTRALIA A piece of jewelry is sold every A baby clothing item is sold every A motorcycle part is sold every 12 sec 46 sec 51 sec
  • 8. Big Data @ We manage one of the largest data platforms in the world We utilize one of the largest data platforms in the world
  • 9. Challenging to ensure data quality for such scale! Challenges at eBay: • No unified view of data quality across multiple systems and teams • No shared platform to manage data quality • No system to measure near real-time data quality
  • 10. What is Data Quality? Definition • How well it meets the expectations of data consumers • How well it represents the objects, events, and concepts it is created to represent Dimensions Completeness Uniqueness Timeliness Validity Accuracy Consistency Core Dimensions
  • 11. Virtuous Cycle of Data Quality Define Measure AnalyzeImprove • Define the scope, dimensions, goals, thresholds, etc. • Measure data quality values • Analyze data quality results • Improve data quality
  • 12. Our Goal A solution with all the below capabilities Capability Commercial DQ software Open source DQ software Support eBay’s scale x x Data Quality measurement √ x Support real-time data x x Support unstructured data x x Service based API √ x Data Profiling √ √ Pluggable measurement types x x
  • 14. What is Griffin? • Data Quality Platform built on Hadoop and Spark  Batch data  Real-time data  Unstructured data • A unified process to detect DQ issues  Incomplete  Inaccurate  Invalid  …… • An open source solution https://github.com/ebay/griffin
  • 15. Data Quality Framework in Griffin DefineMeasureAnalyze • Define Data Quality Dimensions • Define Metrics, Goals, Thresholds Calculators running on Source RDBMS Accuracy Completeness Uniqueness Timeliness Validity Consistency Metrics Metrics Repository Scorecards • Scorecard Reports generated and displayed • Measurement values and quality scores calculated and stored • Data quality trending graphs generated Measure Repository
  • 18. Measure Calculator Example – Accuracy of Viewitem ~300M customer view events per day Accuracy Calculator Metric Target Source Item_view • User Id • Page Id • Site Id • Title • Date • …… 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑅𝑎𝑡𝑒(%) = 𝐶𝑜𝑢𝑛𝑡(𝑠𝑜𝑢𝑟𝑐𝑒.𝑓𝑖𝑒𝑙𝑑1 == 𝑡𝑎𝑟𝑔𝑒𝑡.𝑓𝑖𝑒𝑙𝑑1 && … ) 𝐶𝑜𝑢𝑛𝑡(𝑠𝑜𝑢𝑟𝑐𝑒) X 100%
  • 19. Use Cases • Griffin has been deployed in production at eBay and provided the centralized data quality service for several eBay systems. ~1.2PB 800+M 100+ Data Daily Records Metrics
  • 20. Now life is easier……
  • 21. Demo
  • 22. We are open source and welcome contributions Github: https://github.com/eBay/griffin Blog: http://www.ebaytechblog.com/?p=5877/ Contact: ebay-griffin-dev@googlegroups.com

Notas do Editor

  1. Business challenges here Previously, we maintain a lot of scripts to measure data quality, they are most run daily to generate the metrics. It’s often too late to recognize data quality issues.
  2. Completeness •Are all data sets and data items recorded? Uniqueness •Is there a single view of the data set? Timeliness •Does the data represent reality from the required point in time? Validity •Does the data match the rules? Accuracy •Does the data reflect the data set? Consistency •Can we match the data set across data stores? In Griffin, we’ve implemented “Accuracy”. Need more volunteers to support the other dimensions.
  3. Griffin covers the stages of Define, Measure, Analyze. Define: select dimension, define measurement, threshold Measure: calculate the metrics, generate reports, scorecards Analyze: download sample data to do RCA of DQ issues
  4. What do you mean by “Data Quality measurement” here?
  5. Now let’s take a look at the flow to do data quality assessment in griffin, the first step is to define the measurement. You need to decide the data quality dimensions you’d like to choose, the metrics, goals and thresholds. After that, it comes to the “Measure” stage, the key component is measure calculators running on spark. Each measurement type need a dedicated calculator. The calculators pick up the datasets from data sources, do calculations and produce metrics values. Finally, with the metrics values, we can generate scorecards and graphs, so that the end users can analyze the data qualities of their data.
  6. Real-time : the measure calculators are based on Kafka streaming, so they can handle near real-time data Fast : we have optimized algorithms for data quality domain, to support various data quality dimensions Massive : We support large scale data hosted on Spark Cluster Pluggable: It’s easy to plug a new measurement type, you just need to deploy your customized jar package in griffin to support new dimensions.
  7. Here’s the diagram of the component design in griffin. All the measurement definitions are stored in measure repository, we have a scheduler to pickup measure jobs according to the definition of the measurement. Then, a measure launcher will be started to process the measure job, it sends out the job to measure calculators in asynchronized way. When the calculations are completed, the launcher will be notified and the metrics values will be persisted. With the metrics values, we can monitor the dq, send out alerts, or generate reports/scorecards. (the reports also need the information from measure definition)
  8. As mentioned previously, the calculators are key components in griffin, so I’d like to give you an example of how calculator works. This’s an example of accuracy calculator for viewitem. Viewitem is event data from ebay sites, it represent users’ view activities on these sites, we have some internal processes to persist such kind of event data. To access the accuracy of the persisted data, we need to compare the data with their source, which are streaming data. The accuracy calculator can pick up both streaming and batch data, when the viewitem data is ready, the calculator runs the algorithm on spark and produce metrics values.
  9. Totally it deals with about 1.2 PB data, and the data volume is still increasing as long as we support more systems. there are more than 800 million daily records, totally we’ve on boarded more than 100 metrics