Using Hadoop to build a Data Quality Service for both real-time and batch data

Using Hadoop to build a Data Quality
Service for both real-time and batch data
Griffin – https://github.com/ebay/griffin

About Us:
• Alex Lv (lzhixing@ebay.com)
Senior Staff Software Engineer – Data Products
Platform & Engineering at eBay
• Amber Vaidya (amvaidya@ebay.com)
Lead Product Manager - Data Products
Platform & Engineering at eBay

Agenda
• Background
• Introduction to Griffin
• Demo

eBay Marketplace at a Glance
Q2 2016 data
$19.8B
GMV in Q2 2016
10M
New listings added via mobile per
week
300M
Searches each day
65%
Transactions that ship for free
(in US, UK, DE)
80%
Items sold as new
1B
Live listings
One of the world’s largest and most vibrant marketplaces

Velocity Stats
US
3 car parts or accessories are sold every
A smartphone is sold every
A dress is sold every
1 sec
4 sec
6 sec
UK
A necklace is sold every
A make-up product is sold every
A Lego product is sold every
10 sec
3 sec
19 sec
GERMANY
A truck or car is sold every
A pair of women’s jeans is sold every
A video game is sold every
5 min
4 sec
11 sec
AUSTRALIA
A pair of men’s sunglasses is sold every
A home décor item is sold every
A car or truck part is sold every
1 min
12 sec
4 sec

Mobile Velocity Stats
US
A woman’s handbag is sold every
A car or truck is sold every
An action figure is sold every
10 sec
5 min
10 sec
UK
A tablet is sold every
A cookware item is sold every
A car is sold every
1 min
6 sec
2 min
GERMANY
A pair of women’s shoes is sold every
A watch is sold every
A tire or car part is sold every
20 sec
48 sec
35 sec
AUSTRALIA
A piece of jewelry is sold every
A baby clothing item is sold every
A motorcycle part is sold every
12 sec
46 sec
51 sec

Big Data @
We manage one of the
largest data platforms in
the world
We utilize one of the
largest data platforms in
the world

Challenging to ensure data quality for such scale!
Challenges at eBay:
• No unified view of data quality across multiple systems and teams
• No shared platform to manage data quality
• No system to measure near real-time data quality

What is Data Quality?
Definition
• How well it meets the expectations of
data consumers
• How well it represents the objects,
events, and concepts it is created to
represent
Dimensions
Completeness
Uniqueness
Timeliness
Validity
Accuracy
Consistency
Core Dimensions

Virtuous Cycle of Data Quality
Define Measure
AnalyzeImprove
• Define the scope, dimensions, goals,
thresholds, etc.
• Measure data quality values
• Analyze data quality results
• Improve data quality

Our Goal
A solution with all the below capabilities
Capability Commercial
DQ software
Open source
DQ software
Support eBay’s scale x x
Data Quality measurement √ x
Support real-time data x x
Support unstructured data x x
Service based API √ x
Data Profiling √ √
Pluggable measurement types x x

What is Griffin?
• Data Quality Platform built on Hadoop
and Spark
 Batch data
 Real-time data
 Unstructured data
• A unified process to detect DQ issues
 Incomplete
 Inaccurate
 Invalid
 ……
• An open source solution
https://github.com/ebay/griffin

Data Quality Framework in Griffin
DefineMeasureAnalyze
• Define Data Quality Dimensions
• Define Metrics, Goals, Thresholds
Calculators running on
Source
RDBMS
Accuracy
Completeness
Uniqueness
Timeliness
Validity
Consistency
Metrics
Metrics
Repository
Scorecards
• Scorecard Reports generated and displayed
• Measurement values and quality scores calculated
and stored
• Data quality trending graphs generated
Measure
Repository

Measure Calculator Example – Accuracy of Viewitem
~300M customer view events per day
Accuracy Calculator
Metric
Target
Source
Item_view
• User Id
• Page Id
• Site Id
• Title
• Date
• ……
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑅𝑎𝑡𝑒(%) =
𝐶𝑜𝑢𝑛𝑡(𝑠𝑜𝑢𝑟𝑐𝑒.𝑓𝑖𝑒𝑙𝑑1 == 𝑡𝑎𝑟𝑔𝑒𝑡.𝑓𝑖𝑒𝑙𝑑1 && … )
𝐶𝑜𝑢𝑛𝑡(𝑠𝑜𝑢𝑟𝑐𝑒)
X 100%

Use Cases
• Griffin has been deployed in production at eBay and provided the centralized data
quality service for several eBay systems.
~1.2PB 800+M 100+
Data Daily Records Metrics

We are open source
and welcome contributions
Github: https://github.com/eBay/griffin
Blog: http://www.ebaytechblog.com/?p=5877/
Contact: ebay-griffin-dev@googlegroups.com

Using Hadoop to build a Data Quality Service for both real-time and batch data

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Using Hadoop to build a Data Quality Service for both real-time and batch data

Semelhante a Using Hadoop to build a Data Quality Service for both real-time and batch data (20)

Mais de DataWorks Summit/Hadoop Summit

Mais de DataWorks Summit/Hadoop Summit (20)

Último

Último (20)

Using Hadoop to build a Data Quality Service for both real-time and batch data

Notas do Editor