Griffin is a data quality platform built by eBay on Hadoop and Spark to provide a unified process for detecting data quality issues in both real-time and batch data across multiple systems. It defines common data quality dimensions and metrics and calculates measurement values and quality scores, storing results and generating trending reports. Griffin provides a centralized data quality service for eBay and has been deployed processing over 1.2PB of data and 800M daily records using 100+ metrics. It is open source and contributions are welcome.
Using Hadoop to build a Data Quality Service for both real-time and batch data
1. Using Hadoop to build a Data Quality
Service for both real-time and batch data
Griffin – https://github.com/ebay/griffin
2. About Us:
• Alex Lv (lzhixing@ebay.com)
Senior Staff Software Engineer – Data Products
Platform & Engineering at eBay
• Amber Vaidya (amvaidya@ebay.com)
Lead Product Manager - Data Products
Platform & Engineering at eBay
5. eBay Marketplace at a Glance
Q2 2016 data
$19.8B
GMV in Q2 2016
10M
New listings added via mobile per
week
300M
Searches each day
65%
Transactions that ship for free
(in US, UK, DE)
80%
Items sold as new
1B
Live listings
One of the world’s largest and most vibrant marketplaces
6. Velocity Stats
US
3 car parts or accessories are sold every
A smartphone is sold every
A dress is sold every
1 sec
4 sec
6 sec
UK
A necklace is sold every
A make-up product is sold every
A Lego product is sold every
10 sec
3 sec
19 sec
GERMANY
A truck or car is sold every
A pair of women’s jeans is sold every
A video game is sold every
5 min
4 sec
11 sec
AUSTRALIA
A pair of men’s sunglasses is sold every
A home décor item is sold every
A car or truck part is sold every
1 min
12 sec
4 sec
7. Mobile Velocity Stats
US
A woman’s handbag is sold every
A car or truck is sold every
An action figure is sold every
10 sec
5 min
10 sec
UK
A tablet is sold every
A cookware item is sold every
A car is sold every
1 min
6 sec
2 min
GERMANY
A pair of women’s shoes is sold every
A watch is sold every
A tire or car part is sold every
20 sec
48 sec
35 sec
AUSTRALIA
A piece of jewelry is sold every
A baby clothing item is sold every
A motorcycle part is sold every
12 sec
46 sec
51 sec
8. Big Data @
We manage one of the
largest data platforms in
the world
We utilize one of the
largest data platforms in
the world
9. Challenging to ensure data quality for such scale!
Challenges at eBay:
• No unified view of data quality across multiple systems and teams
• No shared platform to manage data quality
• No system to measure near real-time data quality
10. What is Data Quality?
Definition
• How well it meets the expectations of
data consumers
• How well it represents the objects,
events, and concepts it is created to
represent
Dimensions
Completeness
Uniqueness
Timeliness
Validity
Accuracy
Consistency
Core Dimensions
11. Virtuous Cycle of Data Quality
Define Measure
AnalyzeImprove
• Define the scope, dimensions, goals,
thresholds, etc.
• Measure data quality values
• Analyze data quality results
• Improve data quality
12. Our Goal
A solution with all the below capabilities
Capability Commercial
DQ software
Open source
DQ software
Support eBay’s scale x x
Data Quality measurement √ x
Support real-time data x x
Support unstructured data x x
Service based API √ x
Data Profiling √ √
Pluggable measurement types x x
14. What is Griffin?
• Data Quality Platform built on Hadoop
and Spark
Batch data
Real-time data
Unstructured data
• A unified process to detect DQ issues
Incomplete
Inaccurate
Invalid
……
• An open source solution
https://github.com/ebay/griffin
15. Data Quality Framework in Griffin
DefineMeasureAnalyze
• Define Data Quality Dimensions
• Define Metrics, Goals, Thresholds
Calculators running on
Source
RDBMS
Accuracy
Completeness
Uniqueness
Timeliness
Validity
Consistency
Metrics
Metrics
Repository
Scorecards
• Scorecard Reports generated and displayed
• Measurement values and quality scores calculated
and stored
• Data quality trending graphs generated
Measure
Repository
18. Measure Calculator Example – Accuracy of Viewitem
~300M customer view events per day
Accuracy Calculator
Metric
Target
Source
Item_view
• User Id
• Page Id
• Site Id
• Title
• Date
• ……
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑅𝑎𝑡𝑒(%) =
𝐶𝑜𝑢𝑛𝑡(𝑠𝑜𝑢𝑟𝑐𝑒.𝑓𝑖𝑒𝑙𝑑1 == 𝑡𝑎𝑟𝑔𝑒𝑡.𝑓𝑖𝑒𝑙𝑑1 && … )
𝐶𝑜𝑢𝑛𝑡(𝑠𝑜𝑢𝑟𝑐𝑒)
X 100%
19. Use Cases
• Griffin has been deployed in production at eBay and provided the centralized data
quality service for several eBay systems.
~1.2PB 800+M 100+
Data Daily Records Metrics
22. We are open source
and welcome contributions
Github: https://github.com/eBay/griffin
Blog: http://www.ebaytechblog.com/?p=5877/
Contact: ebay-griffin-dev@googlegroups.com
Notas do Editor
Business challenges here
Previously, we maintain a lot of scripts to measure data quality, they are most run daily to generate the metrics. It’s often too late to recognize data quality issues.
Completeness
•Are all data sets and data items recorded?
Uniqueness
•Is there a single view of the data set?
Timeliness
•Does the data represent reality from the required point in time?
Validity
•Does the data match the rules?
Accuracy
•Does the data reflect the data set?
Consistency
•Can we match the data set across data stores?
In Griffin, we’ve implemented “Accuracy”. Need more volunteers to support the other dimensions.
Griffin covers the stages of Define, Measure, Analyze.
Define: select dimension, define measurement, threshold
Measure: calculate the metrics, generate reports, scorecards
Analyze: download sample data to do RCA of DQ issues
What do you mean by “Data Quality measurement” here?
Now let’s take a look at the flow to do data quality assessment in griffin, the first step is to define the measurement. You need to decide the data quality dimensions you’d like to choose,
the metrics, goals and thresholds.
After that, it comes to the “Measure” stage, the key component is measure calculators running on spark. Each measurement type need a dedicated calculator. The calculators pick up the datasets from data sources, do calculations and produce metrics values.
Finally, with the metrics values, we can generate scorecards and graphs, so that the end users can analyze the data qualities of their data.
Real-time : the measure calculators are based on Kafka streaming, so they can handle near real-time data
Fast : we have optimized algorithms for data quality domain, to support various data quality dimensions
Massive : We support large scale data hosted on Spark Cluster
Pluggable: It’s easy to plug a new measurement type, you just need to deploy your customized jar package in griffin
to support new dimensions.
Here’s the diagram of the component design in griffin.
All the measurement definitions are stored in measure repository, we have a scheduler to pickup measure jobs according to the definition
of the measurement. Then, a measure launcher will be started to process the measure job, it sends out the job to measure calculators
in asynchronized way. When the calculations are completed, the launcher will be notified and the metrics values will be persisted.
With the metrics values, we can monitor the dq, send out alerts, or generate reports/scorecards. (the reports also need the information from measure definition)
As mentioned previously, the calculators are key components in griffin, so I’d like to give you an example of how calculator works.
This’s an example of accuracy calculator for viewitem.
Viewitem is event data from ebay sites, it represent users’ view activities on these sites, we have some internal processes to persist such kind of event data.
To access the accuracy of the persisted data, we need to compare the data with their source, which are streaming data. The accuracy calculator can pick up both
streaming and batch data, when the viewitem data is ready, the calculator runs the algorithm on spark and produce metrics values.
Totally it deals with about 1.2 PB data, and the data volume is still increasing as long as we support more systems.
there are more than 800 million daily records, totally we’ve on boarded more than 100 metrics