Abhishek Doshi of Facebook gave an interesting talk on 31.10.2013 about how Facebook de-duplicates their object graph, e.g. to identify if a movie (object) from IMDB and Netflix refer to the same actual movie, with Hive on Hadoop. The solution has to scale to many millions of objects and Petabytes of data.
4. Product Uses – Collections
Empower users to connect to the books they read, movies they watch, TV shows they
like, etc. whether on Facebook or on other services.
5. Product Uses – Composer
Empower users to create structured posts about the things they do
6. Product Uses – Graph Search
Allow users to find things their friends have done irrespective of the service they used
to do it (subject to privacy checks of course!)
7. Pipeline Overview
Import from data providers and massage into a unified format in hive
Use hive data to create pages on FB as data containers accessible by web tier
Scrape all existing pages / objects back to hive and run de-duplication pipeline
against the entire dataset daily
8. Imports and Page Creation
Before (data in XML file)
<artists>
<artist name=“Katy Perry” id=“109”>
<album id=“kp1”/>
<album id=“kp2”/>
</artist>
</artists>
After (massaged data in Hive table, pages created)
Title
Album
Artist
s1
California
Gurls
kp1
109
s2
Firework
kp1
109
S3
<albums>
<album id=“kp1” name=“Teenage Dream”>
<song id=“s1” title=“California Gurls”/>
<song id=“s2” title=“Firework/>
</album>
<album id=“s3” name=“Prism”>
...
</album>
</albums>
Id
Roar
kp2
109
Teenage
Dream
(album)
Prism
(album)
California
Gurls
(song)
Firework
(song)
Roar
(song)
Katy
Perry
(artist)
9. Data De-duplication - Example
Before
After
Cluster
1
Cluster
1
Node A
Node
B
Node
C
Node
C
Match Z
Node A
Node
B
Node A: “Ender‟s Game” by Orson Scott Card ISBN: 0-306-40615-2 (Authentic
Page)
Node B: “El Juego de Ender” by Orson Scott Card ISBN: 978-0-306-40615-7 (OG
Object)
Node C: “The Ender‟s Game” by O. S. Card ISBN: null (Imported Page)
Cluster 1: The set of node we know that refer to one canonical entity (Node A and
B are grouped together by ISBN (10 vs 13 digit) and loose title/author matching)
Match Z: Title and author normalization and matching logic determined that Node C
10. Data De-duplication - Strategy
▪
Analysis
▪
▪
What metadata do you have?
▪
▪
How large is your data set?
How accurate is your data?
Techniques
▪
High accuracy + disambiguation information
▪
▪
▪
First normalize and figure out what metadata acts as good disambiguation information
Look for exact matches
Low accuracy
▪
Approximate first pass on entire data set
▪
Rigorous check on candidate pairs
11. High Accuracy + Disambiguation Information
▪
De-duplication (Simple w/ good helper functions!)
▪
Step 1:
regex to strip things
UDF for more complicated changes (ex. Casing,
remove punctuation, trim, replace abbreviation, etc.)
▪
Step 2:
OR
12. Low Accuracy
▪
Approximation
▪
▪
▪
Split title string into 2-shingle chunks („Lord of the Rings‟ => „Lord of‟, „of the‟, „the Rings‟)
Compute overlap of sets
Candidate Generation
N-grams
hashes
[„lord of‟, „of the‟, „the
rings‟]
[[25bf9b6f, c1bbdfc6,
b866805d], [306d3a2c,
61a61682, a16dc249],
...]
13. Now What?
▪
Automated Dupe Table
▪
Human Judgement
PHP logic acts on automated
results and user submissions to
create duplicate clusters
14. Other De-duplication related Hive Jobs
▪
Marking Known Non-Duplicates
▪
Gathering De-duplication Statistics
▪
De-duplication of other verticals based on existing work
15. Learnings
▪
Soft Merge vs Hard Merge
▪
▪
▪
Logic will make mistakes or evolve, requiring „undo‟ functionality
Data agreements change over time
De-dupe Entire Dataset vs. Incremental De-dupe
▪
▪
Start Conservative
▪
▪
Easier to mark additional dupes than clean up incorrect existing ones
Always Verify Data Quality
▪
▪
Debugging significantly easier when all information is contained in one partition
Data providers tend to over promise about their data sets
Humans > Machines
▪
Make it easy for trusted people to override automated logic
16. Statistics are Fun
▪
Data warehouse is > 300 PB in size
▪
Tens of thousands of queries are run daily, crunching more than
10 PB of data
▪
600 TB of new data is ingested into the warehouse every day
▪
The data warehouse has grown nearly 4,000x in last four years,
way ahead of FB user growth
17. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0
Notas do Editor
- Brief intro of myself and my time at FB, teams worked on, current stint in London
First – talk about collections product itself and why its important to FB. (get high quality structured information about users tastes so we can create more engaging product experiences and more relevant advertising opportunities). Collections need high quality data to a) index the things they care about with enough information so it’s easy to find and compelling and b) not show them 10 copies of the same thing albeit from different providers
Structured status updates about the things users are doing
Be able to surface this data (subject to) privacy in search in a way that aggregates actions taken on various objects across the entire ecoysystem. Ex: “My friends who watched Star Trek” or “My friends who watched action movies with Tom Cruise”
Walk through basic infrastructure mention that we’ll focus on deduplication logic in hive; seems a bit circuitous but we want these objects to exist on the web tier as well for actual product usage.
Simple example to set the stage for what we’ll be talking about
Talk about string normalization w/ regex, more complicated stuff as a UDF, even make a transform if needed.Describe more naïve approach with self join and then the more interest group by that
Jaccard = | A intersect B | / |A union B|Min-Hash = take set of strings and set of array functions. Calculate hash of each string with first hash function and keep the minimum value. Repeat with each other function so you have a set of minimum values. Do further inspection of set of things grouped by the min hashes after filtering out common hash values.