Hive London Meetup: Facebook Object Graph Entity Deduplication

Entities De-duplication in Hive

Abhishek Doshi
Facebook.com/abhi
adoshi@fb.com

Agenda

1

Motivation

2

Pipeline Overview

3

De-duplication Specifics

4

Related Hive Usage

5

Learnings + Statistics

(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

Product Uses – Collections
Empower users to connect to the books they read, movies they watch, TV shows they
like, etc. whether on Facebook or on other services.

Product Uses – Composer
Empower users to create structured posts about the things they do

Product Uses – Graph Search
Allow users to find things their friends have done irrespective of the service they used
to do it (subject to privacy checks of course!)

Pipeline Overview
Import from data providers and massage into a unified format in hive

Use hive data to create pages on FB as data containers accessible by web tier

Scrape all existing pages / objects back to hive and run de-duplication pipeline
against the entire dataset daily

Imports and Page Creation
Before (data in XML file)
<artists>
<artist name=“Katy Perry” id=“109”>
<album id=“kp1”/>
<album id=“kp2”/>
</artist>
</artists>

After (massaged data in Hive table, pages created)
Title

Album

Artist

s1

California
Gurls

kp1

109

s2

Firework

kp1

109

S3

<albums>
<album id=“kp1” name=“Teenage Dream”>
<song id=“s1” title=“California Gurls”/>
<song id=“s2” title=“Firework/>
</album>
<album id=“s3” name=“Prism”>
...
</album>
</albums>

Id

Roar

kp2

109

Teenage
Dream
(album)

Prism
(album)

California
Gurls
(song)
Firework
(song)
Roar
(song)

Katy
Perry
(artist)

Data De-duplication - Example
Before

After

Cluster
1

Cluster
1

Node A

Node
B

Node
C

Node
C
Match Z

Node A

Node
B

Node A: “Ender‟s Game” by Orson Scott Card ISBN: 0-306-40615-2 (Authentic
Page)
Node B: “El Juego de Ender” by Orson Scott Card ISBN: 978-0-306-40615-7 (OG
Object)
Node C: “The Ender‟s Game” by O. S. Card ISBN: null (Imported Page)
Cluster 1: The set of node we know that refer to one canonical entity (Node A and
B are grouped together by ISBN (10 vs 13 digit) and loose title/author matching)
Match Z: Title and author normalization and matching logic determined that Node C

Data De-duplication - Strategy
▪

Analysis
▪
▪

What metadata do you have?

▪

▪

How large is your data set?

How accurate is your data?

Techniques
▪

High accuracy + disambiguation information
▪

▪

▪

First normalize and figure out what metadata acts as good disambiguation information
Look for exact matches

Low accuracy
▪

Approximate first pass on entire data set

▪

Rigorous check on candidate pairs

High Accuracy + Disambiguation Information
▪

De-duplication (Simple w/ good helper functions!)

▪

Step 1:
 regex to strip things
 UDF for more complicated changes (ex. Casing,
remove punctuation, trim, replace abbreviation, etc.)

▪

Step 2:

OR

Low Accuracy
▪

Approximation
▪

▪

▪

Split title string into 2-shingle chunks („Lord of the Rings‟ => „Lord of‟, „of the‟, „the Rings‟)

Compute overlap of sets

Candidate Generation
N-grams

hashes

[„lord of‟, „of the‟, „the
rings‟]

[[25bf9b6f, c1bbdfc6,
b866805d], [306d3a2c,
61a61682, a16dc249],
...]

Now What?
▪

Automated Dupe Table

▪

Human Judgement

PHP logic acts on automated
results and user submissions to
create duplicate clusters

Other De-duplication related Hive Jobs
▪

Marking Known Non-Duplicates

▪

Gathering De-duplication Statistics

▪

De-duplication of other verticals based on existing work

Learnings
▪

Soft Merge vs Hard Merge
▪
▪

▪

Logic will make mistakes or evolve, requiring „undo‟ functionality
Data agreements change over time

De-dupe Entire Dataset vs. Incremental De-dupe
▪

▪

Start Conservative
▪

▪

Easier to mark additional dupes than clean up incorrect existing ones

Always Verify Data Quality
▪

▪

Debugging significantly easier when all information is contained in one partition

Data providers tend to over promise about their data sets

Humans > Machines
▪

Make it easy for trusted people to override automated logic

Statistics are Fun
▪

Data warehouse is > 300 PB in size

▪

Tens of thousands of queries are run daily, crunching more than
10 PB of data

▪

600 TB of new data is ingested into the warehouse every day

▪

The data warehouse has grown nearly 4,000x in last four years,
way ahead of FB user growth

Hive London Meetup: Facebook Object Graph Entity Deduplication

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (8)

Último

Último (20)

Hive London Meetup: Facebook Object Graph Entity Deduplication

Notas do Editor