SlideShare uma empresa Scribd logo
1 de 17
Entities De-duplication in Hive

Abhishek Doshi
Facebook.com/abhi
adoshi@fb.com
Agenda

1

Motivation

2

Pipeline Overview

3

De-duplication Specifics

4

Related Hive Usage

5

Learnings + Statistics

(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0
What is the Entity Graph?
Product Uses – Collections
Empower users to connect to the books they read, movies they watch, TV shows they
like, etc. whether on Facebook or on other services.
Product Uses – Composer
Empower users to create structured posts about the things they do
Product Uses – Graph Search
Allow users to find things their friends have done irrespective of the service they used
to do it (subject to privacy checks of course!)
Pipeline Overview
Import from data providers and massage into a unified format in hive

Use hive data to create pages on FB as data containers accessible by web tier

Scrape all existing pages / objects back to hive and run de-duplication pipeline
against the entire dataset daily
Imports and Page Creation
Before (data in XML file)
<artists>
<artist name=“Katy Perry” id=“109”>
<album id=“kp1”/>
<album id=“kp2”/>
</artist>
</artists>

After (massaged data in Hive table, pages created)
Title

Album

Artist

s1

California
Gurls

kp1

109

s2

Firework

kp1

109

S3

<albums>
<album id=“kp1” name=“Teenage Dream”>
<song id=“s1” title=“California Gurls”/>
<song id=“s2” title=“Firework/>
</album>
<album id=“s3” name=“Prism”>
...
</album>
</albums>

Id

Roar

kp2

109

Teenage
Dream
(album)

Prism
(album)

California
Gurls
(song)
Firework
(song)
Roar
(song)

Katy
Perry
(artist)
Data De-duplication - Example
Before

After

Cluster
1

Cluster
1

Node A

Node
B

Node
C

Node
C
Match Z

Node A

Node
B

Node A: “Ender‟s Game” by Orson Scott Card ISBN: 0-306-40615-2 (Authentic
Page)
Node B: “El Juego de Ender” by Orson Scott Card ISBN: 978-0-306-40615-7 (OG
Object)
Node C: “The Ender‟s Game” by O. S. Card ISBN: null (Imported Page)
Cluster 1: The set of node we know that refer to one canonical entity (Node A and
B are grouped together by ISBN (10 vs 13 digit) and loose title/author matching)
Match Z: Title and author normalization and matching logic determined that Node C
Data De-duplication - Strategy
▪

Analysis
▪
▪

What metadata do you have?

▪

▪

How large is your data set?

How accurate is your data?

Techniques
▪

High accuracy + disambiguation information
▪

▪

▪

First normalize and figure out what metadata acts as good disambiguation information
Look for exact matches

Low accuracy
▪

Approximate first pass on entire data set

▪

Rigorous check on candidate pairs
High Accuracy + Disambiguation Information
▪

De-duplication (Simple w/ good helper functions!)

▪

Step 1:
 regex to strip things
 UDF for more complicated changes (ex. Casing,
remove punctuation, trim, replace abbreviation, etc.)

▪

Step 2:

OR
Low Accuracy
▪

Approximation
▪

▪

▪

Split title string into 2-shingle chunks („Lord of the Rings‟ => „Lord of‟, „of the‟, „the Rings‟)

Compute overlap of sets

Candidate Generation
N-grams

hashes

[„lord of‟, „of the‟, „the
rings‟]

[[25bf9b6f, c1bbdfc6,
b866805d], [306d3a2c,
61a61682, a16dc249],
...]
Now What?
▪

Automated Dupe Table

▪

Human Judgement

PHP logic acts on automated
results and user submissions to
create duplicate clusters
Other De-duplication related Hive Jobs
▪

Marking Known Non-Duplicates

▪

Gathering De-duplication Statistics

▪

De-duplication of other verticals based on existing work
Learnings
▪

Soft Merge vs Hard Merge
▪
▪

▪

Logic will make mistakes or evolve, requiring „undo‟ functionality
Data agreements change over time

De-dupe Entire Dataset vs. Incremental De-dupe
▪

▪

Start Conservative
▪

▪

Easier to mark additional dupes than clean up incorrect existing ones

Always Verify Data Quality
▪

▪

Debugging significantly easier when all information is contained in one partition

Data providers tend to over promise about their data sets

Humans > Machines
▪

Make it easy for trusted people to override automated logic
Statistics are Fun
▪

Data warehouse is > 300 PB in size

▪

Tens of thousands of queries are run daily, crunching more than
10 PB of data

▪

600 TB of new data is ingested into the warehouse every day

▪

The data warehouse has grown nearly 4,000x in last four years,
way ahead of FB user growth
(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

Mais conteúdo relacionado

Destaque

Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messagesyarapavan
 
Creating a Culture of Data @ Facebook - TCCEU13
Creating a Culture of Data @ Facebook - TCCEU13Creating a Culture of Data @ Facebook - TCCEU13
Creating a Culture of Data @ Facebook - TCCEU13Andy Kriebel
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
FBTFTP: an opensource framework to build dynamic tftp servers
FBTFTP: an opensource framework to build dynamic tftp serversFBTFTP: an opensource framework to build dynamic tftp servers
FBTFTP: an opensource framework to build dynamic tftp serversAngelo Failla
 
SREConEurope15 - The evolution of the DHCP infrastructure at Facebook
SREConEurope15 - The evolution of the DHCP infrastructure at FacebookSREConEurope15 - The evolution of the DHCP infrastructure at Facebook
SREConEurope15 - The evolution of the DHCP infrastructure at FacebookAngelo Failla
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 

Destaque (8)

Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
 
Creating a Culture of Data @ Facebook - TCCEU13
Creating a Culture of Data @ Facebook - TCCEU13Creating a Culture of Data @ Facebook - TCCEU13
Creating a Culture of Data @ Facebook - TCCEU13
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
FBTFTP: an opensource framework to build dynamic tftp servers
FBTFTP: an opensource framework to build dynamic tftp serversFBTFTP: an opensource framework to build dynamic tftp servers
FBTFTP: an opensource framework to build dynamic tftp servers
 
SREConEurope15 - The evolution of the DHCP infrastructure at Facebook
SREConEurope15 - The evolution of the DHCP infrastructure at FacebookSREConEurope15 - The evolution of the DHCP infrastructure at Facebook
SREConEurope15 - The evolution of the DHCP infrastructure at Facebook
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 

Último

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 

Último (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 

Hive London Meetup: Facebook Object Graph Entity Deduplication

  • 1. Entities De-duplication in Hive Abhishek Doshi Facebook.com/abhi adoshi@fb.com
  • 2. Agenda 1 Motivation 2 Pipeline Overview 3 De-duplication Specifics 4 Related Hive Usage 5 Learnings + Statistics (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0
  • 3. What is the Entity Graph?
  • 4. Product Uses – Collections Empower users to connect to the books they read, movies they watch, TV shows they like, etc. whether on Facebook or on other services.
  • 5. Product Uses – Composer Empower users to create structured posts about the things they do
  • 6. Product Uses – Graph Search Allow users to find things their friends have done irrespective of the service they used to do it (subject to privacy checks of course!)
  • 7. Pipeline Overview Import from data providers and massage into a unified format in hive Use hive data to create pages on FB as data containers accessible by web tier Scrape all existing pages / objects back to hive and run de-duplication pipeline against the entire dataset daily
  • 8. Imports and Page Creation Before (data in XML file) <artists> <artist name=“Katy Perry” id=“109”> <album id=“kp1”/> <album id=“kp2”/> </artist> </artists> After (massaged data in Hive table, pages created) Title Album Artist s1 California Gurls kp1 109 s2 Firework kp1 109 S3 <albums> <album id=“kp1” name=“Teenage Dream”> <song id=“s1” title=“California Gurls”/> <song id=“s2” title=“Firework/> </album> <album id=“s3” name=“Prism”> ... </album> </albums> Id Roar kp2 109 Teenage Dream (album) Prism (album) California Gurls (song) Firework (song) Roar (song) Katy Perry (artist)
  • 9. Data De-duplication - Example Before After Cluster 1 Cluster 1 Node A Node B Node C Node C Match Z Node A Node B Node A: “Ender‟s Game” by Orson Scott Card ISBN: 0-306-40615-2 (Authentic Page) Node B: “El Juego de Ender” by Orson Scott Card ISBN: 978-0-306-40615-7 (OG Object) Node C: “The Ender‟s Game” by O. S. Card ISBN: null (Imported Page) Cluster 1: The set of node we know that refer to one canonical entity (Node A and B are grouped together by ISBN (10 vs 13 digit) and loose title/author matching) Match Z: Title and author normalization and matching logic determined that Node C
  • 10. Data De-duplication - Strategy ▪ Analysis ▪ ▪ What metadata do you have? ▪ ▪ How large is your data set? How accurate is your data? Techniques ▪ High accuracy + disambiguation information ▪ ▪ ▪ First normalize and figure out what metadata acts as good disambiguation information Look for exact matches Low accuracy ▪ Approximate first pass on entire data set ▪ Rigorous check on candidate pairs
  • 11. High Accuracy + Disambiguation Information ▪ De-duplication (Simple w/ good helper functions!) ▪ Step 1:  regex to strip things  UDF for more complicated changes (ex. Casing, remove punctuation, trim, replace abbreviation, etc.) ▪ Step 2: OR
  • 12. Low Accuracy ▪ Approximation ▪ ▪ ▪ Split title string into 2-shingle chunks („Lord of the Rings‟ => „Lord of‟, „of the‟, „the Rings‟) Compute overlap of sets Candidate Generation N-grams hashes [„lord of‟, „of the‟, „the rings‟] [[25bf9b6f, c1bbdfc6, b866805d], [306d3a2c, 61a61682, a16dc249], ...]
  • 13. Now What? ▪ Automated Dupe Table ▪ Human Judgement PHP logic acts on automated results and user submissions to create duplicate clusters
  • 14. Other De-duplication related Hive Jobs ▪ Marking Known Non-Duplicates ▪ Gathering De-duplication Statistics ▪ De-duplication of other verticals based on existing work
  • 15. Learnings ▪ Soft Merge vs Hard Merge ▪ ▪ ▪ Logic will make mistakes or evolve, requiring „undo‟ functionality Data agreements change over time De-dupe Entire Dataset vs. Incremental De-dupe ▪ ▪ Start Conservative ▪ ▪ Easier to mark additional dupes than clean up incorrect existing ones Always Verify Data Quality ▪ ▪ Debugging significantly easier when all information is contained in one partition Data providers tend to over promise about their data sets Humans > Machines ▪ Make it easy for trusted people to override automated logic
  • 16. Statistics are Fun ▪ Data warehouse is > 300 PB in size ▪ Tens of thousands of queries are run daily, crunching more than 10 PB of data ▪ 600 TB of new data is ingested into the warehouse every day ▪ The data warehouse has grown nearly 4,000x in last four years, way ahead of FB user growth
  • 17. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

Notas do Editor

  1. - Brief intro of myself and my time at FB, teams worked on, current stint in London
  2. First – talk about collections product itself and why its important to FB. (get high quality structured information about users tastes so we can create more engaging product experiences and more relevant advertising opportunities). Collections need high quality data to a) index the things they care about with enough information so it’s easy to find and compelling and b) not show them 10 copies of the same thing albeit from different providers
  3. Structured status updates about the things users are doing
  4. Be able to surface this data (subject to) privacy in search in a way that aggregates actions taken on various objects across the entire ecoysystem. Ex: “My friends who watched Star Trek” or “My friends who watched action movies with Tom Cruise”
  5. Walk through basic infrastructure mention that we’ll focus on deduplication logic in hive; seems a bit circuitous but we want these objects to exist on the web tier as well for actual product usage.
  6. Simple example to set the stage for what we’ll be talking about
  7. Talk about string normalization w/ regex, more complicated stuff as a UDF, even make a transform if needed.Describe more naïve approach with self join and then the more interest group by that
  8. Jaccard = | A intersect B | / |A union B|Min-Hash = take set of strings and set of array functions. Calculate hash of each string with first hash function and keep the minimum value. Repeat with each other function so you have a set of minimum values. Do further inspection of set of things grouped by the min hashes after filtering out common hash values.