SlideShare uma empresa Scribd logo
1 de 24
SEARCH, NETWORK AND ANALYTICS
Using Set Cover to Optimize a Large-
Scale Low Latency Distributed Graph
LinkedIn Presentation Template
©2013 LinkedIn Corporation. All Rights Reserved. 1
Rui Wang
Staff Software Engineer
SEARCH, NETWORK AND ANALYTICS
Outline
 LinkedIn’s Distributed Graph Infrastructure
 The Scaling Problem
 Set Cover by Example
 Evaluation
 Related Work
 Conclusions and Future Work
©2013 LinkedIn Corporation. All Rights Reserved. 2
SEARCH, NETWORK AND ANALYTICS
Outline
 LinkedIn’s Distributed Graph Infrastructure
 The Scaling Problem
 Set Cover by Example
 Evaluation
 Related Work
 Conclusions and Future Work
©2013 LinkedIn Corporation. All Rights Reserved. 3
SEARCH, NETWORK AND ANALYTICS
LinkedIn Products Backed By Social Graph
©2013 LinkedIn Corporation. All Rights Reserved. 4
SEARCH, NETWORK AND ANALYTICS
LinkedIn’s Distributed Graph APIs
 Graph APIs
– Get Connections
 “Who does member X know?”
– Get Shared Connections
 “Who do I know in common with member Y”?
– Get Graph Distances
 “What’s the graph distances for each member returned from a search
query?”
 “Is member Z within my second degree network so that I can view her
profile?”
 Over 50% queries is to get graph distances
©2013 LinkedIn Corporation. All Rights Reserved. 5
SEARCH, NETWORK AND ANALYTICS
Distributed Graph Architecture Diagram
©2013 LinkedIn Corporation. All Rights Reserved. 6
SEARCH, NETWORK AND ANALYTICS
LinkedIn’s Distributed Graph Infrastructure
 Graph Database (Graph DB)
– Partitioned and replicated graph database
– Distributed adjacency list
 Distributed Network Cache (NCS)
– LRU cache stores second degree network for active members
– Graph traversals are converted to set intersections
 Graph APIs
– Get Connections
– Get Shared Connections
– Get Graph Distances
©2013 LinkedIn Corporation. All Rights Reserved. 7
SEARCH, NETWORK AND ANALYTICS
Outline
 LinkedIn’s Distributed Graph Infrastructure
 The Scaling Problem
 Set Cover by Example
 Evaluation
 Related Work
 Conclusions and Future Work
©2013 LinkedIn Corporation. All Rights Reserved. 8
SEARCH, NETWORK AND ANALYTICS
Graph Distance Queries and Second Degree Creation
©2013 LinkedIn Corporation. All Rights Reserved. 9
get graph distances
cache lookup
[exists=false] retrieve
1st degree connections
[exists=true] set
intersections
Partition IDs
partial merges
K-Way
merges
return
return
SEARCH, NETWORK AND ANALYTICS
The Scaling Problem Illustrated
©2013 LinkedIn Corporation. All Rights Reserved. 10
Graph DBNCS
• Increased Query Distribution
• Merging and De-duping shift to single NCS node
SEARCH, NETWORK AND ANALYTICS
Outline
 LinkedIn’s Distributed Graph Infrastructure
 The Scaling Problem
 Set Cover and Example
 Evaluation
 Related Work
 Conclusions and Future Work
©2013 LinkedIn Corporation. All Rights Reserved. 11
SEARCH, NETWORK AND ANALYTICS
Set Cover Problem
 Definition
– Given a set of elements U = {1, 2, …, m } (called the universe) and a
family S of n sets whose union equals the universe, the set cover
problem is to identify the smallest subset of S the union of which
contains all elements in the universe.
 Greedy Algorithm
– Rule to select sets
 At each stage, choose the set that contains the largest number of
uncovered elements.
– Upper bound
 O(log s), where s is the size of the largest set from S
©2013 LinkedIn Corporation. All Rights Reserved. 12
SEARCH, NETWORK AND ANALYTICS
Modified Set Cover Algorithm Example Cont.
©2013 LinkedIn Corporation. All Rights Reserved. 13
 Build a map of partition ID -> Graph DB nodes for the input graph
SEARCH, NETWORK AND ANALYTICS
Modified Set Cover Algorithm Example Cont.
©2013 LinkedIn Corporation. All Rights Reserved. 14
 Randomly pick an element
from U = { 1,2,5,6 }
– e = 5
 Retrieve from map
– nodes = { R21, R13 }
 Intersect
– R21 = {1,5 } with U,
– R13 = { 5,6 } with U
 Select R21
– U = {2,6}, C = { R21}
SEARCH, NETWORK AND ANALYTICS
Modified Set Cover Algorithm Example Cont.
©2013 LinkedIn Corporation. All Rights Reserved. 15
 Randomly pick an element
from U = { 2,6 }
– e = 2
 Retrieve from map
– nodes = { R11, R22 }
 Intersect
– R11 = {1,2 } with U,
– R22 = { 2,4 } with U
 Select R22
– U = {6}, C = { R21, R22 }
SEARCH, NETWORK AND ANALYTICS
Modified Set Cover Algorithm Example Cont.
©2013 LinkedIn Corporation. All Rights Reserved. 16
 Pick the final element from
U = { 6 }
– e = 6
 Retrieve from map
– nodes = { R23, R13 }
 Intersect
– R23 = { 3,6 } with U,
– R13 = { 5,6 } with U
 Select R23
– U = {}, C = {R21, R22, R23 }
SEARCH, NETWORK AND ANALYTICS
Modified Set Cover Algorithm Example Solution
 Solution Compared to Optimal Result for U , where U = {1,2,5,6}
©2013 LinkedIn Corporation. All Rights Reserved. 17
SEARCH, NETWORK AND ANALYTICS
Outline
 LinkedIn’s Distributed Graph Infrastructure
 The Scaling Problem
 Set Cover by Example
 Evaluation
 Related Work
 Conclusions and Future Work
©2013 LinkedIn Corporation. All Rights Reserved. 18
SEARCH, NETWORK AND ANALYTICS
Evaluation
 Production Results
– Second degree cache creation drops 38% on 99th percentile
– Graph distance calculation drops 25% on 99th percentile
– Outbound traffic drops 40%; Inbound traffic drops 10%
©2013 LinkedIn Corporation. All Rights Reserved. 19
95th quantile 99th quantile
0
1
2
3
control setcover control setcover
latency(normalized)
network cache building
95th quantile 99th quantile
0
1
2
3
control setcover control setcover
latency(normalized)
getdistances
SEARCH, NETWORK AND ANALYTICS
Outline
 LinkedIn’s Distributed Graph Infrastructure
 The Scaling Problem
 Set Cover by Example
 Evaluation
 Related Work
 Conclusions and Future Work
©2013 LinkedIn Corporation. All Rights Reserved. 20
SEARCH, NETWORK AND ANALYTICS
Related Work
 Scaling through Replications
– Collocating neighbors [Pujol2010]
– Replication based on read/write frequencies [Mondal2012]
– Replication based on locality [Carrasco2011]
 Multi-Core Implementations
– Parallel graph exploration [Hong2011]
 Offline Graph Systems
– Google’s Pregel [Malewicz2010]
– Distributed GraphLab [Low2012]
©2013 LinkedIn Corporation. All Rights Reserved. 21
SEARCH, NETWORK AND ANALYTICS
Outline
 LinkedIn’s Distributed Graph Infrastructure
 The Scaling Problem
 Set Cover by Example
 Evaluation
 Related Work
 Conclusions and Future Work
©2013 LinkedIn Corporation. All Rights Reserved. 22
SEARCH, NETWORK AND ANALYTICS
Conclusions and Future Work
 Future Work
– Incremental cache updates
– Replication on GraphDB nodes through LRU caching
– New graph traversal algorithms
 Conclusions
– Key challenges tackled
 Work distribution balancing
 Communication Bandwidth
– Set cover optimized latency for distributed query by
 Identifying a much smaller set of GraphDB nodes serving queries
 Shifting workload to GraphDB to utilize parallel powers
 Alleviating the K-way merging costs on NCS by reducing K
– Available at
https://github.com/linkedin-
sna/norbert/blob/master/network/src/main/scala/com/linkedin/norbert/network/partitioned
/loadbalancer/DefaultPartitionedLoadBalancerFactory.scala
©2013 LinkedIn Corporation. All Rights Reserved. 23
SEARCH, NETWORK AND ANALYTICS©2013 LinkedIn Corporation. All Rights Reserved. 24

Mais conteúdo relacionado

Semelhante a Using Set Cover to Optimize a Large-Scale Low Latency Distributed Graph

Optimizing Your Supply Chain with Neo4j
Optimizing Your Supply Chain with Neo4jOptimizing Your Supply Chain with Neo4j
Optimizing Your Supply Chain with Neo4jNeo4j
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to MahoutTed Dunning
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGMapR Technologies
 
K anonymity for crowdsourcing database
K anonymity for crowdsourcing databaseK anonymity for crowdsourcing database
K anonymity for crowdsourcing databaseLeMeniz Infotech
 
Using Linkurious in your Enterprise Architecture projects
Using Linkurious in your Enterprise Architecture projectsUsing Linkurious in your Enterprise Architecture projects
Using Linkurious in your Enterprise Architecture projectsLinkurious
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataEUCLID project
 
GraphTour 2020 - Graphs & AI: A Path for Data Science
GraphTour 2020 - Graphs & AI: A Path for Data ScienceGraphTour 2020 - Graphs & AI: A Path for Data Science
GraphTour 2020 - Graphs & AI: A Path for Data ScienceNeo4j
 
IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET-  	  A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...IRJET-  	  A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...IRJET Journal
 
Using Connected Data and Graph Technology to Enhance Machine Learning and Art...
Using Connected Data and Graph Technology to Enhance Machine Learning and Art...Using Connected Data and Graph Technology to Enhance Machine Learning and Art...
Using Connected Data and Graph Technology to Enhance Machine Learning and Art...Neo4j
 
The Neo4j Data Platform for Today & Tomorrow.pdf
The Neo4j Data Platform for Today & Tomorrow.pdfThe Neo4j Data Platform for Today & Tomorrow.pdf
The Neo4j Data Platform for Today & Tomorrow.pdfNeo4j
 
El camino hacia el éxito con las bases de datos de grafos, la ciencia de dato...
El camino hacia el éxito con las bases de datos de grafos, la ciencia de dato...El camino hacia el éxito con las bases de datos de grafos, la ciencia de dato...
El camino hacia el éxito con las bases de datos de grafos, la ciencia de dato...Neo4j
 
A simulation-based approach for straggler tasks detection in Hadoop MapReduce
A simulation-based approach for straggler tasks detection in Hadoop MapReduceA simulation-based approach for straggler tasks detection in Hadoop MapReduce
A simulation-based approach for straggler tasks detection in Hadoop MapReduceIRJET Journal
 
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...Keiichiro Ono
 
Metric Recovery from Unweighted k-NN Graphs
Metric Recovery from Unweighted k-NN GraphsMetric Recovery from Unweighted k-NN Graphs
Metric Recovery from Unweighted k-NN Graphsjoisino
 
A Literature Survey on Image Linguistic Visual Question Answering
A Literature Survey on Image Linguistic Visual Question AnsweringA Literature Survey on Image Linguistic Visual Question Answering
A Literature Survey on Image Linguistic Visual Question AnsweringIRJET Journal
 
Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...
Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...
Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...Neo4j
 
Partial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather ConditionsPartial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather ConditionsIRJET Journal
 
Optimizing Your Supply Chain with the Neo4j Graph
Optimizing Your Supply Chain with the Neo4j GraphOptimizing Your Supply Chain with the Neo4j Graph
Optimizing Your Supply Chain with the Neo4j GraphNeo4j
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingNesreen K. Ahmed
 

Semelhante a Using Set Cover to Optimize a Large-Scale Low Latency Distributed Graph (20)

Optimizing Your Supply Chain with Neo4j
Optimizing Your Supply Chain with Neo4jOptimizing Your Supply Chain with Neo4j
Optimizing Your Supply Chain with Neo4j
 
Unit 3- Software Design.pptx
Unit 3- Software Design.pptxUnit 3- Software Design.pptx
Unit 3- Software Design.pptx
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to Mahout
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUG
 
K anonymity for crowdsourcing database
K anonymity for crowdsourcing databaseK anonymity for crowdsourcing database
K anonymity for crowdsourcing database
 
Using Linkurious in your Enterprise Architecture projects
Using Linkurious in your Enterprise Architecture projectsUsing Linkurious in your Enterprise Architecture projects
Using Linkurious in your Enterprise Architecture projects
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked Data
 
GraphTour 2020 - Graphs & AI: A Path for Data Science
GraphTour 2020 - Graphs & AI: A Path for Data ScienceGraphTour 2020 - Graphs & AI: A Path for Data Science
GraphTour 2020 - Graphs & AI: A Path for Data Science
 
IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET-  	  A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...IRJET-  	  A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
 
Using Connected Data and Graph Technology to Enhance Machine Learning and Art...
Using Connected Data and Graph Technology to Enhance Machine Learning and Art...Using Connected Data and Graph Technology to Enhance Machine Learning and Art...
Using Connected Data and Graph Technology to Enhance Machine Learning and Art...
 
The Neo4j Data Platform for Today & Tomorrow.pdf
The Neo4j Data Platform for Today & Tomorrow.pdfThe Neo4j Data Platform for Today & Tomorrow.pdf
The Neo4j Data Platform for Today & Tomorrow.pdf
 
El camino hacia el éxito con las bases de datos de grafos, la ciencia de dato...
El camino hacia el éxito con las bases de datos de grafos, la ciencia de dato...El camino hacia el éxito con las bases de datos de grafos, la ciencia de dato...
El camino hacia el éxito con las bases de datos de grafos, la ciencia de dato...
 
A simulation-based approach for straggler tasks detection in Hadoop MapReduce
A simulation-based approach for straggler tasks detection in Hadoop MapReduceA simulation-based approach for straggler tasks detection in Hadoop MapReduce
A simulation-based approach for straggler tasks detection in Hadoop MapReduce
 
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...
 
Metric Recovery from Unweighted k-NN Graphs
Metric Recovery from Unweighted k-NN GraphsMetric Recovery from Unweighted k-NN Graphs
Metric Recovery from Unweighted k-NN Graphs
 
A Literature Survey on Image Linguistic Visual Question Answering
A Literature Survey on Image Linguistic Visual Question AnsweringA Literature Survey on Image Linguistic Visual Question Answering
A Literature Survey on Image Linguistic Visual Question Answering
 
Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...
Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...
Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...
 
Partial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather ConditionsPartial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather Conditions
 
Optimizing Your Supply Chain with the Neo4j Graph
Optimizing Your Supply Chain with the Neo4j GraphOptimizing Your Supply Chain with the Neo4j Graph
Optimizing Your Supply Chain with the Neo4j Graph
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and Modeling
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Último (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Using Set Cover to Optimize a Large-Scale Low Latency Distributed Graph

  • 1. SEARCH, NETWORK AND ANALYTICS Using Set Cover to Optimize a Large- Scale Low Latency Distributed Graph LinkedIn Presentation Template ©2013 LinkedIn Corporation. All Rights Reserved. 1 Rui Wang Staff Software Engineer
  • 2. SEARCH, NETWORK AND ANALYTICS Outline  LinkedIn’s Distributed Graph Infrastructure  The Scaling Problem  Set Cover by Example  Evaluation  Related Work  Conclusions and Future Work ©2013 LinkedIn Corporation. All Rights Reserved. 2
  • 3. SEARCH, NETWORK AND ANALYTICS Outline  LinkedIn’s Distributed Graph Infrastructure  The Scaling Problem  Set Cover by Example  Evaluation  Related Work  Conclusions and Future Work ©2013 LinkedIn Corporation. All Rights Reserved. 3
  • 4. SEARCH, NETWORK AND ANALYTICS LinkedIn Products Backed By Social Graph ©2013 LinkedIn Corporation. All Rights Reserved. 4
  • 5. SEARCH, NETWORK AND ANALYTICS LinkedIn’s Distributed Graph APIs  Graph APIs – Get Connections  “Who does member X know?” – Get Shared Connections  “Who do I know in common with member Y”? – Get Graph Distances  “What’s the graph distances for each member returned from a search query?”  “Is member Z within my second degree network so that I can view her profile?”  Over 50% queries is to get graph distances ©2013 LinkedIn Corporation. All Rights Reserved. 5
  • 6. SEARCH, NETWORK AND ANALYTICS Distributed Graph Architecture Diagram ©2013 LinkedIn Corporation. All Rights Reserved. 6
  • 7. SEARCH, NETWORK AND ANALYTICS LinkedIn’s Distributed Graph Infrastructure  Graph Database (Graph DB) – Partitioned and replicated graph database – Distributed adjacency list  Distributed Network Cache (NCS) – LRU cache stores second degree network for active members – Graph traversals are converted to set intersections  Graph APIs – Get Connections – Get Shared Connections – Get Graph Distances ©2013 LinkedIn Corporation. All Rights Reserved. 7
  • 8. SEARCH, NETWORK AND ANALYTICS Outline  LinkedIn’s Distributed Graph Infrastructure  The Scaling Problem  Set Cover by Example  Evaluation  Related Work  Conclusions and Future Work ©2013 LinkedIn Corporation. All Rights Reserved. 8
  • 9. SEARCH, NETWORK AND ANALYTICS Graph Distance Queries and Second Degree Creation ©2013 LinkedIn Corporation. All Rights Reserved. 9 get graph distances cache lookup [exists=false] retrieve 1st degree connections [exists=true] set intersections Partition IDs partial merges K-Way merges return return
  • 10. SEARCH, NETWORK AND ANALYTICS The Scaling Problem Illustrated ©2013 LinkedIn Corporation. All Rights Reserved. 10 Graph DBNCS • Increased Query Distribution • Merging and De-duping shift to single NCS node
  • 11. SEARCH, NETWORK AND ANALYTICS Outline  LinkedIn’s Distributed Graph Infrastructure  The Scaling Problem  Set Cover and Example  Evaluation  Related Work  Conclusions and Future Work ©2013 LinkedIn Corporation. All Rights Reserved. 11
  • 12. SEARCH, NETWORK AND ANALYTICS Set Cover Problem  Definition – Given a set of elements U = {1, 2, …, m } (called the universe) and a family S of n sets whose union equals the universe, the set cover problem is to identify the smallest subset of S the union of which contains all elements in the universe.  Greedy Algorithm – Rule to select sets  At each stage, choose the set that contains the largest number of uncovered elements. – Upper bound  O(log s), where s is the size of the largest set from S ©2013 LinkedIn Corporation. All Rights Reserved. 12
  • 13. SEARCH, NETWORK AND ANALYTICS Modified Set Cover Algorithm Example Cont. ©2013 LinkedIn Corporation. All Rights Reserved. 13  Build a map of partition ID -> Graph DB nodes for the input graph
  • 14. SEARCH, NETWORK AND ANALYTICS Modified Set Cover Algorithm Example Cont. ©2013 LinkedIn Corporation. All Rights Reserved. 14  Randomly pick an element from U = { 1,2,5,6 } – e = 5  Retrieve from map – nodes = { R21, R13 }  Intersect – R21 = {1,5 } with U, – R13 = { 5,6 } with U  Select R21 – U = {2,6}, C = { R21}
  • 15. SEARCH, NETWORK AND ANALYTICS Modified Set Cover Algorithm Example Cont. ©2013 LinkedIn Corporation. All Rights Reserved. 15  Randomly pick an element from U = { 2,6 } – e = 2  Retrieve from map – nodes = { R11, R22 }  Intersect – R11 = {1,2 } with U, – R22 = { 2,4 } with U  Select R22 – U = {6}, C = { R21, R22 }
  • 16. SEARCH, NETWORK AND ANALYTICS Modified Set Cover Algorithm Example Cont. ©2013 LinkedIn Corporation. All Rights Reserved. 16  Pick the final element from U = { 6 } – e = 6  Retrieve from map – nodes = { R23, R13 }  Intersect – R23 = { 3,6 } with U, – R13 = { 5,6 } with U  Select R23 – U = {}, C = {R21, R22, R23 }
  • 17. SEARCH, NETWORK AND ANALYTICS Modified Set Cover Algorithm Example Solution  Solution Compared to Optimal Result for U , where U = {1,2,5,6} ©2013 LinkedIn Corporation. All Rights Reserved. 17
  • 18. SEARCH, NETWORK AND ANALYTICS Outline  LinkedIn’s Distributed Graph Infrastructure  The Scaling Problem  Set Cover by Example  Evaluation  Related Work  Conclusions and Future Work ©2013 LinkedIn Corporation. All Rights Reserved. 18
  • 19. SEARCH, NETWORK AND ANALYTICS Evaluation  Production Results – Second degree cache creation drops 38% on 99th percentile – Graph distance calculation drops 25% on 99th percentile – Outbound traffic drops 40%; Inbound traffic drops 10% ©2013 LinkedIn Corporation. All Rights Reserved. 19 95th quantile 99th quantile 0 1 2 3 control setcover control setcover latency(normalized) network cache building 95th quantile 99th quantile 0 1 2 3 control setcover control setcover latency(normalized) getdistances
  • 20. SEARCH, NETWORK AND ANALYTICS Outline  LinkedIn’s Distributed Graph Infrastructure  The Scaling Problem  Set Cover by Example  Evaluation  Related Work  Conclusions and Future Work ©2013 LinkedIn Corporation. All Rights Reserved. 20
  • 21. SEARCH, NETWORK AND ANALYTICS Related Work  Scaling through Replications – Collocating neighbors [Pujol2010] – Replication based on read/write frequencies [Mondal2012] – Replication based on locality [Carrasco2011]  Multi-Core Implementations – Parallel graph exploration [Hong2011]  Offline Graph Systems – Google’s Pregel [Malewicz2010] – Distributed GraphLab [Low2012] ©2013 LinkedIn Corporation. All Rights Reserved. 21
  • 22. SEARCH, NETWORK AND ANALYTICS Outline  LinkedIn’s Distributed Graph Infrastructure  The Scaling Problem  Set Cover by Example  Evaluation  Related Work  Conclusions and Future Work ©2013 LinkedIn Corporation. All Rights Reserved. 22
  • 23. SEARCH, NETWORK AND ANALYTICS Conclusions and Future Work  Future Work – Incremental cache updates – Replication on GraphDB nodes through LRU caching – New graph traversal algorithms  Conclusions – Key challenges tackled  Work distribution balancing  Communication Bandwidth – Set cover optimized latency for distributed query by  Identifying a much smaller set of GraphDB nodes serving queries  Shifting workload to GraphDB to utilize parallel powers  Alleviating the K-way merging costs on NCS by reducing K – Available at https://github.com/linkedin- sna/norbert/blob/master/network/src/main/scala/com/linkedin/norbert/network/partitioned /loadbalancer/DefaultPartitionedLoadBalancerFactory.scala ©2013 LinkedIn Corporation. All Rights Reserved. 23
  • 24. SEARCH, NETWORK AND ANALYTICS©2013 LinkedIn Corporation. All Rights Reserved. 24