SlideShare uma empresa Scribd logo
1 de 39
Baixar para ler offline
How Lucene Powers LinkedIn
Segmentation & Targeting Platform
Lucene/SOLR Revolution EU, November 2013
Hien Luu, Raj Rangaswamy
©2013 LinkedIn Corporation. All Rights Reserved.
About Us
*

Hien	
  Luu	
  

Rajasekaran	
  
Rangaswamy	
  
Agenda
§  Little bit about LinkedIn
§  Segmentation & Targeting Platform Overview
§  How Lucene powers Segmentation & Targeting
Platform
§  Q&A

©2013 LinkedIn Corporation. All Rights Reserved.
Our Mission
Connect the world’s professionals to make them
more productive and successful.

Our Vision
Create economic opportunity for every
professional in the world.

Members First!
The world’s largest professional network
Over 65% of members are now international

	
  
>30M
	
  
>90%

Fortune	
  100	
  Companies	
  	
  
use	
  LinkedIn	
  Talent	
  Soln	
  to	
  hire	
  

>3M	
  
Company	
  Pages	
  

	
  

	
  
19

Languages	
  

	
  

>5.7B	
  
Professional	
  searches	
  in	
  2012	
  

	
  
©2013 LinkedIn Corporation. All Rights Reserved.
Other Company Facts
•  Headquartered	
  in	
  Mountain	
  View,	
  Calif.,	
  with	
  offices	
  around	
  the	
  world!
•  LinkedIn	
  has	
  ~4200	
  full-­‐Kme	
  employees	
  located	
  around	
  the	
  world	
  
*
	
  

Source :
http://press.linkedin.com/about
SegmentaKon	
  &	
  TargeKng	
  

©2013 LinkedIn Corporation. All Rights Reserved.
Segmentation & Targeting
Segmentation & Targeting

Bhaskar Ghosh

Attribute types
Segmentation & Targeting
1. Create attributes
§ 
§ 
§ 
§ 
§ 

Name
Email
State
Occupation
Etc.

2. Attributes Added to Table
Name	
  

Email	
  

State	
  

OccupaEon	
  

John	
  Smith	
  

jsmith@blah.com	
  

California	
  

Engineer	
  

Jane	
  Smith	
  

smithj@mail.com	
  

Nevada	
  

HR	
  Manager	
  

Jane	
  Doe	
  

jdoe@email.com	
  

California	
  

…	
  

Engineer	
  

3. Create Target Segment:
California, Engineer
Name	
  

Email	
  

State	
  

OccupaEon	
  

John	
  Smith	
  

jsmith@blah.com	
  

California	
  

Engineer	
  

Jane	
  Doe	
  

jdoe@email.com	
  

California	
  

4. Export List & Send Vendor

Engineer	
  

LinkedIn Confidential ©2013 All Rights Reserved

10	
  
Segmentation & Targeting

§  Business definition
–  Business would like to launch new campaign
often
–  Business would like to specify targeting criteria
using arbitrary set of attributes
–  Attributes need to be computed to fulfill the
targeting criteria
–  The attribute data resides on Hadoop or TD
–  Business is most comfortable with SQL-like
language
©2013 LinkedIn Corporation. All Rights Reserved.
Segmentation & Targeting

Attribute
Computation
Engine

©2013 LinkedIn Corporation. All Rights Reserved.

Attribute
Serving
Engine
Segmentation & Targeting
Attribute
consolidation

Self-service

Attribute
Computation
Engine

Support various
data sources
©2013 LinkedIn Corporation. All Rights Reserved.

Attribute
availability
Segmentation & Targeting
PB

Attribute computation
~238M
TB

TB

~440

©2013 LinkedIn Corporation. All Rights Reserved.
Segmentation & Targeting
Build
segments

Self-service

Attribute
Serving
Engine

Attribute predicate
expression
©2013 LinkedIn Corporation. All Rights Reserved.

Build lists
Segmentation & Targeting
count

filter
$

1234

complex
sum expressions

Σ

Serving Engine
~238M

~440
LinkedIn Member Attribute table

©2013 LinkedIn Corporation. All Rights Reserved.
LinkedIn Segmentation & Targeting Platform
Who are the job seekers?

Who are the LinkedIn Talent Solution prospects
in Europe?

Who are north American recruiters that
don’t work for a competitor?

©2013 LinkedIn Corporation. All Rights Reserved.
LinkedIn Segmentation & Targeting Platform

Complex tree-like attribute predicate expressions

©2013 LinkedIn Corporation. All Rights Reserved.
Agenda
§  Architecture
–  Indexer Architecture
–  Serving Architecture

§  Load Balanced Model
§  Next Steps - Distributed Model
§  DocValues
§  Lessons Learnt
§  Why not use an existing solution?

©2013 LinkedIn Corporation. All Rights Reserved.
Architecture

Attribute
Serving
Engine

Attribute
Computation
Engine

Data
Storage
Layer
©2013 LinkedIn Corporation. All Rights Reserved.

Attribute
Indexing

Attribute
Creation
Engine

Attribute
Serving
Engine

Attribute
Materialization
Engine

Attribute
Metastore
Indexer
Mapper
mysql
attribute
store

Avro data in
HDFS

Attribute
Definitions
HDFS

Hadoop
Indexer MR

shard 1

shard 2

Index Merger
shard n

K=> AvroKey<GenericRecord>
V=> AvroValue<NullWritable>

Reducer
K=> NullWritable
V=> LuceneDocumentWrapper

LuceneOutputFormat
RecordWriter
LuceneDocumentWrapper
Document

Web Servers

Index
©2013 LinkedIn Corporation. All Rights Reserved.
Serving
JSON Predicate
Expression

JSON Lucene
Query Parser

Inverted
Index
©2013 LinkedIn Corporation. All Rights Reserved.

Inverted
Index

Segment &
List

Inverted
Index
Agenda
§  Architecture
–  Indexer Architecture
–  Serving Architecture
§  Load Balanced Model
§  Next Steps - Distributed Model
§  DocValues
§  Lessons Learnt
§  Why not use an existing solution?

©2013 LinkedIn Corporation. All Rights Reserved.
Serving – Load Balanced Model
HTTP Request

Load Balancer

Web Server 1

Shard 1

Web Server 2

Shard 2

Shared Drive
©2013 LinkedIn Corporation. All Rights Reserved.

Web Server n

Shard n
Serving – Load Balanced Model

But Wait…..
•  Is load balancing alone good enough?
•  What about distribution and failover?

©2013 LinkedIn Corporation. All Rights Reserved.
Agenda
§  Architecture
–  Indexer Architecture
–  Serving Architecture
§  Load Balanced Model
§  Next Steps - Distributed Model
§  DocValues
§  Lessons Learnt
§  Why not use an existing solution?

©2013 LinkedIn Corporation. All Rights Reserved.
Next Steps - Distributed Model

•  A generic cluster management framework
•  Used to manage partitioned and replicated resources in
distributed systems
•  Built on top of Zookeeper that hides the complexity of ZK
primitives
•  Provides distributed features such as leader election, twophase commit etc. via a model of state machine
http://helix.incubator.apache.org/
©2013 LinkedIn Corporation. All Rights Reserved.
Next Steps - Distributed Model
HTTP Request

Load Balancer

Scatter Gather

Web Server 1

Web Server 2

Web Server 3

Shard
1

active

Shard
2

active

Shard
3

active

Shard
2

standby

Shard
3

standby

Shard
1

standby

©2013 LinkedIn Corporation. All Rights Reserved.
Next Steps - Distributed Model
HTTP Request

Load Balancer

Scatter Gather

Web Server 1

Web Server 2

Web Server 3

Shard
1

active

Shard
2

active

Shard
3

failure

Shard
2

standby

Shard
3

active

Shard
1

failure

©2013 LinkedIn Corporation. All Rights Reserved.
Agenda
§  Architecture
–  Indexer Architecture
–  Serving Architecture
§  Load Balanced Model
§  Next Steps - Distributed Model
§  DocValues
§  Lessons Learnt
§  Why not use an existing solution?

©2013 LinkedIn Corporation. All Rights Reserved.
DocValues – Use Case
•  Once segments are built, users want to forecast, see a
target revenue projection for the campaigns that they want
to run.
•  Campaigns can be run on various Revenue Models
•  This involves adding per member Propensity Scores and
Dollar Amounts

©2013 LinkedIn Corporation. All Rights Reserved.
DocValues – Why not Stored Fields?
Why not use Stored Fields?

Document ID

•  Stored fields have one indirection
per document resulting in two disk
seeks per document

.fdx

fetch filepointer to field data

.fdt

scan by id until field is found

•  Performance cost quickly adds up
when fetching millions of documents

©2013 LinkedIn Corporation. All Rights Reserved.
DocValues – Why not Field Cache?
Why not use Field Cache?
•  Is memory resident
•  Works fine when there is enough memory
•  But keeping millions of un-inverted values in memory is impossible
•  Additional cost to parse values (from String and to String)

©2013 LinkedIn Corporation. All Rights Reserved.
DocValues
•  Dense column based storage (1 Value per Document and 1 Column
per field and segment)
•  Accepts primitives
•  No conversion from/to String needed
•  Loads 80x-100x faster than building a FieldCache
•  All the work is done during Indexing
•  DocValue fields can be indexed and stored too

©2013 LinkedIn Corporation. All Rights Reserved.
Agenda
§  Architecture
–  Indexer Architecture
–  Serving Architecture
§  Load Balanced Model
§  Next Steps - Distributed Model
§  DocValues
§  Lessons Learnt
§  Why not use an existing solution?

©2013 LinkedIn Corporation. All Rights Reserved.
Lessons Learnt
Indexing
•  Reuse index writers, field and document instances
•  Create many partitions and Merge them in a different process
•  Rebuild (bootstrap) entire index if possible
•  Use partial updates with caution
•  Analyze the index
Serving
•  Reuse a single instance of IndexSearcher
•  Limit usage of stored fields and term vectors
•  Plan for load balancing and failover
•  Cache term frequencies
•  Use different machines for Serving and indexing

©2013 LinkedIn Corporation. All Rights Reserved.
Agenda
§  Architecture
–  Indexer Architecture
–  Serving Architecture
§  Load Balanced Model
§  Next Steps - Distributed Model
§  DocValues
§  Lessons Learnt
§  Why not use an existing solution?

©2013 LinkedIn Corporation. All Rights Reserved.
Why not use an existing solution?
•  Doesn’t allow dynamic schema
•  Difficult to bootstrap indexes built in
hadoop
•  Indexing elevates query latency

•  Doesn’t allow dynamic schema
•  Difficult to bootstrap indexes built in
hadoop
•  Larger memory overhead
•  Comparatively slow

©2013 LinkedIn Corporation. All Rights Reserved.
Questions?
More info: data.linkedin.com

©2013 LinkedIn Corporation. All Rights Reserved.

Mais conteúdo relacionado

Mais procurados

Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
joshwills
 
Hitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIHitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BI
Andrew Brust
 

Mais procurados (20)

Hadoop Security
Hadoop SecurityHadoop Security
Hadoop Security
 
Securing your Big Data Environments in the Cloud
Securing your Big Data Environments in the CloudSecuring your Big Data Environments in the Cloud
Securing your Big Data Environments in the Cloud
 
Hadoop data access layer v4.0
Hadoop data access layer v4.0Hadoop data access layer v4.0
Hadoop data access layer v4.0
 
Data Discovery & Lineage in Enterprise Hadoop
Data Discovery & Lineage in Enterprise HadoopData Discovery & Lineage in Enterprise Hadoop
Data Discovery & Lineage in Enterprise Hadoop
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
Apache atlas sydney 2017-v4
Apache atlas   sydney 2017-v4Apache atlas   sydney 2017-v4
Apache atlas sydney 2017-v4
 
Foreign Data Wrappers and You with Postgres
Foreign Data Wrappers and You with PostgresForeign Data Wrappers and You with Postgres
Foreign Data Wrappers and You with Postgres
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
 
Open-BDA - Big Data Hadoop Developer Training 10th & 11th June
Open-BDA - Big Data Hadoop Developer Training 10th & 11th JuneOpen-BDA - Big Data Hadoop Developer Training 10th & 11th June
Open-BDA - Big Data Hadoop Developer Training 10th & 11th June
 
Big data course
Big data  courseBig data  course
Big data course
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 
Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph Databases
 
OpenLink Virtuoso - Management & Decision Makers Overview
OpenLink Virtuoso - Management & Decision Makers OverviewOpenLink Virtuoso - Management & Decision Makers Overview
OpenLink Virtuoso - Management & Decision Makers Overview
 
Virtuoso Universal Server Overview
Virtuoso Universal Server OverviewVirtuoso Universal Server Overview
Virtuoso Universal Server Overview
 
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
 
Hitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIHitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BI
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...
Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...
Under The Hood of Pluggable Databases by Alex Gorbachev, Pythian, Oracle OpeW...
 
Fifth Elephant Apache Atlas Talk
Fifth Elephant Apache Atlas TalkFifth Elephant Apache Atlas Talk
Fifth Elephant Apache Atlas Talk
 

Semelhante a How Lucene Powers the LinkedIn Segmentation and Targeting Platform

How Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting PlatformHow Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
Hien Luu
 
The Key to Big Data Modeling: Collaboration
The Key to Big Data Modeling: CollaborationThe Key to Big Data Modeling: Collaboration
The Key to Big Data Modeling: Collaboration
Embarcadero Technologies
 
LinkedIn Segmentation & Targeting Platform
LinkedIn Segmentation & Targeting PlatformLinkedIn Segmentation & Targeting Platform
LinkedIn Segmentation & Targeting Platform
Hien Luu
 

Semelhante a How Lucene Powers the LinkedIn Segmentation and Targeting Platform (20)

How Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting PlatformHow Lucene Powers the LinkedIn Segmentation and Targeting Platform
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
 
Migrating from Oracle to Postgres
Migrating from Oracle to PostgresMigrating from Oracle to Postgres
Migrating from Oracle to Postgres
 
The Real Scoop on Migrating from Oracle Databases
The Real Scoop on Migrating from Oracle DatabasesThe Real Scoop on Migrating from Oracle Databases
The Real Scoop on Migrating from Oracle Databases
 
Key Methodologies for Migrating from Oracle to Postgres
Key Methodologies for Migrating from Oracle to PostgresKey Methodologies for Migrating from Oracle to Postgres
Key Methodologies for Migrating from Oracle to Postgres
 
LinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationLinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data Application
 
The Changing Role of a DBA in an Autonomous World
The Changing Role of a DBA in an Autonomous WorldThe Changing Role of a DBA in an Autonomous World
The Changing Role of a DBA in an Autonomous World
 
Atlassian Executive Business Forum - LinkedIn HQ
Atlassian Executive Business Forum - LinkedIn HQAtlassian Executive Business Forum - LinkedIn HQ
Atlassian Executive Business Forum - LinkedIn HQ
 
Oracle databáze – Konsolidovaná Data Management Platforma
Oracle databáze – Konsolidovaná Data Management PlatformaOracle databáze – Konsolidovaná Data Management Platforma
Oracle databáze – Konsolidovaná Data Management Platforma
 
The Key to Big Data Modeling: Collaboration
The Key to Big Data Modeling: CollaborationThe Key to Big Data Modeling: Collaboration
The Key to Big Data Modeling: Collaboration
 
SharePoint Databases: What you need to know (201509)
SharePoint Databases: What you need to know (201509)SharePoint Databases: What you need to know (201509)
SharePoint Databases: What you need to know (201509)
 
SharePoint Migrations Pitfalls from the Crypt
SharePoint Migrations Pitfalls from the CryptSharePoint Migrations Pitfalls from the Crypt
SharePoint Migrations Pitfalls from the Crypt
 
LinkedIn Segmentation & Targeting Platform
LinkedIn Segmentation & Targeting PlatformLinkedIn Segmentation & Targeting Platform
LinkedIn Segmentation & Targeting Platform
 
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
 
LinkedIn Member Segmentation Platform: A Big Data Application
LinkedIn Member Segmentation Platform: A Big Data ApplicationLinkedIn Member Segmentation Platform: A Big Data Application
LinkedIn Member Segmentation Platform: A Big Data Application
 
Introduction to Active Directory
Introduction to Active DirectoryIntroduction to Active Directory
Introduction to Active Directory
 
(Oracle) DBA and Other Skills Needed in 2020
(Oracle) DBA and Other Skills Needed in 2020(Oracle) DBA and Other Skills Needed in 2020
(Oracle) DBA and Other Skills Needed in 2020
 
SPSChicagoBurbs 2019 - What is CDM and CDS?
SPSChicagoBurbs 2019 - What is CDM and CDS?SPSChicagoBurbs 2019 - What is CDM and CDS?
SPSChicagoBurbs 2019 - What is CDM and CDS?
 
#dbhouseparty - Should I be building Microservices?
#dbhouseparty - Should I be building Microservices?#dbhouseparty - Should I be building Microservices?
#dbhouseparty - Should I be building Microservices?
 
Oracle ADF Architecture TV - Planning & Getting Started - Team, Skills and D...
Oracle ADF Architecture TV -  Planning & Getting Started - Team, Skills and D...Oracle ADF Architecture TV -  Planning & Getting Started - Team, Skills and D...
Oracle ADF Architecture TV - Planning & Getting Started - Team, Skills and D...
 
Muruga logeswaran CV-Senior .Net Developer
Muruga logeswaran CV-Senior .Net DeveloperMuruga logeswaran CV-Senior .Net Developer
Muruga logeswaran CV-Senior .Net Developer
 

Mais de lucenerevolution

Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
lucenerevolution
 

Mais de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 

How Lucene Powers the LinkedIn Segmentation and Targeting Platform

  • 1. How Lucene Powers LinkedIn Segmentation & Targeting Platform Lucene/SOLR Revolution EU, November 2013 Hien Luu, Raj Rangaswamy ©2013 LinkedIn Corporation. All Rights Reserved.
  • 2. About Us * Hien  Luu   Rajasekaran   Rangaswamy  
  • 3. Agenda §  Little bit about LinkedIn §  Segmentation & Targeting Platform Overview §  How Lucene powers Segmentation & Targeting Platform §  Q&A ©2013 LinkedIn Corporation. All Rights Reserved.
  • 4. Our Mission Connect the world’s professionals to make them more productive and successful. Our Vision Create economic opportunity for every professional in the world. Members First!
  • 5. The world’s largest professional network Over 65% of members are now international   >30M   >90% Fortune  100  Companies     use  LinkedIn  Talent  Soln  to  hire   >3M   Company  Pages       19 Languages     >5.7B   Professional  searches  in  2012     ©2013 LinkedIn Corporation. All Rights Reserved.
  • 6. Other Company Facts •  Headquartered  in  Mountain  View,  Calif.,  with  offices  around  the  world! •  LinkedIn  has  ~4200  full-­‐Kme  employees  located  around  the  world   *   Source : http://press.linkedin.com/about
  • 7. SegmentaKon  &  TargeKng   ©2013 LinkedIn Corporation. All Rights Reserved.
  • 9. Segmentation & Targeting Bhaskar Ghosh Attribute types
  • 10. Segmentation & Targeting 1. Create attributes §  §  §  §  §  Name Email State Occupation Etc. 2. Attributes Added to Table Name   Email   State   OccupaEon   John  Smith   jsmith@blah.com   California   Engineer   Jane  Smith   smithj@mail.com   Nevada   HR  Manager   Jane  Doe   jdoe@email.com   California   …   Engineer   3. Create Target Segment: California, Engineer Name   Email   State   OccupaEon   John  Smith   jsmith@blah.com   California   Engineer   Jane  Doe   jdoe@email.com   California   4. Export List & Send Vendor Engineer   LinkedIn Confidential ©2013 All Rights Reserved 10  
  • 11. Segmentation & Targeting §  Business definition –  Business would like to launch new campaign often –  Business would like to specify targeting criteria using arbitrary set of attributes –  Attributes need to be computed to fulfill the targeting criteria –  The attribute data resides on Hadoop or TD –  Business is most comfortable with SQL-like language ©2013 LinkedIn Corporation. All Rights Reserved.
  • 12. Segmentation & Targeting Attribute Computation Engine ©2013 LinkedIn Corporation. All Rights Reserved. Attribute Serving Engine
  • 13. Segmentation & Targeting Attribute consolidation Self-service Attribute Computation Engine Support various data sources ©2013 LinkedIn Corporation. All Rights Reserved. Attribute availability
  • 14. Segmentation & Targeting PB Attribute computation ~238M TB TB ~440 ©2013 LinkedIn Corporation. All Rights Reserved.
  • 15. Segmentation & Targeting Build segments Self-service Attribute Serving Engine Attribute predicate expression ©2013 LinkedIn Corporation. All Rights Reserved. Build lists
  • 16. Segmentation & Targeting count filter $ 1234 complex sum expressions Σ Serving Engine ~238M ~440 LinkedIn Member Attribute table ©2013 LinkedIn Corporation. All Rights Reserved.
  • 17. LinkedIn Segmentation & Targeting Platform Who are the job seekers? Who are the LinkedIn Talent Solution prospects in Europe? Who are north American recruiters that don’t work for a competitor? ©2013 LinkedIn Corporation. All Rights Reserved.
  • 18. LinkedIn Segmentation & Targeting Platform Complex tree-like attribute predicate expressions ©2013 LinkedIn Corporation. All Rights Reserved.
  • 19. Agenda §  Architecture –  Indexer Architecture –  Serving Architecture §  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution? ©2013 LinkedIn Corporation. All Rights Reserved.
  • 20. Architecture Attribute Serving Engine Attribute Computation Engine Data Storage Layer ©2013 LinkedIn Corporation. All Rights Reserved. Attribute Indexing Attribute Creation Engine Attribute Serving Engine Attribute Materialization Engine Attribute Metastore
  • 21. Indexer Mapper mysql attribute store Avro data in HDFS Attribute Definitions HDFS Hadoop Indexer MR shard 1 shard 2 Index Merger shard n K=> AvroKey<GenericRecord> V=> AvroValue<NullWritable> Reducer K=> NullWritable V=> LuceneDocumentWrapper LuceneOutputFormat RecordWriter LuceneDocumentWrapper Document Web Servers Index ©2013 LinkedIn Corporation. All Rights Reserved.
  • 22. Serving JSON Predicate Expression JSON Lucene Query Parser Inverted Index ©2013 LinkedIn Corporation. All Rights Reserved. Inverted Index Segment & List Inverted Index
  • 23. Agenda §  Architecture –  Indexer Architecture –  Serving Architecture §  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution? ©2013 LinkedIn Corporation. All Rights Reserved.
  • 24. Serving – Load Balanced Model HTTP Request Load Balancer Web Server 1 Shard 1 Web Server 2 Shard 2 Shared Drive ©2013 LinkedIn Corporation. All Rights Reserved. Web Server n Shard n
  • 25. Serving – Load Balanced Model But Wait….. •  Is load balancing alone good enough? •  What about distribution and failover? ©2013 LinkedIn Corporation. All Rights Reserved.
  • 26. Agenda §  Architecture –  Indexer Architecture –  Serving Architecture §  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution? ©2013 LinkedIn Corporation. All Rights Reserved.
  • 27. Next Steps - Distributed Model •  A generic cluster management framework •  Used to manage partitioned and replicated resources in distributed systems •  Built on top of Zookeeper that hides the complexity of ZK primitives •  Provides distributed features such as leader election, twophase commit etc. via a model of state machine http://helix.incubator.apache.org/ ©2013 LinkedIn Corporation. All Rights Reserved.
  • 28. Next Steps - Distributed Model HTTP Request Load Balancer Scatter Gather Web Server 1 Web Server 2 Web Server 3 Shard 1 active Shard 2 active Shard 3 active Shard 2 standby Shard 3 standby Shard 1 standby ©2013 LinkedIn Corporation. All Rights Reserved.
  • 29. Next Steps - Distributed Model HTTP Request Load Balancer Scatter Gather Web Server 1 Web Server 2 Web Server 3 Shard 1 active Shard 2 active Shard 3 failure Shard 2 standby Shard 3 active Shard 1 failure ©2013 LinkedIn Corporation. All Rights Reserved.
  • 30. Agenda §  Architecture –  Indexer Architecture –  Serving Architecture §  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution? ©2013 LinkedIn Corporation. All Rights Reserved.
  • 31. DocValues – Use Case •  Once segments are built, users want to forecast, see a target revenue projection for the campaigns that they want to run. •  Campaigns can be run on various Revenue Models •  This involves adding per member Propensity Scores and Dollar Amounts ©2013 LinkedIn Corporation. All Rights Reserved.
  • 32. DocValues – Why not Stored Fields? Why not use Stored Fields? Document ID •  Stored fields have one indirection per document resulting in two disk seeks per document .fdx fetch filepointer to field data .fdt scan by id until field is found •  Performance cost quickly adds up when fetching millions of documents ©2013 LinkedIn Corporation. All Rights Reserved.
  • 33. DocValues – Why not Field Cache? Why not use Field Cache? •  Is memory resident •  Works fine when there is enough memory •  But keeping millions of un-inverted values in memory is impossible •  Additional cost to parse values (from String and to String) ©2013 LinkedIn Corporation. All Rights Reserved.
  • 34. DocValues •  Dense column based storage (1 Value per Document and 1 Column per field and segment) •  Accepts primitives •  No conversion from/to String needed •  Loads 80x-100x faster than building a FieldCache •  All the work is done during Indexing •  DocValue fields can be indexed and stored too ©2013 LinkedIn Corporation. All Rights Reserved.
  • 35. Agenda §  Architecture –  Indexer Architecture –  Serving Architecture §  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution? ©2013 LinkedIn Corporation. All Rights Reserved.
  • 36. Lessons Learnt Indexing •  Reuse index writers, field and document instances •  Create many partitions and Merge them in a different process •  Rebuild (bootstrap) entire index if possible •  Use partial updates with caution •  Analyze the index Serving •  Reuse a single instance of IndexSearcher •  Limit usage of stored fields and term vectors •  Plan for load balancing and failover •  Cache term frequencies •  Use different machines for Serving and indexing ©2013 LinkedIn Corporation. All Rights Reserved.
  • 37. Agenda §  Architecture –  Indexer Architecture –  Serving Architecture §  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution? ©2013 LinkedIn Corporation. All Rights Reserved.
  • 38. Why not use an existing solution? •  Doesn’t allow dynamic schema •  Difficult to bootstrap indexes built in hadoop •  Indexing elevates query latency •  Doesn’t allow dynamic schema •  Difficult to bootstrap indexes built in hadoop •  Larger memory overhead •  Comparatively slow ©2013 LinkedIn Corporation. All Rights Reserved.
  • 39. Questions? More info: data.linkedin.com ©2013 LinkedIn Corporation. All Rights Reserved.