Mais conteúdo relacionado Semelhante a How Lucene Powers the LinkedIn Segmentation and Targeting Platform (20) Mais de lucenerevolution (20) How Lucene Powers the LinkedIn Segmentation and Targeting Platform1. How Lucene Powers LinkedIn
Segmentation & Targeting Platform
Lucene/SOLR Revolution EU, November 2013
Hien Luu, Raj Rangaswamy
©2013 LinkedIn Corporation. All Rights Reserved.
3. Agenda
§ Little bit about LinkedIn
§ Segmentation & Targeting Platform Overview
§ How Lucene powers Segmentation & Targeting
Platform
§ Q&A
©2013 LinkedIn Corporation. All Rights Reserved.
4. Our Mission
Connect the world’s professionals to make them
more productive and successful.
Our Vision
Create economic opportunity for every
professional in the world.
Members First!
5. The world’s largest professional network
Over 65% of members are now international
>30M
>90%
Fortune
100
Companies
use
LinkedIn
Talent
Soln
to
hire
>3M
Company
Pages
19
Languages
>5.7B
Professional
searches
in
2012
©2013 LinkedIn Corporation. All Rights Reserved.
6. Other Company Facts
• Headquartered
in
Mountain
View,
Calif.,
with
offices
around
the
world!
• LinkedIn
has
~4200
full-‐Kme
employees
located
around
the
world
*
Source :
http://press.linkedin.com/about
10. Segmentation & Targeting
1. Create attributes
§
§
§
§
§
Name
Email
State
Occupation
Etc.
2. Attributes Added to Table
Name
Email
State
OccupaEon
John
Smith
jsmith@blah.com
California
Engineer
Jane
Smith
smithj@mail.com
Nevada
HR
Manager
Jane
Doe
jdoe@email.com
California
…
Engineer
3. Create Target Segment:
California, Engineer
Name
Email
State
OccupaEon
John
Smith
jsmith@blah.com
California
Engineer
Jane
Doe
jdoe@email.com
California
4. Export List & Send Vendor
Engineer
LinkedIn Confidential ©2013 All Rights Reserved
10
11. Segmentation & Targeting
§ Business definition
– Business would like to launch new campaign
often
– Business would like to specify targeting criteria
using arbitrary set of attributes
– Attributes need to be computed to fulfill the
targeting criteria
– The attribute data resides on Hadoop or TD
– Business is most comfortable with SQL-like
language
©2013 LinkedIn Corporation. All Rights Reserved.
17. LinkedIn Segmentation & Targeting Platform
Who are the job seekers?
Who are the LinkedIn Talent Solution prospects
in Europe?
Who are north American recruiters that
don’t work for a competitor?
©2013 LinkedIn Corporation. All Rights Reserved.
18. LinkedIn Segmentation & Targeting Platform
Complex tree-like attribute predicate expressions
©2013 LinkedIn Corporation. All Rights Reserved.
19. Agenda
§ Architecture
– Indexer Architecture
– Serving Architecture
§ Load Balanced Model
§ Next Steps - Distributed Model
§ DocValues
§ Lessons Learnt
§ Why not use an existing solution?
©2013 LinkedIn Corporation. All Rights Reserved.
23. Agenda
§ Architecture
– Indexer Architecture
– Serving Architecture
§ Load Balanced Model
§ Next Steps - Distributed Model
§ DocValues
§ Lessons Learnt
§ Why not use an existing solution?
©2013 LinkedIn Corporation. All Rights Reserved.
24. Serving – Load Balanced Model
HTTP Request
Load Balancer
Web Server 1
Shard 1
Web Server 2
Shard 2
Shared Drive
©2013 LinkedIn Corporation. All Rights Reserved.
Web Server n
Shard n
25. Serving – Load Balanced Model
But Wait…..
• Is load balancing alone good enough?
• What about distribution and failover?
©2013 LinkedIn Corporation. All Rights Reserved.
26. Agenda
§ Architecture
– Indexer Architecture
– Serving Architecture
§ Load Balanced Model
§ Next Steps - Distributed Model
§ DocValues
§ Lessons Learnt
§ Why not use an existing solution?
©2013 LinkedIn Corporation. All Rights Reserved.
27. Next Steps - Distributed Model
• A generic cluster management framework
• Used to manage partitioned and replicated resources in
distributed systems
• Built on top of Zookeeper that hides the complexity of ZK
primitives
• Provides distributed features such as leader election, twophase commit etc. via a model of state machine
http://helix.incubator.apache.org/
©2013 LinkedIn Corporation. All Rights Reserved.
28. Next Steps - Distributed Model
HTTP Request
Load Balancer
Scatter Gather
Web Server 1
Web Server 2
Web Server 3
Shard
1
active
Shard
2
active
Shard
3
active
Shard
2
standby
Shard
3
standby
Shard
1
standby
©2013 LinkedIn Corporation. All Rights Reserved.
29. Next Steps - Distributed Model
HTTP Request
Load Balancer
Scatter Gather
Web Server 1
Web Server 2
Web Server 3
Shard
1
active
Shard
2
active
Shard
3
failure
Shard
2
standby
Shard
3
active
Shard
1
failure
©2013 LinkedIn Corporation. All Rights Reserved.
30. Agenda
§ Architecture
– Indexer Architecture
– Serving Architecture
§ Load Balanced Model
§ Next Steps - Distributed Model
§ DocValues
§ Lessons Learnt
§ Why not use an existing solution?
©2013 LinkedIn Corporation. All Rights Reserved.
31. DocValues – Use Case
• Once segments are built, users want to forecast, see a
target revenue projection for the campaigns that they want
to run.
• Campaigns can be run on various Revenue Models
• This involves adding per member Propensity Scores and
Dollar Amounts
©2013 LinkedIn Corporation. All Rights Reserved.
32. DocValues – Why not Stored Fields?
Why not use Stored Fields?
Document ID
• Stored fields have one indirection
per document resulting in two disk
seeks per document
.fdx
fetch filepointer to field data
.fdt
scan by id until field is found
• Performance cost quickly adds up
when fetching millions of documents
©2013 LinkedIn Corporation. All Rights Reserved.
33. DocValues – Why not Field Cache?
Why not use Field Cache?
• Is memory resident
• Works fine when there is enough memory
• But keeping millions of un-inverted values in memory is impossible
• Additional cost to parse values (from String and to String)
©2013 LinkedIn Corporation. All Rights Reserved.
34. DocValues
• Dense column based storage (1 Value per Document and 1 Column
per field and segment)
• Accepts primitives
• No conversion from/to String needed
• Loads 80x-100x faster than building a FieldCache
• All the work is done during Indexing
• DocValue fields can be indexed and stored too
©2013 LinkedIn Corporation. All Rights Reserved.
35. Agenda
§ Architecture
– Indexer Architecture
– Serving Architecture
§ Load Balanced Model
§ Next Steps - Distributed Model
§ DocValues
§ Lessons Learnt
§ Why not use an existing solution?
©2013 LinkedIn Corporation. All Rights Reserved.
36. Lessons Learnt
Indexing
• Reuse index writers, field and document instances
• Create many partitions and Merge them in a different process
• Rebuild (bootstrap) entire index if possible
• Use partial updates with caution
• Analyze the index
Serving
• Reuse a single instance of IndexSearcher
• Limit usage of stored fields and term vectors
• Plan for load balancing and failover
• Cache term frequencies
• Use different machines for Serving and indexing
©2013 LinkedIn Corporation. All Rights Reserved.
37. Agenda
§ Architecture
– Indexer Architecture
– Serving Architecture
§ Load Balanced Model
§ Next Steps - Distributed Model
§ DocValues
§ Lessons Learnt
§ Why not use an existing solution?
©2013 LinkedIn Corporation. All Rights Reserved.
38. Why not use an existing solution?
• Doesn’t allow dynamic schema
• Difficult to bootstrap indexes built in
hadoop
• Indexing elevates query latency
• Doesn’t allow dynamic schema
• Difficult to bootstrap indexes built in
hadoop
• Larger memory overhead
• Comparatively slow
©2013 LinkedIn Corporation. All Rights Reserved.