3. Where to find information
Code - https://github.com/linkedin/instantsearch-tutorial
Wiki - https://github.com/linkedin/instantsearch-tutorial/wiki
Slack - https://instantsearchtutorial.slack.com/
Slides - will be on the slideshare and we will update the wiki/tweet
Twitter - #instantsearchtutorial (twitter.com/search)
3
4. The Plot
● At the end of this tutorial, attendees should:
○ Understand the challenges/constraints faced while dealing with instant search (latency,
tolerance to user errors) etc
○ Get a broad overview of the theoretical foundations behind:
■ Indexing
■ Query Processing
■ Ranking and Blending (including personalization)
○ Understand open source options available to put together an ‘end-to-end’ instant search
solution
○ Put together an end-to-end solution on their own (with some helper code)
4
5. What would graduation look like?
● Instant result solution built over
stackoverflow data
● Built based on open source tools
(elasticsearch, typeahead.js)
● Ability to experiment further to
modify ranking/query construction
5
7. Agenda
● Terminology and Background
● Indexing & Retrieval
○ Instant Results
○ Query Autocomplete
● Ranking
● Hands on tutorial with data from stackoverflow
○ Index and search posts from stackoverflow
○ Play around with ranking
7
8. Agenda
● Terminology and Background
● Indexing & Retrieval
○ Instant Results
○ Query Autocomplete
● Ranking
● Hands on tutorial with data from stackoverflow
○ Index and search xx posts from stackoverflow
○ Play around with ranking
8
14. When to display instant results vs query completion
● LinkedIn product decision
○ when the confidence level is high enough for a
particular result, show the result
● What is ‘high enough’ could be application specific and
not merely a function of score
14
15. Completing query vs instant results
● “lin” => first degree connection with lots of common connections, same
company etc.
● “link” => better off completing the query (even with possible suggestions for
verticals)
15
16. Terminology - Blending
● Bringing results from different search verticals (news, web, answers etc)
16
18. Why Instant Search and why now?
● Natural evolution of search
● Users have gotten used to getting immediate feedback
● Mobile devices => need to type less
18
19. Agenda
● Terminology and Background
● Indexing & Retrieval
○ Instant Results
○ Query Autocomplete
● Ranking
● Hands on tutorial with data from stackoverflow
○ Index and search xx posts from stackoverflow
○ Play around with ranking
19
20. Instant Search at Scale
● Constraints (example: LinkedIn people search)
○ Scale - ability to store and retrieve 100’s of Millions/Billions of
documents via prefix
○ Fast - ability to return results quicker than typing speed
○ Resilience to user errors
○ Personalized
20
21. Instant Search via Inverted Index
● Scaleable
● Ability to form complex boolean queries
● Open source availability (Lucene/Elasticsearch)
● Easy to add metadata (payloads, forward index)
21
22. The Search Index
Inverted Index: Mapping from (search) terms to list of
documents (they are present in)
Forward Index: Mapping from documents to metadata about
them
22
25. Prefix indexing
● Instant search, query != ‘abraham’
● Queries = [‘a’, ‘ab’, … , ‘abraham’]
● Need to index each prefix
● Elasticsearch refers to this form of tokenization as ‘edge n-gram’
● Issues
○ Bigger index
○ Big posting list for short prefixes => much higher number of documents retrieved
25
26. Early Termination
● We cannot ‘afford’ to retrieve and score all documents that match the query
● We terminate posting list traversal when certain number of documents have
been retrieved
● We may miss out on recall
26
27. Static Rank
● Order the posting lists so that documents with high (query independent) prior
probability of relevance appears first
● Use application specific logic to rewrite query
● Once the query has achieved a certain number of matches in the posting list,
we stop. This number of matches is referred to as “early termination limit”
27
28. Static Rank Example - People Search at LinkedIn
● Some factors that go into static rank computation
○ Member popularity measure by profile views both
within and outside network
○ Spam in person’s name
○ Security and Spam. Downgrade profiles flagged by
LinkedIn’s internal security team
○ Celebrities and Influencers
28
29. Static Rank Case study - People Search at LinkedIn
29
Recall
Early termination limit
30. Resilience to Spelling errors
● We focus on names as they can be (often) hard to get right (ex: “marissa
mayer” or “marissa meyer”?)
● Names vs traditional spelling errors:
○ “program manager” vs “program manger” - only one of these is right
○ “Mayer” vs “Meyer” - no clear source of truth
● Edit distance based approaches can be wrong both ways:
○ “Mohamad” and “Muhammed” are 3 edits apart and yet plausible variants
○ “Jeff” and “Joff” are 1 edit distance apart, but highly unlikely to be plausible variants of the
same name
30
31. LinkedIn Approach - Name clusters
Solution touches indexing, query reformulation and ranking
31
32. Name Clusters - Two step clustering
● Course level clustering
○ Uses double metaphone + some known heuristics
○ Focus on recall
● Fine level clustering
○ similarity function that takes into account Jaro-Winkler distance
○ User session data
32
33. Overall approach for Name Clusters
● Indexing
○ Store clusterID for each cluster in a separate field (say ‘NAMECLUSTERID’)
○ ‘Cris’ and ‘chris’ in same name cluster CHRISID
○ NAME:cris NAMECLUSTERID:chris
● Query processing
○ user query = ‘chris’
○ Rewritten query = ?NAME:chris ?NAMECLUSTERID:chris
● Ranking
○ Different weights for ‘perfect match’ vs. ‘name cluster match’
33
34. Instant Results via Inverted Index - Some Takeaways
● Used for documents at very high scale
● Use early termination
● Approach the problem as a combination of indexing/query processing/ranking
34
35. Agenda
● Terminology and Background
● Indexing & Retrieval
○ Instant Results
○ Query Autocomplete
● Ranking
● Hands on tutorial with data from stackoverflow
○ Index and search xx posts from stackoverflow
○ Play around with ranking
35
36. Query Autocomplete - Problem Statement
● Let q = w1
, w2
. . . wk
* represent
the query with k words, where the
kth
token is a prefix as denoted by
the asterisk
● Goal: Find one or more relevant
completions for the query
36
37. Trie
● Used to store an associative array
where keys are strings
● Only certain keys and leaves are
of interest
● Structure allows for only sharing
of prefixes
● Representation not memory
efficient
37
An trie of words {space, spark, moth}
38. Finite State Transducers (FST)
● Allows efficient retrieval of
completions at runtime
● Can fit entirely into RAM
● Useful when keys have
commonalities to them, allowing
better compression
● Lucene has support for FSTs*
FST for words: software, scala,
scalding, spark
*Lucene FST implementation based on “Direct Construction of Minimal Acyclic Subsequential Transducers (2001)” by Stoyan Mihov, Denis Maurel
38
39. Query Autocomplete vs. Instant Results
● For query autocomplete corpus of terms remains relatively constant, instant
results documents can be continuously added/removed
● Query autocomplete focuses only on prefix based retrieval whereas instant
search results utilize complex query construction for retrieval
● Query autocomplete retrieval based off a dictionary hence index can be
refreshed periodically instead of real time
39
40. Query Tagging
● Segment query based on
recognized entities
● Annotate query with:
○ Named Entity Tags
○ Standardized Identifiers
○ Related Entities
○ Additional Entity Specific Metadata
40
41. Data Processing
● Break queries into recognized entities and individual tokens
● Past querylogs are parsed for recognized entities, tokens and fed into an fst
for retrieval of candidate suggestions.
41
42. Retrieval
● All candidate completions over increasingly longer suffixes of the query are
used to capture enough context
● Given a query like “linkedin sof*” we look completions for:
○ sof*, linkedin sof*
● Candidates are then provided to the scoring phase.
42
43. Retrieval
● From the above FST, for the query “linkedin sof*” we retrieve the
candidates:
○ sof: [software developer, software engineer]
○ linkedin sof: []
43
44. Payloads
● Each query autocomplete result
can have a payload associated
with it.
● A payload holds serialized data
useful in scoring the autocomplete
result
44
46. Fuzzy Matching
● Use levenshtein automata constructed from
a word and maximum edit distance
● Based on the automaton and letters input
to it, we decide whether to continue or not
● Ex. search for “dpark” (s/d being close on
the keyboard) with edit distance 1 =
[spark]
An index of {space, spark, moth}
represented as a trie
46
50. Agenda
● Terminology and Background
● Indexing & Retrieval
● Ranking
○ Ranking instant results
○ Ranking query suggestions
○ Blending
● Hands on tutorial with data from stackoverflow
50
51. Ranking Challenge
● Short query prefixes
● Context beyond query
○ Personalized context
○ Global context
■ Global popularity
■ Trending
51
52. Hand-Tuned vs. Machine-Learned Ranking
● Hard to manually tune with very large number of features
● Challenging to personalize
● LTR allows leveraging large volume of click data in an automated way
52
53. Agenda
● Terminology and Background
● Indexing & Retrieval
● Ranking
○ Ranking instant results
○ Ranking query suggestions
○ Blending
● Hands on tutorial with data from stackoverflow
53
56. Features
● Social Affinity (personalized features)
○ Network distance between searcher and result
○ Connection Strength
■ Within the same company
■ Common connections
■ From the same school
56
73. Blending Challenges
● Different verticals associate with different signals
○ People: network distance
○ Groups: time of the last edit
○ Query suggestion: edit distance
● Even common features may not be equally predictive
across verticals
○ Popularity
○ Text similarity
● Scores might not be comparable across verticals
73
75. Approaches
● Separate binary classifiers
○ Pros
■ Handle vertical-specific features
■ Handle common features with different predictive powers
○ Cons
■ Need to calibrate output scores of multiple classifiers
75
76. Approaches
● Learning-to-rank - Equal correlation assumption
○ Union feature schema and padding zeros to non-applicable features
○ Equal correlation assumption
f1
f2
f3
f1
f2
f4
People
Jobs
f1
f2
f3
f4
=0
f1
f2
f3
=0 f4
Model
76
77. Approaches
● Learning-to-rank - Equal correlation assumption
○ Pros
■ Handle vertical-specific features
■ Comparable output scores across verticals
○ Cons
■ Assume common features are equally predictive of vertical relevance
77
78. Approaches
● Learning-to-rank - Without equal correlation assumption
f1
f2
f3
f4
f5
f6
People
Jobs
f1
f2
f3
0
0 0 0 f4
Model
0 0
f5
f6
People vertical features
Job vertical features
78
79. Approaches
● Learning-to-rank - Without equal correlation assumption
○ Pros
■ Handle vertical-specific features
■ Without equal correlation assumption -> auto learn evidence-vertical
association
■ Comparable output scores across verticals
○ Cons
■ The number of features is huge
● Overfitting
● Require a huge amount of training data
79
80. Evaluation
● “If you can’t measure it, you can’t improve it”
● Metrics
○ Successful search rate
○ Number of keystrokes per search: query length + clicked result rank
80
81. Take-Aways
● Speed
○ Instant results: Early termination
○ Autocompletion: FST
● Tolerance to spelling errors
● Relevance: go beyond query prefix
○ Personalized context
○ Global context
81
82. Agenda
● Terminology and Background
● Indexing & Retrieval
● Ranking
○ Ranking instant results
○ Ranking query suggestions
○ Blending
● Hands on tutorial with data from stackoverflow
82
83. Dataset
● Posts and Tags from stackoverflow.com
● Posts are questions posted by users and contains following attributes
○ Title
○ Score
● Tags help identify a suitable category for the post and contain following
attributes
○ Tag Name
○ Count
● Each post can have a maximum of five tags
83
89. Assignments
● Assignments available on Github
● Each assignment builds on a component of the end product
● Tests are provided at end of each assignment for validation
● Finished files available for reference (if needed)
● Raise hand if you need help or have a question
89
92. Take-Aways
● Index should be used primarily for retrieval
● Data sources should be kept separate from the index
● Building an index is not instantaneous hence have replicas in production
● Real world indexes seldom can be stored in a single shard
92
97. Summary
● Theoretical understanding of indexing, retrieval and ranking for instant search
results and query autocomplete
● Insights and learnings from linkedin.com case studies
● Working end-to-end implementation of query autocomplete and instant results
with stackoverflow.com dataset
97