3. LinkedInâs Vision
âCreate economic opportunity for every member of the
global workforceâ
⢠Find
work
⢠Realize
your
dream
job
⢠Be
great
at
what
you
do
7. Comparison of different Search Engines
Netflix: 100K
AirBnB: 800K
Ebay: 500M
Bing: 100âs of Billions
Google: 100âs of Billions
Facebook: Trillions
8. Comparison of different Search Engines
Netflix: 100K
Lucene
AirBnB: 800K
Lucene
Ebay: 500M
Custom C++
Bing: 100âs of Billions
Custom C++
Google: 100âs of Billions
Custom C++
Facebook: Trillions
Custom C++
LinkedIn:
100âs of Millions
Lucene
Galene
(Lucene based)
Galene
(Custom)
9. Important Galene Features
⢠Offline index building
⢠Live updates at a fine granularity
⢠Static rank and early termination
⢠Faceting
⢠Data distribution
⢠Relevance framework
11. A little about LinkedIn data
⢠Most datasets at LinkedIn are available in 2 ways
⢠A
real
9me,
change
no9fica9on
stream
⢠A
complete
dataset,
ETLâd
to
Hadoop
⢠We often rely on derived datasets
⢠Many derived datasets canât be crunched in real time
12. Anatomy of a Galene index
⢠Base Index
⢠Generated
by
Hadoop
periodically
⢠Single-Ââsegment
Lucene
index
⢠On
Disk.
Immutable.
MMAPed
and
MLOCKed
⢠Contains
complex
/
rich
features,
that
we
can
only
afford
to
compute
offline
⢠Live Index
⢠Inverted
index
with
our
own
format
⢠In-Ââmemory
data
structure
⢠Contains
incremental
updates
to
documents
⢠Snapshot Index
⢠On
Disk
Snapshot
of
Live
index
when
necessary
⢠Ini9ally
empty
⢠Single
segment
Lucene
Index.
Live
index
is
folded
in
regularly
13. BLAH BLAH BLAH Jeff BLAH BLAH LinkedIn BLAH BLAH BLAH BLAH
1.
2. BLAH BLAH Reid BLAH LinkedIn BLAH BLAH BLAH BLAH BLAH BLAH BLAH
Jeff Reid LinkedIn
1
2
Inverted Index (with Posting Lists) Forward Index
14. 1
2
3
4
5
6
7
8
9
1
2
3
4
5
10
11
12
.
.
.
Base
Index
Live
Update
Snapshot
In-ÂâMemory
Live
Updates
15. Inverted Index: Three Segments
Three independent segments with non-overlapped UIDs:
⢠B1S1L1 (Base/snapshot/live) segment
⢠Base
has
all
UIDs.
⢠Neither
of
Snapshot
nor
Live
introduces
new
UIDs.
⢠S2L2 (Snapshot/live) segment
⢠None
of
UIDs
exist
in
BSL.
⢠Snapshot
has
all
UIDs
⢠Live
does
not
introduce
any
new
UIDs.
⢠L3 (live) segment
⢠None
of
UIDs
exist
in
BSL
or
SL.
18. Search: Static Rank (SR)
⢠A global score of a document
⢠Each
document
must
have
one
and
only
one
SR
⢠It
could
be
anything
that
can
globally
represent
the
importance
of
an
UID,
for
example,
the
number
of
1st
degree
connec9ons
⢠Different
documents
might
have
same
SRs
⢠B1S1L1 segment
⢠Base
knows
SRs
of
all
UIDs
of
the
segment
⢠S2L2
⢠Snapshot
knows
SRs
of
all
UIDs
of
the
segment
⢠L3 segments
⢠We
assign
ar9ficial
SRs
in
either
of
the
two
ways:
⢠Ascending
order
star9ng
from
the
max
SR
of
all
UIDs
in
all
3
segments
⢠Descending
order
star9ng
from
the
min
SR
of
all
UIDs
in
all
3
segments
19. Search: Early Termination (ET)
⢠Segment Level ET
⢠Depending
on
the
ordering
of
sta9c
ranking
assignment
of
L
segment,
which
will
affect
the
ordering
of
all
segments,
we
can
search:
⢠BSL
-Ââ>
SL
-Ââ>
L
(if
it
is
descending)
⢠L
-Ââ>
SL
-Ââ>
BSL
(if
it
is
ascending)
⢠Posting List Level ET
⢠Since
all
pos9ngs
are
first
sorted
by
SR,
early
termina9on
on
pos9ng
list
guarantees
that
documents
with
highest
SRs
are
always
first
retrieved
(however,
this
does
not
guarantee
that
the
final
scores
are
also
highest
scores).
20. Going Forward
⢠Very efficient custom index in C++
⢠Base index build can be run in a distributed manner
⢠BSL supported at a more fundamental level
22. Faceting
⢠Types of facets supported:
⢠discoverable
(e.g.
current
company)
⢠sta9c
values
(e.g.
network)
⢠supplied
values
(e.g.
my
groups)
⢠Legacy stack had no early termination allowing for exact facet counting (at a
cost)
⢠Current Galene stack applies heuristics to determine counts in an approximate
manner
⢠Going forward, custom posting list format will encode facet details for more
efficient facet count estimation
24. Relevance Framework
⢠Infrastructure to support common scoring needs
⢠Provides framework to evaluate relevance changes
⢠Enables rapid iterations over relevance experiments
⢠Allows relevance engineers to focus on building features
25. Life of a Query â Within A Rewriter
Query
DATA
MODEL
Rewriter
State
Rewriter
Module
DATA
MODEL
DATA
MODEL
Rewri4en
Query
Rewriter
Module
Rewriter
Module
26. Life of a Query â Within A Search Shard
INDEX
Top
Results
Retrieve
a
Document
Score
the
Document
Rewri4en
Query
Top
Results
From
Shard
28. Case Study: Instant Member Search
⢠The index contains connections as document terms
(term:diego
AND
prefix:buth
AND
(connec>on:35176
OR
connec>on:418001
OR
connec>on:1520032))
⢠Static Rank of documents reflects popularity
⢠Documents are augmented offline with spell correction data
⢠âshreeram
saâ
:
(term:shreeram
OR
cluster:5678)
AND
(prefix:sa)
AND
(connec9on:1234)