[In]formation Retrieval: Search at LinkedIn

Shakti Daniel

formation Retrieval: Search at LinkedIn
Shakti Sinha Daniel Tunkelang
Head, Search Relevance Head, Query Understanding

Recruiting Solutions 1

Why do 200M+ people use LinkedIn?

2

People use LinkedIn because of other people.

3

Search helps members find and be found.

4

Rich collection of professional content.

5

Every search is personalized.

6

Let’s talk a bit about how it all works.

§  Query Understanding

§  Search Spam

§  Unified Search

More at http://data.linkedin.com/search.

7

Query Understanding

8

People are semi-structured objects.

for i in [1..n]!
s ← w 1 w 2 … w i!
if Pc(s) > 0!
a ← new Segment()!
a.segs ← {s}!
a.prob ← Pc(s)!
B[i] ← {a}!
for j in [1..i-1]!
for b in B[j]!
s ← wj wj+1 … wi!
if Pc(s) > 0!
a ← new Segment()!
a.segs ← b.segs U {s}!
a.prob ← b.prob * Pc(s)!
B[i] ← B[i] U {a}!
sort B[i] by prob!
truncate B[i] to size k!

9

Word sense is contextual.

10

Understand queries as early as possible.

11

Query structure has many applications.

§  Boost results that match query interpretation.
§  Bucket search log analysis by query classes.
§  Query rewriting specific to query classes.
§  …

Query understanding focuses on set-level metrics.

Not just about best answer,
but getting to best question.

12

Let’s look at a search spammer.

14

Summary is verbose but legitimate.

15

But then comes the keyword stuffing.

16

How we train our search spam classifier.

§  Find the queries targeted by spammers.
–  10,000 most common non-name queries.

§  Look at top results for a generic user.
–  i.e., show unpersonalized search results.

§  Remove private profiles.
–  Members first! Can’t sacrifice privacy to fight spammers.

§  Label data by crowdsourcing.
–  Relevance is subjective, but spam is relatively objective.

17

ROC curve for spam thresholding.

1
Spam score
threshold 0.9

0.8
a
0.7

0.6

0.5
b
0.4

0.3

0<a<b<1 0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

18

Integrate spamminess into relevance score.

§  Spam model yields a probability between 0 and 1.

§  Use spam score as piecewise linear factor:
if score < spammin:
# not a spammer
relevance *= 1.0
elif score > spammax:
# spammer
relevance *= 0.0
else:
# linear function of spamminess
relevance *= (spammax - score) / (spammax - spammin)

19

Spam is an arms race.

§  We can’t reveal precisely which features we use for spam
detection, or spammers will work around them.

§  Spammers will try to reverse-engineer us anyway.

§  Personalization benefits us and our legitimate users – it’s
hard to spam your way to high personalized ranking.

§  Fighting spam is all about making the investment less
profitable for the spammer.

20

Un-Unified Search

22

Introducing LinkedIn Unified Search!

Goal: make all of our content more discoverable.

Three new features:
§  Query Auto-Complete
§  Content Type Suggestions
§  Unified Search Result Page

23

Query Auto-Complete

24

Best completion not always the most popular.

§  In a heavy-tailed distribution, even the most popular
queries account for a small fraction of distribution.

§  We don’t want to suggest generic queries that would
produce useless results.
–  e.g., c -> company, j -> jobs

§  Goal is to not only to infer user’s intent but also suggest a
search that yields relevant results across content types.

25

Content Type Suggestions

26

How we compute content type suggestions.

§  Rank content types by likelihood of a successful search.
–  Consider click-through behavior as well as downstream actions.

§  Bootstrap using what we know from pre-unified search
behavior.
–  Tricky part is compensating for findability bias.

§  Continuously evaluate and collect feedback through user
behavior.
–  E.g., members using the left rail to select a particular vertical.

27

Unified Search Result Page

28

Intent Detection and Page Construction

§  Relevance is now a two-part computation:

P(Content Type | User, Query)
x
P(Document | User, Query, Content Type)

§  Intent detection comes first: inefficient to send all queries
to all verticals.

§  Secondary components introduce diversity.

29

Summary

§  Personalize every search and leverage structure.
§  Understand queries as early as possible.
§  Fight the spammers that be.
§  Unify and simplify the search experience.

Goal: help LinkedIn’s 200M+
members find and be found.

30

Want to learn more?

§  Check out http://data.linkedin.com/search.

§  Contact us:
–  Shakti: ssinha@linkedin.com
http://linkedin.com/in/sdsinha

–  Daniel: dtunkelang@linkedin.com
http://linkedin.com/in/dtunkelang

§  Did we mention that we’re hiring?

32

[In]formation Retrieval: Search at LinkedIn

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (16)

Mais de Daniel Tunkelang

Mais de Daniel Tunkelang (20)

Último

Último (20)

[In]formation Retrieval: Search at LinkedIn