Data By The People, For The People
Daniel Tunkelang
Director, Data Science at LinkedIn
Invited Talk at the 21st ACM International Conference on Information and Knowledge Management (CIKM 2012)
LinkedIn has a unique data collection: the 175M+ members who use LinkedIn are also the content those same members access using our information retrieval products. LinkedIn members performed over 4 billion professionally-oriented searches in 2011, most of those to find and discover other people. Every LinkedIn search and recommendation is deeply personalized, reflecting the user's current employment, career history, and professional network. In this talk, I will describe some of the challenges and opportunities that arise from working with this unique corpus. I will discuss work we are doing in the areas of relevance, recommendation, and reputation, as well as the ecosystem we have developed to incent people to provide the high-quality semi-structured profiles that make LinkedIn so useful.
Bio:
Daniel Tunkelang leads the data science team at LinkedIn, which analyzes terabytes of data to produce products and insights that serve LinkedIn's members. Prior to LinkedIn, Daniel led a local search quality team at Google. Daniel was a founding employee of faceted search pioneer Endeca (recently acquired by Oracle), where he spent ten years as Chief Scientist. He has authored fourteen patents, written a textbook on faceted search, created the annual workshop on human-computer interaction and information retrieval (HCIR), and participated in the premier research conferences on information retrieval, knowledge management, databases, and data mining (SIGIR, CIKM, SIGMOD, SIAM Data Mining). Daniel holds a PhD in Computer Science from CMU, as well as BS and MS degrees from MIT.
9. But not all relevance factors are personal.
Good Bad
9
10. People are semi-structured objects.
for i in [1..n]!
s ← w 1 w 2 … w i!
if Pc(s) > 0!
a ← new Segment()!
a.segs ← {s}!
a.prob ← Pc(s)!
B[i] ← {a}!
for j in [1..i-1]!
for b in B[j]!
s ← wj wj+1 … wi!
if Pc(s) > 0!
a ← new Segment()!
a.segs ← b.segs U {s}!
a.prob ← b.prob * Pc(s)!
B[i] ← B[i] U {a}!
sort B[i] by prob!
truncate B[i] to size k!
10
20. How LinkedIn matches people to jobs
Job Corpus Stats
Matching Transition probabilities
Connectivity
Binary yrs of experience to reach title
title industry …
Exact matches: education needed for this title
geo description …
company functional area geo, industry,
…
User Base Soft Similarity
(candidate expertise, job description)
transition
Filtered 0.56
probabilities,
Similarity
Candidate similarity, (candidate specialties, job description)
… 0.2
Transition probability
Text (candidate industry, job industry)
General Current Position 0.43
expertise title
specialties summary Title Similarity
education tenure length 0.8
headline industry
Similarity (headline, title)
geo functional area
experience … 0.7
.
derive
d
.
.
20
24. Recommendations: Summary
Content is king.
Connections provide social dimension.
Context determines where and when
a recommendation is appropriate.
24
27. Closing the triangles
Carol
Alice ?
Bob
§ Triads suggest and affect relationships.
[Simmel, 1908], [Granovetter, 1973]
§ Triangle closing is a Big Data problem.
[Shah, 2011]
§ Use machine learning to rank candidates.
27
31. Networking: Summary
Close triangles to suggest connections.
Connections as social proof.
Unleash the power of weak ties.
31
32. Conclusion
§ People use LinkedIn because of other people.
§ Primary use cases:
– Find and be found.
– Discover and share knowledge.
§ People are at the heart of LinkedIn’s products:
– Search
– Recommendations
– Networking
32
33. Thank You!
175M+ 2/sec
62% non U.S.
25th
90 We’re Most visit website worldwide
(Comscore 6-12)
55
Hiring! >2M
Company pages
85%
32
17
8
2 4 Fortune 500 Companies use
LinkedIn to hire
2004 2005 2006 2007 2008 2009 2010 2011
LinkedIn Members (Millions)
Learn more at http://data.linkedin.com/
33