Aardvark shalini

Aardvark
The Anatomy of a Large-Scale Social Search
Engine
by
Damon Horowitz and Sepandar D. Kamvar

Presented by
Shalini Sahoo
11/21/2011

Introduction
• Library vs village paradigm
• Traditional IR approaches follow the library paradigm
• In a village, information is passed from person to person
• The retrieval task consists of ﬁnding the right person (expert
in that ﬁeld)
• Example queries: (Pg 1) “Do you have any good baby-sitter
recommendations in Palo-Alto for my 6-year-old twins? I’m
looking for somebody that won’t let them watch TV.”
“Is it safe for me to take a cab alone at 3 am from SFO airport
to my home in Berkeley?”

2

Differences

Library Village

Keywords are used to search Natural language used to ask questions

Verbose, highly contextualized and
Short queries (~2.93 words)
subjective (~18.6 words)
Knowledge base created by content
Community forms the knowledge base
publishers

Trust is based on authority Trust is based on intimacy

Retrieval involves ﬁnding the right Retrieval involves ﬁnding the right
document person

3

Aardvark
• Is* a social search engine based on the village paradigm
• It connects users live with friends or friends-of-friends who
are able to answer their questions
• Users submit questions via Aardvark’s website, email,
instant messenger or app on mobile devices
• It identiﬁes and facilitates a live chat or email conversation
with one or more topic experts in the users extended social
network
• It was mainly used for asking subjective questions for
which human judgement or recommendation was desired

* was - Aardvark was shut down on September 30th 2011
4

Aardvark
• It was originally developed by The
Mechanical Zoo, a San-Francisco-
based startup founded in 2007 by
Max Ventilla, Damon Horowitz

A prototype version It was released Google acquired it Google shut down
was launched to public for $50 million Aardvark*

Early 2008 March 2009 February 2010 September 2011

* A fall spring-clean: http://googleblog.blogspot.com/2011/09/fall-spring-clean.html
5

Outline
• Overview

• Anatomy

• Examples

• Analysis

• Evaluation

• Discussion

6

Outline
➡ Overview
‣ Main Components
‣ The Initiation of a User
‣ The Life of a Query
• Anatomy

• Examples

• Analysis

• Evaluation

• Discussion

7

Main Components
• Crawler and Indexer: To ﬁnd and label resources that
contain information

• Query Analyzer: To understand the user’s information
need

• Ranking Function: To select the best resources to provide
the information

• User Interface: To present the information to the user in
an accessible and interactive form

8

The Initiation of a User
• The ﬁrst step involves forming the “Social Graph”
• Users can import contacts from:
- social networking sites like Facebook or LinkedIn
- webmail program like Gmail or Yahoo mail
- invite friends to join

• Users in a common group or community (e.g. studied at UT
Austin, Google summer interns 2011) are added to the social
graph
• User’s topical expertise information is indexed:
- Users can indicate the topics in which they have expertise
- User’s friend can select topics for which they trust the user’s opinion
- Users can indicate their personal webpages or blogs
- User’s status updates from Facebook or Twitter (if available)

9

The Initiation of a User
• Forward Index: stores the userId, a scored list of topics,
further scores about user behavior
• From this forward index, an inverted index is constructed
• Inverted Index: stores each topicId and a scored list of
userIds (with expertise in that topic)
• Inverted index also stores scored list of userIds for features
like answer quality and response time

10

The Life of a Query
Question
Analyzer
classiﬁers
Gateways
IM
Email Transport Msgs Conversation Manager RSR
Layer

Twitter Routing
iPhone Engine
Web {RS}*
SMS

Index
Importers Database
Web Content Users
Facebook Graph
LinkedIn Topics
Etc. Content

11

Outline
✓ Overview

➡ Anatomy
‣ The Model
‣ Social Crawling
‣ Indexing People
‣ Analyzing Questions
‣ Ranking Algorithm
‣ User Interface

• Examples

• Analysis

• Evaluation

• Discussion

12

, Twitter or 3. ANATOMY
xtracts top-
sis (see Sec- 3.1 The Model
ark observes

s.
The Model
(or electing
The core of Aardvark is a statistical model for routing
questions to potential answerers. We use a network variant
of what has been called an aspect model [12], that has two
recorded in primary features. First, it associates an unobserved class
scored list of variable t 2 T with each observation (i.e., the successful
er’s behavior • A network variantqof an aspect In other is usedthe proba-
answer of question by user ui ). model words,
the Forward - bility p(ui |q) that user i will successfully answer tquestion q each
It associates an unobserved class variable ∊ T with
The Inverted depends on whether q is about the topics t inof question q by
observation (i.e., the successful answer which ui has
userIds that expertise1 :
pics, the In- user ui) X
features like p(ui |q) = p(ui |t)p(t|q) (1)
The second main feature of the model is that it defines
t2T
breadth of th
a query-independent probability of success for each poten- signing inter
or a user are 1
nd ready to - Equation 1 a query-independent probabilitydegree of
tialdefines is a simplification based upon their of success for extended
It asker/answerer pair (ui , uj ), of what Aardvark actually an
uses to connectedness and profile similarity.we present it this
match queries to answerers, but In
social potential asker/answerer pair (u , other words, in the next s
each clarity and conciseness.
way define a probability p(u |u ) that user u i will deliver a uj)
we for i j i
satisfying answer to user u, ,u q) is defined as a composition
j regardless of the question.
3.3 Inde
• The We then define the scoring function s(u , u , q) as the com-
scoring function s(ui j,
The centra
i j
of position ofprobabilities
the two the two probabilities. the right use
X In order to
s(ui , uj , q) = p(ui |uj ) · p(ui |q) = p(ui |uj ) p(ui |t)p(t|q) to learn abo
t2T
be able to a
(2)
users uj to w
Our goal in the ranking problem is: given a question q
Topics. A
from user uj , return a ranked list of users ui 2 U that
13
topics known

The Model
• The goal of the ranking algorithm is: given a question q
from user uj, return a ranked list of users ui ∊ U that
maximizes the scoring function s(ui, uj, q)
• The scoring function allows real-time routing because much
of the computation is done ofﬂine
• The only term which needs to be computed at query-time is
p(t|q)
• The distribution p(ui|t) assigns users to topics, and the
distribution p(ui|uj) deﬁnes the Aardvark social graph, both
of these are computed by the Indexer at signup time

14

Social Crawling
• In Aardvark, people form the knowledge base rather than
documents
• The more active users there are, the more potential
answerers
• So it is important for Aardvark to create a good experience
for users so that they remain active and inclined to invite
their friends
• The breadth of the Aardvark knowledge base depends upon
designing interfaces and algorithms to update the topic lists
for each user over time

15

Indexing People
• The distribution (p(t|uj)) of topics known by user ui is
computed from the following sources:
- Users can indicate the topics in which they have expertise
- User’s friend can select topics for which they trust the user’s opinion
- Users can indicate their personal webpages or blogs
- User’s status updates from Facebook or Twitter (if available)

• Over time, Aardvark learns which topics not to send to a
particular user by keeping track of:
- when user explicitly “mutes” a topic
- declines to answer questions about a topic when given the chance
- receives negative feedback on his answer from the asker

16

Indexing People
level of expertise than if he were alone in his group with
knowledge in that area. Mathematically, for some user ui ,
It is imp
requiremen
his group of friends U , and some topic t, if p(t|ui ) 6= 0,
P understand
then s(t|ui ) = p(t|ui ) + u2U p(t|u), where is a small answerer.
• Periodically, astopic strengthening algorithm form prob-
constant. The values are then renormalized to
is used lenge facin
• For a user ui and his group of friends U, and for some topic to determi
abilities.
Aardvark then runs two smoothing algorithms the pur- seeking (i.e
t, if p(t|ui) ≠ 0, then
pose of which are to record the possibility that the user may mation nee
be able to answer questions about additional topics not ex- a given w
plicitly recorded in her profile.+The u ∊ U p(t|u)
s(t|ui) = p(t|ui) n∑ first uses basic collabo- contrast, i
rative filtering techniques on topics (i.e., based on users with who has th
similar topics), the n is a small semantic similarity2 .
where second uses constant an answer
Once all of these bootstrap, extraction, and smoothing man intell
• Other smoothing techniques are used to record the
methods are applied, we have a list of topics and scores asker can
possibility that a user might be able topic scoresabout
for a given user. Normalizing these to answer so that and the h
P
additional itopics not explicitly mentioned in their profile derstandin
t2T p(t|u ) = 1, we have a probability distribution for
topics known by user ui . Using Bayes’ Law, we compute for voice, sens
• p(ui|t) is computed using Bayes’ law
each topic and user: forth, to d
in a respo
p(t|ui )p(ui )
p(ui |t) = , (3) a social se
p(t) question th
using a uniform distribution for p(ui ) and observed topic
17 knowledge

Connectedness
• Connectedness between users p(ui|uj) is computed using a
weighted cosine similarity over the following feature set:
- Social connection
- Demographic similarity
- Proﬁle similarity
- Vocabulary match
- Chattiness match
- Verbosity match
- Politeness match
- Speed match

• p(ui|uj) is stored in the social graph

18

Analyzing Questions
• The main goal of the Question Analyzer is: given a question
q , determine a scored list of topics p(t|q) for each question
• The following classifiers are run on a question:
- NonQuestion Classifier
- InappropriateQuestion Classifier
- TrivialQuestion Classifier
- LocationSensitive Classifier

• Next, the list of relevant topics is produced by merging
outputs from several TopicMapper algorithms
- KeywordMatchTopicMapper
- TaxonomyTopicMapper
- SalientTermTopicMapper
- UserTagTopicMapper
19

Analyzing Questions
• The TopicMapper algorithms are continuously evaluated
• Given a question all the returned topics to select an
answerer, and a much larger list of relevant topics are
assigned scores by two human judges
• 89% precision and 84% recall of relevant topics

20

Ranking Algorithm
• The topic list generated by the Question Analyzer is sent to
the Routing Engine which then determines the top
answerers for the given question
• The main factors that determines the ranking of users are:
- Topic expertise p(ui|q)
- Connectedness p(ui|uj)
- Availability

• From this ordered list of users the Routing Engine then
ﬁlters out users who should not be contacted
- based on preferred time of contact
- based on the frequency of times they have been contacted in the
recent past

21

User Interface
• The various user interfaces of Aardvark are built on top of
the real time communication channels such as IM, email,
SMS, iPhone, Twitter and Web-based messaging

22

Outline
✓ Overview

✓ Anatomy
➡ Examples

• Analysis

• Evaluation

• Discussion

25

Examples
EXAMPLE 1 EXAMPLE 2
(Question from Mark C./M/LosAltos,CA) (Question from James R./M/
I am looking for a restaurant in San TwinPeaksWest,SF)
Francisco that is open for lunch. Must be What is the best new restaurant in San
very high-end and fancy (this is for a small, Francisco for a Monday business dinner?
formal, post-wedding gathering of about 8 Fish & Farm? Gitane? Quince (a little older)?
people). (+7 minutes -- Answer from Paul D./M/
(+4 minutes -- Answer from Nick T./28/M/ SanFrancisco,CA -- A friend of your friend
SanFrancisco,CA -- a friend of your friend Sebastian V.)
Fritz Schwartz) For business dinner I enjoyed Kokkari
fringale (fringalesf.com) in soma is a good Estiatorio at 200 Jackson. If you prefer a
bet; small, fancy, french (the french actually place in SOMA i recommend Ozumo (a great
hang out there too). Lunch: Tuesday - sushi restaurant).
Friday: 11:30am - 2:30pm
(Reply from James to Paul)
(Reply from Mark to Nick) thx I like them both a lot but I am ready to try
Thanks Nick, you are the best PM ever! something new
(Reply from Nick to Mark) (+1 hour -- Answer from Fred M./29/M/
you're very welcome. hope the days they're Marina,SF)
open for lunch work... Quince is a little fancy... La Mar is pretty
fantastic for cevice - like the Slanted Door of
EXAMPLE 3 peruvian food...

(Question from Brian T./22/M/Castro,SF) What is a good place to take a spunky, off-the-cuff,
social, and pretty girl for a nontraditional, fun, memorable dinner date in San Francisco?
(+4 minutes -- Answer from Dan G./M/SanFrancisco,CA)
Start with drinks at NocNoc (cheap, beer/wine only) and then dinner at RNM (expensive,
across the street).
(Reply from Brian to Dan) Thanks!
(+6 minutes -- Answer from Anthony D./M/Sunnyvale,CA -- you are both in the Google group)
Take her to the ROTL production of Tommy, in the Mission. Best show i've seen all year!
(Reply from Brian to Anthony) Tommy as in the Who's rock opera? COOL!
(+10 minutes -- Answer from Bob F./M/Mission,SF -- you are connected through Mathias' friend
Samantha S.) Cool question. Spork is usually my top choice for a ﬁrst date, because in addition
to having great food and good really friendly service, it has an atmosphere that's perfectly in
between casual and romantic. It's a quirky place, interesting funny menu, but not exactly non-
traditional in the sense that you're not eating while suspended from the ceiling or anything
26

Outline
✓ Overview

✓ Anatomy
✓ Examples
➡ Analysis

• Evaluation

• Discussion

29

Analysis
• As of October 2009, Aardvark had 90361 registered users
• The average query volume was 3167.2 questions per day in
this period Users

30

Analysis
• Mobile users were particularly active
- It is easier to reply to questions in the form of IM or SMS on phone
- People are comfortable using natural language in an IM setting
rather than in a web search setting

• Questions are highly contextualized
- Average query length is 18.6 words

• Questions often have a subjective element
websites & internet apps business research
music, movies, TV, books sports & recreation
home & cooking
finance & investing
technology & programming miscellaneous
Aardvark

local services
travel

product reviews & help
restaurants & bars
31

music, movies, TV, books sports & recreation
home & cooking
finance & investing
technology & programming miscellaneous
Aardvark

Analysis local services
travel

product reviews & help
restaurants & bars
• Questions get answered quickly
Figure 8: Categories of questions sent to Aardvark
4
x 10
2.5
Questions Answered

2

ser growth 1.5

1

m a coworker; and the 0.5
-friend-of-friend. The
ailed, came from a user 0
0−3 min 3−6 min 6−12 min 12−30 min30min−1hr 1−4 hr 4+ hr
d to both “restaurants”

• Answers are9: Distribution of questions and answering
ures of Aardvark is that Figure of high quality
times.
are hypercustomized to Answers are comprehensive and concise
-
nt restaurant recommen-
with a spunky and spon- Median answer lengthas mobile users [14].) Second, mo-
-
times as active
was 22.2 words
ing small formal family 70.4% of bile users of Aardvark are almost as active in absolute
- inline feedback rated answers as ‘good’, 14.1% rated
business meeting — and terms as mobile15.5%of Google (who have on average
as ‘OK’ and users
ize these constraints. It answers 5.68 mobile sessions perwere rated as ‘bad’
month [14]). This is quite sur-
st of these examples (as prising for a service that has only been available for 6
ons), the asker took the months.
ing out.
We believe this is for 32 reasons. First, browsing
two

Analysis
• There are a broad range of answerers
• Social proximity matters
• People are indexable

33

Outline
✓ Overview

✓ Anatomy
✓ Examples
✓ Analysis
➡ Evaluation

• Discussion

34

Evaluation
• Compared to Google!
• “Do you want to help Aardvark run an experiment?” was
inserted into a random sample of active questions
• Users were asked to reformulate their question as a query
and search on Google
• Users time how long it took to ﬁnd a satisfactory result and
also rate the quality of answers
• 71.5% on Aardvark, with a mean rating of 3.93
• 70.5% on Google, with a mean rating of 3.07

35

Outline
✓ Overview

✓ Anatomy
✓ Examples
✓ Analysis
✓ Evaluation
➡ Discussion

36

Discussion
• Participation Fatigue: (Pg 9) “86.7% users have been
contacted by Aardvark with a request to answer a question,
and of those, 70% have looked at the question and 38%
could answer a question. 20% of the users accounted for
85% of answers” What happens when this thin slice of
users get overwhelmed and start dropping out?
• Availability: There can be cases when the topic expert(s) in
your social graph might not be online. Do you think having
an “ofﬂine” mode be helpful?
• Evaluation: Can we get a better understanding of how well
Aardvark worked had it been compared to another social
search engine which works on the same paradigm? How
can that be achieved?

37

Aardvark shalini

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Aardvark shalini

Semelhante a Aardvark shalini (20)

Último

Último (20)

Aardvark shalini