Learning by example: training users through high-quality query suggestions

Learning by Example: training
users through high-quality
query suggestions (SIGIR’15)
A collaboration with Morgan Harvey & David Elsweiler.
Claudia Hauff
Web Information Systems

0
50,000,000
100,000,000
150,000,000
200,000,000
250,000,000
300,000,000
350,000,000
Sep*12 Apr*13 Oct*13 May*14 Dec*14 Jun*15 Jan*16
Data available at https://duckduckgo.com/trafﬁc.html
NSA collecting phone records of millions of Verizon
customers daily. The Guardian. June 6, 2013.
Not everyone
stays around.

I do care about privacy …
until the moment my
searches fail me.
@ﬂickr:eviloars
Can we teach searchers to use an arbitrary search
engine as best as possible?

@ﬂickr:practicalowl
Advanced retrieval algorithms; queries as a given.
Assisting users in creating better queries.
query suggestions related searches query autocompletion
Personalised & context-driven search.
Educate users to become better searchers.Educate users to become better searchers.
complimentary to technical solutions system speciﬁc

• Altering the size [Franzen & Karlgren, 2000] and wording [Belkin et al., 2003] of the
search box influences the length of submitted queries
• Exchanging a complex multi-field catalogue interface for a simple search
box radically alters user behaviour [McKay & Buchanan, 2013]
• Training users how to construct boolean logic queries can change search
behaviour [Lucas & Topi, 2004]
• Allowing users to compare their search behaviour to expert searchers
enables them to reflect and change their habits [Bateman et al., 2012]
deeper in the results list [6].
Behaviour change support
systems
“… information systems designed to form, alter, or reinforce
attitudes or behaviours or both without using coercion or
deception” [Oinas-Kukkonen & Harjumaa, 2008]

Our questions
Are users able to notice differences between good queries
and their own? Can they abstract these differences to
change their own behaviour?
How effectively can users learn and abstract from good
queries? Do users who are “trained” perform better than
users who did not receive training?
@ﬂickr:eviloars

Our hypotheses
@ﬂickr:carbonnyc
H1: Users can adapt their querying behaviour to pose good queries to
an unfamiliar search system.
H3: A small number of “training queries” are sufﬁcient.
H4: A user who receives training with queries he can relate to, learns
better than a user who receives training with less-relatable queries.
faster than a user who receives training with less-relatable queries.
H2: Users are able to identify salient characteristics of good queries.

A collection of user studies
Piloting zing
User perception of
high-quality queries Main study: zing
Training size study
Generating
training
queries
All studies are based on AQUAINT and the TREC 2005 Robust track topics.

• Query quality is measured in Average Precision
• The queries should intuitively make sense to
humans (instead of relying on quirks in documents)
• The queries should not be overly verbose or
speciﬁc
Generating high-quality
queries I

for each TREC topic
relevant
documents
100 single-term
queries
AQUAINT
Hand-crafted ﬁltering rules to avoid unintuitive term selection.
queries II

for each TREC topic
relevant
documents
AQUAINT
AP-based
query ranking
top two-term
queries
queries II

for each TREC topic
relevant
documents
AQUAINT
AP-based
query ranking
3x
: top 100 queries up to length 4
queries II

Identify positive accomplishments of
the Hubble telescope since it was
launched in 1991. (303)
Identify drugs used in the treatment of
mental illness. (383)
What is the status of The Three Gorges
Project? (416)
* universe astronomer faint hubble
* infrared galaxies universe hubble
* infrared stars universe hubble
* antidepressant risk zoloft prozac
* zoloft studies prozac
* antidepressant effective zoloft
* cofferdams damming generating 2009
* dam corporation phase 2009
* 2009 river construction
Median AP across the 100 generated queries: 0.38
queries III

Piloting
User perception of
high-quality queries Main study:
Training size study
Generating
training
queries

You are given an information need and a query suggestion that has
been derived for this information need. Rate the suggestion along
four dimensions: knowledge, surprise, usage and relevance.
Identify positive accomplishments of the Hubble telescope
since it was launched in 1991.
universe astronomer faint hubble
Top 15 queries per topic.
Hit: 10 tasks, 12 cents.
3 workers per task.
task
User perception I

1 2 3 4 5
0
100
200
300
400
500
600
Rating
Numberofratings
How surprised were you?
Not
Very
1 2 3 4 5
0
200
400
600
800
Rating
Numberofratings
Would you use the suggestion?
No
Yes
1 2 3 4 5
0
200
400
600
800
Rating
Numberofratings
What will the quality
of the search results be?
Low
High
User perception II

1 2 3 4 5
0
100
200
300
400
500
600
Rating
Numberofratings
How surprised were you?
Not
Very
1 2 3 4 5
0
200
400
600
800
Rating
Numberofratings
Would you use the suggestion?
No
Yes
1 2 3 4 5
0
200
400
600
800
Rating
Numberofratings
What will the quality
of the search results be?
Low
High
User perception II
Indicates that our query
generation approach is valid.
Many of our suggestions are
not very convincing.
Expected search result
quality is mostly average.

• Familiar topics tend to be of broad interest
• Topics covering speciﬁc themes attract low
knowledge ratings 
 
User perception III
What factors contributed to the growth of
consumer on-line shopping? (639) 3.0/5
Identify drugs used in the treatment of mental
illness. (383) 2.89/5
What is the status of The Three Gorges
Project? (416) 1.58/5

Piloting zing
User perception of
high-quality queries Main study:
Training size study
Generating
training
queries

A closer look at zing
How well am I doing?
Suggestions
(higher AP than
user queries)
after 2 initial
queries.
Relevant documents are
marked by the system

Piloting
• N=22 undergraduates
• 10 medium difficulty topics
• Randomized topic order
• Reflection prompts
When does fatigue set in?
By topic 7, median AP≈0
Query characteristics
81 reflections encoded
C1: Specific query terms
C2: More general query terms
C3: Queries not in topic description
C4: Unexpected or surprising vocab.
C5: Surprising non-use of vocab.
C6: Terms the user was surprised
at the usefulness of
C7: Thinking creatively
C8: Advanced vocabulary (rare)
C9: Specialist vocabulary (rare)
C10: Good combination of search terms
C11: Synonyms and related concepts
C12: Query requires specialist knowledgeUsers are able to identify salient characteristics of good queries.

Piloting
User perception of
high-quality queries Main study: zing
Training size study
Generating
training
queries

• Between-group design, N=91
• 6 medium difﬁculty topics
• Randomized topic order
• Training & test phase
Main study
Group Gexp_high
Trained on high-quality suggestions,
that were also perceived as high quality.
Group Gexp_low
Trained on high-quality suggestions,
that were perceived as low quality.
Group Gcontrol No training at any stage.
topic
+suggestions
topic
+suggestions
topictopic
+suggestions
topic
+suggestions
topic
topic topic topictopic topic topic

Main study: query effectiveness
Training topics Test topics
Users who receive high-quality training suggestions perform better
on average & achieve considerably higher max. AP scores.

Main study: query sequence
effectiveness
1 2 3 4 5 6 7 8 9 10
0
0.1
0.2
0.3
0.4
Query sequence
AveragePrecision
Control
Exp_High
Exp_Low
Average precision over sequences of queries
on test topics.
Each point represents the mean AP of
all queries submitted as nth query.
Gexp_high & Gexp_low signiﬁcantly outperform Gcontrol.
No signiﬁcant differences observed between Gexp_high & Gexp_low.

Training size
study
• Between-group design, N=57
• Analogous setup to Main study
1 2 3 4 5 6 7 8 9 10
0
0.1
0.2
0.3
0.4
Query sequence
AveragePrecision
Control
Exp_High
Exp_Low
Main study:
4 training
&
2 test topics
1 2 3 4 5 6 7 8 9 10
0
0.1
0.2
0.3
0.4
Query sequence
AveragePrecision
Control
Exp_High
Exp_Low
Now:
2 training
&
4 test topics
Less training yields fewer (but still stat. signiﬁcant) improvements.
Similarity between Gexp_high & Gexp_low remains stable.

Looking back at our
hypotheses
@ﬂickr:carbonnyc
H1: Users can adapt their querying behaviour to pose good queries to
an unfamiliar search system.
H3: A small number of “training queries” are sufﬁcient.
better than a user who receives training with less-relatable queries.
faster than a user who receives training with less-relatable queries.
H2: Users are able to identify salient characteristics of good queries.

• Learning is limited to a single session
• Does the learning effect hold across sessions and
over time?
• How to translate this approach (requiring qrels) into
settings where users are unwilling to train?
• Are implicit relevance indicators sufficient?
• What is the most efficient manner of presenting such
“learning queries” to users?
Looking ahead
@flickr:

Ideas, comments & suggestions
are more than welcome!
Thank you.
c.hauff@tudelft.nl

Learning by example: training users through high-quality query suggestions

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (18)

Destaque

Destaque (7)

Semelhante a Learning by example: training users through high-quality query suggestions

Semelhante a Learning by example: training users through high-quality query suggestions (20)

Último

Último (20)

Learning by example: training users through high-quality query suggestions