Measuring Learning During Search - ACM SIGIR CHIIR 2019

NILAVRA BHATTACHARYA, JACEK GWIZDKA
School of Information, The University of Texas at Austin
ACM SIGIR CHIIR 2019 • GLASGOW, SCOTLAND, UK? EU?
MEASURING
LEARNING DURING
SEARCH
Differences in Interactions, Eye-Gaze, and Semantic Similarity to
Expert Knowledge

Why is the
sky blue?
The sky is blue
because …
Big Idea: to measure this knowledge-change, and (eventually) infer when it is happening
Benefits: can be extended to a wide variety of fields, independent of topic and content
e.g. online learning environments will become more popular
“learning” or change in knowledge

1. Introduction & Background
2. Method
3. Measures
4. Results
5. Summary
Overview
4

1.1 What is Learning?
5
change in verbal knowledge,
from before to after a search session
Image: http://thepeakperformancecenter.com/educational-learning/thinking/blooms-taxonomy/blooms-taxonomy-revised
Revised Bloom’s Taxonomy

1.2 Measuring Learning
6
Existing Methods of Assessing of Knowledge-Change:
• asking explicit fact-checking questions
– can be disruptive for web-searching
• SVT: Sentence Verification Technique
– requires creating specific questions for each document consumed (Freund et al., 2016)
• (Automated) Essay Scoring
– requires training set of carefully hand-scored essays (Yang et al., 2002)
• concept-maps and mind-mapping
– difficult to score for non-experts
• common drawbacks: in the context of online information search
– topic specific
– time consuming to measure
– difficult to scale-up

1.3 Prior Work
7
• Goal: implicit measurement of learning or knowledge-gain
Implicit Measures:
• Cole et al. (2013): eye gaze patterns can assess differences in users’ domain
knowledge level (for text search).
– behavioural features are topic-independent predictive cues of domain knowledge
• Collins-Thompson et al. (2016): diversity in search queries is an indicator of
increased knowledge gain.
• Vakkari (2016): suggested a set of predictors for knowledge-change during search.

1.3.1 Prior Work
8
Image: Gadiraju, U., Yu, R., Dietze, S., & Holtz, P. (2018). Analyzing knowledge gain of users in informational search sessions on the web. CHIIR ’18
Gadiraju et al. (CHIIR 2018):
• topic specific pre- and post tests
involving True/False questionnaires
– may not be generalizable for all topics
– exposes users to search-topic and possible answers
– correct answer for multiple-choice questions can be
selected by guesswork

1.3.2 Prior Work
9
Image: Ghosh, S., Rath, M., & Shah, C. (2018). Searching as Learning: Exploring Search Behavior and Learning Outcomes in Learning-related Tasks. CHIIR ’18
Ghosh et al. (CHIIR 2018):
• users were asked to self-rate their perceived change in knowledge
– subjective
– may not reflect true change in knowledge

• explore knowledge-change measures that
– do not require domain-specific comprehension tests
– do not expose users to the search-topic before the actual search begins
– attempt to measure a searcher’s knowledge-change, minimizing guessing
and subjective differences
• investigate differences in search behaviour and gaze-patterns of users
showing low versus high knowledge-change
1.4 Research Aims
10

2. Method
3. Measures
4. Results
5. Summary
Overview
11

• Eye-tracking user study (n=30; 16 females)
• Within subjects design
• Searched for health-related information on the web
• participants were pre-screened for
- non-expert topic familiarity
- uncorrected eye-sight
- proficiency in online searching
2.1 Experimental Design
12

• Two search tasks, on health related topics, simulating work-task
approach (Borlund, 2003)
– tried to trigger realistic information-need in participants
(e.g., helping a cousin, and a friend)
• Topics:
– Vitamin A
– Hypotension
• Each task had 4 questions from multiple facets
– e.g. for Vitamin A, participants had to find:
• recommended dosage
• health benefits
• consequences of excess and deficiency
• food sources
2.2 Task Description
13
Borlund, P. (2003). The IIR evaluation model: a framework for evaluation of interactive information retrieval systems. Information Research. 8(3).
Section 3.2 of our paper contains
the full-texts of the task prompts.

Memory-
span
test
via
Working
Memory
Capacity
WMC
Online
health
literacy
test
via
eHealth
Literacy
Scale
eHEALS
Training
Task
to
familiarize
with
interface
Search
task
(in
counter-
balanced
order)
2.3 Procedure
14

2.3 Procedure
15
a
b
c
e
d
Pre-task Knowledge (free-text)
Customized Google SERP
CONTENT pages
Bookmarking
Note-taking
Post-task
Knowledge
(free-text)

• Custom Google SERP:
– result retrieved in
background from Google
– 7 results per page
• increases font-size and
visual angle for proper eye-
tracking
– no ads
2.3 Custom Google SERP
16

2.3 Bookmarking & Note-taking
17
Bookmarking Note-taking

Memory-
span
test
via
Working
Memory
Capacity
WMC
Online
health
literacy
test
via
eHealth
Literacy
Scale
eHEALS
Training
Task
to
familiarize
with
interface
Search
task
(in
counter-
balanced
order)
Perceived
workload
test after
each task
via
NASA-TLX
2.4 Procedure
18

2. Method
3. Measures
4. Results
5. Summary
Overview
19

• Pre- and post-tasks
3 Measures
20
Think of what you already
know on the topic of this
search and list as many
phrases or words as you can
that come to your mind.
Now that you have completed
this search task, think of the
information that you found
and list as many words or
phrases as you can on the
topic of the search task.
change in knowledge
Aim: to measure this change, using implicit-feedback measures
Challenge: user input is open-ended text, via free-recall from memory (no time-limit)
(Key difference of our study from prior works (Gadiraju et al., 2018; Yu et al., 2018; Ghosh et al., 2018))

• Knowledge Change (KC)
– simple
– sophisticated (using semantic similarity)
• Eye-tracking (ET)
• Search Interactions (SI)
• Unit of analysis: <user, task> pair
3 Measures
21

3.1.1 KC Measures - Simple
22
𝐾𝐶_𝑆𝑖𝑚𝑝𝑙𝑒 =
𝑖𝑡𝑒𝑚𝑠 𝑝𝑜𝑠𝑡 − 𝑖𝑡𝑒𝑚𝑠 𝑝𝑟𝑒
𝑖𝑡𝑒𝑚𝑠 𝑝𝑜𝑠𝑡
items = words and phrases entered by users before and after
each task, separated by ENTER key presses (“n”)

3.1.2 KC Measures - Sophisticated
23
Expert
Knowledge
(or “Correct” Answers)
User’s
Pre-task
answers
User’s
Post-task
answers
knowledge
change

24
Step 1: Curating expert knowledge vocabulary:
– crowdsourced answers to each question from the search task (MTurk)
. . .
– answers were cleaned and verified by a medical doctor (expert)
– final vocabulary size:
• 115 phrases / words for Task 1
• 105 phrases / words for Task 2

25
Step 2: Measuring semantic similarity between texts
Step 2(a): Turn natural text into numbers:
"user's pre-task answers"
"user's post-task answers"
"answers from expert"
[0.3, 5.6, 0.7, …]
[0.7, 1.2, 0.1, …]
[0.9, 3.6, 0.5, …]
Sentence
Embedding
Image: https://tfhub.dev/google/universal-sentence-encoder/2
• encoder of greater-than-word length text
phrases, sentences, short paragraphs
• trained on a variety of large text-corpuses
Google News, entire English Wikipedia, etc.

26
Step 2: Measuring semantic similarity between texts
Step 2(b): Measure distance between vectors:
[0.9, 3.6, 0.5, …] Ԧ𝑣
expert’s knowledge vector
= 1 − arccos ൗ
𝑢 ⋅ Ԧ𝑣
‖𝑢‖‖ Ԧ𝑣‖
𝜋

27
Expert
Knowledge
(or “Correct” Answers)
User’s
Pre-task
answers
User’s
Post-task
answers
knowledge
change
final knowledge state
initial knowledge state

28
𝐾𝐶_𝑆𝑒𝑚_𝑫𝒊𝒇𝒇 = sim 𝒑𝒐𝒔𝒕_𝑡𝑎𝑠𝑘, 𝑒𝑥𝑝𝑒𝑟𝑡 − sim 𝒑𝒓𝒆_𝑡𝑎𝑠𝑘, 𝑒𝑥𝑝𝑒𝑟𝑡
𝐾𝐶_𝑆𝑒𝑚_𝑹𝒂𝒕𝒊𝒐 =
sim 𝒑𝒐𝒔𝒕_𝑡𝑎𝑠𝑘, 𝑒𝑥𝑝𝑒𝑟𝑡
sim 𝒑𝒓𝒆_𝑡𝑎𝑠𝑘, 𝑒𝑥𝑝𝑒𝑟𝑡
• How to measure change between two numbers?

• we used “reading” eye-fixations only
– fixation count (𝑓𝑖𝑥_𝑛) and duration (𝑓𝑖𝑥_𝑑𝑢𝑟_𝑠𝑢𝑚, 𝑓𝑖𝑥_𝑑𝑢𝑟_𝑎𝑣𝑔)
– length of reading sequences (𝑟𝑠𝑒𝑞_𝑙𝑒𝑛)
– regression count (𝑟𝑒𝑔𝑟_𝑛) and length (𝑟𝑒𝑔𝑟_𝑙𝑒𝑛)
3.2 Eye-tracking (ET)
29
(Gwizdka, 2014)
Reading
(on relevant content)
Scanning
(on irrelevant content)
Gwizdka, J. (2014). Characterizing Relevance with Eye-tracking Measures. IIiX ’14

• Webpage based:
– count of pages visited (𝑝𝑔_𝑛)
• Search-query based:
– count of queries (𝑞𝑢𝑒𝑟𝑦_𝑛)
– count of new queries in query-reformulations (𝑞𝑟_𝑛𝑒𝑤_𝑛)
– how “specialized” were the words used in queries (𝑞_𝑤𝑜𝑟𝑑_𝑓𝑟𝑒𝑞)
• "cure for low blood pressure" (less specialized)
• "mayoclinic hypotension treatment" (more specialized)
• Table 1 in our paper describes how to compute all the measures.
3.3 Search Interactions (SI)
30

2. Method
3. Measures
4. Results
5. Summary
Overview
31

4.1 Data Analysis
32
𝐾𝐶_𝑆𝑖𝑚𝑝𝑙𝑒
𝐾𝐶_𝑆𝑒𝑚_𝑅𝑎𝑡𝑖𝑜
𝐾𝐶_𝑆𝑒𝑚_𝐷𝑖𝑓𝑓
LO group1
LO group2
LO group3
HI group1
HI group2
HI group3
ET SI
ET SI
ET SI ET SI
ET SI
ET SI
Do LO and HI groups differ
significantly in terms of their
Eye-tracking (ET) and Search
Interaction (SI) measures?
• Quasi-independent Vars:
– Knowledge Change (KC)
groups (LO and HI)
• Dependent Vars:
– Eye-tracking (ET)
– Search Interactions (SI)
• Statistical Test:
– Mann Whitney UGroup-membership was fairly consistent:
- 2 / 49 mismatches between _Ratio and _Diff
- 9 / 49 mismatches between _Simple and _Sem

4.2.1 ET Measures - Fixations
33
• LO group had higher (!) eye-tracking fixation-measures than HI group:
– fixated more on CONTENT pages (fix_n_content_avg .05 ≤ p ≤ .1)
– fixated longer in total (fix_dur_content_sum p < .01) and on average (fix_n_content_avg)
• Yu et al. (SIGIR 2018) similarly found:
– total, average, and max time spent on webpages have highest predictive power for
knowledge-gain prediction

4.2.1 ET Measures - Movement
34
• Again, LO group differed significantly by having:
– longer reading sequences (rseq_n); higher probability of reading (pRR_serp)
– regressed backwards longer (regr_len), and more often (regr_n)
• Eye-tracking measures show LO group put more effort in reading, yet our Knowledge-Change
measures reflect they learnt less

4.2.2 SI Measures
35
• LO and HI users entered similar number of search queries
– LO group entered fewer new queries in reformulations (qr_new_n)
– LO group used more common (or less specialized) words in queries (q_words_freq)
• Yu et al. (SIGIR 2018) similarly observed:
– count of unique terms in queries was the only query-related feature that showed
predictive power

4.3 Other Measures
36
• HI group reported higher mental workload (NASA_TLX)
• LO and HI groups did not have any significant differences in
– eHealth literacy knowledge, comfort, and skills at finding, evaluating, and applying electronic health information
– working-memory capacity
– number of webpages visited
• Yu et al. (SIGIR 2018) similarly illustrate:
– counts of webpages visited are very weak predictors of knowledge-gain (Fig 1 of Yu et
al. (2018): feature importance of random forest model).

• LO-FKS group:
– spent longer time on reading SERPs (pRR_serp)
– opened fewer CONTENT pages (pg_content_n);
thus found fewer relevant CONTENT pages (pg_content_rel_n)
• similar phenomenon observed by Gwizdka (CHIIR 2017) and Collins-Thompson
et al. (2016)
– reported lower mental workload after task (NASA_TLX)
4.4 Final Knowledge State (FKS)
37
Expert
Knowledge
Post-task
answers
sim 𝒑𝒐𝒔𝒕_𝑡𝑎𝑠𝑘, 𝑒𝑥𝑝𝑒𝑟𝑡
𝑝𝑜𝑠𝑡_𝑒𝑥𝑝_𝑠𝑖𝑚
LO-FKS HI-FKS
ET SI ET SI
final knowledge state (FKS)

2. Method
3. Measures
4. Results
5. Summary
Overview
38

• LO group read more, yet they learnt less
– possibly due to difficulty in acquiring information
• LO-FKS group spent more time in reading SERPs
– yet they opened fewer relevant search results
• LO group used less specialized words in their queries
• LO group reported lower mental workload after each task
• No significant differences in
– total number of pages visited
– eHealth Literacy Score
– Working Memory Capacity
5.1 Takeaways
39
GROUPS:
LO: Low Knowledge-Change (KC)
LO-FKS: Low Final-Knowledge-State (FKS)

• explore knowledge-change measures that
– do not require domain-specific comprehension tests
– do not expose users to the search-topic before the actual search begins
• we introduce a topic-independent, free-recall based method of knowledge assessment
– expert vocabulary can be curated from online knowledgebases (e.g. Wikipedia)
– attempt to measure a searcher’s knowledge-change, while minimizing guessing
and subjective differences
• we used semantic similarity of user-responses to expert-knowledge to measure
knowledge-change
– advances in measuring semantic-similarity will help in this direction
• investigate differences in search behaviour and gaze-patterns of users
showing low versus high knowledge-change
– results show Eye-tracking (ET) and Search-Interaction (SI) measures sig. differ with varying
levels of knowledge-change => ET & SI: good candidate measures of verbal-learning
5.2 In terms of Research Aims
40

5.3 Limitations & Future Work
41
• Limitations:
– only 2 search-tasks, of similar nature (health information search)
– data-analysis at task-level (not participant level)
– relatively uniform group of participants (young-adult college students)
– short time-frame
• Future Directions:
– wider range of search tasks
– more diverse participants
– additional individual-difference tests
– multiple-session study (to assess knowledge-change over longer period of time)

5.4 Summary
42
Verbal
Knowledge
Change
Specialized
words in
queries
NASA TLX
mental
workload
Eye-
tracking
measures
Search
interactions
webpage counts,
durations
Working
Memory
Capacity
eHealth
Literacy
Score

THANK YOU Questions?
Student Travel GrantCareer Award
Acknowledgements:
expert-knowledge curation
Dr. Andrzej Kahl
crowdsourcing and data collection
Yinglong Zhang

• Collins-Thompson, K., Rieh, S. Y., Haynes, C. C., & Syed, R. (2016). Assessing learning outcomes in web search:
A comparison of tasks and query strategies. CHIIR ’16
• Gadiraju, U., Yu, R., Dietze, S., & Holtz, P. (2018). Analyzing knowledge gain of users in informational search
sessions on the web. CHIIR ’18
• Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., & Dietze, S. (2018). Predicting user knowledge gain in
informational search sessions. SIGIR ‘18
• Ghosh, S., Rath, M., & Shah, C. (2018). Searching as Learning: Exploring Search Behavior and Learning
Outcomes in Learning-related Tasks. CHIIR ’18
• Gwizdka, J. (2014). Characterizing Relevance with Eye-tracking Measures. IIiX ’14
• Cole, M. J., Gwizdka, J., Liu, C., Belkin, N. J., & Zhang, X. (2013). Inferring user knowledge level from eye
movement patterns. Information Processing & Management, 49(5), 1075-1091.
• Gwizdka, J. (2017, March). I can and so I search more: effects of memory span on search behavior. CHIIR ’17
45
References

• Borlund, P. (2003). The IIR evaluation model: a framework for evaluation of interactive information retrieval
systems. Information Research. 8(3).
• Wildemuth, B. M. (2004). The effects of domain knowledge on search tactic formulation. Journal of the
American Society for Information Science and Technology, 55(3), 246-258.
• Vakkari, P. (2016). Searching as learning: A systematization based on literature. Journal of Information Science,
42(1), 7-18.
• Cer, D., Yang, Y., Kong, S. Y., Hua, N., Limtiaco, N., John, R. S., ... & Sung, Y. H. (2018). Universal sentence
encoder. arXiv preprint arXiv:1803.11175.
• Franz, A., & Brants, T. (2006). All our n-gram are belong to you. Google Machine Translation Team, 20.
• Freund, L., Kopak, R., & O’Brien, H. (2016). The effects of textual environment on reading comprehension:
Implications for searching as learning. Journal of Information Science, 42(1), 79-93.
• Yang, Y., Buckendahl, C. W., Juszkiewicz, P. J., & Bhola, D. S. (2002). A review of strategies for validating
computer-automated scoring. Applied Measurement in Education, 15(4), 391-412.
• Francis, G., MacKewn, A., & Goldthwaite, D. (2004). CogLab on a CD. Wadsworth Publishing Company.
46
References

Measuring Learning During Search - ACM SIGIR CHIIR 2019

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (18)

Semelhante a Measuring Learning During Search - ACM SIGIR CHIIR 2019

Semelhante a Measuring Learning During Search - ACM SIGIR CHIIR 2019 (20)

Último

Último (20)

Measuring Learning During Search - ACM SIGIR CHIIR 2019