Scaling API-first – The story of a global engineering organization
Identifying Information Needs by Modelling Collective Query Patterns
1. Iden%fying
Informa%on
Needs
by
Modelling
Collec%ve
Query
Pa:erns
K.Elbedweihy,
S.
Mazumdar,
A.E.
Cano,
S.N.
Wrigley,
F.Ciravegna
OAK
Research
Group,
Department
of
Computer
Science,
University
of
Sheffield
2. Informa%on
Needs
Informa(on
needs
“the
set
of
concepts
and
proper%es
users
refer
to
while
using
SPARQL
queries.”
4. Mo%va%on
Saracevic[1997]:
“ The
success
or
failure
of
any
interac%ve
system
and
technology
is
con%ngent
on
the
extent
to
which
user
issues,
the
human
factors,
are
addressed
right
from
the
beginning
to
the
very
end…..”
Peter
Mika[2009]:
“Considering
the
informa%on
needs
of
end
users
is
cri%cal
to
the
success
of
Seman%c
Search”
!
5. Mo%va%on
understand
how
to
use
logs
of
queries
iden%fy
informa%on
needs
consume
such
analysis
be:er
understanding
and
insight
into
the
data
usage
!
7. Introduc%on
Moseley
Audio
Scrobbler
LOV
Linked
User
Feedback
Slideshare
2RDF
tags2con
delicious
Bricklink Sussex
295 Dataset
31 billion RDF triples
Folk (DBTune) Reading St.
GTAA
Magna- Lists Andrews
Klapp-
tune stuhl- Resource NTU
DB club Lists Resource
Tropes Lotico Semantic yovisto
John Music Man- Lists
“September 2011”
Music Tweet chester
Hellenic Peel Brainz NDL
(DBTune) (Data Brainz Reading
subjects
FBD (zitgist) Lists Open
EUTC Incubator) Linked
Hellenic Library Open t4gm
Produc- Crunch-
PD Surge RDF info
tions
Discogs base Library
Radio Ontos Source Code
Crime ohloh Plymouth (Talis)
(Data News LEM
Ecosystem Reading RAMEAU
Reports business Incubator)
Crime data.gov. Portal Linked Data Lists SH
UK Music Jamendo
(En- uk
Brainz (DBtune) LinkedL
Ox AKTing) FanHubz gnoss ntnusc
(DBTune) SSW CCN
Points Thesau-
Last.FM Poké- Thesaur
Popula- artists pédia Didactal us rus W LIBRIS
tion (En- (DBTune) Last.FM ia theses. LCSH Rådata
reegle research patents MARC
AKTing) (rdfize) my fr nå!
data.gov. data.go Codes
Ren.
NHS uk v.uk Good- Experi-
Classical List
Energy (En- win flickr ment
(DB Pokedex Family Norwe-
Genera- AKTing) Mortality BBC wrappr Sudoc PSH
Tune) gian
(En-
tors Program MeSH
AKTing) semantic
mes BBC IdRef GND
CO2 educatio OpenEI web.org SW
Energy Sudoc ndlna
Emission n.data.g Music Dog VIAF
EEA (En- Chronic- Linked
(En- ov.uk Portu- Food UB
AKTing) ling Event MDB
AKTing) guese Mann- Europeana
BBC America Media
DBpedia Calames heim
Ord- Recht- Wildlife Deutsche
Open Revyu DDC
Openly spraak. Finder Bio- lobid
Election nance
legislation Local nl RDF graphie
Resources NSZL Swedish
Data Survey Tele- data Ulm
EU New Book
Project data.gov.uk graphis bnf.fr Catalog Open
Insti- York
URI Open Mashup Cultural
tutions Times Greek P20
UK Post- Burner Calais Heritage
codes DBpedia ECS Wiki
statistics lobid
GovWILD data.gov. Taxon iServe South- Organi-
LOIUS BNB
Brazilian
uk Concept ECS ampton sations
Geo World OS BibBase STW GESIS
Poli- ESD South- ECS
Names Fact- (RKB
ticians stan- reference ampton
data.gov.uk book Freebase Explorer) Budapest
dards data.gov. NASA EPrints
uk intervals Project OAI
Lichfield transport (Data DBpedia data
Guten- Pisa
Spen- data.gov. Incu- dcs RESEX Scholaro-
ISTAT ding bator) Fishes berg DBLP DBLP
uk Geo
meter
Immi- Scotland of Texas (FU (L3S)
Pupils & Uberblic DBLP
gration Species Berlin) IRIT
Exams Euro- dbpedia data- (RKB
London TCM ACM
stat lite open- Explorer) NVD
Gazette (FUB) Gene IBM
Traffic Geo ac-uk
Scotland TWC LOGD Eurostat Daily DIT
Linked UN/
Data UMBEL Med ERA
Data LOCODE DEPLOY
Gov.ie CORDIS YAGO New-
lingvoj Disea-
(RKB some SIDER RAE2001 castle LOCAH
CORDIS Explorer) Linked Eurécom
Eurostat Drug CiteSeer Roma
(FUB) Sensor Data
GovTrack (Ontology (Kno.e.sis) Open Bank Pfam Course-
Central) riese Enipedia
Cyc Lexvo LinkedCT ware
Linked PDB
UniProt VIVO
EURES EDGAR dotAC
US SEC Indiana ePrints IEEE
(Ontology totl.net
(rdfabout)
Central) WordNet RISKS
(VUA) Taxono UniProt
US Census EUNIS Twarql HGNC
Semantic Cornetto (Bio2RDF)
(rdfabout) my VIVO
FTS XBRL PRO- ProDom STITCH Cornell LAAS
SITE KISTI NSF
Scotland
Geo- GeoWord LODE
graphy Net WordNet WordNet JISC
(W3C) (RKB
Climbing
Linked Affy- KEGG
SMC Explorer) SISVU Pub VIVO UF
Piedmont GeoData metrix Drug
ECCO-
Finnish Journals PubMed Gene SGD Chem
Munici-
Accomo- El AGROV Ontology TCP Media
dations Alpine bible
palities Viajero OC
Ski ontology
Tourism KEGG
Ocean
Austria
Enzyme PBAC Geographic
Metoffice GEMET ChEMBL
Italian Drilling OMIM KEGG
Weather Open
public Codices AEMET Linked MGI Pathway
schools Forecasts
Data
Open InterPro GeneID Publications
EARTh Thesau- KEGG
Turismo
rus Colors Reaction
de
Zaragoza Product Smart KEGG
User-generated content
Weather DB Link Medi Glycan
Janus Stations Product Care KEGG
AMP UniParc UniRef UniSTS Government
Types Italian
Homolo Com-
Yahoo! Airports Museums pound
Ontology Google
Gene
Geo Art
Planet National
wrapper
Chem2 Cross-domain
Radio- Bio2RDF
activity UniPath
JP Sears Open Linked OGOLOD way
Life sciences
Corpo- Amster- Reactome
dam medu- Open
rates Numbers
Museum cator
As of September 2011
9. Related
Work
Analysis
for
the
Web
of
Documents
• Studying
the
search
behavior
of
Web
users
[Silverstein
et
al.
(1999),
Jansen
and
Spink
(2005),
Jansen
et
al.
(2005)
and
Spink
et
al.
(2002)].
• Improving
the
search
experience
of
Web
users:
-‐
Query
Recommenda(ons
[Baeza-‐Yates
et
al.
(2004)
and
Wen
et
al.
(2001)]
-‐
Query
Expansion
[Cui
et
al.
(2002a)]
10. Related
Work
(Cont’d)
Analysis
for
the
Web
of
Data
• Moller
et
al.
[10]
iden%fied
pa>erns
of
Linked
Data
usage
with
respect
to
different
types
of
agents.
• Arias
et
al.
[1]
analyzed
the
structure
of
the
SPARQL
queries
to
iden(fy
most
frequent
language
elements.
• Kirchberg
et
al.
[8]
introduced
a
new
no%on
of
‘relevance
of
a
LD
resource’
as
the
‘rela%onship
between
traffic
and
the
resource
and
whether
it
changes
over
%me
windows’
11. Related
Work
(Cont’d)
How
our
work
is
different:
Our
focus
is
on
iden%fying
informa%on
needs
by
modelling
query
pa5erns
of
Linked
Data
users.
approach
to
formalize
seman%c
query
log
analysis
set
of
methods
for
extrac%ng
pa:erns
in
the
query
logs
visualiza%on
of
informa%on
needs
13. Formalizing
Query
Logs
• Proposed
ontology
‘Qlog’
used
to
represent
the
main
concepts
and
rela%ons
extracted
from
a
query
log
entry.
• A
log
entry
follows
the
Combined
Log
Format
(CLF):
20. Dataset
• The
data
used
in
this
study
is
made
available
by
the
USEWOD2011
data
challenge.
• The
logs
contained
around
5
million
queries
issued
to
DBpedia
over
a
%me
period
of
almost
4
months.
Number
of
analyzed
queries
4951803
Number
of
unique
triple
pa:erns
2641098
Number
of
unique
subjects
1168945
Number
of
unique
predicates
2003
Number
of
unique
objects
196221
Number
of
unique
vocabularies
323