How to Troubleshoot Apps for the Modern Connected Worker
Bibliometric-enhanced Retrieval Models for Big Scholarly Information Systems
1. Bibliometric-enhanced Retrieval
Models for Big Scholarly Information
Systems
philipp.mayr@gesis.org
Workshop on Scholarly Big Data: Challenges and
Ideas. IEEE BigData 2013
3. Intro
• What are bibliometric-enhanced IR
models?
– set of methods to quantitatively analyze
scientific and technological literature
– E.g. citation analysis (h-index)
– CiteSeer was a pioneer bibliometric-enhanced
IR system
4. Background
• DFG-funded (2009-2013): Projects IRM I and IRM II
– IRM = Information Retrieval Mehrwertdienste (value-added IR services)
• Goal: Implementation and evaluation of value-added IR services for
digital library systems
• Main idea: Applying scholarly (science) models for IR
Co-occurrence analysis of controlled vocabularies (thesauri)
Bibliometric analysis of core journals (Bradford’s law)
Centrality in author networks (betweenness)
• In IRM I we concentrated on the basic evaluation
• In IRM II we concentrate on the implementation of reusable (web)
services
4
http://www.gesis.org/en/research/external-funding-projects/archive/irm/
5. Search Term Recommender (Petras 2006)
Search Term Service: recommending strongly
associated terms from controlled vocabulary
6. Bradfordizing (White 1981, Mayr 2009)
Bradford Law of Scattering (Bradford 1948): idealized example for 450 articles
Nucleus/Core:
150 papers in
3 Journals
Zone 2:
150 papers in
9 Journals
Zone 3:
150 papers in
27 Journals
Ranking by Bradfordizing: sorting the core journal papers / core books on top
bradfordized list of journals in informetrics applied to monographs: publisher as sorting criterion
8. Scenarios for combined ranking services
iterative use : simultanous use:
Result Set
Core Journal Papers
Central Author Papers
Relevant
Papers
Result Set
Central Author Papers
Core Journal Papers
11. Main Research Issue:
Contribution to retrieval quality and usability
• Precision:
– Do central authors (core journals) provide more relevant hits?
– Do highly associated cowords have any positive effects?
• Value-adding effects:
– Do central authors (core journals) provide OTHER relevant hits?
– Do coword-relationships provide OTHER relevant search terms?
• Mashup effects:
– Do combinations of the services enhance the effects?
13. Evaluation of Bradfordizing on CLEF Data (Mayr 2013)
0,00
0,05
0,10
0,15
0,20
0,25
0,30
0,35
Bradford zones (core, z2, z3)
2003 articles 0,29 0,22 0,16
2004 articles 0,23 0,18 0,13
2005 articles 0,31 0,24 0,17
2006 articles 0,29 0,27 0,24
2007 articles 0,28 0,26 0,22
2005 monographs 0,21 0,16 0,19
2006 monographs 0,28 0,28 0,24
2007 monographs 0,24 0,21 0,23
core z2 z3
journal articles:
significant improvement
of precision from zone3
to core
monographs:
slight improvement of
precision distribution
between the three
zones
precision between Bradford zones (core, zone2 and zone3)
14. Evaluation of Author Centrality on CLEF Data
• moderate positive relationship between
rate of networking and precision
• precision of TF-IDF rankings (0.60)
significantly higher than author centrality
based rankings (0.31) – BUT:
• very little overlap of documents on top of
the ranking lists: 90% of relevant hits
provided by author centrality did not appear
on top of TF-IDF rankings
→ added precision of 28%
0
20
40
60
80
100
120
140
0 0,2 0,4 0,6 0,8 1 1,2
GiantSize
Precision
Correlation Precision10 -
Giant Size: 0.25
• author centrality seems to favor OTHER
relevant documents than traditional rankings
• value-adding effect:
other view to the information space
avg number docs 517
avg number authors 664
avg number co-authors 302
avg giant size 24
21. IRM & Modeling Science
measuring contribution
of bibliometric-enhanced services
to retrieval quality
deeper insights in
structure & functioning
of science
Bibliometric-enhanced
services
(structural attributes of
science system)
way towards a formal
model of science
22. References
• Mutschke, P., Mayr, P., Schaer, P., & Sure, Y. (2011). Science models as value-
added services for scholarly information systems. Scientometrics, 89(1), 349–
364. doi:10.1007/s11192-011-0430-x
• Lüke, T., Schaer, P., & Mayr, P. (2013). A framework for specific term
recommendation systems. In Proceedings of the 36th international ACM SIGIR
conference on Research and development in information retrieval - SIGIR ’13
(pp. 1093–1094). New York, New York, USA: ACM Press.
doi:10.1145/2484028.2484207
• Mayr, P. (2013). Relevance distributions across Bradford Zones: Can
Bradfordizing improve search? In J. Gorraiz, E. Schiebel, C. Gumpenberger, M.
Hörlesberger, & H. Moed (Eds.), 14th International Society of Scientometrics
and Informetrics Conference (pp. 1493–1505). Vienna, Austria. Retrieved from
http://arxiv.org/abs/1305.0357
• Hienert, D., Schaer, P., Schaible, J., & Mayr, P. (2011). A Novel Combined Term
Suggestion Service for Domain-Specific Digital Libraries. In S. Gradmann, F.
Borri, C. Meghini, & H. Schuldt (Eds.), International Conference on Theory and
Practice of Digital Libraries (TPDL) (pp. 192–203). Berlin: Springer.
doi:10.1007/978-3-642-24469-8_21 22