1. The MESUR project: an overview and update
Johan Bollen
Indiana University
School of Informatics and Computing
Center for Complex Networks and System Research
p y
jbollen@indiana.edu
Acknowledgements:
Herbert Van de Sompel (LANL), Marko A. Rodriguez (LANL), Ryan Chute (LANL),
Lyudmila L. Balakireva (LANL), Aric Hagberg (LANL), Luis Bettencourt (LANL)
Research supported by the NSF and Andrew W. Mellon Foundation.
School of Informatics and Computing
Indiana University
November 5th, 2009
2. When the obvious is staring you in the face
School of Informatics and Computing
Indiana University
November 5th, 2009
3. The scientific process: the importance of early indicators
(Egghe & Rousseau, 2000; Wouters, 1997)
(Brody, Harnad, & Carr 2006),
Usage data
Citation: final products
• Scale, cf. Elsevier downloads
• Publication delays
(+1B) vs. Wos citations (650M)
vs
• Focus on publications
• Immediate, early stages
• Focus on authors
• Variety of resources and actors
School of Informatics and Computing
Indiana University
November 5th, 2009
4. What is MESUR?
MESUR IS: Scientific project to study Science itself
from REAL-TIME indicators.
Foundations:
• Very large-scale, representative usage data (10^9)
• Network science and social network analysis
• Complex systems, complex networks
systems
Outcomes
• Surveys of novel impact metrics and their properties
• Network models of Science
Unrelated picture of my daughter (who
wants to be a scientist)
MESUR IS NOT:
Commercial endeavor, end-user service development, promoter of
particular impact metrics, enemy nor friend of particular
metrics, advocacy group, …
School of Informatics and Computing
Indiana University
November 5th, 2009
5. Timeline and development
• 2006-2008:
o Andrew W. Mellon Foundation
o Digital Library Research and Prototyping team Los Alamos National
team,
Laboratory
o Collection of large-scale usage data from some of world’s most
significant publishers, aggregators and institutional consortia
o Feasibility: Usage data, usage-based network models of science,
usage-based impact metrics
• 2009 – infinity and beyond:
o NSF funding (SciSIP, 2009 2012)
f di (S iSIP 2009-2012)
o Indiana University, School of Informatics and Computing
• 2010: Andrew W. Mellon foundation
o Continuation of MESUR data collection and scientific work
o Investigate evolving to sustainable, open, community-supported
infrastructure
School of Informatics and Computing
Indiana University
November 5th, 2009
6. Presentation structure
1. MESUR’s Usage reference data set
g
2. Mapping scientific activity
3. Metrics survey
4. Future research
5. Discussion
School of Informatics and Computing
Indiana University
November 5th, 2009
7. Creating the MESUR usage reference data set
1B
2006-2008: Collaborating publishers, aggregators and institutional consortia:
• BMC, Blackwell, UC, CSU (23), EBSCO, ELSEVIER, EMERALD, INGENTA, JSTOR,
LANL, MIMAS/ZETOC, THOMSON, UPENN (9), UTEXAS
• Scale:
o > 1,000,000,000 usage events, and growing…
o +50M articles, +-100,000 serials
50M ti l 100 000 i l
• Period: 2002-2007, but mostly 2006
School of Informatics and Computing
Indiana University
November 5th, 2009
8. Data normalization and ingestion
Minimal requirements for all usage data
• Unique usage events (article level)
• Fields: unique session ID, date/time, unique document ID and/or metadata,
request type
• Note difference with usage statistics
2007 9 1 0 0 1 CFA cffoe A172080.N1.Vanderbilt.Edu unknown AST A 1996SPIE.2828..64S http://foe.edu/abs/1996SPIE.2828..64S http://www.google.com
2007 9 1 0 0 1 CFA cffoe 210.94.41.89 unknown PHY A 2007ApPhL.90a2120C http://foe.edu/abs/2007ApPhL.90a2120C http://www.google.co.kr
2007 9 1 0 0 1 CFA cffoe 24-196-228-125.dhcp.gwnt.ga.charter.com unknown AST A 2000ASPC.213.333S http://foe.edu/abs/2000bioa.conf.333S http://scholar.go
2007 9 1 0 0 4 CFA cffoe 163.152.35.114 4700387eae PHY A 1993WRR..29.133S http://foe.edu/abs/1993WRR..29.133S http://scholar.google.com
2007 9 1 0 0 6 CFA cffoe pd9e980fc dip0 t ipconnect de 45f0c69881
pd9e980fc.dip0.t-ipconnect.de AST X 2007AN 328 841H http://arXiv org/abs/0708 1863 http://foe edu
2007AN..328.841H http://arXiv.org/abs/0708.1863 http://foe.edu
2007 9 1 0 0 1 CFA cffoe A172080.N1.Vanderbilt.Edu unknown AST A 1996SPIE.2828..64S http://foeabs.edu/abs/1996SPIE.2828..64S http://www.google.com
2007 9 1 0 0 1 CFA cffoe 210.94.41.89 unknown PHY A 2007ApPhL.90a2120C http://foeabs.edu/abs/2007ApPhL.90a2120C http://www.google.co.kr
2007 9 1 0 0 1 CFA cffoe 24-196-228-125.dhcp.gwnt.ga.charter.com unknown AST A 2000ASPC.213.333S http://foeabs.edu/abs/2000bioa.conf.333S http://schola
2007 9 1 0 0 4 CFA cffoe 163.152.35.114 4700387eae PHY A 1993WRR..29.133S http://foeabs.edu/abs/1993WRR..29.133S http://scholar.google.com
2007 9 1 0 0 6 CFA cffoe pd9e980fc.dip0.t-ipconnect.de 45f0c69881 AST X 2007AN..328.841H http://arXiv.org/abs/0708.1863 http://foeabs.edu
2007 9 1 0 0 6 CFA cffoe foel25144.4u.com.gh 47002f8eda PHY A 2002AGUFM.S21A0965M http://foeabs.edu/abs/2002AGUFM.S21A0965M http://www.goo
2007 9 1 0 0 6 CFA cffoe 66-215-171-214.dhcp.ccmn.ca.charter.com 4681d22a6f AST A 2001P&SS..49.657R http://foeabs.edu/cgi-bin/bib_query?bibcode=2001P%
2007 9 1 0 0 7 CFA cffoe nat ptouser3 uspto gov unknown PHY A
nat-ptouser3.uspto.gov 2005ApPhL 86g2106M http://foeabs edu/abs/2005ApPhL 86g2106M
2005ApPhL.86g2106M http://foeabs.edu/abs/2005ApPhL.86g2106M http://www google com
http://www.google.com
2007 9 1 0 0 7 CFA cffoe cpe-71-65-25-115.ma.res.rr.com unknown PHY A 1980SPIE.205.153S http://foeabs.edu/abs/1980SPIE.205.153S http://www.google.com
2007 9 1 0 0 7 CFA cffoe customer3491.pool1.unallocated-106-0.orangehomedsl.co.uk unknown PHY A 1983ElL..19.883V http://foeabs.edu/abs/1983ElL..19.883V
2007 9 1 0 0 8 CFA cffoe Uranus.seas.ucla.edu 46672d96b2 PHY A 1966Phy..32.385K http://foeabs.edu/abs/1966Phy..32.385K http://www.google.com
2007 9 1 0 0 9 CFA cffoe 75-121-173-37.dyn.centurytel.net 46cf1fd8a6 AST D 1984ApJS..56.257J http://vizier.cfa.edu/viz-bin/VizieR?-source=III/92/ http://foe
2007 9 1 0 0 13 CFA cffoe foel17-18.kln.forthnet.gr unknown AST A 1987cosm.book...C http://foeabs.edu/abs/1987cosm.book...C http://www.google.gr
2007 9 1 0 0 15 CFA cffoe hades.astro.uiuc.edu 46f707564d PRE A 2007arXiv0707.3146N http://foeabs.edu/abs/2007arXiv0707.3146N http://foeabs.edu
2007 9 1 0 0 17 CFA cffoe ool-43554752.dyn.optonline.net unknown PHY A 2000PhTea.38.132K http://foeabs.edu/abs/2000PhTea.38.132K http://www.google.com
2007 9 1 0 0 17 CFA cffoe c 68 33 176 222 hsd1 md comcast net unknown GEN A
c-68-33-176-222.hsd1.md.comcast.net 1994RSPSB.256.177M http://foeabs.edu/abs/1994RSPSB.256.177M
1994RSPSB 256 177M http://foeabs edu/abs/1994RSPSB 256 177M http://w
2007 9 1 0 0 19 CFA cffoe 74-36-139-46.dr02.brvl.mn.frontiernet.net unknown AST T 2002SPIE.4767.114W http://foeabs.edu/cgi-bin/nph-abs_connect?bibcode=20
2007 9 1 0 0 19 CFA cffoe c-76-16-53-120.hsd1.il.comcast.net 46f667b71b AST F 1916PA...24.613L http://articles.foeabs.edu/cgi-bin/nph-iarticle_query?1916PA
2007 9 1 0 0 20 CFA cffoe 74-39-37-62.nas03.roch.ny.frontiernet.net unknown PHY E 2007JSTEd.tmp..29B http://dx.doi.org/10.1007/s10972-007-9067-2 http://fo
2007 9 1 0 0 22 ANU bio-mirror uatu-virtual1.anu.edu.au 46f9e8f87f AST A 2006ApJ..647.128E http://foe.grangenet.net/abs/2006ApJ..647.128E http://foe
2007 9 1 0 0 22 CFA cffoe fw.hia.nrc.ca 46f1531d59 AST A 2002P&SS..50.745H http://foeabs.edu/abs/2002P%26SS..50.745H http://foeabs.edu
2007 9 1 0 0 22 CFA cffoe 24-117-0-220.cpe.cableone.net unknown AST A 1984BITA..15.268S http://foeabs.edu/abs/1984BITA..15.268S http://www.google.com
2
School of Informatics and Computing
Indiana University
November 5th, 2009
9. Presentation structure
1. MESUR’s Usage reference data set
g
2. Mapping scientific activity
3. Metrics survey
4. Future research
5. Discussion
School of Informatics and Computing
Indiana University
November 5th, 2009
10. Data set: subset of MESUR
• Common time period:
p
o March 1st 2006 - February 1st 2007
o Thomson Scientific (Web of Science),
Elsevier (Scopus), JSTOR, Ingenta,
University of Texas ( campuses, 6
y (9 p ,
health institutions), and California State
University (23 campuses)
• 346,312,045
346 312 045 usage events
• 97,532 serials (many of which not
journals)
School of Informatics and Computing
Indiana University
November 5th, 2009
11. How to generate a usage network.
Same session ~ documents relatedness
• Same session, same user: common interest
• FFrequency of co-occurrence = d
f degree of
f
relationship
• Normalized: conditional probability
Usage data is on article level:
• Works for journals and articles
• Anything for which usage was recorded
Note: not something we invented: association rule
learning in data mining.
Beer and diapers!
School of Informatics and Computing
Indiana University
November 5th, 2009
12. Johan Bollen, Herbert Van de Sompel, Aric Hagberg,Luis
Bettencourt, Ryan Chute, Marko A. Rodriguez, Lyudmila
Balakireva. Clickstream data yields high-resolution maps
of science. PLoS One, February 2009.
School of Informatics and Computing
Indiana University
November 5th, 2009
13. Network science for impact metrics.
PageRank
PR(vi): PageRank of node vi
O(vj): out-degree of journal vj
N: number of nodes in network
L: dampening factor
Betweenness centrality
: Number of geodesics between vi and vj
School of Informatics and Computing
Indiana University
November 5th, 2009
14. Presentation structure
1. MESUR’s Usage reference data set
g
2. Mapping scientific activity
3. Metrics survey
4. Future research
5. Discussion
School of Informatics and Computing
Indiana University
November 5th, 2009
15. A variety of impact metrics
Note:
• Metrics can be calculated
both on citation and
usage data
• “Frequentist”
o Citation and usage
rates
• “Structural”
o Citation graph, e.g.
2005 JCR
o Usage graph, e.g.
created by MESUR
• H-index, G-index, SJR,
What d th
Wh t do they MEAN? etc
t
What facets of impact do they represent?
Which are best suited?
School of Informatics and Computing
Indiana University
November 5th, 2009
16. Set of metrics calculated on MESUR data set
School of Informatics and Computing
Indiana University
November 5th, 2009
17. The MESUR Metrics Map
BETWEENNESS
PAGERANK(S)
USAGE METRICS
TOTAL CITES
Johan Bollen, Herbert Van de Sompel, Aric
Hagberg and Ryan Chute. A Principal
g g y p
Component Analysis of 39 Scientific Impact
Measures. PLoS ONE, June 2009. URL:
http://dx.plos.org/10.1371/journal.pone.0006
022.
RATE METRICS
School of Informatics and Computing
Indiana University
November 5th, 2009
18. Presentation structure
1. MESUR’s Usage reference data set
g
2. Mapping scientific activity
3. Metrics survey
4. Future research
5. Discussion
School of Informatics and Computing
Indiana University
November 5th, 2009
19. Samples of future work (can be skipped)
• Longitudinal studies:
o Network changes over time: collaboration with Carl Bergstrom (UW)
o Prediction f innovation using random walk models
P di ti of i ti i d lk d l
• Logistics:
o Expand existing data set: focus on standardization, repeatability
o Establish continued funding, good home for project
o “Center” model: rather than data->scientists, scientists->data
School of Informatics and Computing
Indiana University
November 5th, 2009
20. Animated maps: tracing bursts of scientific activity
p g y
School of Informatics and Computing
Indiana University
November 5th, 2009
21. Coordinated bursts
3
2
1
School of Informatics and Computing
Indiana University
November 5th, 2009
22. MESUR Mapping and ranking services
School of Informatics and Computing
Indiana University
November 5th, 2009
23. MESUR Mapping and ranking services
School of Informatics and Computing
Indiana University
November 5th, 2009
24. MESUR Mapping and ranking services
School of Informatics and Computing
Indiana University
November 5th, 2009
25. MESUR: the good ...
After 3 years of MESUR:
• Scientific exploration of metrics for scholarly evaluation
• Creation of large-scale reference data set
• Mapping science from the viewpoint of users: there is structure!
pp g p
• Variety of metrics that cover various aspects of scholarly impact and prestige
• MESUR dataset contains many more pearls for future research
• Foundation for future continued research program:
p g
• Longitudinal studies
• Models of collective behavior of scientists
School of Informatics and Computing
Indiana University
November 5th, 2009
26. MESUR: the bad and the ugly …
Scalability of the approach:
• Lengthy negotiations to obtain log data
• No infrastructure standards (yet): Recording, aggregating,
normalization, ingestion, de-duplication,…
• No generally accepted policies: privacy, property, …
• No census data: when is a sample large and representative enough?
Quality control:
• Bots, Crawlers (detectable but never perfect)
• Cheating manipulation (easier with usage statistics than network
Cheating,
metrics)
Acceptance:
p
• Network-based usage metrics require session information. This is
overlooked! As a result, will we end up with usage-based statistics only?
• “As simple as possible, but not more simple!”
School of Informatics and Computing
Indiana University
November 5th, 2009
27. Moving towards community involvement
“
Registration is now open for "Scholarly Evaluation Metrics: Opportunities and
Challenges", a one-day NSF-funded workshop that will take place in the Renaissance
Washington
W hi t DC H t l on W d
Hotel Wednesday, December 16th 2009. P ti i ti i thi workshop
d D b 2009 Participation in this k h
is limited to 50 people. Registration is free
at http://informatics.indiana.edu/scholmet09/registration.html.
The topic of the workshop is the future of scholarly assessment approaches, including
p p y pp , g
organizational, infrastructural, and community issues. The overall goal is to identify
requirements for novel assessment approaches, several of which have been proposed in
recent years, to become acceptable to community stakeholders including scholars, academic
and research institutions, and funding agencies. The impressive group of speakers and
panelists for the workshop includes representatives from each of these constituencies.
Further details are available at http://informatics.indiana.edu/scholmet09/announcement.html
Workshop organizers:
Johan Bollen (jbollen@indiana.edu),
Herbert Van de Sompel (hvdsomp@gmail.com) and
Ying Ding (dingying@indiana.edu)
“
School of Informatics and Computing
Indiana University
November 5th, 2009
28. Moving towards community involvement
Planning process underway to establish sustainable, open, community supported infrastructure.
New support from Andrew W. Mellon foundation to figure it all out.
Logistics: Science:
Data aggregation Metrics
Normalization Analysis
Data-related services Prediction
Data management
Services =More than sum of parts:
Ranking • Each component supports the other
Assessment • Various business and funding models
Mapping • Generate added value on all levels
Can fundamentally change scholarly
communication
School of Informatics and Computing
Indiana University
November 5th, 2009
29. Some relevant publications.
Johan Bollen, Herbert Van de Sompel, Aric Hagberg, Luis Bettencourt, Ryan Chute,
Marko A. Rodriguez, Lyudmila Balakireva. Clickstream data yields high-
resolution maps of science. PLoS One, March 2009 (In Press)
Johan Bollen, Herbert Van de Sompel, Aric HagBerg, Ryan Chute. A principal
component analysis of 39 scientific impact measures.
arXiv.org/abs/0902.2183
Johan Bollen, Marko A. Rodriguez, and Herbert Van de Sompel. Journal status. Scientometrics, 69(3),
December 2006 (arxiv.org:cs.DL/0601030)
( g )
Johan Bollen, Herbert Van de Sompel, and Marko A. Rodriguez. Towards usage-based impact metrics: first
results from the MESUR project. In Proceedings of the Joint Conference on Digital Libraries, Pittsburgh,
June 2008
Marko A. Rodriguez, Johan Bollen and Herbert Van de Sompel. A Practical Ontology for the Large-Scale
Modeling of Scholarly Artifacts and their Usage, In Proceedings of the Joint Conference on Digital
Libraries, Vancouver,
Libraries Vancouver June 2007
Johan Bollen and Herbert Van de Sompel. Usage Impact Factor: the effects of sample characteristics on
usage-based impact metrics. (cs.DL/0610154)
Johan Bollen and Herbert Van de Sompel. An architecture for the aggregation and analysis of scholarly
usage data. In Joint Conference on Digital Libraries (JCDL2006), pages 298-307, June 2006.
Johan Bollen and Herbert Van de Sompel. Mapping the structure of science through usage. Scientometrics,
69(2), 2006.
Johan Bollen, Herbert Van de Sompel, Joan Smith, and Rick Luce. Toward alternative metrics of journal
impact: a comparison of download and citation data. Information Processing and Management,
41(6):1419-1440, 2005.
School of Informatics and Computing
Indiana University
November 5th, 2009
30. Presentation structure
1. MESUR’s Usage reference data set
g
2. Mapping scientific activity
3. Metrics survey
4. Future research
5. Discussion
School of Informatics and Computing
Indiana University
November 5th, 2009