SlideShare a Scribd company logo
1 of 48
Download to read offline
Advanced citation matching and
large-scale cited reference extraction
Nees Jan van Eck
Centre for Science and Technology Studies (CWTS), Leiden University
EXCITE Workshop 2017: “Challenges in Extracting and Managing References”
Cologne, Germany, March 30, 2017
Outline
• Citation matching
– Comparison of the accuracy of the Web of Science, CWTS, and
iFQ citation matching algorithms
• Cited reference extraction
– Assessment of the accuracy of cited references in Web of Science
based on Elsevier ScienceDirect data
1
Accuracy of the WoS,
CWTS, and iFQ
citation matching
algorithms
2
3
Citation matching problem
4
…
References
[1] Hirsch, JE (2005)
PNAS, 102, p.16569
[2] Egghe, L (2006)
Scientist, 20, p.15
…
An index to quantify an
individual's scientific
research output
Hirsch, JE
PNAS, 102(46), p.16569-72
UT: 000233462900010
Abstract
…
How to improve the h-
index
Egghe, L
The Scientist, 20(3), p.15
UT: 000235634200013
Abstract
…
Bibliographic database
WoS, Scopus
A
B
C
Why is citation matching difficult?
• ‘Big data’ problem
– No. of publications: 50 million
– No. of cited references: 1 billion
• Little data available on cited references in WoS
– First author (last name and initials)
– Source title (abbreviated)
– Publication year
– Volume number
– First page number
– (DOI)
• Errors in data
– Citation extraction errors
• OCR errors
• Interpretation errors due to different citation styles
– Typos and other human errors
5
/A Olensky, M
/Y 2015
/W J ASS INFORM SCI TEC
/V 67
/P 2550
Citation matching algorithms of WoS
• Little is known about the citation matching
algorithm used in WoS
• Larsen et al. (2007) concluded from their
investigation of missed matches in WoS that the
algorithm is quite conservative and does not allow
for any variations
6
Citation matching algorithm of CWTS
• The aim is to overcome the problem of missed
citation matches in WoS
• Iterative, rule-based algorithm:
1. Preprocessing
2. Start with the most restrictive matching rules
3. Continue with less restrictive matching rules
• Less restrictive matching rules allow for various
types of inaccuracies in the cited reference data
7
Example matching rules
• Most restrictive matching rule:
– Exact match on
• first author
• publication year
• publication name
• volume number
• starting page number
• DOI
• Less restrictive matching rule:
– Match on
• Soundex encoding of the last name of the first author
• publication year plus or minus one
• volume number
• starting page number
8
Citation matching algorithms of iFQ
• Iterative, rule-based algorithm
• Allows non-unique matches of a single cited
reference with several target articles
9
Data collection (1)
• Builds on data collected by Olensky (2015)
• Sample of 300 publications (cited pubs)
– 2 science domains
– 6 disciplines
– 2 languages
– 2 publication years
• 3975 corresponding cited references in WoS
– Times cited used to find cited references that are linked in WoS
– Cited reference search used to find cited references that are not
linked in WoS
10
Data collection (2)
11
300 cited pubs
3975 citing pubs
not linked
in WoS
linked
in WoS
Results
12
All matches WoS CWTS iFQ
# % # % # %
Correct matches 3664 99.2 3855 98.8 3856 98.5
Incorrect matches 29 0.8 45 1.2 57 1.5
All citations WoS CWTS iFQ
# % # % # %
Correct matches 3664 93.8 3855 98.6 3856 98.7
Missed matches 244 6.2 53 1.4 52 1.3
Recall
Precision
WoS CWTS iFQ
F1 score 96.4 98.7 98.6
Qualitative analysis of missed matches
(WOS: 244; CWTS: 53; iFQ: 52)
13
Changes in CWTS citation matching
algorithm
• Introduction of a matching rules in which:
1. Volume and issue number are interchanged
2. Volume and first page number are interchanged
• Small change in the order in which the matching
rules are applied
14
Improved results
15
All matches CWTS (current) CWTS (revised)
# % # %
Correct matches 3888 99.6 3906 99.8
Incorrect matches 16 0.4 9 0.2
All citations CWTS (current) CWTS (revised)
# % # %
Correct matches 3888 98.7 3906 99.1
Missed matches 53 1.3 35 0.9
Recall
Precision
CWTS (current) CWTS (revised)
F1 score 99.1 99.4
Conclusions
• A significant number of citation matches are
missing in WoS
• Substantial improvement in recall is possible, but at
the cost of a small decrease in precision
• Citation matching algorithm of CWTS performs
quite well
• During the analysis, various problems were
detected in WoS cited reference extraction
16
Accuracy of WoS
cited reference
extraction
17
Introduction
• Aim: To determine the accuracy of WoS cited
references data
• Approach: Comparison of the cited references
extracted from the full text of Elsevier publications
with the cited references available in WoS
18
Data
• Elsevier full text data
– ScienceDirect API
– Subscription-based journal publications in the period 1987-2016
• WoS meta data
– Document types ‘article’ and ‘review’
• Matching of Elsevier full-text data and WoS meta
data at the level of individual publications
19
20
21
22
Validation of missing cited references
in WoS
• Publication year: 2015
• Number of missing cited references: 73,536
• Sample size: 60
– Missing cited reference: 33 (55.0%)
– Incorrect cited reference: 10 (16.7%)
– Error in meta data of cited reference (e.g., incorrect
publication year or incorrect volume number): 16 (26.7%)
– Correct cited reference: 1 (1.5%)
23
Missing cited references in WoS (1)
24
Missing cited references in WoS (2)
25
Missing cited references in WoS (3)
26
Missing cited references in WoS (4)
27
Missing cited references in WoS (5)
28
Missing cited references in WoS (6)
29
Missing cited references in WoS (7)
30
Missing cited references in WoS (8)
31
Missing cited references in WoS (9)
32
Incorrect cited references in WoS (1)
33
Incorrect cited references in WoS (2)
34
Incorrect cited references in WoS (3)
35
Incorrect cited references in WoS (4)
36
Incorrect cited references in WoS (5)
37
Incorrect cited references in WoS (6)
38
WoS cited reference Original cited reference in publication
WANG J, 2006, CHINESE
CHEM LETT, V17, P49
J. Wang, J.K. Carson, M.F. North, D.J.
Cleland, Int. J. Heat Mass Transfer 49 (17)
(2006) 3075–3083.
KANBER B, 2013,
CEREBROVASC DIS S2, V35,
P21
Kanber B, Hartshorne TC, Horsfield MA,
Naylor AR, Robinson TG, Ramnarine KV.
Dynamic variations in the ultrasound gray-
scale median of carotid artery plaques.
Cardiovasc Ultrasound 2013a;11:21.
EVANS P, 2010, TLS-TIMES
LIT S 0326, P30
Evans PD, Chowdhury MJA. Photoprotection of
wood using polyester-type UVabsorbers
derived from the reaction of 2 hydroxy-
4(2,3-epoxypropoxy)-benzophenone with
dicarboxylic acid anhydrides. J Wood Chem
Technol 2010;30:186e204.
Incorrect cited references in WoS (7)
39
WoS cited reference Original cited reference in publication
CAO X, 2010, IEEE
GLOBECOMM 2010, V2010,
P1
Cao, X., Zong, Z., Ju, X., Sun, Y., Dai, C.,
Liu, Q., Jiang, J., 2010. Molecular cloning,
characterization and function analysis of
the gene encoding HMG-CoA reductase from
Euphorbia Pekinensis Rupr. Mol. Biol. Rep.
37, 1559e1567.
LI XY, 2013, NANJING
NONGYE DAXUE, V36, P36
X. Li, S. Wang, Y. Chen, G. Liu, X. Yang,
Overexpression of CD40 in sacral chordomas
and its correlation with low tumor
recurrence, Onkologie 36 (10) (2013) 567–571
ZHANG K, 2014, IEEE T
PATTERN ANAL, V1, P1
K. Zhang, H. Chen, G. Wu, K. Chen, H. Yang,
High expression of SPHK1 in sacral chordoma
and association with patients’ poor
prognosis, Med. Oncol. 31 (11) (2014) 247.
More cited references in WoS (1)
40
More cited references in WoS (2)
41
More cited references in WoS (3)
42
More cited references in WoS (4)
43
More cited references in WoS (5)
44
More cited references in WoS (6)
45
Conclusions
• About 0.3% of cited references are missing in WoS
• About 0.2% of cited references in WoS have minor
errors (e.g., incorrect publication year or volume
number)
• About 0.1% of cited references in WoS have major
errors (i.e., reference to completely incorrect target
document)
• WoS does a good job in handling references
pointing to multiple target documents
• These results are based on Elsevier publications
only; publications from other publishers may yield
different outcomes 46
Thank you for your attention!
47

More Related Content

What's hot

Advanced bibliometric software tools for publishers and editors
Advanced bibliometric software tools for publishers and editorsAdvanced bibliometric software tools for publishers and editors
Advanced bibliometric software tools for publishers and editorsNees Jan van Eck
 
Bibliometric network analysis: Software tools, techniques, and an analysis o...
Bibliometric network analysis: Software tools, techniques, and an analysis o...Bibliometric network analysis: Software tools, techniques, and an analysis o...
Bibliometric network analysis: Software tools, techniques, and an analysis o...Nees Jan van Eck
 
CWTS Leiden Ranking: An advanced bibliometric approach to university ranking
CWTS Leiden Ranking: An advanced bibliometric approach to university rankingCWTS Leiden Ranking: An advanced bibliometric approach to university ranking
CWTS Leiden Ranking: An advanced bibliometric approach to university rankingNees Jan van Eck
 
Visual exploration of scientific literature using VOSviewer and CitNetExplorer
Visual exploration of scientific literature using VOSviewer and CitNetExplorerVisual exploration of scientific literature using VOSviewer and CitNetExplorer
Visual exploration of scientific literature using VOSviewer and CitNetExplorerNees Jan van Eck
 
Network visualization: Fine-tuning layout techniques for different types of n...
Network visualization: Fine-tuning layout techniques for different types of n...Network visualization: Fine-tuning layout techniques for different types of n...
Network visualization: Fine-tuning layout techniques for different types of n...Nees Jan van Eck
 
Intermediacy of publications
Intermediacy of publicationsIntermediacy of publications
Intermediacy of publicationsNees Jan van Eck
 
Large-scale analysis of bibliometric networks
Large-scale analysis of bibliometric networksLarge-scale analysis of bibliometric networks
Large-scale analysis of bibliometric networksNees Jan van Eck
 
Getting started with CitNetExplorer
Getting started with CitNetExplorerGetting started with CitNetExplorer
Getting started with CitNetExplorerNees Jan van Eck
 
Large-scale visualization of science
Large-scale visualization of scienceLarge-scale visualization of science
Large-scale visualization of scienceNees Jan van Eck
 
Open data sources in VOSviewer
Open data sources in VOSviewerOpen data sources in VOSviewer
Open data sources in VOSviewerNees Jan van Eck
 
Scientometric approaches to classification
Scientometric approaches to classificationScientometric approaches to classification
Scientometric approaches to classificationNees Jan van Eck
 
Bibliometrische visualisaties voor het bijhouden van wetenschappelijke litera...
Bibliometrische visualisaties voor het bijhouden van wetenschappelijke litera...Bibliometrische visualisaties voor het bijhouden van wetenschappelijke litera...
Bibliometrische visualisaties voor het bijhouden van wetenschappelijke litera...Nees Jan van Eck
 
Visualizing science based on open data sources
Visualizing science based on open data sourcesVisualizing science based on open data sources
Visualizing science based on open data sourcesNees Jan van Eck
 
VOSviewer: A software tool for analyzing and visualizing scientific literature
VOSviewer: A software tool for analyzing and visualizing scientific literatureVOSviewer: A software tool for analyzing and visualizing scientific literature
VOSviewer: A software tool for analyzing and visualizing scientific literatureNees Jan van Eck
 
Large-scale visualization of science: Methods, tools, and applications
Large-scale visualization of science: Methods, tools, and applicationsLarge-scale visualization of science: Methods, tools, and applications
Large-scale visualization of science: Methods, tools, and applicationsLudo Waltman
 
Using full-text data to create improved term maps
Using full-text data to create improved term mapsUsing full-text data to create improved term maps
Using full-text data to create improved term mapsNees Jan van Eck
 
Bibliometric visualization using VOSviewer
Bibliometric visualization using VOSviewerBibliometric visualization using VOSviewer
Bibliometric visualization using VOSviewerLudo Waltman
 
Scientometrics for research assessment
Scientometrics for research assessmentScientometrics for research assessment
Scientometrics for research assessmentLudo Waltman
 

What's hot (20)

Advanced bibliometric software tools for publishers and editors
Advanced bibliometric software tools for publishers and editorsAdvanced bibliometric software tools for publishers and editors
Advanced bibliometric software tools for publishers and editors
 
Bibliometric network analysis: Software tools, techniques, and an analysis o...
Bibliometric network analysis: Software tools, techniques, and an analysis o...Bibliometric network analysis: Software tools, techniques, and an analysis o...
Bibliometric network analysis: Software tools, techniques, and an analysis o...
 
CWTS Leiden Ranking: An advanced bibliometric approach to university ranking
CWTS Leiden Ranking: An advanced bibliometric approach to university rankingCWTS Leiden Ranking: An advanced bibliometric approach to university ranking
CWTS Leiden Ranking: An advanced bibliometric approach to university ranking
 
Visual exploration of scientific literature using VOSviewer and CitNetExplorer
Visual exploration of scientific literature using VOSviewer and CitNetExplorerVisual exploration of scientific literature using VOSviewer and CitNetExplorer
Visual exploration of scientific literature using VOSviewer and CitNetExplorer
 
Network visualization: Fine-tuning layout techniques for different types of n...
Network visualization: Fine-tuning layout techniques for different types of n...Network visualization: Fine-tuning layout techniques for different types of n...
Network visualization: Fine-tuning layout techniques for different types of n...
 
Intermediacy of publications
Intermediacy of publicationsIntermediacy of publications
Intermediacy of publications
 
Large-scale analysis of bibliometric networks
Large-scale analysis of bibliometric networksLarge-scale analysis of bibliometric networks
Large-scale analysis of bibliometric networks
 
Getting started with CitNetExplorer
Getting started with CitNetExplorerGetting started with CitNetExplorer
Getting started with CitNetExplorer
 
Cluster stability
Cluster stabilityCluster stability
Cluster stability
 
Large-scale visualization of science
Large-scale visualization of scienceLarge-scale visualization of science
Large-scale visualization of science
 
Open data sources in VOSviewer
Open data sources in VOSviewerOpen data sources in VOSviewer
Open data sources in VOSviewer
 
Scientometric approaches to classification
Scientometric approaches to classificationScientometric approaches to classification
Scientometric approaches to classification
 
Bibliometrische visualisaties voor het bijhouden van wetenschappelijke litera...
Bibliometrische visualisaties voor het bijhouden van wetenschappelijke litera...Bibliometrische visualisaties voor het bijhouden van wetenschappelijke litera...
Bibliometrische visualisaties voor het bijhouden van wetenschappelijke litera...
 
Visualizing science based on open data sources
Visualizing science based on open data sourcesVisualizing science based on open data sources
Visualizing science based on open data sources
 
VOSviewer: A software tool for analyzing and visualizing scientific literature
VOSviewer: A software tool for analyzing and visualizing scientific literatureVOSviewer: A software tool for analyzing and visualizing scientific literature
VOSviewer: A software tool for analyzing and visualizing scientific literature
 
On cluster stability
On cluster stabilityOn cluster stability
On cluster stability
 
Large-scale visualization of science: Methods, tools, and applications
Large-scale visualization of science: Methods, tools, and applicationsLarge-scale visualization of science: Methods, tools, and applications
Large-scale visualization of science: Methods, tools, and applications
 
Using full-text data to create improved term maps
Using full-text data to create improved term mapsUsing full-text data to create improved term maps
Using full-text data to create improved term maps
 
Bibliometric visualization using VOSviewer
Bibliometric visualization using VOSviewerBibliometric visualization using VOSviewer
Bibliometric visualization using VOSviewer
 
Scientometrics for research assessment
Scientometrics for research assessmentScientometrics for research assessment
Scientometrics for research assessment
 

Similar to Advanced citation matching and large-scale cited reference extraction

Accuracy of citation data in Web of Science and Scopus
Accuracy of citation data in Web of Science and ScopusAccuracy of citation data in Web of Science and Scopus
Accuracy of citation data in Web of Science and ScopusNees Jan van Eck
 
What is your h-index and other measures of impact
What is your h-index and other measures of impactWhat is your h-index and other measures of impact
What is your h-index and other measures of impactBerenika Webster
 
Citation analysis for research evaluation
Citation analysis for research evaluationCitation analysis for research evaluation
Citation analysis for research evaluationWouter Gerritsma
 
Using Bibliometrics in the Library
Using Bibliometrics in the LibraryUsing Bibliometrics in the Library
Using Bibliometrics in the LibraryState Of Innovation
 
Comparing bibliographic data sources
Comparing bibliographic data sourcesComparing bibliographic data sources
Comparing bibliographic data sourcesLudo Waltman
 
Beyond Journal Impact and Usage Statistics: Using Citation Analysis for Colle...
Beyond Journal Impact and Usage Statistics: Using Citation Analysis for Colle...Beyond Journal Impact and Usage Statistics: Using Citation Analysis for Colle...
Beyond Journal Impact and Usage Statistics: Using Citation Analysis for Colle...NASIG
 
Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...
Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...
Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...NASIG
 
Non-textual ranking in Digital Libraries
Non-textual ranking in Digital LibrariesNon-textual ranking in Digital Libraries
Non-textual ranking in Digital LibrariesGESIS
 
Web of Science, Scopus, Dimensions, and beyond: The evolving landscape of bib...
Web of Science, Scopus, Dimensions, and beyond: The evolving landscape of bib...Web of Science, Scopus, Dimensions, and beyond: The evolving landscape of bib...
Web of Science, Scopus, Dimensions, and beyond: The evolving landscape of bib...Ludo Waltman
 
A new role for libraries in research assessments
A new role for libraries in research assessmentsA new role for libraries in research assessments
A new role for libraries in research assessmentsWouter Gerritsma
 
Publication strategy for LEI
Publication strategy for LEIPublication strategy for LEI
Publication strategy for LEIWouter Gerritsma
 
Investigation of Partition Cells as a Structural Basis Suitable for Assessmen...
Investigation of Partition Cells as a Structural Basis Suitable for Assessmen...Investigation of Partition Cells as a Structural Basis Suitable for Assessmen...
Investigation of Partition Cells as a Structural Basis Suitable for Assessmen...Nadine Rons
 
Publishing for impact by RIKILT
Publishing for impact by RIKILTPublishing for impact by RIKILT
Publishing for impact by RIKILTWouter Gerritsma
 
Bibliometrics in the library Wageningen UR Library experience
Bibliometrics in the library Wageningen UR Library experienceBibliometrics in the library Wageningen UR Library experience
Bibliometrics in the library Wageningen UR Library experienceWouter Gerritsma
 
Bibliometrics in the library
Bibliometrics in the libraryBibliometrics in the library
Bibliometrics in the libraryWouter Gerritsma
 
Broad altmetric analysis of Mendeley readerships through the ‘academic status...
Broad altmetric analysis of Mendeley readerships through the ‘academic status...Broad altmetric analysis of Mendeley readerships through the ‘academic status...
Broad altmetric analysis of Mendeley readerships through the ‘academic status...Zohreh Zahedi
 

Similar to Advanced citation matching and large-scale cited reference extraction (20)

Accuracy of citation data in Web of Science and Scopus
Accuracy of citation data in Web of Science and ScopusAccuracy of citation data in Web of Science and Scopus
Accuracy of citation data in Web of Science and Scopus
 
What is your h-index and other measures of impact
What is your h-index and other measures of impactWhat is your h-index and other measures of impact
What is your h-index and other measures of impact
 
Citation analysis for research evaluation
Citation analysis for research evaluationCitation analysis for research evaluation
Citation analysis for research evaluation
 
Using Bibliometrics in the Library
Using Bibliometrics in the LibraryUsing Bibliometrics in the Library
Using Bibliometrics in the Library
 
Comparing bibliographic data sources
Comparing bibliographic data sourcesComparing bibliographic data sources
Comparing bibliographic data sources
 
Discussants
DiscussantsDiscussants
Discussants
 
Beyond Journal Impact and Usage Statistics: Using Citation Analysis for Colle...
Beyond Journal Impact and Usage Statistics: Using Citation Analysis for Colle...Beyond Journal Impact and Usage Statistics: Using Citation Analysis for Colle...
Beyond Journal Impact and Usage Statistics: Using Citation Analysis for Colle...
 
Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...
Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...
Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...
 
Non-textual ranking in Digital Libraries
Non-textual ranking in Digital LibrariesNon-textual ranking in Digital Libraries
Non-textual ranking in Digital Libraries
 
Web of Science, Scopus, Dimensions, and beyond: The evolving landscape of bib...
Web of Science, Scopus, Dimensions, and beyond: The evolving landscape of bib...Web of Science, Scopus, Dimensions, and beyond: The evolving landscape of bib...
Web of Science, Scopus, Dimensions, and beyond: The evolving landscape of bib...
 
A new role for libraries in research assessments
A new role for libraries in research assessmentsA new role for libraries in research assessments
A new role for libraries in research assessments
 
Publication strategy for LEI
Publication strategy for LEIPublication strategy for LEI
Publication strategy for LEI
 
Investigation of Partition Cells as a Structural Basis Suitable for Assessmen...
Investigation of Partition Cells as a Structural Basis Suitable for Assessmen...Investigation of Partition Cells as a Structural Basis Suitable for Assessmen...
Investigation of Partition Cells as a Structural Basis Suitable for Assessmen...
 
Presentation for GIRS
Presentation for GIRSPresentation for GIRS
Presentation for GIRS
 
Publishing for impact by RIKILT
Publishing for impact by RIKILTPublishing for impact by RIKILT
Publishing for impact by RIKILT
 
Bibliometrics in the library Wageningen UR Library experience
Bibliometrics in the library Wageningen UR Library experienceBibliometrics in the library Wageningen UR Library experience
Bibliometrics in the library Wageningen UR Library experience
 
Taming the Wilde
Taming the WildeTaming the Wilde
Taming the Wilde
 
Bibliometrics in the library
Bibliometrics in the libraryBibliometrics in the library
Bibliometrics in the library
 
citation analysis vlag
citation analysis vlagcitation analysis vlag
citation analysis vlag
 
Broad altmetric analysis of Mendeley readerships through the ‘academic status...
Broad altmetric analysis of Mendeley readerships through the ‘academic status...Broad altmetric analysis of Mendeley readerships through the ‘academic status...
Broad altmetric analysis of Mendeley readerships through the ‘academic status...
 

More from Nees Jan van Eck

Crossref as a source of open bibliographic metadata
Crossref as a source of open bibliographic metadataCrossref as a source of open bibliographic metadata
Crossref as a source of open bibliographic metadataNees Jan van Eck
 
Community detection using citation relations and textual similarities in a la...
Community detection using citation relations and textual similarities in a la...Community detection using citation relations and textual similarities in a la...
Community detection using citation relations and textual similarities in a la...Nees Jan van Eck
 
Visualizing science using VOSviewer based on Crossref, Microsoft Academic, an...
Visualizing science using VOSviewer based on Crossref, Microsoft Academic, an...Visualizing science using VOSviewer based on Crossref, Microsoft Academic, an...
Visualizing science using VOSviewer based on Crossref, Microsoft Academic, an...Nees Jan van Eck
 
A scientometric perspective on university ranking
A scientometric perspective on university rankingA scientometric perspective on university ranking
A scientometric perspective on university rankingNees Jan van Eck
 
Open data sources in VOSviewer
Open data sources in VOSviewerOpen data sources in VOSviewer
Open data sources in VOSviewerNees Jan van Eck
 
A scientometric perspective on university ranking
A scientometric perspective on university rankingA scientometric perspective on university ranking
A scientometric perspective on university rankingNees Jan van Eck
 
CWTS Leiden Ranking: An advanced bibliometric approach to university ranking
CWTS Leiden Ranking: An advanced bibliometric approach to university rankingCWTS Leiden Ranking: An advanced bibliometric approach to university ranking
CWTS Leiden Ranking: An advanced bibliometric approach to university rankingNees Jan van Eck
 
Open data sources in VOSviewer
Open data sources in VOSviewerOpen data sources in VOSviewer
Open data sources in VOSviewerNees Jan van Eck
 
How to design a ranking system: Criteria and opportunities for a comparison
How to design a ranking system: Criteria and opportunities for a comparisonHow to design a ranking system: Criteria and opportunities for a comparison
How to design a ranking system: Criteria and opportunities for a comparisonNees Jan van Eck
 

More from Nees Jan van Eck (9)

Crossref as a source of open bibliographic metadata
Crossref as a source of open bibliographic metadataCrossref as a source of open bibliographic metadata
Crossref as a source of open bibliographic metadata
 
Community detection using citation relations and textual similarities in a la...
Community detection using citation relations and textual similarities in a la...Community detection using citation relations and textual similarities in a la...
Community detection using citation relations and textual similarities in a la...
 
Visualizing science using VOSviewer based on Crossref, Microsoft Academic, an...
Visualizing science using VOSviewer based on Crossref, Microsoft Academic, an...Visualizing science using VOSviewer based on Crossref, Microsoft Academic, an...
Visualizing science using VOSviewer based on Crossref, Microsoft Academic, an...
 
A scientometric perspective on university ranking
A scientometric perspective on university rankingA scientometric perspective on university ranking
A scientometric perspective on university ranking
 
Open data sources in VOSviewer
Open data sources in VOSviewerOpen data sources in VOSviewer
Open data sources in VOSviewer
 
A scientometric perspective on university ranking
A scientometric perspective on university rankingA scientometric perspective on university ranking
A scientometric perspective on university ranking
 
CWTS Leiden Ranking: An advanced bibliometric approach to university ranking
CWTS Leiden Ranking: An advanced bibliometric approach to university rankingCWTS Leiden Ranking: An advanced bibliometric approach to university ranking
CWTS Leiden Ranking: An advanced bibliometric approach to university ranking
 
Open data sources in VOSviewer
Open data sources in VOSviewerOpen data sources in VOSviewer
Open data sources in VOSviewer
 
How to design a ranking system: Criteria and opportunities for a comparison
How to design a ranking system: Criteria and opportunities for a comparisonHow to design a ranking system: Criteria and opportunities for a comparison
How to design a ranking system: Criteria and opportunities for a comparison
 

Recently uploaded

Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 

Recently uploaded (20)

Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 

Advanced citation matching and large-scale cited reference extraction

  • 1. Advanced citation matching and large-scale cited reference extraction Nees Jan van Eck Centre for Science and Technology Studies (CWTS), Leiden University EXCITE Workshop 2017: “Challenges in Extracting and Managing References” Cologne, Germany, March 30, 2017
  • 2. Outline • Citation matching – Comparison of the accuracy of the Web of Science, CWTS, and iFQ citation matching algorithms • Cited reference extraction – Assessment of the accuracy of cited references in Web of Science based on Elsevier ScienceDirect data 1
  • 3. Accuracy of the WoS, CWTS, and iFQ citation matching algorithms 2
  • 4. 3
  • 5. Citation matching problem 4 … References [1] Hirsch, JE (2005) PNAS, 102, p.16569 [2] Egghe, L (2006) Scientist, 20, p.15 … An index to quantify an individual's scientific research output Hirsch, JE PNAS, 102(46), p.16569-72 UT: 000233462900010 Abstract … How to improve the h- index Egghe, L The Scientist, 20(3), p.15 UT: 000235634200013 Abstract … Bibliographic database WoS, Scopus A B C
  • 6. Why is citation matching difficult? • ‘Big data’ problem – No. of publications: 50 million – No. of cited references: 1 billion • Little data available on cited references in WoS – First author (last name and initials) – Source title (abbreviated) – Publication year – Volume number – First page number – (DOI) • Errors in data – Citation extraction errors • OCR errors • Interpretation errors due to different citation styles – Typos and other human errors 5 /A Olensky, M /Y 2015 /W J ASS INFORM SCI TEC /V 67 /P 2550
  • 7. Citation matching algorithms of WoS • Little is known about the citation matching algorithm used in WoS • Larsen et al. (2007) concluded from their investigation of missed matches in WoS that the algorithm is quite conservative and does not allow for any variations 6
  • 8. Citation matching algorithm of CWTS • The aim is to overcome the problem of missed citation matches in WoS • Iterative, rule-based algorithm: 1. Preprocessing 2. Start with the most restrictive matching rules 3. Continue with less restrictive matching rules • Less restrictive matching rules allow for various types of inaccuracies in the cited reference data 7
  • 9. Example matching rules • Most restrictive matching rule: – Exact match on • first author • publication year • publication name • volume number • starting page number • DOI • Less restrictive matching rule: – Match on • Soundex encoding of the last name of the first author • publication year plus or minus one • volume number • starting page number 8
  • 10. Citation matching algorithms of iFQ • Iterative, rule-based algorithm • Allows non-unique matches of a single cited reference with several target articles 9
  • 11. Data collection (1) • Builds on data collected by Olensky (2015) • Sample of 300 publications (cited pubs) – 2 science domains – 6 disciplines – 2 languages – 2 publication years • 3975 corresponding cited references in WoS – Times cited used to find cited references that are linked in WoS – Cited reference search used to find cited references that are not linked in WoS 10
  • 12. Data collection (2) 11 300 cited pubs 3975 citing pubs not linked in WoS linked in WoS
  • 13. Results 12 All matches WoS CWTS iFQ # % # % # % Correct matches 3664 99.2 3855 98.8 3856 98.5 Incorrect matches 29 0.8 45 1.2 57 1.5 All citations WoS CWTS iFQ # % # % # % Correct matches 3664 93.8 3855 98.6 3856 98.7 Missed matches 244 6.2 53 1.4 52 1.3 Recall Precision WoS CWTS iFQ F1 score 96.4 98.7 98.6
  • 14. Qualitative analysis of missed matches (WOS: 244; CWTS: 53; iFQ: 52) 13
  • 15. Changes in CWTS citation matching algorithm • Introduction of a matching rules in which: 1. Volume and issue number are interchanged 2. Volume and first page number are interchanged • Small change in the order in which the matching rules are applied 14
  • 16. Improved results 15 All matches CWTS (current) CWTS (revised) # % # % Correct matches 3888 99.6 3906 99.8 Incorrect matches 16 0.4 9 0.2 All citations CWTS (current) CWTS (revised) # % # % Correct matches 3888 98.7 3906 99.1 Missed matches 53 1.3 35 0.9 Recall Precision CWTS (current) CWTS (revised) F1 score 99.1 99.4
  • 17. Conclusions • A significant number of citation matches are missing in WoS • Substantial improvement in recall is possible, but at the cost of a small decrease in precision • Citation matching algorithm of CWTS performs quite well • During the analysis, various problems were detected in WoS cited reference extraction 16
  • 18. Accuracy of WoS cited reference extraction 17
  • 19. Introduction • Aim: To determine the accuracy of WoS cited references data • Approach: Comparison of the cited references extracted from the full text of Elsevier publications with the cited references available in WoS 18
  • 20. Data • Elsevier full text data – ScienceDirect API – Subscription-based journal publications in the period 1987-2016 • WoS meta data – Document types ‘article’ and ‘review’ • Matching of Elsevier full-text data and WoS meta data at the level of individual publications 19
  • 21. 20
  • 22. 21
  • 23. 22
  • 24. Validation of missing cited references in WoS • Publication year: 2015 • Number of missing cited references: 73,536 • Sample size: 60 – Missing cited reference: 33 (55.0%) – Incorrect cited reference: 10 (16.7%) – Error in meta data of cited reference (e.g., incorrect publication year or incorrect volume number): 16 (26.7%) – Correct cited reference: 1 (1.5%) 23
  • 25. Missing cited references in WoS (1) 24
  • 26. Missing cited references in WoS (2) 25
  • 27. Missing cited references in WoS (3) 26
  • 28. Missing cited references in WoS (4) 27
  • 29. Missing cited references in WoS (5) 28
  • 30. Missing cited references in WoS (6) 29
  • 31. Missing cited references in WoS (7) 30
  • 32. Missing cited references in WoS (8) 31
  • 33. Missing cited references in WoS (9) 32
  • 34. Incorrect cited references in WoS (1) 33
  • 35. Incorrect cited references in WoS (2) 34
  • 36. Incorrect cited references in WoS (3) 35
  • 37. Incorrect cited references in WoS (4) 36
  • 38. Incorrect cited references in WoS (5) 37
  • 39. Incorrect cited references in WoS (6) 38 WoS cited reference Original cited reference in publication WANG J, 2006, CHINESE CHEM LETT, V17, P49 J. Wang, J.K. Carson, M.F. North, D.J. Cleland, Int. J. Heat Mass Transfer 49 (17) (2006) 3075–3083. KANBER B, 2013, CEREBROVASC DIS S2, V35, P21 Kanber B, Hartshorne TC, Horsfield MA, Naylor AR, Robinson TG, Ramnarine KV. Dynamic variations in the ultrasound gray- scale median of carotid artery plaques. Cardiovasc Ultrasound 2013a;11:21. EVANS P, 2010, TLS-TIMES LIT S 0326, P30 Evans PD, Chowdhury MJA. Photoprotection of wood using polyester-type UVabsorbers derived from the reaction of 2 hydroxy- 4(2,3-epoxypropoxy)-benzophenone with dicarboxylic acid anhydrides. J Wood Chem Technol 2010;30:186e204.
  • 40. Incorrect cited references in WoS (7) 39 WoS cited reference Original cited reference in publication CAO X, 2010, IEEE GLOBECOMM 2010, V2010, P1 Cao, X., Zong, Z., Ju, X., Sun, Y., Dai, C., Liu, Q., Jiang, J., 2010. Molecular cloning, characterization and function analysis of the gene encoding HMG-CoA reductase from Euphorbia Pekinensis Rupr. Mol. Biol. Rep. 37, 1559e1567. LI XY, 2013, NANJING NONGYE DAXUE, V36, P36 X. Li, S. Wang, Y. Chen, G. Liu, X. Yang, Overexpression of CD40 in sacral chordomas and its correlation with low tumor recurrence, Onkologie 36 (10) (2013) 567–571 ZHANG K, 2014, IEEE T PATTERN ANAL, V1, P1 K. Zhang, H. Chen, G. Wu, K. Chen, H. Yang, High expression of SPHK1 in sacral chordoma and association with patients’ poor prognosis, Med. Oncol. 31 (11) (2014) 247.
  • 41. More cited references in WoS (1) 40
  • 42. More cited references in WoS (2) 41
  • 43. More cited references in WoS (3) 42
  • 44. More cited references in WoS (4) 43
  • 45. More cited references in WoS (5) 44
  • 46. More cited references in WoS (6) 45
  • 47. Conclusions • About 0.3% of cited references are missing in WoS • About 0.2% of cited references in WoS have minor errors (e.g., incorrect publication year or volume number) • About 0.1% of cited references in WoS have major errors (i.e., reference to completely incorrect target document) • WoS does a good job in handling references pointing to multiple target documents • These results are based on Elsevier publications only; publications from other publishers may yield different outcomes 46
  • 48. Thank you for your attention! 47