SlideShare uma empresa Scribd logo
1 de 32
Baixar para ler offline
1/32
An Analysis of the Microsoft Academic Graph
Drahomira Herrmannova (@robodasha)
&
Petr Knoth (@petrknoth)
KMi, The Open University
2/32
Introduction
• To understand the strengths and limitations of
the Microsoft Academic Graph (MAG) for
applying it to scholarly communication tasks
• We study the characteristics of the dataset
and perform a correlation analysis with other
similar datasets
3/32
Questions
• How complete/sparse are the data?
• How many of the graph entities have all
associated metadata fields populated and how
reliable they are?
• How well are the data
conflated/disambiguated?
4/32
Dataset
• Heterogeneous graph comprised of more than
120 million publications and the related
authors, venues, organizations, and fields of
study
• The largest publicly available dataset of
scholarly publications
• The largest dataset of open citation data
5/32
Dataset size
Papers 126,909,021
Authors 114,698,044
Institutions 19,843
Journals 23,404
Conferences 1,283
Conference instances 50,202
Fields of study 50,266
6/32
External datasets used
• CORE (Connecting Repositories)
• Mendeley
• Webometrics Ranking of World Universities
• Scimago Journal and Country Rank
7/32
Publication age
8/32
Publication age
• Publication dates from MAG compared with
CORE and Mendeley data
• Intersection found using DOI
Unique DOIs in the MAG 35,569,305
Unique DOIs in CORE 2,673,592
Intersection MAG/CORE 1,690,668
Intersection MAG/CORE/Mendeley 1,314,854
Intersection without missing data 1,258,611
9/32
Publication age
• Compared using two methods
– Spearman's rho correlation coefficient
– Cumulative distribution function of the difference
between the publication years in the different
datasets
10/32
Publication age
Spearman’s rho MAG CORE Mendeley
MAG - 0.9555 0.9656
CORE 0.9555 - 0.9743
Mendeley 0.9656 0.9743 -
11/32
Publication age
12/32
Authors and affiliations
• Publications linked to author and affiliation
entities
• All publications linked to one or more authors,
however 105,980,107 (~83%) publications not
linked to any affiliation
13/32
Authors and affiliations
Mean number of authors per paper 2.66
Max authors per paper 6,530
Mean number of papers per author 2.94
Max number of papers per author 153,915
Mean number of collaborators 116.93
Max number of collaborators 3,661,912
Number of papers with affiliation 20,928,914
Mean number of affiliations per paper 0.23
Max number of affiliations per paper 181
14/32
Authors and affiliations
• Paper with most authors: ”Sunday, 26 August
2012"
• Author with most papers: ”united vertical
media gmbh"
15/32
Journals and conferences
• Papers linked to publication venues
• Of all papers in MAG (over 126 million), more
than 51 million (~40%) are linked to a journal
and 1,7 million to a conference entity
16/32
Fields of study
• FoS in MAG organised hierarchically into four
levels (0-3)
– 47,989 at level 3
– 1,966 at level 2
– 293 at level 1
– 18 at level 0
• Over 41 million papers are linked to one or
more fields of study (~33%)
17/32
Fields of study
18/32
Fields of study – Mendeley
19/32
Citation network
• We study the network by
– looking at the citation distribution, to see whether
it is consistent with previous studies
– Compare the citations received by two types of
entities in the graph with citations from external
datasets
• Why?
– To understand the quality of the citation data (not
to rank universities or journals)
20/32
Citation network
• 528,682,289 internal citations
• Significant portion of papers disconnected
from the graph
Total number of papers 126,909,021
Papers with zero references 96,850,699
Papers with zero citations 89,647,949
Papers with zero references and citations 80,166,717
Mean citation per paper 4.17
Mean citation per ”connected” paper 11.31
21/32
Citation network
• Comparison of university and journal citation
data found in MAG with the Ranking Web of
Universities (RWoU) and the Scimago Journal
& Country Rank (SJCR) citation data
• Two comparison methods
– Size of overlap of the top university/journal lists
– Pearson’s and Spearman’s correlation (calculated
on matching items)
22/32
Citation network
• Matched 1,255 universities between MAG and
RWoU (2,105 in total), and 13,050 journals
between MAG and SJRC (22,878 in total)
• 4 common journals in among the top 10
• 54 among the top 100
• 677 among the top 1000 and 1407 among the
top 2000
23/32
Citation network – top 10 universities
24/32
Citation network – top 10 journals
25/32
Citation network
• To quantify how much do the lists differ, we
created histograms of the differences between
the ranks in the MAG and in the external lists
• To produce the histograms
– Sorted the data by number of citations found in
the external dataset
– For top 100/1000 universities/journals created a
histogram of absolute difference between rank in
MAG and in external dataset
26/32
Rank difference – top 100 universities
27/32
Rank difference – top 100 universities
• University citation rank in the MAG differs by
more than 200 positions for about 20% of
universities in the top 100 of the Ranking Web
of Universities list
• The citation university rank differs by less than
25 positions for less than 40% of universities
across these two datasets
28/32
Rank difference – top 1000 universities
29/32
Rank difference – top 100 journals
30/32
Rank difference – top 1000 journals
31/32
Citation network
• Ranks of top universities differ on average by
163, with standard deviation of 185
• Ranks of top journals differ on average by
1,203 with standard deviation of 1,211
• Correlations calculated on matching items
Universities Journals
Pearson’s r 0.8773, p -> 0.0 0.8246, p -> 0.0
Spearman’s rho 0.8266, p -> 0.0 0.8973, p -> 0.0
32/32
Conclusions
• MAG data correlate well with external datasets
• We have identified certain limitations as to the
completeness of links from publications to other
entities
• Existing university and journal rankings
(proprietary data) produce substantially different
results
– MAG is open and transparent at the level of individual
citations, it is possible to verify and better interpret
the citation data
• Currently the most comprehensive publicly
available dataset of its kind

Mais conteúdo relacionado

Mais procurados

Most borrowed is most cited? Library loan statistics as a proxy for monograph...
Most borrowed is most cited? Library loan statistics as a proxy for monograph...Most borrowed is most cited? Library loan statistics as a proxy for monograph...
Most borrowed is most cited? Library loan statistics as a proxy for monograph...Torres Salinas
 
Scopus harvestering trumpeting
Scopus harvestering trumpetingScopus harvestering trumpeting
Scopus harvestering trumpetingJoanne Paterson
 
Bibliometrics in the library, putting science in to practice
Bibliometrics in the library, putting science in to practiceBibliometrics in the library, putting science in to practice
Bibliometrics in the library, putting science in to practiceWouter Gerritsma
 
ICPSR Data Services
ICPSR Data ServicesICPSR Data Services
ICPSR Data ServicesICPSR
 
Learning the ABCs of Tracking APCs
Learning the ABCs of Tracking APCsLearning the ABCs of Tracking APCs
Learning the ABCs of Tracking APCsErin Calhoun
 
Most borrowed is most cited? Library loan statistics as a proxy for monograph...
Most borrowed is most cited? Library loan statistics as a proxy for monograph...Most borrowed is most cited? Library loan statistics as a proxy for monograph...
Most borrowed is most cited? Library loan statistics as a proxy for monograph...Nicolas Robinson-Garcia
 
improving student and researcher relations with the library
improving student and researcher relations with the libraryimproving student and researcher relations with the library
improving student and researcher relations with the libraryWouter Gerritsma
 
Scientometric approaches to classification
Scientometric approaches to classificationScientometric approaches to classification
Scientometric approaches to classificationNees Jan van Eck
 
Let's Talk Research 2015 - Mary Hill - What have librarians ever done for us?
Let's Talk Research 2015 - Mary Hill - What have librarians ever done for us? Let's Talk Research 2015 - Mary Hill - What have librarians ever done for us?
Let's Talk Research 2015 - Mary Hill - What have librarians ever done for us? NHSNWRD
 
Citation Analysis of Higher Education Texts in Selected Databases: A Comparis...
Citation Analysis of Higher Education Texts in Selected Databases: A Comparis...Citation Analysis of Higher Education Texts in Selected Databases: A Comparis...
Citation Analysis of Higher Education Texts in Selected Databases: A Comparis...Che-Wei Lee
 
Web resources for thesis work
Web resources for thesis workWeb resources for thesis work
Web resources for thesis workMichael Le Duc
 
Accuracy of citation data in Web of Science and Scopus
Accuracy of citation data in Web of Science and ScopusAccuracy of citation data in Web of Science and Scopus
Accuracy of citation data in Web of Science and ScopusNees Jan van Eck
 
Cabell's Directory - Features NATIONAL FORUM JOURNALS, www.nationalforum.com,...
Cabell's Directory - Features NATIONAL FORUM JOURNALS, www.nationalforum.com,...Cabell's Directory - Features NATIONAL FORUM JOURNALS, www.nationalforum.com,...
Cabell's Directory - Features NATIONAL FORUM JOURNALS, www.nationalforum.com,...William Kritsonis
 
Serving the Biomedical Research Community
Serving the Biomedical Research CommunityServing the Biomedical Research Community
Serving the Biomedical Research CommunityMelissa Rethlefsen
 
Bibliometric analyses on repository contents for the evaluation of research a...
Bibliometric analyses on repository contents for the evaluation of research a...Bibliometric analyses on repository contents for the evaluation of research a...
Bibliometric analyses on repository contents for the evaluation of research a...marco.vanveller
 
What does it take to have precise indicators?
What does it take to have precise indicators?What does it take to have precise indicators?
What does it take to have precise indicators?Held de Souza
 
library resources for optometrists
library resources for optometristslibrary resources for optometrists
library resources for optometristsHossein Mirzaie
 

Mais procurados (20)

Most borrowed is most cited? Library loan statistics as a proxy for monograph...
Most borrowed is most cited? Library loan statistics as a proxy for monograph...Most borrowed is most cited? Library loan statistics as a proxy for monograph...
Most borrowed is most cited? Library loan statistics as a proxy for monograph...
 
Scopus harvestering trumpeting
Scopus harvestering trumpetingScopus harvestering trumpeting
Scopus harvestering trumpeting
 
Bibliometrics in the library, putting science in to practice
Bibliometrics in the library, putting science in to practiceBibliometrics in the library, putting science in to practice
Bibliometrics in the library, putting science in to practice
 
ICPSR Data Services
ICPSR Data ServicesICPSR Data Services
ICPSR Data Services
 
Learning the ABCs of Tracking APCs
Learning the ABCs of Tracking APCsLearning the ABCs of Tracking APCs
Learning the ABCs of Tracking APCs
 
Most borrowed is most cited? Library loan statistics as a proxy for monograph...
Most borrowed is most cited? Library loan statistics as a proxy for monograph...Most borrowed is most cited? Library loan statistics as a proxy for monograph...
Most borrowed is most cited? Library loan statistics as a proxy for monograph...
 
Disentangling gold open access
Disentangling gold open accessDisentangling gold open access
Disentangling gold open access
 
improving student and researcher relations with the library
improving student and researcher relations with the libraryimproving student and researcher relations with the library
improving student and researcher relations with the library
 
Scientometric approaches to classification
Scientometric approaches to classificationScientometric approaches to classification
Scientometric approaches to classification
 
Let's Talk Research 2015 - Mary Hill - What have librarians ever done for us?
Let's Talk Research 2015 - Mary Hill - What have librarians ever done for us? Let's Talk Research 2015 - Mary Hill - What have librarians ever done for us?
Let's Talk Research 2015 - Mary Hill - What have librarians ever done for us?
 
Citation Analysis of Higher Education Texts in Selected Databases: A Comparis...
Citation Analysis of Higher Education Texts in Selected Databases: A Comparis...Citation Analysis of Higher Education Texts in Selected Databases: A Comparis...
Citation Analysis of Higher Education Texts in Selected Databases: A Comparis...
 
Web resources for thesis work
Web resources for thesis workWeb resources for thesis work
Web resources for thesis work
 
Accuracy of citation data in Web of Science and Scopus
Accuracy of citation data in Web of Science and ScopusAccuracy of citation data in Web of Science and Scopus
Accuracy of citation data in Web of Science and Scopus
 
Cabell's Directory - Features NATIONAL FORUM JOURNALS, www.nationalforum.com,...
Cabell's Directory - Features NATIONAL FORUM JOURNALS, www.nationalforum.com,...Cabell's Directory - Features NATIONAL FORUM JOURNALS, www.nationalforum.com,...
Cabell's Directory - Features NATIONAL FORUM JOURNALS, www.nationalforum.com,...
 
Serving the Biomedical Research Community
Serving the Biomedical Research CommunityServing the Biomedical Research Community
Serving the Biomedical Research Community
 
Bibliometric analyses on repository contents for the evaluation of research a...
Bibliometric analyses on repository contents for the evaluation of research a...Bibliometric analyses on repository contents for the evaluation of research a...
Bibliometric analyses on repository contents for the evaluation of research a...
 
Practical applications of altmetrics
Practical applications of altmetricsPractical applications of altmetrics
Practical applications of altmetrics
 
Liber2011
Liber2011Liber2011
Liber2011
 
What does it take to have precise indicators?
What does it take to have precise indicators?What does it take to have precise indicators?
What does it take to have precise indicators?
 
library resources for optometrists
library resources for optometristslibrary resources for optometrists
library resources for optometrists
 

Semelhante a An Analysis of the Microsoft Academic Graph

Identifying Twitter audiences: Who is tweeting about scientific papers?
Identifying Twitter audiences: Who is tweeting about scientific papers?Identifying Twitter audiences: Who is tweeting about scientific papers?
Identifying Twitter audiences: Who is tweeting about scientific papers?Stefanie Haustein
 
UKSG 2024 Plenary 2 - What did we Read, What did we Publish: Distilling the d...
UKSG 2024 Plenary 2 - What did we Read, What did we Publish: Distilling the d...UKSG 2024 Plenary 2 - What did we Read, What did we Publish: Distilling the d...
UKSG 2024 Plenary 2 - What did we Read, What did we Publish: Distilling the d...UKSG: connecting the knowledge community
 
Non-textual ranking in Digital Libraries
Non-textual ranking in Digital LibrariesNon-textual ranking in Digital Libraries
Non-textual ranking in Digital LibrariesGESIS
 
Scopus:Workshops on Scopus for Literature Searching and Research Impact
Scopus:Workshops on Scopus for Literature Searching and Research ImpactScopus:Workshops on Scopus for Literature Searching and Research Impact
Scopus:Workshops on Scopus for Literature Searching and Research Impactmotqin
 
how to publish a paper-1.ppt
how to publish a paper-1.ppthow to publish a paper-1.ppt
how to publish a paper-1.pptAlexmoradi
 
Research Impact in Specialized Settings: 3 Case Studies
Research Impact in Specialized Settings: 3 Case StudiesResearch Impact in Specialized Settings: 3 Case Studies
Research Impact in Specialized Settings: 3 Case StudiesElaine Lasda
 
Public engagement while you sleep
Public engagement while you sleepPublic engagement while you sleep
Public engagement while you sleepUoLResearchSupport
 
Scopus: a changing world of Research
Scopus: a changing world of ResearchScopus: a changing world of Research
Scopus: a changing world of ResearchCiarán Quinn
 
Public engagement while you sleep? How altmetrics can help researchers broade...
Public engagement while you sleep? How altmetrics can help researchers broade...Public engagement while you sleep? How altmetrics can help researchers broade...
Public engagement while you sleep? How altmetrics can help researchers broade...UoLResearchSupport
 
Public engagement while you sleep
Public engagement while you sleep Public engagement while you sleep
Public engagement while you sleep Kirsten Thompson
 
Making Sense of the Confusing World of Research Information Management
Making Sense of the Confusing World of Research Information ManagementMaking Sense of the Confusing World of Research Information Management
Making Sense of the Confusing World of Research Information ManagementOCLC
 
Research workshop presentation unisa
Research workshop presentation unisaResearch workshop presentation unisa
Research workshop presentation unisaerasmus01
 
Reviewing and summarization of university ranking system to.pptx
Reviewing and summarization of university ranking system  to.pptxReviewing and summarization of university ranking system  to.pptx
Reviewing and summarization of university ranking system to.pptxAss.Prof. Dr. Mogeeb Mosleh
 
Finding research evidence 2016
Finding research evidence 2016 Finding research evidence 2016
Finding research evidence 2016 John Iona
 

Semelhante a An Analysis of the Microsoft Academic Graph (20)

Identifying Twitter audiences: Who is tweeting about scientific papers?
Identifying Twitter audiences: Who is tweeting about scientific papers?Identifying Twitter audiences: Who is tweeting about scientific papers?
Identifying Twitter audiences: Who is tweeting about scientific papers?
 
UKSG 2024 Plenary 2 - What did we Read, What did we Publish: Distilling the d...
UKSG 2024 Plenary 2 - What did we Read, What did we Publish: Distilling the d...UKSG 2024 Plenary 2 - What did we Read, What did we Publish: Distilling the d...
UKSG 2024 Plenary 2 - What did we Read, What did we Publish: Distilling the d...
 
Non-textual ranking in Digital Libraries
Non-textual ranking in Digital LibrariesNon-textual ranking in Digital Libraries
Non-textual ranking in Digital Libraries
 
Hgm elpub2018
Hgm elpub2018Hgm elpub2018
Hgm elpub2018
 
Tr georgia 05 2010
Tr georgia 05 2010Tr georgia 05 2010
Tr georgia 05 2010
 
Scopus
ScopusScopus
Scopus
 
Scopus:Workshops on Scopus for Literature Searching and Research Impact
Scopus:Workshops on Scopus for Literature Searching and Research ImpactScopus:Workshops on Scopus for Literature Searching and Research Impact
Scopus:Workshops on Scopus for Literature Searching and Research Impact
 
how to publish a paper-1.ppt
how to publish a paper-1.ppthow to publish a paper-1.ppt
how to publish a paper-1.ppt
 
Research Impact in Specialized Settings: 3 Case Studies
Research Impact in Specialized Settings: 3 Case StudiesResearch Impact in Specialized Settings: 3 Case Studies
Research Impact in Specialized Settings: 3 Case Studies
 
2016 AAUDE
2016 AAUDE2016 AAUDE
2016 AAUDE
 
Public engagement while you sleep
Public engagement while you sleepPublic engagement while you sleep
Public engagement while you sleep
 
Scopus: a changing world of Research
Scopus: a changing world of ResearchScopus: a changing world of Research
Scopus: a changing world of Research
 
Public engagement while you sleep? How altmetrics can help researchers broade...
Public engagement while you sleep? How altmetrics can help researchers broade...Public engagement while you sleep? How altmetrics can help researchers broade...
Public engagement while you sleep? How altmetrics can help researchers broade...
 
Public engagement while you sleep
Public engagement while you sleep Public engagement while you sleep
Public engagement while you sleep
 
Bryant Confusing World of RIM
Bryant Confusing World of RIM Bryant Confusing World of RIM
Bryant Confusing World of RIM
 
Making Sense of the Confusing World of Research Information Management
Making Sense of the Confusing World of Research Information ManagementMaking Sense of the Confusing World of Research Information Management
Making Sense of the Confusing World of Research Information Management
 
Research workshop presentation unisa
Research workshop presentation unisaResearch workshop presentation unisa
Research workshop presentation unisa
 
InCites
InCitesInCites
InCites
 
Reviewing and summarization of university ranking system to.pptx
Reviewing and summarization of university ranking system  to.pptxReviewing and summarization of university ranking system  to.pptx
Reviewing and summarization of university ranking system to.pptx
 
Finding research evidence 2016
Finding research evidence 2016 Finding research evidence 2016
Finding research evidence 2016
 

Mais de Dasha Herrmannova

Machine Learning for Data Extraction
Machine Learning for Data ExtractionMachine Learning for Data Extraction
Machine Learning for Data ExtractionDasha Herrmannova
 
Do Authors Deposit on Time? Tracking Open Access Policy Compliance
Do Authors Deposit on Time? Tracking Open Access Policy ComplianceDo Authors Deposit on Time? Tracking Open Access Policy Compliance
Do Authors Deposit on Time? Tracking Open Access Policy ComplianceDasha Herrmannova
 
Semantometrics: Text Analysis in Research Evaluation
Semantometrics: Text Analysis in Research Evaluation Semantometrics: Text Analysis in Research Evaluation
Semantometrics: Text Analysis in Research Evaluation Dasha Herrmannova
 
Do Citations and Readership Predict Excellent Publications?
Do Citations and Readership Predict Excellent Publications?Do Citations and Readership Predict Excellent Publications?
Do Citations and Readership Predict Excellent Publications?Dasha Herrmannova
 
Visual Search for Supporting Content Exploration in Large Document Collections
Visual Search for Supporting Content Exploration in Large Document CollectionsVisual Search for Supporting Content Exploration in Large Document Collections
Visual Search for Supporting Content Exploration in Large Document CollectionsDasha Herrmannova
 
Unsupervised Identification of Study Descriptors in Toxicology Research: An E...
Unsupervised Identification of Study Descriptors in Toxicology Research: An E...Unsupervised Identification of Study Descriptors in Toxicology Research: An E...
Unsupervised Identification of Study Descriptors in Toxicology Research: An E...Dasha Herrmannova
 
Simple Yet Effective Methods for Large-Scale Scholarly Publication Ranking
Simple Yet Effective Methods for Large-Scale Scholarly Publication RankingSimple Yet Effective Methods for Large-Scale Scholarly Publication Ranking
Simple Yet Effective Methods for Large-Scale Scholarly Publication RankingDasha Herrmannova
 
Semantometrics in Coauthorship Networks: Fulltext-based Approach for Analysin...
Semantometrics in Coauthorship Networks: Fulltext-based Approach for Analysin...Semantometrics in Coauthorship Networks: Fulltext-based Approach for Analysin...
Semantometrics in Coauthorship Networks: Fulltext-based Approach for Analysin...Dasha Herrmannova
 
Towards Semantometrics: A New Semantic Similarity Based Measure for Assessing...
Towards Semantometrics: A New Semantic Similarity Based Measure for Assessing...Towards Semantometrics: A New Semantic Similarity Based Measure for Assessing...
Towards Semantometrics: A New Semantic Similarity Based Measure for Assessing...Dasha Herrmannova
 
Mining Research Publication Networks for Impact -- KMi Internal Seminar
Mining Research Publication Networks for Impact -- KMi Internal SeminarMining Research Publication Networks for Impact -- KMi Internal Seminar
Mining Research Publication Networks for Impact -- KMi Internal SeminarDasha Herrmannova
 

Mais de Dasha Herrmannova (10)

Machine Learning for Data Extraction
Machine Learning for Data ExtractionMachine Learning for Data Extraction
Machine Learning for Data Extraction
 
Do Authors Deposit on Time? Tracking Open Access Policy Compliance
Do Authors Deposit on Time? Tracking Open Access Policy ComplianceDo Authors Deposit on Time? Tracking Open Access Policy Compliance
Do Authors Deposit on Time? Tracking Open Access Policy Compliance
 
Semantometrics: Text Analysis in Research Evaluation
Semantometrics: Text Analysis in Research Evaluation Semantometrics: Text Analysis in Research Evaluation
Semantometrics: Text Analysis in Research Evaluation
 
Do Citations and Readership Predict Excellent Publications?
Do Citations and Readership Predict Excellent Publications?Do Citations and Readership Predict Excellent Publications?
Do Citations and Readership Predict Excellent Publications?
 
Visual Search for Supporting Content Exploration in Large Document Collections
Visual Search for Supporting Content Exploration in Large Document CollectionsVisual Search for Supporting Content Exploration in Large Document Collections
Visual Search for Supporting Content Exploration in Large Document Collections
 
Unsupervised Identification of Study Descriptors in Toxicology Research: An E...
Unsupervised Identification of Study Descriptors in Toxicology Research: An E...Unsupervised Identification of Study Descriptors in Toxicology Research: An E...
Unsupervised Identification of Study Descriptors in Toxicology Research: An E...
 
Simple Yet Effective Methods for Large-Scale Scholarly Publication Ranking
Simple Yet Effective Methods for Large-Scale Scholarly Publication RankingSimple Yet Effective Methods for Large-Scale Scholarly Publication Ranking
Simple Yet Effective Methods for Large-Scale Scholarly Publication Ranking
 
Semantometrics in Coauthorship Networks: Fulltext-based Approach for Analysin...
Semantometrics in Coauthorship Networks: Fulltext-based Approach for Analysin...Semantometrics in Coauthorship Networks: Fulltext-based Approach for Analysin...
Semantometrics in Coauthorship Networks: Fulltext-based Approach for Analysin...
 
Towards Semantometrics: A New Semantic Similarity Based Measure for Assessing...
Towards Semantometrics: A New Semantic Similarity Based Measure for Assessing...Towards Semantometrics: A New Semantic Similarity Based Measure for Assessing...
Towards Semantometrics: A New Semantic Similarity Based Measure for Assessing...
 
Mining Research Publication Networks for Impact -- KMi Internal Seminar
Mining Research Publication Networks for Impact -- KMi Internal SeminarMining Research Publication Networks for Impact -- KMi Internal Seminar
Mining Research Publication Networks for Impact -- KMi Internal Seminar
 

Último

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 

Último (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 

An Analysis of the Microsoft Academic Graph

  • 1. 1/32 An Analysis of the Microsoft Academic Graph Drahomira Herrmannova (@robodasha) & Petr Knoth (@petrknoth) KMi, The Open University
  • 2. 2/32 Introduction • To understand the strengths and limitations of the Microsoft Academic Graph (MAG) for applying it to scholarly communication tasks • We study the characteristics of the dataset and perform a correlation analysis with other similar datasets
  • 3. 3/32 Questions • How complete/sparse are the data? • How many of the graph entities have all associated metadata fields populated and how reliable they are? • How well are the data conflated/disambiguated?
  • 4. 4/32 Dataset • Heterogeneous graph comprised of more than 120 million publications and the related authors, venues, organizations, and fields of study • The largest publicly available dataset of scholarly publications • The largest dataset of open citation data
  • 5. 5/32 Dataset size Papers 126,909,021 Authors 114,698,044 Institutions 19,843 Journals 23,404 Conferences 1,283 Conference instances 50,202 Fields of study 50,266
  • 6. 6/32 External datasets used • CORE (Connecting Repositories) • Mendeley • Webometrics Ranking of World Universities • Scimago Journal and Country Rank
  • 8. 8/32 Publication age • Publication dates from MAG compared with CORE and Mendeley data • Intersection found using DOI Unique DOIs in the MAG 35,569,305 Unique DOIs in CORE 2,673,592 Intersection MAG/CORE 1,690,668 Intersection MAG/CORE/Mendeley 1,314,854 Intersection without missing data 1,258,611
  • 9. 9/32 Publication age • Compared using two methods – Spearman's rho correlation coefficient – Cumulative distribution function of the difference between the publication years in the different datasets
  • 10. 10/32 Publication age Spearman’s rho MAG CORE Mendeley MAG - 0.9555 0.9656 CORE 0.9555 - 0.9743 Mendeley 0.9656 0.9743 -
  • 12. 12/32 Authors and affiliations • Publications linked to author and affiliation entities • All publications linked to one or more authors, however 105,980,107 (~83%) publications not linked to any affiliation
  • 13. 13/32 Authors and affiliations Mean number of authors per paper 2.66 Max authors per paper 6,530 Mean number of papers per author 2.94 Max number of papers per author 153,915 Mean number of collaborators 116.93 Max number of collaborators 3,661,912 Number of papers with affiliation 20,928,914 Mean number of affiliations per paper 0.23 Max number of affiliations per paper 181
  • 14. 14/32 Authors and affiliations • Paper with most authors: ”Sunday, 26 August 2012" • Author with most papers: ”united vertical media gmbh"
  • 15. 15/32 Journals and conferences • Papers linked to publication venues • Of all papers in MAG (over 126 million), more than 51 million (~40%) are linked to a journal and 1,7 million to a conference entity
  • 16. 16/32 Fields of study • FoS in MAG organised hierarchically into four levels (0-3) – 47,989 at level 3 – 1,966 at level 2 – 293 at level 1 – 18 at level 0 • Over 41 million papers are linked to one or more fields of study (~33%)
  • 18. 18/32 Fields of study – Mendeley
  • 19. 19/32 Citation network • We study the network by – looking at the citation distribution, to see whether it is consistent with previous studies – Compare the citations received by two types of entities in the graph with citations from external datasets • Why? – To understand the quality of the citation data (not to rank universities or journals)
  • 20. 20/32 Citation network • 528,682,289 internal citations • Significant portion of papers disconnected from the graph Total number of papers 126,909,021 Papers with zero references 96,850,699 Papers with zero citations 89,647,949 Papers with zero references and citations 80,166,717 Mean citation per paper 4.17 Mean citation per ”connected” paper 11.31
  • 21. 21/32 Citation network • Comparison of university and journal citation data found in MAG with the Ranking Web of Universities (RWoU) and the Scimago Journal & Country Rank (SJCR) citation data • Two comparison methods – Size of overlap of the top university/journal lists – Pearson’s and Spearman’s correlation (calculated on matching items)
  • 22. 22/32 Citation network • Matched 1,255 universities between MAG and RWoU (2,105 in total), and 13,050 journals between MAG and SJRC (22,878 in total) • 4 common journals in among the top 10 • 54 among the top 100 • 677 among the top 1000 and 1407 among the top 2000
  • 23. 23/32 Citation network – top 10 universities
  • 24. 24/32 Citation network – top 10 journals
  • 25. 25/32 Citation network • To quantify how much do the lists differ, we created histograms of the differences between the ranks in the MAG and in the external lists • To produce the histograms – Sorted the data by number of citations found in the external dataset – For top 100/1000 universities/journals created a histogram of absolute difference between rank in MAG and in external dataset
  • 26. 26/32 Rank difference – top 100 universities
  • 27. 27/32 Rank difference – top 100 universities • University citation rank in the MAG differs by more than 200 positions for about 20% of universities in the top 100 of the Ranking Web of Universities list • The citation university rank differs by less than 25 positions for less than 40% of universities across these two datasets
  • 28. 28/32 Rank difference – top 1000 universities
  • 29. 29/32 Rank difference – top 100 journals
  • 30. 30/32 Rank difference – top 1000 journals
  • 31. 31/32 Citation network • Ranks of top universities differ on average by 163, with standard deviation of 185 • Ranks of top journals differ on average by 1,203 with standard deviation of 1,211 • Correlations calculated on matching items Universities Journals Pearson’s r 0.8773, p -> 0.0 0.8246, p -> 0.0 Spearman’s rho 0.8266, p -> 0.0 0.8973, p -> 0.0
  • 32. 32/32 Conclusions • MAG data correlate well with external datasets • We have identified certain limitations as to the completeness of links from publications to other entities • Existing university and journal rankings (proprietary data) produce substantially different results – MAG is open and transparent at the level of individual citations, it is possible to verify and better interpret the citation data • Currently the most comprehensive publicly available dataset of its kind