3. Europe PubMed Central
26 million abstracts
2.3 million full text articles
Citation networks
Database links
Text-mining
2006 2011 2012 2016?
4. How many open access articles in UKPMC?
PubMed (995K)
UKPMC (18%,182K)
OA (9.6%, 96K)
200 200 200 200 200 200 200 200 200 20 20
Publication Date
Total: 489,000 OA articles
5. 45000
• Big data
300
European Nucleotide Archive Ensembl and Ensembl Genomes
Nucleotides (millions)
40000
250
35000
• Thematic data
30000 200
Genomes
25000
150
20000
• Public data 15000
10000
100
50
• Archived data
5000
0 0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year
14000000 25000
Year
12000000
UniProt InterPro
20000
10000000
Entries
• Two petabytes of data
Entries
8000000 15000
• Scales to 7 pbs raw disk
6000000
10000
4000000
• Majority is DNA
5000
2000000
0 0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year Year
500000
70000
450000 ArrayExpress
PDBe
Hybridisations
400000 60000
Structures
350000
50000
300000
40000
250000
200000 30000
150000
20000
100000
10000
50000
0 0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year Year
Figure 2. Growth of key resources
7. PMC336623 Extended to several other biological data types
8. Literature citation from data
800 K • Proteins
• Nucleotides
• OMIM
• Chemicals
• Structure
• Clinical reviews
370 K • Protein families
• Protein-protein interactions
• Gene expression experiments
110 K
9. Data referral from literature: text mining
Semantic Type Unique Terms Articles Annotations
Accession No. 233,017 66,356 387,787
Chemical 76,712 1,694,385 83,923,066
Disease 171,692 1,768,214 57,821,871
Gene/Protein 227,318 1,310,382 77,189,022
GO Terms 32,664 1,832,294 65,061,579
Organism 180,637 1,713,280 70,832,222
2.3 million articles
11. Why is this important? Implications
Scientific:
Linking articles that cite the same data
Citation:
Data Citation as measure of impact (Thomson: Data citation index)
Context of data citation: submission, reuse, analysis
Operational:
Services for publishers to improve Accession number tagging
Editorial policies and adherence
Extension of NLM DTD
Lessons learned for considering unstructured data
That we can perform this analysis at all highlights a benefit of Open Access
16. Europe PubMed Central and Institutional Repositories:
content matching
Number of article IDs
OpenAIRE plus
**Coming soon: RESTful interface for data linked to articles
17. People
• Paula Buttery • Rebholz Group
• Andrew Caines • Peter Stoehr
• Norman Cobley
• Yuci Gou • University of Manchester
• SenayKafkas • British Library
• JyothiKaturi
• Oliver Kilian • OpenAIRE/OpenAIRE Plus
• Jee-Hyub Kim
• Nikos Marinos • NCBI, NLM
• Jo McEntyre
• Xingjun Pi
• Philip Rossiter