4. Data Landscape and Definitions
Research
articles
Funder mandates
Journal requirements
Metadata
Standards
Big Data:
Deposition
Primary
Unstructured
Data
*reuse
Big Data:
Curated
Annotation
6. 40000
300
European Nucleotide Archive
Ensembl and Ensembl Genomes
250
35000
30000
Genomes
• Big data
• Thematic data
• Public data
• Archived data
Nucleotides (millions)
45000
25000
20000
15000
200
150
100
10000
50
5000
0
0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year
14000000
12000000
25000
UniProt
Year
InterPro
Entries
10000000
8000000
6000000
15000
10000
4000000
5000
2000000
0
0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year
Year
500000
450000
70000
ArrayExpress
60000
400000
Structures
Hybridisations
• Two petabytes of data
• Scales to 7 pbs raw disk
• Majority is DNA
Entries
20000
350000
300000
250000
200000
150000
PDBe
50000
40000
30000
20000
100000
10000
50000
0
0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year
7. Two core literature databases
•
26 million abstracts
PubMed, Patents, Agricola
•
Website and web services
•
•
•
Citation networks
Database links
Whatizit textmining
• over 1.1 million new records per year
•
2.2 million full text articles
(217K articles with suppl data)
•
Website
•
•
Supplemented by CiteXplore
Additional text mining
• over 150K new articles per year
8. UK PubMed Central Overview
• Built in collaboration with PubMed Central USA (+ PMC Canada) since
2006
• Led by the European Bioinformatics Institute since 2011, with the
British Library, and the University of Manchester
• Supported by 16 UK and 2 European Funders, led by the Wellcome
Trust. Research spend: ~ 2 billion GBP
• A life-science web-based repository
• Manuscript submission service (self archiving by grant holders)
• Database of grant information – with details of about 18000 PIs
• Grant reporting and funder analysis tool
• 250K requests, 40K IPs, 7K direct interactive searches per day
11. Links
• by the author - on submission, as metadata (primary databases)
• by database curators - information and links from the
literature
• expensive, slow, but high quality
Text mining
• by algorithms that use terminologies (can be subject to lag)
• post publication – can find new associations
• variable quality, but high throughput
12. Links from Literature to Databases
•
•
•
•
•
•
•
•
•
800 K
370 K
110 K
Proteins
Nucleotides
OMIM
Chemicals
Structure
Clinical reviews
Protein families
Protein-protein interactions
Gene expression experiments …
13. Text Mining in UKPMC (2.2 million articles)
Semantic Type
Gene/Protein
Unique Terms
Articles
Annotations
225,905
1,288,809
15,021,502
GO Terms
32,486
1,806,539
15,016,957
Organism
178,847
1,689,251
12,322,782
Disease
170,592
1,743,212
16,201,198
Accession No.
232,950
65,640
331,329
76,350
1,669,500
22,438,980
Chemical
25. Data-driven science
Data re-use: biology is
post publication
Linking: citing papers
and data (provenance
and integration)
Metrics and attribution
Hard decisions about
value of keeping
complete data sets
26. Data landscape - possibilities
analysis
Research
articles
Unstructured
Data
Structured links
Big Data:
Deposition
Primary
Big Data:
Curated
Annotation
reuse?