SlideShare uma empresa Scribd logo
1 de 29
Baixar para ler offline
Linking literature to data in the life sciences
OpenAIREplus workshop, Copenhagen, 11 June 2012
Overview
•
•
•
•

What literature? What data?
How we make literature-data connections
Case study
Challenges and future directions
What literature? What data?
Data Landscape and Definitions
Research
articles

Funder mandates
Journal requirements
Metadata
Standards

Big Data:
Deposition
Primary

Unstructured
Data

*reuse

Big Data:
Curated
Annotation
PMC336623

Extended to several other biological data types
40000

300

European Nucleotide Archive

Ensembl and Ensembl Genomes
250

35000
30000

Genomes

• Big data
• Thematic data
• Public data
• Archived data

Nucleotides (millions)

45000

25000
20000
15000

200
150
100

10000
50
5000
0

0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Year
14000000
12000000

25000

UniProt

Year
InterPro

Entries

10000000
8000000
6000000

15000
10000

4000000
5000

2000000
0

0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Year

Year

500000
450000

70000

ArrayExpress

60000

400000

Structures

Hybridisations

• Two petabytes of data
• Scales to 7 pbs raw disk
• Majority is DNA

Entries

20000

350000
300000
250000
200000
150000

PDBe

50000
40000
30000
20000

100000
10000

50000
0

0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Year

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Year
Two core literature databases

•

26 million abstracts

PubMed, Patents, Agricola
•

Website and web services

•
•
•

Citation networks
Database links
Whatizit textmining

• over 1.1 million new records per year

•

2.2 million full text articles
(217K articles with suppl data)

•

Website

•
•

Supplemented by CiteXplore
Additional text mining

• over 150K new articles per year
UK PubMed Central Overview
• Built in collaboration with PubMed Central USA (+ PMC Canada) since
2006
• Led by the European Bioinformatics Institute since 2011, with the
British Library, and the University of Manchester
• Supported by 16 UK and 2 European Funders, led by the Wellcome
Trust. Research spend: ~ 2 billion GBP

• A life-science web-based repository
• Manuscript submission service (self archiving by grant holders)
• Database of grant information – with details of about 18000 PIs
• Grant reporting and funder analysis tool
• 250K requests, 40K IPs, 7K direct interactive searches per day
How many articles?

Overall: 20% OA (~ 450K OA articles out of 2.2 million total)
How we make literature-data connections
Links
• by the author - on submission, as metadata (primary databases)
• by database curators - information and links from the
literature

• expensive, slow, but high quality

Text mining
• by algorithms that use terminologies (can be subject to lag)
• post publication – can find new associations
• variable quality, but high throughput
Links from Literature to Databases
•
•
•
•
•
•
•
•
•

800 K

370 K

110 K

Proteins
Nucleotides
OMIM
Chemicals
Structure
Clinical reviews
Protein families
Protein-protein interactions
Gene expression experiments …
Text Mining in UKPMC (2.2 million articles)
Semantic Type
Gene/Protein

Unique Terms

Articles

Annotations

225,905

1,288,809

15,021,502

GO Terms

32,486

1,806,539

15,016,957

Organism

178,847

1,689,251

12,322,782

Disease

170,592

1,743,212

16,201,198

Accession No.

232,950

65,640

331,329

76,350

1,669,500

22,438,980

Chemical
Case study
3.9 billion years ago
E. Coli meets humans

Human colon cancer

DNA repair
07/21/10

17
Protein structure in PDBe
Link to the literature from the PDBe record
Algorithms that find similar structures
Text mine full text for 1ewq
Towards understanding DNA repair mechanisms
Challenges and future directions
Data-driven science
Data re-use: biology is
post publication
Linking: citing papers
and data (provenance
and integration)
Metrics and attribution
Hard decisions about
value of keeping
complete data sets
Data landscape - possibilities
analysis

Research
articles

Unstructured
Data
Structured links

Big Data:
Deposition
Primary

Big Data:
Curated
Annotation

reuse?
PDF
HTML

GIF

JPG

TIF
MOV
DOC

Analysis supplied by Mimas, University of Manchester

XSL
Solutions that make sense to scientists
http://ukpmc.ac.uk

Mais conteúdo relacionado

Mais procurados

Clustering the royal society of chemistry chemical repository to enable enhan...
Clustering the royal society of chemistry chemical repository to enable enhan...Clustering the royal society of chemistry chemical repository to enable enhan...
Clustering the royal society of chemistry chemical repository to enable enhan...
Valery Tkachenko
 
FAIR Data and Model Management for Systems Biology (and SOPs too!)
FAIR Data and Model Management for Systems Biology(and SOPs too!)FAIR Data and Model Management for Systems Biology(and SOPs too!)
FAIR Data and Model Management for Systems Biology (and SOPs too!)
Carole Goble
 
20140327 rda plazi_final
20140327 rda plazi_final20140327 rda plazi_final
20140327 rda plazi_final
agosti
 
China: Journal Publishing, DOI and CrossCheck (2011 CrossRef Workshops)
China: Journal Publishing, DOI and CrossCheck (2011 CrossRef Workshops)China: Journal Publishing, DOI and CrossCheck (2011 CrossRef Workshops)
China: Journal Publishing, DOI and CrossCheck (2011 CrossRef Workshops)
Crossref
 
The UK National Chemical Database Service – an integration of commercial and ...
The UK National Chemical Database Service – an integration of commercial and ...The UK National Chemical Database Service – an integration of commercial and ...
The UK National Chemical Database Service – an integration of commercial and ...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Tools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesTools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databases
Valery Tkachenko
 
Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 

Mais procurados (20)

Clustering the royal society of chemistry chemical repository to enable enhan...
Clustering the royal society of chemistry chemical repository to enable enhan...Clustering the royal society of chemistry chemical repository to enable enhan...
Clustering the royal society of chemistry chemical repository to enable enhan...
 
Building a semantic chemistry platform with the royal society of chemistry
Building a semantic chemistry platform with the royal society of chemistryBuilding a semantic chemistry platform with the royal society of chemistry
Building a semantic chemistry platform with the royal society of chemistry
 
FAIR Data and Model Management for Systems Biology (and SOPs too!)
FAIR Data and Model Management for Systems Biology(and SOPs too!)FAIR Data and Model Management for Systems Biology(and SOPs too!)
FAIR Data and Model Management for Systems Biology (and SOPs too!)
 
CrossRef DOIs for African Journal Partnership Journals
CrossRef DOIs for African Journal Partnership JournalsCrossRef DOIs for African Journal Partnership Journals
CrossRef DOIs for African Journal Partnership Journals
 
The royal society of chemistry and its adoption of semantic web technologies ...
The royal society of chemistry and its adoption of semantic web technologies ...The royal society of chemistry and its adoption of semantic web technologies ...
The royal society of chemistry and its adoption of semantic web technologies ...
 
20140327 rda plazi_final
20140327 rda plazi_final20140327 rda plazi_final
20140327 rda plazi_final
 
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
 
Semantics as a service at EMBL-EBI
Semantics as a service at EMBL-EBISemantics as a service at EMBL-EBI
Semantics as a service at EMBL-EBI
 
China: Journal Publishing, DOI and CrossCheck (2011 CrossRef Workshops)
China: Journal Publishing, DOI and CrossCheck (2011 CrossRef Workshops)China: Journal Publishing, DOI and CrossCheck (2011 CrossRef Workshops)
China: Journal Publishing, DOI and CrossCheck (2011 CrossRef Workshops)
 
The UK National Chemical Database Service – an integration of commercial and ...
The UK National Chemical Database Service – an integration of commercial and ...The UK National Chemical Database Service – an integration of commercial and ...
The UK National Chemical Database Service – an integration of commercial and ...
 
BioDBCore: Current Status and Next Developments
BioDBCore: Current Status and Next DevelopmentsBioDBCore: Current Status and Next Developments
BioDBCore: Current Status and Next Developments
 
Tools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesTools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databases
 
OSFair2017 Workshop | How FAIR friendly is the FAIRDOM Hub? Exposing metadata...
OSFair2017 Workshop | How FAIR friendly is the FAIRDOM Hub? Exposing metadata...OSFair2017 Workshop | How FAIR friendly is the FAIRDOM Hub? Exposing metadata...
OSFair2017 Workshop | How FAIR friendly is the FAIRDOM Hub? Exposing metadata...
 
Proteomics public data resources: enabling "big data" analysis in proteomics
Proteomics public data resources: enabling "big data" analysis in proteomicsProteomics public data resources: enabling "big data" analysis in proteomics
Proteomics public data resources: enabling "big data" analysis in proteomics
 
FAIR data and model management for systems biology.
FAIR data and model management for systems biology.FAIR data and model management for systems biology.
FAIR data and model management for systems biology.
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
A chemistry data repository to serve them all
A chemistry data repository to serve them allA chemistry data repository to serve them all
A chemistry data repository to serve them all
 
Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...
 
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
 
Dealing with the complex challenge of managing diverse analytical chemistry d...
Dealing with the complex challenge of managing diverse analytical chemistry d...Dealing with the complex challenge of managing diverse analytical chemistry d...
Dealing with the complex challenge of managing diverse analytical chemistry d...
 

Semelhante a Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
Dr. Haxel Consult
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
Ken Karapetyan
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
Dr. Haxel Consult
 
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Amit Sheth
 
Biological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdfBiological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdf
BioinformaticsCentre
 
Health Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha NoyHealth Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha Noy
Health Data Consortium
 

Semelhante a Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI (20)

ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
 
Mcentyre dryad-orcid_may2013
Mcentyre dryad-orcid_may2013Mcentyre dryad-orcid_may2013
Mcentyre dryad-orcid_may2013
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
The application of cloud computing to royal society of chemistry data platforms
The application of cloud computing to royal society of chemistry data platformsThe application of cloud computing to royal society of chemistry data platforms
The application of cloud computing to royal society of chemistry data platforms
 
Globus in European Life Science
Globus in European Life ScienceGlobus in European Life Science
Globus in European Life Science
 
Data integration
Data integrationData integration
Data integration
 
NEBASE Hour - August 2008 - What's New At OCLC?
NEBASE Hour - August 2008 - What's New At OCLC?NEBASE Hour - August 2008 - What's New At OCLC?
NEBASE Hour - August 2008 - What's New At OCLC?
 
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
 
Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications
 
Prototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and CeremonyPrototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and Ceremony
 
2013 CrossRef Annual Meeting Flash Update CrossCheck and CrossMark Rachael La...
2013 CrossRef Annual Meeting Flash Update CrossCheck and CrossMark Rachael La...2013 CrossRef Annual Meeting Flash Update CrossCheck and CrossMark Rachael La...
2013 CrossRef Annual Meeting Flash Update CrossCheck and CrossMark Rachael La...
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
 
iMicrobe_ASLO_2015
iMicrobe_ASLO_2015iMicrobe_ASLO_2015
iMicrobe_ASLO_2015
 
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
 
Connecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics InstituteConnecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics Institute
 
Biological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdfBiological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdf
 
Data integration
Data integrationData integration
Data integration
 
Health Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha NoyHealth Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha Noy
 

Mais de OpenAIRE

Mais de OpenAIRE (20)

10th OpenAIRE Content Providers Community Call
10th OpenAIRE Content Providers Community Call10th OpenAIRE Content Providers Community Call
10th OpenAIRE Content Providers Community Call
 
9th Content Providers Community Call\
9th Content Providers Community Call\9th Content Providers Community Call\
9th Content Providers Community Call\
 
OpenAIRE in the European Open Science Cloud (EOSC)
OpenAIRE in the European Open Science Cloud (EOSC)OpenAIRE in the European Open Science Cloud (EOSC)
OpenAIRE in the European Open Science Cloud (EOSC)
 
8th Content Providers Community Call
8th Content Providers Community Call8th Content Providers Community Call
8th Content Providers Community Call
 
7th Content Providers Community Call
7th Content Providers Community Call7th Content Providers Community Call
7th Content Providers Community Call
 
OpenAIRE PROVIDE Dashboard for Turkish repository managers
OpenAIRE PROVIDE Dashboard for Turkish repository managersOpenAIRE PROVIDE Dashboard for Turkish repository managers
OpenAIRE PROVIDE Dashboard for Turkish repository managers
 
What will it cost to manage and share my data?
What will it cost to manage and share my data?What will it cost to manage and share my data?
What will it cost to manage and share my data?
 
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 3)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 3)Open Research Gateway for the ELIXIR-GR Infrastructure (Part 3)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 3)
 
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 2)
 
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 1)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 1)Open Research Gateway for the ELIXIR-GR Infrastructure (Part 1)
Open Research Gateway for the ELIXIR-GR Infrastructure (Part 1)
 
6th Content Providers Community Call
6th Content Providers Community Call6th Content Providers Community Call
6th Content Providers Community Call
 
20200504_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
20200504_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data20200504_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
20200504_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
 
20200504_Research Data & the GDPR: How Open is Open?
20200504_Research Data & the GDPR: How Open is Open?20200504_Research Data & the GDPR: How Open is Open?
20200504_Research Data & the GDPR: How Open is Open?
 
20200504_Data, Data Ownership and Open Science
20200504_Data, Data Ownership and Open Science20200504_Data, Data Ownership and Open Science
20200504_Data, Data Ownership and Open Science
 
20200429_Research Data & the GDPR: How Open is Open? (updated version)
20200429_Research Data & the GDPR: How Open is Open? (updated version)20200429_Research Data & the GDPR: How Open is Open? (updated version)
20200429_Research Data & the GDPR: How Open is Open? (updated version)
 
20200429_Data, Data Ownership and Open Science
20200429_Data, Data Ownership and Open Science20200429_Data, Data Ownership and Open Science
20200429_Data, Data Ownership and Open Science
 
20200429_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
20200429_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data20200429_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
20200429_OpenAIRE Legal Policy Webinar: GDPR and Sharing Data
 
COVID-19: Activities, tools, best practice and contact points in Greece
 COVID-19: Activities, tools, best practice and contact points in Greece COVID-19: Activities, tools, best practice and contact points in Greece
COVID-19: Activities, tools, best practice and contact points in Greece
 
5th Content Providers Community Call
5th Content Providers Community Call5th Content Providers Community Call
5th Content Providers Community Call
 
4th Content Providers Community Call
4th Content Providers Community Call4th Content Providers Community Call
4th Content Providers Community Call
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Último (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

Literature-data integration in the life sciences – Jo McEntyre, EMBL-EBI

  • 1. Linking literature to data in the life sciences OpenAIREplus workshop, Copenhagen, 11 June 2012
  • 2. Overview • • • • What literature? What data? How we make literature-data connections Case study Challenges and future directions
  • 4. Data Landscape and Definitions Research articles Funder mandates Journal requirements Metadata Standards Big Data: Deposition Primary Unstructured Data *reuse Big Data: Curated Annotation
  • 5. PMC336623 Extended to several other biological data types
  • 6. 40000 300 European Nucleotide Archive Ensembl and Ensembl Genomes 250 35000 30000 Genomes • Big data • Thematic data • Public data • Archived data Nucleotides (millions) 45000 25000 20000 15000 200 150 100 10000 50 5000 0 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Year 14000000 12000000 25000 UniProt Year InterPro Entries 10000000 8000000 6000000 15000 10000 4000000 5000 2000000 0 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Year Year 500000 450000 70000 ArrayExpress 60000 400000 Structures Hybridisations • Two petabytes of data • Scales to 7 pbs raw disk • Majority is DNA Entries 20000 350000 300000 250000 200000 150000 PDBe 50000 40000 30000 20000 100000 10000 50000 0 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Year 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Year
  • 7. Two core literature databases • 26 million abstracts PubMed, Patents, Agricola • Website and web services • • • Citation networks Database links Whatizit textmining • over 1.1 million new records per year • 2.2 million full text articles (217K articles with suppl data) • Website • • Supplemented by CiteXplore Additional text mining • over 150K new articles per year
  • 8. UK PubMed Central Overview • Built in collaboration with PubMed Central USA (+ PMC Canada) since 2006 • Led by the European Bioinformatics Institute since 2011, with the British Library, and the University of Manchester • Supported by 16 UK and 2 European Funders, led by the Wellcome Trust. Research spend: ~ 2 billion GBP • A life-science web-based repository • Manuscript submission service (self archiving by grant holders) • Database of grant information – with details of about 18000 PIs • Grant reporting and funder analysis tool • 250K requests, 40K IPs, 7K direct interactive searches per day
  • 9. How many articles? Overall: 20% OA (~ 450K OA articles out of 2.2 million total)
  • 10. How we make literature-data connections
  • 11. Links • by the author - on submission, as metadata (primary databases) • by database curators - information and links from the literature • expensive, slow, but high quality Text mining • by algorithms that use terminologies (can be subject to lag) • post publication – can find new associations • variable quality, but high throughput
  • 12. Links from Literature to Databases • • • • • • • • • 800 K 370 K 110 K Proteins Nucleotides OMIM Chemicals Structure Clinical reviews Protein families Protein-protein interactions Gene expression experiments …
  • 13. Text Mining in UKPMC (2.2 million articles) Semantic Type Gene/Protein Unique Terms Articles Annotations 225,905 1,288,809 15,021,502 GO Terms 32,486 1,806,539 15,016,957 Organism 178,847 1,689,251 12,322,782 Disease 170,592 1,743,212 16,201,198 Accession No. 232,950 65,640 331,329 76,350 1,669,500 22,438,980 Chemical
  • 16. E. Coli meets humans Human colon cancer DNA repair
  • 18.
  • 20. Link to the literature from the PDBe record
  • 21. Algorithms that find similar structures
  • 22. Text mine full text for 1ewq
  • 23. Towards understanding DNA repair mechanisms
  • 24. Challenges and future directions
  • 25. Data-driven science Data re-use: biology is post publication Linking: citing papers and data (provenance and integration) Metrics and attribution Hard decisions about value of keeping complete data sets
  • 26. Data landscape - possibilities analysis Research articles Unstructured Data Structured links Big Data: Deposition Primary Big Data: Curated Annotation reuse?
  • 27. PDF HTML GIF JPG TIF MOV DOC Analysis supplied by Mimas, University of Manchester XSL
  • 28. Solutions that make sense to scientists