2. Contents
1 Europe PMC and Linking Literature
2 Publishing Text-Mined Data on RDF
3 Text-Mining RDF Service
4 Discussion
2 / 15
3. Europe PMC
• Europe PMC is a literature database [1].
• Abstracts: 30 million PubMed, Agricola and patent records, updated
daily
• Full text articles: over 3 million full text articles, of which over 900,000
are free to read and reuse, updated daily
• Powerful and easy search
• Search all article content through one simple search interface,
supported by deep search options for advanced users.
3 / 15
4. Linking Literature
• Europe PMC provides various types of linking literature.
• External Links: to any (e.g., database, Wikipedia, press release, etc.)
• Citations: to literature
• BioEntities (produced by Europe PMC text-mining pipeline)
• Biological entities: to concept
• Accession numbers: to data
• Example: http://europepmc.org/abstract/MED/21926972
4 / 15
5. Europe PMC Text-Mining Pipeline
• A pipeline of dictionary- and machine learning-based named entity
taggers [3].
• 6 semantic types
• Genes/proteins
• Chemicals
• Organisms
• GO terms
• Disease terms
• EFO terms
• 20 accession numbers [2]:
• ENA, RefSNP, PDB, UniProt, OMIM, PFam, ArrayExpress, RefSeq,
Data DOI, Ensembl, InterPro
• NCT, Bioproject, Biosample, Eudract, EMDB, PXD, GO, EGA,
TreeFam
• Programmatic access available.
5 / 15
6. Publishing Text-Mined Data
• Beyond BioEntities Tab
• Goals
• More connectivity
• More contexts for each linking
• Links to share
• Challenge: dealing with nearly a billion annotations generated
automatically in a large scale
• Using Web Annotation Data Model.
6 / 15
7. Web Annotation Data Model
• Built on the top on RDF
• Annotations as resources
• To provide a standard description mechanism for sharing annotations
between systems
• For more general purpose use
• Not only for text mining
• For example, YouTube video comments (by people), image annotation,
etc.
• W3C Working Draft:
http://www.w3.org/TR/2014/WD-annotation-model-20141211/
7 / 15
8. Core Annotation Framework
• Typically an Annotation has a single Body, which is the comment or
other descriptive resource, and a single Target that the Body is
somehow "about".
• The Body provides the information which is annotating the Target.
• This "aboutness" may be further claried or extended to notions such
as classifying or identifying.
8 / 15
9. One Scenario: Text Comment On Web Page
• A textual comment on a selection of text within a web page
• How to select a text fragment?
• Text Position Selector: oa:start, oa:end
• Text Quote Selector: oa:exact, oa:prex, oa:postx
9 / 15
12. Service Description
• Running on EBI RDF Platform
• Stores 1,563,241,810 triples text-mined from 400,746 Open Access
articles in Europe PubMed Central.
• Provides
• for each article, all the annotations linking to ontologies/databases
• with contexts:
• sentences
• section information
12 / 15
13. Use Case for Database Curation
• Given an database identier, provides sentence-level information for
database curation.
1 Show all the articles where a PDB accession number 3NSS is
mentioned.
2 Show all the annotations with each its label in PMC3382907.
3 Show all the articles where inammatory bowel disease (C0021390) is
mentioned.
• http://wwwdev.ebi.ac.uk/rdf/services/textmining/sparql
13 / 15
14. Discussion
• Can we deal with a large number of triples from 3 million full text
articles?
• A better URI scheme: e.g.,
http://europepmc.org/articles/PMC4298172/methods/genes/TEM-
1/23
• Interoperability with other formats used in text-mining community
• e.g., BioC, UIMA
• Questions?
14 / 15
15. References
The Europe PMC Consortium.
Europe pmc: a full-text literature database for the life sciences and
platform for innovation.
Nucleic Acids Research, 2014.
Senay Kafkas, Jee-Hyub Kim, and Johanna R. McEntyre.
Database citation in full text biomedical articles.
PLoS ONE, 8(5):e63184, 05 2013.
Dietrich Rebholz-Schuhmann, Miguel Arregui, Sylvain Gaudan, Harald
Kirsch, and Antonio J. Yepes.
Text processing through web services: Calling whatizit.
Bioinformatics, pages btm557+, November 2007.
15 / 15