SlideShare uma empresa Scribd logo
1 de 29
The Right to Read is the Right to Mine
http://contentmine.org
Background
• Contentmine aims to make large areas of scientific fact OPEN (100
million facts/year)
• We’re working with WellcomeTrust, Europe PubMedCentral, etc.
• A politically “hot” area (Hargreaves legislation, EU activity)
• 2015 WellcomeTrust workshop on TDM and Neuroscience; “rough
consensus” on what was needed.
• Day workshop at Cochrane, UK (Amy Price, Anna Noel Storr, Ben
Goldacre)
• 2-day workshop at Edinburgh on Systematic Reviews of Animal Test
publications
• In the last few months we’ve prototyped a unique Open starting
point, continuously released.
• Now actively building communities (plants, clinical, animals,
psychology, crystallography, HEPhys)
http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about-
ebola.html
We were stunned recently when we stumbled across an article by European
researchers in Annals of Virology [1982]: “The results seem to indicate that
Liberia has to be included in the Ebola virus endemic zone.” In the future,
the authors asserted, “medical personnel in Liberian health centers should be
aware of the possibility that they may come across active cases and thus be
prepared to avoid nosocomial epidemics,” referring to hospital-acquired
infection.
Adage in public health: “The road to inaction is paved with research
papers.”
Bernice Dahn (chief medical officer of Liberia’s Ministry of Health)
Vera Mussah (director of county health services)
Cameron Nutt (Ebola response adviser to Partners in Health)
A System Failure of Scholarly Publishing
catalogue
getpapers
query
Daily
Crawl
EuPMC, arXiv
CORE , HAL,
(UNIV repos)
ToC
services
PDF HTML
DOC ePUB
TeX XML
PNG
EPS CSV
XLSURLs
DOIs
crawl
quickscrape
norma
Normalizer
Structurer
Semantic
Tagger
Text
Data
Figures
ami
UNIV
Repos
search
Lookup
CONTENT
MINING
Chem
Phylo
Trials
Crystal
Plants
COMMUNITY
plugins
Visualization
and Analysis
PloSONE, BMC,
peerJ… Nature, IEEE,
Elsevier…
Publisher Sites
scrapers
queries
taggers
abstract
methods
references
Captioned
Figures
Fig. 1
HTML tables
30, 000 pages/day
Semantic ScholarlyHTML
Facts
What is “Content”?
http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.01113
03&representation=PDF CC-BY
SECTIONS
MAPS
TABLES
CHEMISTRY
TEXT
MATH
contentmine.org tackles these
What is “Content”?
How “data” are published in the 21st C
ContentMine Workshops and
Hackdays
Open Science Brazil, 2014-08
Easily distributed software
Get started in 30 mins
Build application
in a morning
Start simple: bagOfWords, Stemming, Regex, templates
Oxford 2013
Berlin 2014
Delhi 2014
Jenny Molloy with mascot AMI
Workshops
(1-hour -> full day or more)
2014-May->Nov
• Budapest/Shuttleworth
• Leicester Univ
• Electronic Theses and Dissertations
• Austrian Science Fund AT
• OKFest DE
• Eur. Bioinformatics Institute
• Open Science Rio de Janeiro BR
• Sci DataCon , Delhi IN
• Univ of Chicago US
• OpenCon 2014, Wash DC. US
• JISC , London
Upcoming
• LIBER
• Cochrane
• BL
• Wellcome Trust (April)
• WHO
Collaborators
• Wikimedia/Wikidata
• Mozilla
• Open Knowledge
• LIBER (European Research Libraries)
• British Library
• Wellcome Trust
• EBI (Eur. Bioinf. Inst.)
• JISC
• Open Access Button
• SPARC
• Creative Commons
• CORE
• EuropePubmedCentral
Facts Marked by “non-scientists” in ContentMine workshops
With Wikipedia everyone can be a scientist
Linked Open Data – the world’s knowledge
very little physical science and THESES?? 
http://upload.wikimedia.org/wikipedia/commons/3/34/LOD_Cloud_Diagram_as_of_September_2011.png
DBPedia
BIO
Comp
Lib
PDB
Ontologies
GOV
GOV.uk
Music,
Art
Literature
Social
Knowledge
bases
RDF
triples
https://en.wikipedia.org/wiki/Irrigation#mediaviewer/File:Pump-
enabled_Riverside_Irrigation_in_Comilla,_Bangladesh,_25_April_2014.jpg CC BY-SA 3.0
Daily Stream of 100,000 Open Facts
Twitter?Indexed by CAT
PLoSONE BMC
1
BMC
2
Closed1 Closed2Hybrid
CATalog
Enhanced annotated
articles
FACTSFACTS
Daily Crawl
Crawl … Scrape … Normalize … Mine
Linked OpenData
Semantic
Scientific Objects
2000-5000
Articles
quickscrape
Crawl
Feed
Norma Index &
Transform
TXT
XML
URL
DOI
Scientific
literature
Repositories DOC
CSV
sHTML
Plugins
Regex
SequencesSpecies
Bespoke
Scrapers
XPathPer-Journal
Taggers
Per- Journal
MetadataChemistry
Phylogenetics Farming
AMI
BadHTML
OCR
Diagrams
Open NORMA-lized Scientific
Literature + Facts
CANARY pipeline
CAT-alogue index
PDF
AMI-plugins
• BagOfWords, Stemming and Regular Expressions
• Species
• Biological Sequences
• Chemical compounds & reactions
• Crystallography * (Saulius Grazulis, COD)
• Clinical Trials * (Amy Price)
• Phylogenetics * (Ross Mounce)
• Phytochemistry * (Chris Steinbeck, PMR)
• Psychology * (Chris Hartgerink)
• HighEnergyPhysics (Durham)
* subcommunities
Questions we can tackle
• How to we find (mentions of) clinical/animal trials?
• Is a document a trial?
• What is the subject of the trial?
• What is the methodology used?
• Does the design and practice conform to
CONSORT/ARRIVE?
• What are the outcomes?
• Can we extract specific re-usable information?
• Who are involved? (researchers, sponsors, patients?)
• Has a proposed trial been completed and reported?
Text-based plugins
• Bag of words
(https://en.wikipedia.org/wiki/Bag-of-
words_model)
• https://en.wikipedia.org/wiki/Tf%E2%80%93idf
(Term-frequency, inverse document frequency)
• Templates and regexes (regular expressions).
“Bag of Words”
Three fulltext articles from trialsjournal.com
Regular Expressions for Systematic Reviews of Animal Tests
Preceding Text
Following Text
Extracted term
http://chemicaltagger.ch.cam.ac.uk/
• Typical
Typical chemical synthesis
Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are >
3,000,000 reactions/year. Added value > 1B Eur.
Ln Bacterial load per fly
11.5
11.0
10.5
10.0
9.5
9.0
6.5
6.0
Days post—infection
0 1 2 3 4 5
Bitmap Image and Tesseract OCR
AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home
Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:
AMI reads the complete diagram,
recognizes the paths and
generates the molecules. Then
she creates a stop-fram animation
showing how the 12 reactions
lead into each other
CLICK HERE FOR ANIMATION
(may be browser dependent)
https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now-
mine-images-of-phylogenetic-trees-and-more/ for story of extraction
Thinning Topology
Serialization
Newick
Phytochemistry extraction
O. dayi
“volatile composition of “
A.sibeiri
A. judaica
Displayed by CAT (CottageLabs)
What we can do
• Recognize and promote autonomous sub-
communities
• Engage Early Career Researchers, including
undergraduates and let THEM BUILD the
systems.
• COMMUNALLY build tools for data checking
• Insist on semantic data input, even if it costs
submissions
contentmine.org team

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the Literature
 
Open software and knowledge for MIOSS
Open software and knowledge for MIOSS Open software and knowledge for MIOSS
Open software and knowledge for MIOSS
 
Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literatureAmanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature
 
Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature
 
Open software and knowledge for MIOSS
Open software and knowledge for MIOSSOpen software and knowledge for MIOSS
Open software and knowledge for MIOSS
 
High throughput mining of the scholarly literature
High throughput mining of the scholarly literatureHigh throughput mining of the scholarly literature
High throughput mining of the scholarly literature
 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the Literature
 
Open Science Principles and Practice
Open Science Principles and PracticeOpen Science Principles and Practice
Open Science Principles and Practice
 
Sharing re-usable phylogenetic data: we're not there yet
Sharing re-usable phylogenetic data: we're not there yetSharing re-usable phylogenetic data: we're not there yet
Sharing re-usable phylogenetic data: we're not there yet
 
2018 09-03-ses open-fair_practices_in_evolutionary_genomics
2018 09-03-ses open-fair_practices_in_evolutionary_genomics2018 09-03-ses open-fair_practices_in_evolutionary_genomics
2018 09-03-ses open-fair_practices_in_evolutionary_genomics
 
Our dire need to mandate data standards and expectations for scientific publi...
Our dire need to mandate data standards and expectations for scientific publi...Our dire need to mandate data standards and expectations for scientific publi...
Our dire need to mandate data standards and expectations for scientific publi...
 
High throughput mining of the scholarly literature
High throughput mining of the scholarly literature High throughput mining of the scholarly literature
High throughput mining of the scholarly literature
 
Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016 Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016
 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData
 
Open scholarship [a FOSTER open science talk]
Open scholarship [a FOSTER open science talk]Open scholarship [a FOSTER open science talk]
Open scholarship [a FOSTER open science talk]
 
Laurie Goodman at #SSPBoston: Article+Data+Tools Reproducibility, Reuse, & Ra...
Laurie Goodman at #SSPBoston: Article+Data+ToolsReproducibility, Reuse, & Ra...Laurie Goodman at #SSPBoston: Article+Data+ToolsReproducibility, Reuse, & Ra...
Laurie Goodman at #SSPBoston: Article+Data+Tools Reproducibility, Reuse, & Ra...
 
Cochrane workshop2016
Cochrane workshop2016Cochrane workshop2016
Cochrane workshop2016
 
OpenVirus at OpenPublishingFest
OpenVirus at OpenPublishingFestOpenVirus at OpenPublishingFest
OpenVirus at OpenPublishingFest
 
Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape? Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape?
 
The Content Mine (presented at UKSG)
The Content Mine (presented at UKSG)The Content Mine (presented at UKSG)
The Content Mine (presented at UKSG)
 

Semelhante a Why ContentMining is useful

Semelhante a Why ContentMining is useful (20)

ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UK
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and  Medicine from the scholarly literatureAutomatic Extraction of Science and  Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literature
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literatureAutomatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literature
 
Big Data and ContentMining for Libraries
Big Data and ContentMining for LibrariesBig Data and ContentMining for Libraries
Big Data and ContentMining for Libraries
 
Early Career Reseachers and Open Healthcare
Early Career Reseachers and Open HealthcareEarly Career Reseachers and Open Healthcare
Early Career Reseachers and Open Healthcare
 
The culture of researchData
The culture of researchDataThe culture of researchData
The culture of researchData
 
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers in Science. Start Early, Be Open , Be BraveEarly Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
 
Open sciencerefresher2019
Open sciencerefresher2019Open sciencerefresher2019
Open sciencerefresher2019
 
Scientific search for everyone
Scientific search for everyoneScientific search for everyone
Scientific search for everyone
 
Scott Edmunds, HKU Open Access Week: Experiences from the front-line of Open ...
Scott Edmunds, HKU Open Access Week: Experiences from the front-line of Open ...Scott Edmunds, HKU Open Access Week: Experiences from the front-line of Open ...
Scott Edmunds, HKU Open Access Week: Experiences from the front-line of Open ...
 
Plosslides
PlosslidesPlosslides
Plosslides
 
PLOS slides
PLOS slidesPLOS slides
PLOS slides
 
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature
 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search
 
Disrupting the Publisher-Academic Complex
Disrupting the Publisher-Academic ComplexDisrupting the Publisher-Academic Complex
Disrupting the Publisher-Academic Complex
 
The Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-RustThe Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-Rust
 
Towards Responsible Content Mining: A Cambridge perspective
Towards Responsible Content Mining: A Cambridge perspectiveTowards Responsible Content Mining: A Cambridge perspective
Towards Responsible Content Mining: A Cambridge perspective
 
WikiFactMine: Science for Everyone
WikiFactMine: Science for EveryoneWikiFactMine: Science for Everyone
WikiFactMine: Science for Everyone
 

Mais de petermurrayrust

Mais de petermurrayrust (20)

Omdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital AgeOmdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital Age
 
Open Virus Indian Presentation
Open Virus Indian PresentationOpen Virus Indian Presentation
Open Virus Indian Presentation
 
Can machines understand the scientific literature?
Can machines understand the scientific literature?Can machines understand the scientific literature?
Can machines understand the scientific literature?
 
Open Virus Indian Presentation
Open Virus Indian PresentationOpen Virus Indian Presentation
Open Virus Indian Presentation
 
Automatic mining of data from materials science literature
Automatic mining of data from materials science literatureAutomatic mining of data from materials science literature
Automatic mining of data from materials science literature
 
Climate Change and Human Migration
Climate Change and Human MigrationClimate Change and Human Migration
Climate Change and Human Migration
 
openVirus - tools for discovering literature on viruses
openVirus - tools for discovering literature on virusesopenVirus - tools for discovering literature on viruses
openVirus - tools for discovering literature on viruses
 
XML for science; its huge potential; but are pubiishers preventing it?
XML for science; its huge potential; but are pubiishers preventing it?XML for science; its huge potential; but are pubiishers preventing it?
XML for science; its huge potential; but are pubiishers preventing it?
 
Openplant2018 Poster; Semantic searching
Openplant2018 Poster; Semantic searchingOpenplant2018 Poster; Semantic searching
Openplant2018 Poster; Semantic searching
 
Extracting science from the archive
Extracting science from the archiveExtracting science from the archive
Extracting science from the archive
 
WikiFactMine: Ontology for Everybody and Everything
WikiFactMine: Ontology for Everybody and EverythingWikiFactMine: Ontology for Everybody and Everything
WikiFactMine: Ontology for Everybody and Everything
 
Paradise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to MineParadise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to Mine
 
Young people in an Age of Knowledge Neocolonialism
Young people in an Age of Knowledge NeocolonialismYoung people in an Age of Knowledge Neocolonialism
Young people in an Age of Knowledge Neocolonialism
 
ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017ContentMining and Copyright at CopyCamp2017
ContentMining and Copyright at CopyCamp2017
 
The mining "Revolution"; are Libraries supporting Researchers or Publishers"?
The mining "Revolution"; are Libraries supporting Researchers or Publishers"?The mining "Revolution"; are Libraries supporting Researchers or Publishers"?
The mining "Revolution"; are Libraries supporting Researchers or Publishers"?
 
WikiFactMine for Plant Chemistry
WikiFactMine for Plant ChemistryWikiFactMine for Plant Chemistry
WikiFactMine for Plant Chemistry
 
Biovision2017 Accessing the scientific literature
Biovision2017 Accessing the scientific literatureBiovision2017 Accessing the scientific literature
Biovision2017 Accessing the scientific literature
 
Can machines understand the scientific literature
Can machines understand the scientific literatureCan machines understand the scientific literature
Can machines understand the scientific literature
 
Asking the scientific literature to tell us about metabolism
Asking the scientific literature to tell us about metabolismAsking the scientific literature to tell us about metabolism
Asking the scientific literature to tell us about metabolism
 
Asking the scientific literature to tell us about metabolism
Asking the scientific literature to tell us about metabolismAsking the scientific literature to tell us about metabolism
Asking the scientific literature to tell us about metabolism
 

Último

Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Sérgio Sacani
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Sérgio Sacani
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 

Último (20)

GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 

Why ContentMining is useful

  • 1. The Right to Read is the Right to Mine http://contentmine.org
  • 2. Background • Contentmine aims to make large areas of scientific fact OPEN (100 million facts/year) • We’re working with WellcomeTrust, Europe PubMedCentral, etc. • A politically “hot” area (Hargreaves legislation, EU activity) • 2015 WellcomeTrust workshop on TDM and Neuroscience; “rough consensus” on what was needed. • Day workshop at Cochrane, UK (Amy Price, Anna Noel Storr, Ben Goldacre) • 2-day workshop at Edinburgh on Systematic Reviews of Animal Test publications • In the last few months we’ve prototyped a unique Open starting point, continuously released. • Now actively building communities (plants, clinical, animals, psychology, crystallography, HEPhys)
  • 3. http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about- ebola.html We were stunned recently when we stumbled across an article by European researchers in Annals of Virology [1982]: “The results seem to indicate that Liberia has to be included in the Ebola virus endemic zone.” In the future, the authors asserted, “medical personnel in Liberian health centers should be aware of the possibility that they may come across active cases and thus be prepared to avoid nosocomial epidemics,” referring to hospital-acquired infection. Adage in public health: “The road to inaction is paved with research papers.” Bernice Dahn (chief medical officer of Liberia’s Ministry of Health) Vera Mussah (director of county health services) Cameron Nutt (Ebola response adviser to Partners in Health) A System Failure of Scholarly Publishing
  • 4. catalogue getpapers query Daily Crawl EuPMC, arXiv CORE , HAL, (UNIV repos) ToC services PDF HTML DOC ePUB TeX XML PNG EPS CSV XLSURLs DOIs crawl quickscrape norma Normalizer Structurer Semantic Tagger Text Data Figures ami UNIV Repos search Lookup CONTENT MINING Chem Phylo Trials Crystal Plants COMMUNITY plugins Visualization and Analysis PloSONE, BMC, peerJ… Nature, IEEE, Elsevier… Publisher Sites scrapers queries taggers abstract methods references Captioned Figures Fig. 1 HTML tables 30, 000 pages/day Semantic ScholarlyHTML Facts
  • 7. How “data” are published in the 21st C
  • 8. ContentMine Workshops and Hackdays Open Science Brazil, 2014-08 Easily distributed software Get started in 30 mins Build application in a morning Start simple: bagOfWords, Stemming, Regex, templates
  • 9. Oxford 2013 Berlin 2014 Delhi 2014 Jenny Molloy with mascot AMI
  • 10. Workshops (1-hour -> full day or more) 2014-May->Nov • Budapest/Shuttleworth • Leicester Univ • Electronic Theses and Dissertations • Austrian Science Fund AT • OKFest DE • Eur. Bioinformatics Institute • Open Science Rio de Janeiro BR • Sci DataCon , Delhi IN • Univ of Chicago US • OpenCon 2014, Wash DC. US • JISC , London Upcoming • LIBER • Cochrane • BL • Wellcome Trust (April) • WHO Collaborators • Wikimedia/Wikidata • Mozilla • Open Knowledge • LIBER (European Research Libraries) • British Library • Wellcome Trust • EBI (Eur. Bioinf. Inst.) • JISC • Open Access Button • SPARC • Creative Commons • CORE • EuropePubmedCentral
  • 11. Facts Marked by “non-scientists” in ContentMine workshops With Wikipedia everyone can be a scientist
  • 12. Linked Open Data – the world’s knowledge very little physical science and THESES??  http://upload.wikimedia.org/wikipedia/commons/3/34/LOD_Cloud_Diagram_as_of_September_2011.png DBPedia BIO Comp Lib PDB Ontologies GOV GOV.uk Music, Art Literature Social Knowledge bases RDF triples
  • 14. PLoSONE BMC 1 BMC 2 Closed1 Closed2Hybrid CATalog Enhanced annotated articles FACTSFACTS Daily Crawl Crawl … Scrape … Normalize … Mine Linked OpenData Semantic Scientific Objects 2000-5000 Articles
  • 15. quickscrape Crawl Feed Norma Index & Transform TXT XML URL DOI Scientific literature Repositories DOC CSV sHTML Plugins Regex SequencesSpecies Bespoke Scrapers XPathPer-Journal Taggers Per- Journal MetadataChemistry Phylogenetics Farming AMI BadHTML OCR Diagrams Open NORMA-lized Scientific Literature + Facts CANARY pipeline CAT-alogue index PDF
  • 16. AMI-plugins • BagOfWords, Stemming and Regular Expressions • Species • Biological Sequences • Chemical compounds & reactions • Crystallography * (Saulius Grazulis, COD) • Clinical Trials * (Amy Price) • Phylogenetics * (Ross Mounce) • Phytochemistry * (Chris Steinbeck, PMR) • Psychology * (Chris Hartgerink) • HighEnergyPhysics (Durham) * subcommunities
  • 17. Questions we can tackle • How to we find (mentions of) clinical/animal trials? • Is a document a trial? • What is the subject of the trial? • What is the methodology used? • Does the design and practice conform to CONSORT/ARRIVE? • What are the outcomes? • Can we extract specific re-usable information? • Who are involved? (researchers, sponsors, patients?) • Has a proposed trial been completed and reported?
  • 18. Text-based plugins • Bag of words (https://en.wikipedia.org/wiki/Bag-of- words_model) • https://en.wikipedia.org/wiki/Tf%E2%80%93idf (Term-frequency, inverse document frequency) • Templates and regexes (regular expressions).
  • 19. “Bag of Words” Three fulltext articles from trialsjournal.com
  • 20. Regular Expressions for Systematic Reviews of Animal Tests Preceding Text Following Text Extracted term
  • 22. Open Content Mining of FACTs Machines can interpret chemical reactions We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.
  • 23.
  • 24. Ln Bacterial load per fly 11.5 11.0 10.5 10.0 9.5 9.0 6.5 6.0 Days post—infection 0 1 2 3 4 5 Bitmap Image and Tesseract OCR
  • 25. AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY: AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-fram animation showing how the 12 reactions lead into each other CLICK HERE FOR ANIMATION (may be browser dependent)
  • 27. Phytochemistry extraction O. dayi “volatile composition of “ A.sibeiri A. judaica Displayed by CAT (CottageLabs)
  • 28. What we can do • Recognize and promote autonomous sub- communities • Engage Early Career Researchers, including undergraduates and let THEM BUILD the systems. • COMMUNALLY build tools for data checking • Insist on semantic data input, even if it costs submissions