SlideShare uma empresa Scribd logo
1 de 22
Baixar para ler offline
Metadata Analyser: measuring
metadata quality
Bruno Inácio, João D. Ferreira, and Francisco M. Couto
LaSIGE, Faculdade de Ciências da Universidade de Lisboa, Portugal
PACBB, June 21-23, 2017
Porto Portugal
Figure 1. Two pages (scan) from Galilei's Sidereus Nuncius (“The
Starry Messenger” or “The Herald of the Stars”), Venice, 1610.
Goodman A, Pepe A, Blocker AW, Borgman CL, et al. (2014) Ten Simple Rules for the Care and
Feeding of Scientific Data. PLoS Comput Biol 10(4): e1003542. doi:10.1371/journal.pcbi.1003542
http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003542
Galileo integrated
• the direct results of
his observations of
Jupiter
• with careful and
clear descriptions
of how they were
performed
From “Big” Data to Knowledge
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc= "http://purl.org/dc/elements/1.1/">
<rdf:Description rdf:about="http://en.wikipedia.org/wiki/Sintra_Collar">
<dc:description>
Gold collar. It was made from three circular sectioned and tapering gold bars
that are fused at the ends forming a penannular neck-ring.
</dc:description>
<dc:date>1250BC-800BC (circa)</dc:date>
<dc:location>
Sintra, Portugal
http://yboss.yahooapis.com/geo/placefinder?woeid=748874
</dc:location>
<dc:type>
Gold
http://purl.obolibrary.org/obo/CHEBI_30050
</dc:type>
</rdf:Description>
</rdf:RDF>
Metal
Silver
CoinagePrecious
Palladium GoldPlatinum Copper
is-a
mappings
Conventional Solution
proper data sharing rules
• So let’s create some
Data-sharing Policies
and some
Compliance and
Enforcement activities
Esperanto
• Created in 1887 as an easy-to-learn
• And politically neutral language
• But, English provides a greater incentive
– Websites
Languages,
March 2014
Data-sharing policies
“Adherence to data-sharing policies is as
inconsistent as the policies themselves”
“351 papers covered by some data-sharing policy,
only 143 fully adhered to that policy” (~40%)
“is time-consuming to do properly, the reward
systems aren't there and neither is the stick”
“Of all the data that are made available, what
fraction is actually used by someone else? “
Steven Wiley in Nature, 2011
http://www.nature.com/news/2011/110914/full/news.2011.536.html
Human Factor
• “More often than scientists would like to
admit, they cannot even recover the data
associated with their own published works”
http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003542
Goals
1. propose two measures of metadata quality
2. to implement a tool that is able to evaluate
these measures in a public repository
3. to show that these measures are valid and
significant in a real-world scientific repository
Measures of metadata quality
1. Term coverage
the proportion of annotations in the metadata file
that link to an ontology concept
2. Semantic specificity
the average specificity of those ontology
concepts
Term Coverage
• It is the ratio between
– the number of annotations that refer to ontology
concepts
– and the total number of annotations in the
metadata file
Semantic specificity
• A(t) is the number of ascendant concepts up
from t
• and D(t) is the average distance between t and
all its leaf descendants
Metadata Analyser Architecture
1. An interface layer that interacts with the user by
requesting a metadata file, informing the user on the
analysis progress, and outputting the result
2. An application layer that analyses the metadata file
and evaluates the annotations found therein.
3. A data layer that holds the ontologies in local
databases
4. A web API layer that connects the interface layer to
the application layer, coded in commonly used web
technologies
Case Study: Metabolights
• a database of metabolomics experiments
• developed by the EBI since 2012
• Evaluation
– the measures on all the resources
– manually in a selection of resources
– metadata quality before and after a curation step
by experts
Manual Evaluation
Lower coverage: not all ontologies used to annotate
the resources were included in the local database
pre- and post-curation analysis
Human Factor
1. may not know the ontologies that contain the
concepts they need
2. do not fully know the structure of the ontologies
in order to perform annotation with the
appropriate specific terms
3. lack the proper skills to carry on the annotation
process because of the technical difficulties
associated with this task
4. do not consider data sharing to be relevant
5. consider that the cost of ensuring proper
semantic integration outweighs the benefits
Conclusions
• apparent correlation between specificity and
coverage
• a weak term coverage (average of 0.25)
• two proposed measures can effectively
measure the effort put into the semantic
annotation of digital resources
• Metadata Analyser
– a means to measure the quality of their metadata
– 10,000 times faster than the previous work
Acknowledgments
• The EBI team in charge of the development
and maintenance of metabolights for their
support in this study.
Software:
https://github.com/lasigeBioTM/MetadataAnalyser

Mais conteúdo relacionado

Mais procurados

Nanoinformatics 2010 SMIRP-ONS Talk
Nanoinformatics 2010 SMIRP-ONS TalkNanoinformatics 2010 SMIRP-ONS Talk
Nanoinformatics 2010 SMIRP-ONS TalkJean-Claude Bradley
 
Technology and Students: Mix, Match or Miss?
Technology and Students: Mix, Match or Miss?Technology and Students: Mix, Match or Miss?
Technology and Students: Mix, Match or Miss?Jean-Claude Bradley
 
RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015William Gunn
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
 
Reproducibility and replicability: a practical approach
Reproducibility and replicability: a practical approachReproducibility and replicability: a practical approach
Reproducibility and replicability: a practical approachKrzysztof Gorgolewski
 
Link Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked DataLink Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked DataMichel Dumontier
 
2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavis2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavisSean Davis
 
Data analysis workflows part 1 2015
Data analysis workflows part 1 2015Data analysis workflows part 1 2015
Data analysis workflows part 1 2015Dmitry Grapov
 
Insights from Knowledge Graphs
Insights from Knowledge GraphsInsights from Knowledge Graphs
Insights from Knowledge GraphsAnirudh Prabhu
 
Data for AI models, the past, the present, the future
Data for AI models, the past, the present, the futureData for AI models, the past, the present, the future
Data for AI models, the past, the present, the futurePistoia Alliance
 
Roche_open_science_NIOO_KNAW_workshop_NL
Roche_open_science_NIOO_KNAW_workshop_NLRoche_open_science_NIOO_KNAW_workshop_NL
Roche_open_science_NIOO_KNAW_workshop_NLDominique Roche
 
Identifying Relevant Temporal Expressions for Real-world Events
Identifying Relevant Temporal Expressions for Real-world EventsIdentifying Relevant Temporal Expressions for Real-world Events
Identifying Relevant Temporal Expressions for Real-world EventsNattiya Kanhabua
 
Penn State Researchers Code Targets Stealthy Computer Worms
Penn State Researchers Code Targets Stealthy Computer WormsPenn State Researchers Code Targets Stealthy Computer Worms
Penn State Researchers Code Targets Stealthy Computer Wormsdgrinnell
 
Open Notebook Science HUBzero 2011
Open Notebook Science HUBzero 2011Open Notebook Science HUBzero 2011
Open Notebook Science HUBzero 2011Jean-Claude Bradley
 
Containers in Science: neuroimaging use cases
Containers in Science: neuroimaging use casesContainers in Science: neuroimaging use cases
Containers in Science: neuroimaging use casesKrzysztof Gorgolewski
 
openSNP - Crowdsourcing Genome Wide Association Studies
openSNP - Crowdsourcing Genome Wide Association StudiesopenSNP - Crowdsourcing Genome Wide Association Studies
openSNP - Crowdsourcing Genome Wide Association StudiesBastian Greshake
 

Mais procurados (20)

Nanoinformatics 2010 SMIRP-ONS Talk
Nanoinformatics 2010 SMIRP-ONS TalkNanoinformatics 2010 SMIRP-ONS Talk
Nanoinformatics 2010 SMIRP-ONS Talk
 
Technology and Students: Mix, Match or Miss?
Technology and Students: Mix, Match or Miss?Technology and Students: Mix, Match or Miss?
Technology and Students: Mix, Match or Miss?
 
RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
 
Reproducibility and replicability: a practical approach
Reproducibility and replicability: a practical approachReproducibility and replicability: a practical approach
Reproducibility and replicability: a practical approach
 
Link Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked DataLink Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked Data
 
2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavis2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavis
 
Peer Review and Science2.0
Peer Review and Science2.0Peer Review and Science2.0
Peer Review and Science2.0
 
Data analysis workflows part 1 2015
Data analysis workflows part 1 2015Data analysis workflows part 1 2015
Data analysis workflows part 1 2015
 
Insights from Knowledge Graphs
Insights from Knowledge GraphsInsights from Knowledge Graphs
Insights from Knowledge Graphs
 
Data for AI models, the past, the present, the future
Data for AI models, the past, the present, the futureData for AI models, the past, the present, the future
Data for AI models, the past, the present, the future
 
Cheminfo Retrieval 2010 Class 1
Cheminfo Retrieval 2010 Class 1Cheminfo Retrieval 2010 Class 1
Cheminfo Retrieval 2010 Class 1
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
Roche_open_science_NIOO_KNAW_workshop_NL
Roche_open_science_NIOO_KNAW_workshop_NLRoche_open_science_NIOO_KNAW_workshop_NL
Roche_open_science_NIOO_KNAW_workshop_NL
 
Identifying Relevant Temporal Expressions for Real-world Events
Identifying Relevant Temporal Expressions for Real-world EventsIdentifying Relevant Temporal Expressions for Real-world Events
Identifying Relevant Temporal Expressions for Real-world Events
 
Penn State Researchers Code Targets Stealthy Computer Worms
Penn State Researchers Code Targets Stealthy Computer WormsPenn State Researchers Code Targets Stealthy Computer Worms
Penn State Researchers Code Targets Stealthy Computer Worms
 
Open Notebook Science HUBzero 2011
Open Notebook Science HUBzero 2011Open Notebook Science HUBzero 2011
Open Notebook Science HUBzero 2011
 
Containers in Science: neuroimaging use cases
Containers in Science: neuroimaging use casesContainers in Science: neuroimaging use cases
Containers in Science: neuroimaging use cases
 
openSNP - Crowdsourcing Genome Wide Association Studies
openSNP - Crowdsourcing Genome Wide Association StudiesopenSNP - Crowdsourcing Genome Wide Association Studies
openSNP - Crowdsourcing Genome Wide Association Studies
 
Lifesavingcomputer a
Lifesavingcomputer aLifesavingcomputer a
Lifesavingcomputer a
 

Semelhante a Metadata Analyser: measuring metadata quality

Luciano pr 08-849_ontology_evaluation_methods_metrics
Luciano pr 08-849_ontology_evaluation_methods_metricsLuciano pr 08-849_ontology_evaluation_methods_metrics
Luciano pr 08-849_ontology_evaluation_methods_metricsJoanne Luciano
 
Luciano pr 08-849_ontology_evaluation_methods_metrics
Luciano pr 08-849_ontology_evaluation_methods_metricsLuciano pr 08-849_ontology_evaluation_methods_metrics
Luciano pr 08-849_ontology_evaluation_methods_metricsJoanne Luciano
 
Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011Rudy Potenzone
 
Omics Logic - Bioinformatics 2.0
Omics Logic - Bioinformatics 2.0Omics Logic - Bioinformatics 2.0
Omics Logic - Bioinformatics 2.0Elia Brodsky
 
Tragedy of the (Data) Commons
Tragedy of the (Data) CommonsTragedy of the (Data) Commons
Tragedy of the (Data) CommonsJames Hendler
 
The W3C PROV standard: data model for the provenance of information, and enab...
The W3C PROV standard:data model for the provenance of information, and enab...The W3C PROV standard:data model for the provenance of information, and enab...
The W3C PROV standard: data model for the provenance of information, and enab...Paolo Missier
 
Bioinformatic core facilities discussion
Bioinformatic core facilities discussionBioinformatic core facilities discussion
Bioinformatic core facilities discussionJennifer Shelton
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Paul Groth
 
Data at the NIH: Some Early Thoughts
Data at the NIH: Some Early ThoughtsData at the NIH: Some Early Thoughts
Data at the NIH: Some Early ThoughtsPhilip Bourne
 
WOW13_RPITWC_Web Observatories
WOW13_RPITWC_Web ObservatoriesWOW13_RPITWC_Web Observatories
WOW13_RPITWC_Web Observatoriesgloriakt
 
Biomedical Research as Part of the Digital Enterprise
Biomedical Research as Part of the Digital EnterpriseBiomedical Research as Part of the Digital Enterprise
Biomedical Research as Part of the Digital EnterprisePhilip Bourne
 
informatics_future.pdf
informatics_future.pdfinformatics_future.pdf
informatics_future.pdfAdhySugara2
 
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...William Gunn
 
Next-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information RetrievalNext-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information RetrievalWaqas Tariq
 
Mtsr2015 goble-keynote
Mtsr2015 goble-keynoteMtsr2015 goble-keynote
Mtsr2015 goble-keynoteCarole Goble
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...GigaScience, BGI Hong Kong
 

Semelhante a Metadata Analyser: measuring metadata quality (20)

Luciano pr 08-849_ontology_evaluation_methods_metrics
Luciano pr 08-849_ontology_evaluation_methods_metricsLuciano pr 08-849_ontology_evaluation_methods_metrics
Luciano pr 08-849_ontology_evaluation_methods_metrics
 
Luciano pr 08-849_ontology_evaluation_methods_metrics
Luciano pr 08-849_ontology_evaluation_methods_metricsLuciano pr 08-849_ontology_evaluation_methods_metrics
Luciano pr 08-849_ontology_evaluation_methods_metrics
 
Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011
 
Omics Logic - Bioinformatics 2.0
Omics Logic - Bioinformatics 2.0Omics Logic - Bioinformatics 2.0
Omics Logic - Bioinformatics 2.0
 
Tragedy of the (Data) Commons
Tragedy of the (Data) CommonsTragedy of the (Data) Commons
Tragedy of the (Data) Commons
 
The W3C PROV standard: data model for the provenance of information, and enab...
The W3C PROV standard:data model for the provenance of information, and enab...The W3C PROV standard:data model for the provenance of information, and enab...
The W3C PROV standard: data model for the provenance of information, and enab...
 
Bioinformatic core facilities discussion
Bioinformatic core facilities discussionBioinformatic core facilities discussion
Bioinformatic core facilities discussion
 
Martone grethe
Martone gretheMartone grethe
Martone grethe
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
Cartegena051811
Cartegena051811Cartegena051811
Cartegena051811
 
NISO Working Group Connection Live! Research Data Metrics Landscape: An Updat...
NISO Working Group Connection Live! Research Data Metrics Landscape: An Updat...NISO Working Group Connection Live! Research Data Metrics Landscape: An Updat...
NISO Working Group Connection Live! Research Data Metrics Landscape: An Updat...
 
Open Access as a Means to Produce High Quality Data
Open Access as a Means to Produce High Quality DataOpen Access as a Means to Produce High Quality Data
Open Access as a Means to Produce High Quality Data
 
Data at the NIH: Some Early Thoughts
Data at the NIH: Some Early ThoughtsData at the NIH: Some Early Thoughts
Data at the NIH: Some Early Thoughts
 
WOW13_RPITWC_Web Observatories
WOW13_RPITWC_Web ObservatoriesWOW13_RPITWC_Web Observatories
WOW13_RPITWC_Web Observatories
 
Biomedical Research as Part of the Digital Enterprise
Biomedical Research as Part of the Digital EnterpriseBiomedical Research as Part of the Digital Enterprise
Biomedical Research as Part of the Digital Enterprise
 
informatics_future.pdf
informatics_future.pdfinformatics_future.pdf
informatics_future.pdf
 
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
 
Next-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information RetrievalNext-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information Retrieval
 
Mtsr2015 goble-keynote
Mtsr2015 goble-keynoteMtsr2015 goble-keynote
Mtsr2015 goble-keynote
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
 

Mais de Francisco Couto

Master's Theses in Bioinformatics and Computational Biology
Master's Theses in Bioinformatics and Computational BiologyMaster's Theses in Bioinformatics and Computational Biology
Master's Theses in Bioinformatics and Computational BiologyFrancisco Couto
 
Linked Data – challenges for Imagiology and Radiology
Linked Data – challenges for Imagiology and RadiologyLinked Data – challenges for Imagiology and Radiology
Linked Data – challenges for Imagiology and RadiologyFrancisco Couto
 
MER: a Minimal Named-Entity Recognition Tagger and Annotation Server
MER: a Minimal Named-Entity Recognition Tagger and Annotation ServerMER: a Minimal Named-Entity Recognition Tagger and Annotation Server
MER: a Minimal Named-Entity Recognition Tagger and Annotation ServerFrancisco Couto
 
Towards a privacy-preserving environment for genomic data analysis
Towards a privacy-preserving environment for genomic data analysisTowards a privacy-preserving environment for genomic data analysis
Towards a privacy-preserving environment for genomic data analysisFrancisco Couto
 
A Large-Scale Characterization of User Behaviour in Cable TV
A Large-Scale Characterization of User Behaviour in Cable TVA Large-Scale Characterization of User Behaviour in Cable TV
A Large-Scale Characterization of User Behaviour in Cable TVFrancisco Couto
 
A Flexible Recommendation System for Cable TV
A Flexible Recommendation System for Cable TVA Flexible Recommendation System for Cable TV
A Flexible Recommendation System for Cable TVFrancisco Couto
 
Master in Bioinformatics and Computational Biology
Master in Bioinformatics and Computational BiologyMaster in Bioinformatics and Computational Biology
Master in Bioinformatics and Computational BiologyFrancisco Couto
 
KnowledgeCoin : recognizing and rewarding metadata integration and sharing ...
KnowledgeCoin: recognizing and rewarding metadata integration and sharing ...KnowledgeCoin: recognizing and rewarding metadata integration and sharing ...
KnowledgeCoin : recognizing and rewarding metadata integration and sharing ...Francisco Couto
 
Bioinf2Bio Oportunidades
Bioinf2Bio OportunidadesBioinf2Bio Oportunidades
Bioinf2Bio OportunidadesFrancisco Couto
 
Stabvida oportunidades profissionais
Stabvida oportunidades profissionaisStabvida oportunidades profissionais
Stabvida oportunidades profissionaisFrancisco Couto
 
Mestrado em Bioinformática e Biologia Computacional da FCUL
Mestrado em Bioinformática e Biologia Computacional da FCULMestrado em Bioinformática e Biologia Computacional da FCUL
Mestrado em Bioinformática e Biologia Computacional da FCULFrancisco Couto
 

Mais de Francisco Couto (11)

Master's Theses in Bioinformatics and Computational Biology
Master's Theses in Bioinformatics and Computational BiologyMaster's Theses in Bioinformatics and Computational Biology
Master's Theses in Bioinformatics and Computational Biology
 
Linked Data – challenges for Imagiology and Radiology
Linked Data – challenges for Imagiology and RadiologyLinked Data – challenges for Imagiology and Radiology
Linked Data – challenges for Imagiology and Radiology
 
MER: a Minimal Named-Entity Recognition Tagger and Annotation Server
MER: a Minimal Named-Entity Recognition Tagger and Annotation ServerMER: a Minimal Named-Entity Recognition Tagger and Annotation Server
MER: a Minimal Named-Entity Recognition Tagger and Annotation Server
 
Towards a privacy-preserving environment for genomic data analysis
Towards a privacy-preserving environment for genomic data analysisTowards a privacy-preserving environment for genomic data analysis
Towards a privacy-preserving environment for genomic data analysis
 
A Large-Scale Characterization of User Behaviour in Cable TV
A Large-Scale Characterization of User Behaviour in Cable TVA Large-Scale Characterization of User Behaviour in Cable TV
A Large-Scale Characterization of User Behaviour in Cable TV
 
A Flexible Recommendation System for Cable TV
A Flexible Recommendation System for Cable TVA Flexible Recommendation System for Cable TV
A Flexible Recommendation System for Cable TV
 
Master in Bioinformatics and Computational Biology
Master in Bioinformatics and Computational BiologyMaster in Bioinformatics and Computational Biology
Master in Bioinformatics and Computational Biology
 
KnowledgeCoin : recognizing and rewarding metadata integration and sharing ...
KnowledgeCoin: recognizing and rewarding metadata integration and sharing ...KnowledgeCoin: recognizing and rewarding metadata integration and sharing ...
KnowledgeCoin : recognizing and rewarding metadata integration and sharing ...
 
Bioinf2Bio Oportunidades
Bioinf2Bio OportunidadesBioinf2Bio Oportunidades
Bioinf2Bio Oportunidades
 
Stabvida oportunidades profissionais
Stabvida oportunidades profissionaisStabvida oportunidades profissionais
Stabvida oportunidades profissionais
 
Mestrado em Bioinformática e Biologia Computacional da FCUL
Mestrado em Bioinformática e Biologia Computacional da FCULMestrado em Bioinformática e Biologia Computacional da FCUL
Mestrado em Bioinformática e Biologia Computacional da FCUL
 

Último

Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 
bonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlsbonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlshansessene
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Organic farming with special reference to vermiculture
Organic farming with special reference to vermicultureOrganic farming with special reference to vermiculture
Organic farming with special reference to vermicultureTakeleZike1
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxMedical College
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxmaryFF1
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxzaydmeerab121
 
Servosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicServosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicAditi Jain
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Quarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsQuarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsCharlene Llagas
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptxpallavirawat456
 

Último (20)

Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 
bonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlsbonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girls
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Organic farming with special reference to vermiculture
Organic farming with special reference to vermicultureOrganic farming with special reference to vermiculture
Organic farming with special reference to vermiculture
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptx
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptx
 
Servosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicServosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by Petrovic
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Quarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsQuarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and Functions
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptx
 

Metadata Analyser: measuring metadata quality

  • 1. Metadata Analyser: measuring metadata quality Bruno Inácio, João D. Ferreira, and Francisco M. Couto LaSIGE, Faculdade de Ciências da Universidade de Lisboa, Portugal PACBB, June 21-23, 2017 Porto Portugal
  • 2. Figure 1. Two pages (scan) from Galilei's Sidereus Nuncius (“The Starry Messenger” or “The Herald of the Stars”), Venice, 1610. Goodman A, Pepe A, Blocker AW, Borgman CL, et al. (2014) Ten Simple Rules for the Care and Feeding of Scientific Data. PLoS Comput Biol 10(4): e1003542. doi:10.1371/journal.pcbi.1003542 http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003542 Galileo integrated • the direct results of his observations of Jupiter • with careful and clear descriptions of how they were performed From “Big” Data to Knowledge
  • 3. <?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc= "http://purl.org/dc/elements/1.1/"> <rdf:Description rdf:about="http://en.wikipedia.org/wiki/Sintra_Collar"> <dc:description> Gold collar. It was made from three circular sectioned and tapering gold bars that are fused at the ends forming a penannular neck-ring. </dc:description> <dc:date>1250BC-800BC (circa)</dc:date> <dc:location> Sintra, Portugal http://yboss.yahooapis.com/geo/placefinder?woeid=748874 </dc:location> <dc:type> Gold http://purl.obolibrary.org/obo/CHEBI_30050 </dc:type> </rdf:Description> </rdf:RDF>
  • 5. Conventional Solution proper data sharing rules • So let’s create some Data-sharing Policies and some Compliance and Enforcement activities
  • 6. Esperanto • Created in 1887 as an easy-to-learn • And politically neutral language • But, English provides a greater incentive – Websites Languages, March 2014
  • 7. Data-sharing policies “Adherence to data-sharing policies is as inconsistent as the policies themselves” “351 papers covered by some data-sharing policy, only 143 fully adhered to that policy” (~40%) “is time-consuming to do properly, the reward systems aren't there and neither is the stick” “Of all the data that are made available, what fraction is actually used by someone else? “ Steven Wiley in Nature, 2011 http://www.nature.com/news/2011/110914/full/news.2011.536.html
  • 8. Human Factor • “More often than scientists would like to admit, they cannot even recover the data associated with their own published works” http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003542
  • 9. Goals 1. propose two measures of metadata quality 2. to implement a tool that is able to evaluate these measures in a public repository 3. to show that these measures are valid and significant in a real-world scientific repository
  • 10. Measures of metadata quality 1. Term coverage the proportion of annotations in the metadata file that link to an ontology concept 2. Semantic specificity the average specificity of those ontology concepts
  • 11. Term Coverage • It is the ratio between – the number of annotations that refer to ontology concepts – and the total number of annotations in the metadata file
  • 12. Semantic specificity • A(t) is the number of ascendant concepts up from t • and D(t) is the average distance between t and all its leaf descendants
  • 13. Metadata Analyser Architecture 1. An interface layer that interacts with the user by requesting a metadata file, informing the user on the analysis progress, and outputting the result 2. An application layer that analyses the metadata file and evaluates the annotations found therein. 3. A data layer that holds the ontologies in local databases 4. A web API layer that connects the interface layer to the application layer, coded in commonly used web technologies
  • 14. Case Study: Metabolights • a database of metabolomics experiments • developed by the EBI since 2012 • Evaluation – the measures on all the resources – manually in a selection of resources – metadata quality before and after a curation step by experts
  • 15.
  • 16.
  • 17.
  • 18. Manual Evaluation Lower coverage: not all ontologies used to annotate the resources were included in the local database
  • 20. Human Factor 1. may not know the ontologies that contain the concepts they need 2. do not fully know the structure of the ontologies in order to perform annotation with the appropriate specific terms 3. lack the proper skills to carry on the annotation process because of the technical difficulties associated with this task 4. do not consider data sharing to be relevant 5. consider that the cost of ensuring proper semantic integration outweighs the benefits
  • 21. Conclusions • apparent correlation between specificity and coverage • a weak term coverage (average of 0.25) • two proposed measures can effectively measure the effort put into the semantic annotation of digital resources • Metadata Analyser – a means to measure the quality of their metadata – 10,000 times faster than the previous work
  • 22. Acknowledgments • The EBI team in charge of the development and maintenance of metabolights for their support in this study. Software: https://github.com/lasigeBioTM/MetadataAnalyser