Metadata Analyser: measuring metadata quality

Metadata Analyser: measuring
metadata quality
Bruno Inácio, João D. Ferreira, and Francisco M. Couto
LaSIGE, Faculdade de Ciências da Universidade de Lisboa, Portugal
PACBB, June 21-23, 2017
Porto Portugal

Figure 1. Two pages (scan) from Galilei's Sidereus Nuncius (“The
Starry Messenger” or “The Herald of the Stars”), Venice, 1610.
Goodman A, Pepe A, Blocker AW, Borgman CL, et al. (2014) Ten Simple Rules for the Care and
Feeding of Scientific Data. PLoS Comput Biol 10(4): e1003542. doi:10.1371/journal.pcbi.1003542
http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003542
Galileo integrated
• the direct results of
his observations of
Jupiter
• with careful and
clear descriptions
of how they were
performed
From “Big” Data to Knowledge

<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc= "http://purl.org/dc/elements/1.1/">
<rdf:Description rdf:about="http://en.wikipedia.org/wiki/Sintra_Collar">
<dc:description>
Gold collar. It was made from three circular sectioned and tapering gold bars
that are fused at the ends forming a penannular neck-ring.
</dc:description>
<dc:date>1250BC-800BC (circa)</dc:date>
<dc:location>
Sintra, Portugal
http://yboss.yahooapis.com/geo/placefinder?woeid=748874
</dc:location>
<dc:type>
Gold
http://purl.obolibrary.org/obo/CHEBI_30050
</dc:type>
</rdf:Description>
</rdf:RDF>

Metal
Silver
CoinagePrecious
Palladium GoldPlatinum Copper
is-a
mappings

Conventional Solution
proper data sharing rules
• So let’s create some
Data-sharing Policies
and some
Compliance and
Enforcement activities

Esperanto
• Created in 1887 as an easy-to-learn
• And politically neutral language
• But, English provides a greater incentive
– Websites
Languages,
March 2014

Data-sharing policies
“Adherence to data-sharing policies is as
inconsistent as the policies themselves”
“351 papers covered by some data-sharing policy,
only 143 fully adhered to that policy” (~40%)
“is time-consuming to do properly, the reward
systems aren't there and neither is the stick”
“Of all the data that are made available, what
fraction is actually used by someone else? “
Steven Wiley in Nature, 2011
http://www.nature.com/news/2011/110914/full/news.2011.536.html

Human Factor
• “More often than scientists would like to
admit, they cannot even recover the data
associated with their own published works”
http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003542

Goals
1. propose two measures of metadata quality
2. to implement a tool that is able to evaluate
these measures in a public repository
3. to show that these measures are valid and
significant in a real-world scientific repository

Measures of metadata quality
1. Term coverage
the proportion of annotations in the metadata file
that link to an ontology concept
2. Semantic specificity
the average specificity of those ontology
concepts

Term Coverage
• It is the ratio between
– the number of annotations that refer to ontology
concepts
– and the total number of annotations in the
metadata file

Semantic specificity
• A(t) is the number of ascendant concepts up
from t
• and D(t) is the average distance between t and
all its leaf descendants

Metadata Analyser Architecture
1. An interface layer that interacts with the user by
requesting a metadata file, informing the user on the
analysis progress, and outputting the result
2. An application layer that analyses the metadata file
and evaluates the annotations found therein.
3. A data layer that holds the ontologies in local
databases
4. A web API layer that connects the interface layer to
the application layer, coded in commonly used web
technologies

Case Study: Metabolights
• a database of metabolomics experiments
• developed by the EBI since 2012
• Evaluation
– the measures on all the resources
– manually in a selection of resources
– metadata quality before and after a curation step
by experts

Manual Evaluation
Lower coverage: not all ontologies used to annotate
the resources were included in the local database

pre- and post-curation analysis

Human Factor
1. may not know the ontologies that contain the
concepts they need
2. do not fully know the structure of the ontologies
in order to perform annotation with the
appropriate specific terms
3. lack the proper skills to carry on the annotation
process because of the technical difficulties
associated with this task
4. do not consider data sharing to be relevant
5. consider that the cost of ensuring proper
semantic integration outweighs the benefits

Conclusions
• apparent correlation between specificity and
coverage
• a weak term coverage (average of 0.25)
• two proposed measures can effectively
measure the effort put into the semantic
annotation of digital resources
• Metadata Analyser
– a means to measure the quality of their metadata
– 10,000 times faster than the previous work

Acknowledgments
• The EBI team in charge of the development
and maintenance of metabolights for their
support in this study.
Software:
https://github.com/lasigeBioTM/MetadataAnalyser

Metadata Analyser: measuring metadata quality

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Metadata Analyser: measuring metadata quality

Semelhante a Metadata Analyser: measuring metadata quality (20)

Mais de Francisco Couto

Mais de Francisco Couto (11)

Último

Último (20)

Metadata Analyser: measuring metadata quality