Scientific research is increasingly dependent on publicly avail-
able information and data sharing. So far, the best practices to ensure
that data is accessible and shareable has been to deposit it in public
repositories. However, these repositories often fail to implement mech-
anisms that measure data quality, which could lead to improving the
discoverability of existing data, and contribute to its future integration.
In light of this, we present Metadata Analyser, a tool that measures
metadata quality. It assesses the quality of metadata by considering the
proportion of terms actually linked to ontology concepts, as well as the
specificity of the terms used in the metadata. Metadata Analyser applied
to Metabolights, a real-world repository of metabolomics data, and re-
sults show that the tool successfully implements the proposed measures,
that there is indeed a lack of effort in the annotation task, and that our
tool can be used to improve this situation. Metadata Analyser’s frontend
is available at http://masterweb-metadataanalyser.rhcloud.com.
1. Metadata Analyser: measuring
metadata quality
Bruno Inácio, João D. Ferreira, and Francisco M. Couto
LaSIGE, Faculdade de Ciências da Universidade de Lisboa, Portugal
PACBB, June 21-23, 2017
Porto Portugal
2. Figure 1. Two pages (scan) from Galilei's Sidereus Nuncius (“The
Starry Messenger” or “The Herald of the Stars”), Venice, 1610.
Goodman A, Pepe A, Blocker AW, Borgman CL, et al. (2014) Ten Simple Rules for the Care and
Feeding of Scientific Data. PLoS Comput Biol 10(4): e1003542. doi:10.1371/journal.pcbi.1003542
http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003542
Galileo integrated
• the direct results of
his observations of
Jupiter
• with careful and
clear descriptions
of how they were
performed
From “Big” Data to Knowledge
3. <?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc= "http://purl.org/dc/elements/1.1/">
<rdf:Description rdf:about="http://en.wikipedia.org/wiki/Sintra_Collar">
<dc:description>
Gold collar. It was made from three circular sectioned and tapering gold bars
that are fused at the ends forming a penannular neck-ring.
</dc:description>
<dc:date>1250BC-800BC (circa)</dc:date>
<dc:location>
Sintra, Portugal
http://yboss.yahooapis.com/geo/placefinder?woeid=748874
</dc:location>
<dc:type>
Gold
http://purl.obolibrary.org/obo/CHEBI_30050
</dc:type>
</rdf:Description>
</rdf:RDF>
5. Conventional Solution
proper data sharing rules
• So let’s create some
Data-sharing Policies
and some
Compliance and
Enforcement activities
6. Esperanto
• Created in 1887 as an easy-to-learn
• And politically neutral language
• But, English provides a greater incentive
– Websites
Languages,
March 2014
7. Data-sharing policies
“Adherence to data-sharing policies is as
inconsistent as the policies themselves”
“351 papers covered by some data-sharing policy,
only 143 fully adhered to that policy” (~40%)
“is time-consuming to do properly, the reward
systems aren't there and neither is the stick”
“Of all the data that are made available, what
fraction is actually used by someone else? “
Steven Wiley in Nature, 2011
http://www.nature.com/news/2011/110914/full/news.2011.536.html
8. Human Factor
• “More often than scientists would like to
admit, they cannot even recover the data
associated with their own published works”
http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003542
9. Goals
1. propose two measures of metadata quality
2. to implement a tool that is able to evaluate
these measures in a public repository
3. to show that these measures are valid and
significant in a real-world scientific repository
10. Measures of metadata quality
1. Term coverage
the proportion of annotations in the metadata file
that link to an ontology concept
2. Semantic specificity
the average specificity of those ontology
concepts
11. Term Coverage
• It is the ratio between
– the number of annotations that refer to ontology
concepts
– and the total number of annotations in the
metadata file
12. Semantic specificity
• A(t) is the number of ascendant concepts up
from t
• and D(t) is the average distance between t and
all its leaf descendants
13. Metadata Analyser Architecture
1. An interface layer that interacts with the user by
requesting a metadata file, informing the user on the
analysis progress, and outputting the result
2. An application layer that analyses the metadata file
and evaluates the annotations found therein.
3. A data layer that holds the ontologies in local
databases
4. A web API layer that connects the interface layer to
the application layer, coded in commonly used web
technologies
14. Case Study: Metabolights
• a database of metabolomics experiments
• developed by the EBI since 2012
• Evaluation
– the measures on all the resources
– manually in a selection of resources
– metadata quality before and after a curation step
by experts
20. Human Factor
1. may not know the ontologies that contain the
concepts they need
2. do not fully know the structure of the ontologies
in order to perform annotation with the
appropriate specific terms
3. lack the proper skills to carry on the annotation
process because of the technical difficulties
associated with this task
4. do not consider data sharing to be relevant
5. consider that the cost of ensuring proper
semantic integration outweighs the benefits
21. Conclusions
• apparent correlation between specificity and
coverage
• a weak term coverage (average of 0.25)
• two proposed measures can effectively
measure the effort put into the semantic
annotation of digital resources
• Metadata Analyser
– a means to measure the quality of their metadata
– 10,000 times faster than the previous work
22. Acknowledgments
• The EBI team in charge of the development
and maintenance of metabolights for their
support in this study.
Software:
https://github.com/lasigeBioTM/MetadataAnalyser