5. multilinguality problem
5
★ Mona Lisa → 456
results
★ La Gioconda → 365
results
★ La Joconde → 71
results
http://www.europeana.eu/portal/en/record/90402/RP_F_00_351.html
http://bit.ly/qa-cas2018
6. strange values
6
from a template?
more examples in Report and Recommendations from the Task Force on Metadata Quality (2015)
http://bit.ly/qa-cas2018
7. consequence of metadata quality issues
7
main purpose of metadata: to access content
vs.
no metadata
no access to data no data usage
more explanation:
Data on the Web Best Practices, W3C Working Draft, https://www.w3.org/TR/dwbp/
bad metadata
http://bit.ly/qa-cas2018
8. purpose of assessment
8
we feel that there are “good” and “bad” metadata
records
we would like to achieve metrics like this:
functional requirements
good
acceptable
bad
http://bit.ly/qa-cas2018
9. metadata quality metrics in literature
★ completeness: number of metadata elements filled out
★ accuracy: data correspond to the resource that is being described
★ consistency: values compliant to what is defined by the metadata scheme
★ objectiveness: values describe the resource in an unbiased way
★ appropriateness: values are facilitating the deployment of search
★ correctness: syntactically and grammatically correct language
★ ...
Bruce and Hillman (2004); Ochoa and Duval (2009); Palavitsinis (2014); Zaveri et al. (2015)
https://www.zotero.org/groups/488224/metadata_assessment
9
http://bit.ly/qa-cas2018
13. technical proposal
13
“Metadata Quality Assessment Framework”
a generic tool for measuring metadata quality
★ adaptable to different metadata schemes
★ scalable (to Big Data)
★ understandable reports for data curators
★ open source
http://bit.ly/qa-cas2018
15. What to measure?
15
★Structural and semantic features
Completeness, cardinality, uniqueness, length, dictionary entry, data type
conformance, multilinguality (generic metrics)
★Functional requirement analysis / Discovery scenarios
Requirements of the most important functions
★Problem catalog
Known metadata problems
http://bit.ly/qa-cas2018
29. K-means clustering
Spark (Scala)
increasing number of clusters
decreasing the distance from
the centroids
after a point this gain is not
so big (“elbow effect”) -- in
theory
Big number or low
quality records
small clusters with ‘in
between’ quality records
the acceptable average
clusters with good quality
records
29
http://bit.ly/qa-cas2018
30. more information
quality dashboard: http://144.76.218.178/europeana-qa/
https://pro.europeana.eu/project/data-quality-committee
https://github.com/pkiraly (GPL-3.0, binaries, scripts)
http://pkiraly.github.io, https://twitter.com/kiru
Would you like to cooperate? (I do!) peter.kiraly@gwdg.de
30
http://bit.ly/qa-cas2018