2. Towards metadata measurement. Glossary
2
★ Metadata here: cultural heritage metadata (descriptions of books etc.)
★ Europeana a metadata aggregator from 3500+ cultural heritage
institutions http://europeana.eu
★ Big Data here: 10-100 million metadata records, 100 GB - 1.5 TB
★ EDM Europeana Data Model, Europeana’s metadata schema
★ MARC MAchine Readable Catalog, a library metadata standard
4. Towards metadata measurement. Hypothesis
4
by measuring structural elements we
can approximate metadata record quality
≃ metadata smell
5. Measuring metadata quality. Proposal II. Tool
5
“Metadata Quality Assurance Framework”
a generic tool for measuring metadata quality
★ adaptable to different metadata schemes
★ scalable (to Big Data)
★ understandable reports for data curators
★ open source
7. Towards metadata measurement. What to measure?
7
★Structural and semantic features
Cardinality, uniqueness, length, dictionary entry, data type conformance,
multilinguality (schema-independent measurements)
★Functional requirement analysis / Discovery scenarios
Requirements of the most important functions
★Problem catalog
Known metadata problems
8. Towards metadata measurement. Dimensions and metrics
8
★Completeness: degree to which all required information is present
CM1: schema completeness - no. of classes and properties represented
/ total no. of classes and properties
CM2: property completeness
CM3: population completeness
CM4: interlinking completeness
★Availability: the extent to which data is present and ready for use
★Licensing: granting of permission to re-use under defined conditions
...
Ngomo et al., Introduction to Linked Data and Its Lifecycle on the Web (2014)
9. Towards metadata measurement. Requirements // element—function map
9
Europeana sub-dimensions MARC Summary of Mapping to User Tasks
http://www.loc.gov/marc/marc-functional-analysis/source/analysis.pdf
10. Measuring completeness. Completeness score calculation
10
Weighted
cardinality
Completeness
score
Weighted
functionality
Pearson’s correlation
coefficient is 0.52
Method I Method II
weight: 2.5 × score
bit.ly/mq-dh2017 - 10
11. Measuring completeness. Completeness score distribution
11
Distribution of completeness scores in one dataset.
functionality-based method
★ higher scores
★ more variant
cardinality-based method
★ lower scores
★ less variant
combined method
★ closer to functionality
bit.ly/mq-dh2017 - 11
12. Towards metadata measurement. Field frequency per collections
12
no record has alternative title
every record has alternative title
filters
13. Towards metadata measurement. Record level
13
<#record> a ore:Proxy ;
dc:subject “Ballet”, “Opera” .
<#record> a ore:Proxy ; edm:europeanaProxy true ;
dc:subject <http://data.europeana.eu/concept/base/264>
, <http://data.europeana.eu/concept/base/247> .
<http://data.europeana.eu/concept/base/264> a skos:Concept .
skos:prefLabel "Ballett"@no, "बैले"@hi, "Ballett"@de, "Балет"@be, "Балет"@ru
, "Balé"@pt, "Балет"@bg, "Baletas"@lt, "Balet"@hr, "Balets"@lv .
<http://data.europeana.eu/concept/base/247>
skos:prefLabel "Opera"@no, "ओपेरा (गीतिनाटक)"@hi, "Oper"@de, "Ooppera"@fi
, "Опера"@be, "Опера"@ru, "Ópera"@pt, "Опера"@bg, "Opera"@lt .
0
0
11 19
Distinct languages Tagged literals 1,7 Literals per language
dereferencing
17. Towards metadata measurement. Flexible measurement
17
API
★ Addressing and iterating over schema elements
○ schema.getFields()
○ field.getPath(), field.getSubdimensions(), ...
★ Abstracting the metrics
○ metric1.measure()metric2.measure()
○ metric1.getResult() metric2.getResult()
★ Making the process configurable (turn on-off metrics)
○ configuration.enableMetricX()
○ configuration.disableMetricY()
★ Unified reporting data structure
Unified statistical analysis
19. Towards metadata measurement. Batch API
19
client Metadata QA
/batch/measuring/start
sessionID
/batch/[recordId]
csv
for each records
/batch/measuring/stop
“success” | “failure”
/batch/analyzing/start
“success” | “failure”
/batch/analyzing/status
“in progress” | “ready”
/batch/analyzing/retriev
e
compressed package
periodically
measurement
analysis
20. Towards metadata measurement. Community bibliography
20
zotero.org/groups/metadata_assessment
dlfmetadataassessment.github.io
21. Towards metadata measurement. Further steps
21
★Translate the results into
documentation,
recommendations
★Communication with data
providers
★Human evaluation of metadata
quality
★Cooperation with other projects
★Incorporating into ingestion
process
★Shape Constraint Language
(SHACL) for defining patterns
★Process usage statistics
★Measuring changes of scores
★Machine learning based
classification & clustering
human analysis technical
22. Towards metadata measurement. Links
22
★Europeana Data Quality Committee // http://pro.europeana.eu/europeana-
tech/data-quality-committee
★site // http://144.76.218.178/europeana-qa/
★source codes (GPL v3.0) // http://pkiraly.github.io/about/#source-codes
★Europeana data (CC0) // http://hdl.handle.net/21.11101/0000-0001-781F-7
★Library of Congress data (OA) //
http://www.loc.gov/cds/products/marcDist.php
★contact: peter.kiraly@gwdg.de, @kiru