SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
Towards an extensible measurement of metadata quality (DATeCH 2017)
1. Towards an extensible measurement
of metadata quality
Péter Király
Way to DATeCH, 2017-05-22
2. Measuring metadata quality. The problem
2
there are “good” and “bad” metadata records
but we don’t have clear metrics like this:
functional requirements
good
acceptable
bad
3. Measuring metadata quality. Non-informative values
3
non informative dc:title:
“photograph, framed”,
“group photograph”
“photograph”
informative dc:title:
“Photograph of Sir Dugald Clerk”,
“Photograph of "Puffing Billy"”
4. Measuring metadata quality. Copy & paste cataloging
4
from a template?
more examples in Report and Recommendations from the Task Force on Metadata Quality (2015)
5. Measuring metadata quality. Why data quality is important?
5
“Fitness for purpose” (QA principle)
no metadata no access to data no data usage
more explanation:
Data on the Web Best Practices
W3C Working Draft, https://www.w3.org/TR/dwbp/
6. Measuring metadata quality. Hypothesis
6
by measuring structural elements we
can predict metadata record quality
≃ metadata smell
7. Measuring metadata quality. Purposes
7
▪ improve the metadata
▪ services: good data → reliable functions
▪ better metadata schema & documentation
▪ propagate “good practice”
8. Measuring metadata quality. What to measure?
8
▪ Structural and semantic features
Cardinality, uniqueness, length, dictionary entry, data type conformance,
multilinguality (schema-independent measurements)
▪ Functional requirement analysis / Discovery scenarios
Requirements of the most important functions
▪ Problem catalog
Known metadata problems
9. Measuring metadata quality. Metadata requirements // User scenario
9
As a user I want to be able to filter by whether a person is the subject
of a book, or its author, engraver, printer etc.
Metadata analysis
In each case the underlying requirement is that the relevant EDM
fields for objects be populated with URIs rather than free text. These
URIs need to be related, at a minimum, to a label for each of the
supported languages.
Measurement rules
▪ the relevant field values should be resolvable URI
▪ each URI should be associated with labels in multiple languages
10. Measuring metadata quality. Metadata requirements // Supported functions
10
#1 Resource Discovery
★ Search Search for a resource corresponding to stated criteria (i.e., to search either
a single entity or a set of entities using an attribute or relationship of the entity as
the search criteria).
★ Identify confirm that the entity described or located corresponds to the entity sought
★ Select choose an entity that meets the user’s requirements
★ Obtain access a resource either physically or electronically
#2 Resource Use
★ Restrict
★ Manage
★ Operate
★ Interpret
#3 Data Management
★ Identify
★ Process
★ Sort
★ Display
Functional Analysis of the MARC 21 Bibliographic and Holdings Formats
http://www.loc.gov/marc/marc-functional-analysis/source/analysis.pdf
11. Measuring metadata quality. Metadata requirements // element—function map
11
Europeana sub-dimensions MARC Summary of Mapping to User Tasks
12. Measuring metadata quality. The data workflow (in Europeana)
12
data transformations Europeana Data Model (EDM)
Dublin Core,
LIDO, EAD,
MARC, EDM
custom, ...
13. Measuring metadata quality. Measurement
13
overall view collection view record view
Completeness
Field cardinality
Uniqueness
Multilinguality
Language specification
Problem catalog
etc.
links
measurements
aggregated statistics
metrics
14. Measuring metadata quality. Field frequency per collections
14
no record has alternative title
every record has alternative title
filters
15. Measuring metadata quality. Details of field cardinality
15
128 subjects in one record
median is 0, mean is close to 1
link to interesting records
16. Measuring metadata quality. Multilinguality
16
@resource is a URI
@ = language notation in RDF
no language specification
19. Measuring metadata quality. Multilinguality
19
★ Mona Lisa → 456
results
★ La Gioconda → 365
results
★ La Joconde → 71
results
http://www.europeana.eu/portal/en/record/90402/RP_F_00_351.html
20. Measuring metadata quality. What Could be Measured?
20
★ Number of (distinct) languages in the metadata
★ Number of tagged literals
★ Tagged literals per language
Requirement: language annotations / tags!
21. Measuring metadata quality. Distinct Languages
21
Text w/o language annotation (dc.subject: Germany):
Text w language annotation (dc.subject: Germany@en)
Text w several language annotations (dc.subject:
Germany@en, Deutschland@de)
Link to (multilingual) vocabulary (http://www.geonames.org
/2921044/federal-republic-of-germany)
0
1
2
n
25. Measuring metadata quality. Good example
25
dc:description
dc:title
Place/skos:prefLabel
Descriptive fields Subject headings
"Brandenburger Tor"@de
"Brandenburg Gate"@en
"Grenzübergang Potsdamer Platz"@de
"Postdamer Platz border crossing"@en
"Reichstag"@de
"Reichstag building"@en
"Die Mauer muß weg!"@de
"Die Mauer muß weg! (The
Wall must go!)"@en
"Kommentiertes Fotorama mit
Bildern von 1989-1990 in
Berlin"@de
"Annotated images from 1989-
1990 in Berlin"@en
32. Measuring metadata quality. Layers
32
source field link value ① ② ③ ④
:provider dc:subject literal "special relativity"@en ① ② ③ ④
dc:creator standard "Einstein, Albert"@de ① ② ③ ④
dc:type non-std "Books in general"@en ② ④
:enhancement dc:subject standard "Physics"@en ③ ④
① data provider's proxy and dereferencable enrichments
② data provider's proxy and all enrichments
③ all proxies and dereferencable enrichments
④ all proxies and all enrichments
credit: Antoine Isaac
36. Measuring metadata quality. Batch API
36
client Metadata QA
/batch/measuring/start
sessionID
/batch/[recordId]
csv
for each records
/batch/measuring/stop
“success” | “failure”
/batch/analyzing/start
“success” | “failure”
/batch/analyzing/status
“in progress” | “ready”
/batch/analyzing/retriev
e
compressed package
periodically
measurement
analysis
37. Measuring metadata quality. Formal issue definition I. RDFUnit
37
SELECT ?s WHERE {
?s %% P1 %% ?v1 .
?s %% P2 %% ?v2 .
FILTER ( ?v1 %% OP %% ?v2 )
} SELECT ?s WHERE {
?s dbo: birthDate ?v1.
?s dbo: deathDate ?v2.
FILTER ( ?v1 > ?v2 )
}
pattern
SPARQL
P1 => dbo : birthDate
P2 => dbo : deathDate
OP => >
parameters
Kontokostas et al. (2014), Test-driven Evaluation of Linked Data Quality
38. Measuring metadata quality. Formal issue definition II. SHACL
38
<IssueShape> sh:property [
sh:predicate ex:submittedBy;
sh:minLength 20
] .
<IssueShape> <issue1> pass
<IssueShape> <issue2> fail ex:submittedOn expected to be >= 20
characters, 3 characters found.
shape
result
<issue1> ex:submittedBy
<http://a.example/bob> .
<issue2> ex:submittedBy
"Bob" .
RDF triplets
SHACL Core Abstract Syntax and Semantics
W3C First Public Working Draft 25 August 2016
39. Measuring metadata quality. Community bibliography
39
zotero.org/groups/metadata_assessment
dlfmetadataassessment.github.io
40. Measuring metadata quality. Cooperations and project proposals
40
★Europeana Network’s Data Quality Committee
★Digital Library Federation Metadata Assessment Group
★Deutsche Digitale Bibliothek
41. Measuring metadata quality. Further steps
41
▪ Translate the results into
documentation,
recommendations
▪ Communication with data
providers
▪ Human evaluation of metadata
quality
▪ Cooperation with other projects
▪ Incorporating into ingestion
process
▪ Shape Constraint Language
(SHACL) for defining patterns
▪ Process usage statistics
▪ Measuring changes of scores
▪ Machine learning based
classification & clustering
human analysis technical
42. Measuring metadata quality. Links
42
★Europeana Data Quality Committee // http://pro.europeana.eu/europeana-
tech/data-quality-committee
★site // http://144.76.218.178/europeana-qa/
★source codes (GPL v3.0) // http://pkiraly.github.io/about/#source-codes
★Europeana data (CC0) // http://hdl.handle.net/21.11101/0000-0001-781F-7
★Library of Congress data (OA) //
http://www.loc.gov/cds/products/marcDist.php