SlideShare uma empresa Scribd logo
1 de 26
Metadata Quality Assurance
Péter Király
peter.kiraly@gwdg.de
Heyne Haus, Göttingen, 18/12/2015
Oberseminar Datenmanagement, Cloud und e-Infrastructure
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen
Metadata Quality Assurance Framework
2
What is metadata?
 Data about data
 Specifically: descriptive data about ...
 digitized (or physical) object
such as paintings, books, photos
 larger datasets
such as research data
 Provides access points to the underlining data
Metadata Quality Assurance Framework
3
Why data quality is important?
„Fitness for purpose”
no metadata no access to data no data usage
more explanation:
Data on the Web Best Practices
W3C Working Draft 17 December 2015
http://www.w3.org/TR/2015/WD-dwbp-20151217/
Metadata Quality Assurance Framework
4
Symptoms of bad quality metadata
 Hard to identify („What it is?”)
 Hard to distinguish from other records
 Misleading descriptions
 Uninterpretable descriptions
 Missing fields
 Unreusable (lost original context)
 Hard to find
Metadata Quality Assurance Framework
5
Some typical issues
 Title is not informative
Metadata Quality Assurance Framework
6
Mixing different data types
 Numeric
 RDF resource
Metadata Quality Assurance Framework
7
Field overuse
 What is the meaning of the field?
 identifier
 relation
 source
TextGrid OAI-PMH response
Metadata Quality Assurance Framework
8
Copy & paste cataloguing
 Keeping placeholders / templates
Metadata Quality Assurance Framework
9
Same entity, differently recorded
 lucas cranach der ältere
 Cranach, Lucas (der Ältere) [Herstellung]
 Cranach, Lucas (I) (naar tekening van)
 Cranach, Lucas vanem (autor)
Result of entity detection:
 http://dbpedia.org/resource/Lucas_Cranach_t
he_Elder
 http://viaf.org/viaf/49268177/
 none
Metadata Quality Assurance Framework
10
Same entity recorded differently
Different displays, and content:
 http://dbpedia.org/resource/Lucas_Cranach_t
he_Elder
 http://viaf.org/viaf/49268177/
 none
Metadata Quality Assurance Framework
11
What to measure?
field1 field2 field3 field4
doc1
doc2
doc3
doc3
An overall value for a record
Metadata Quality Assurance Framework
12
What to measure?
field1 field2 field3 field4
doc1
doc2
doc3
doc3
An overall value for a record set (e.g. a collection from the same
source)
Metadata Quality Assurance Framework
13
What to measure?
field1 field2 field3 field4
doc1
doc2
doc3
doc3
An overall value for a field – how users utilize the field?
Metadata Quality Assurance Framework
14
What to measure?
field1 field2 field3 field4
doc1
doc2
doc3
doc3
Field group. A group of fields together supports a given
funtionality, e.g. display, search, identify, re-use, multilinguality.
Metadata Quality Assurance Framework
15
Grouping fields by functionalities
Mandatory
Descriptiveness
Searchability
Contextualisation
Identification
Browsing
Viewing
Re-Usability
Multilinguality
dc:title × × × × ×
dcterms:alternative × × × ×
dc:description × × × × × ×
dc:creator × × × ×
dc:publisher × ×
dc:contributor ×
Created by Valentine Charles, Europeana Research and Development team
Metadata Quality Assurance Framework
16
Metrics
The foundational metrics were set by Bruce–
Hillmann, Stvilia, Ochoa–Duval, Gavrilis et al.
 Completeness
 Accuracy
 Conformance to expectations
 Logical consistency and coherence
 Accessibility
 Timeliness
 Provenance
Metadata Quality Assurance Framework
17
Data sources
 Europeana – the European digital library,
museum and archive: 48M+ medatata records
in EDM (Europeana Data Model) schema
 TextGrid repository: Dublin Core metadata
and TEI (Text Encoding Initiative) records
 Research data from the Göttingen Campus
 Library catalogue records in MARC (Machine
Readable Catalog) schema
 Other open data
Metadata Quality Assurance Framework
18
Method: collection – measuring – sharing
 Data collection (ingestion) via REST API, OAI-
OMH harvesting, file download etc.
 Issues:
 GWDG cloud: 160 GB, Europeana: 300 GB
 low I/O performance
 Europeana OAI-PMH is in a „beta” state
 OAI-PMH requires 10M+ HTTP requests
 REST API requires 50M+ HTTP requests
Metadata Quality Assurance Framework
19
Method: collection – measuring – sharing
Measuring records
 Big data so it should be scalable
 Apache Hadoop: MapReduce and friends
 Plugable architecture: „meters”
 UI: set parameters for meters
 input: records, schema, meters, config files
 output:
 identifier, projected metadata fields
 metric1, metric2, metric3 ... metricN
Metadata Quality Assurance Framework
20
Method: collection – measuring – sharing
Statistical analysis
 Calculating descriptive statistics with
R/Julia/other tool
 Derivation of numbers representing
collections and fields from the record level
measurements
Metadata Quality Assurance Framework
21
Method: collection – measuring – sharing
Completeness of 3 collections 2 response types
best in
collection
worst in
collection
similar records
heterogenious
records
different
manifestations
Metadata Quality Assurance Framework
22
Method: collection – measuring – sharing
outputs
 Display results in an interactive dashboard
 REST API to share the raw data
Images: i) European Data Portal Metadata Quality Dashboard ii) Kibana promotional video
Metadata Quality Assurance Framework
23
Method: collection – measuring – sharing
Data Quality Vocabulary (W3C Working Draft)
http://w3c.github.io/dwbp/vocab-dqg.html
:myDatasetDistribution
dqv:hasQualityMeasure :measure1, :measure2 .
:measure1
a dqv:QualityMeasure ;
dqv:computedOn :myDatasetDistribution ;
dqv:hasMetric :csvAvailabilityMetric ;
dqv:value "1.0"^^xsd:double .
:measure2
a dqv:QualityMeasure ;
dqv:computedOn :myDatasetDistribution ;
dqv:hasMetric :csvConsistencyMetric ;
dqv:value "0.5"^^xsd:double .
Metadata Quality Assurance Framework
24
What it is good for?
 Improve the metadata
 Improve metadata schema and its docum.
 Propagate „good practice”
 Improve services: „good” data is ranked
higher in search result list
Specifically for GWDG:
 Could be built in to current and planned data
management / data archiving tools
Metadata Quality Assurance Framework
25
Further steps
 Define meters by Domain Specific Language
 Pattern discovery, machine learning,
clustering
 Connectors for data sources
 „Jenkins for data publication”
Problem catalogue
Data source
Schema
Metadata QA Report
Metadata Quality Assurance Framework
26
Follow me
 Project plan and blog: http://pkiraly.github.io
 Software development:
 https://github.com/pkiraly/europeana-oai-pmh-client:
Harvester for Europeana OAI-PMH Service
 https://github.com/pkiraly/oai-pmh-lib: OAI-PMH client
library
 https://github.com/pkiraly/europeana-api-php-client: PHP
client for Europeana’s REST API
 https://github.com/pkiraly/europeana-qa: Europeana
Metadata Quality Assurance Toolkit
 @kiru, https://www.linkedin.com/in/peterkiraly

Mais conteúdo relacionado

Mais procurados

FAIR Data Knowledge Graphs–from Theory to Practice
FAIR Data Knowledge Graphs–from Theory to PracticeFAIR Data Knowledge Graphs–from Theory to Practice
FAIR Data Knowledge Graphs–from Theory to PracticeTom Plasterer
 
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...Anastasija Nikiforova
 
Making Data FAIR (Findable, Accessible, Interoperable, Reusable)
Making Data FAIR (Findable, Accessible, Interoperable, Reusable)Making Data FAIR (Findable, Accessible, Interoperable, Reusable)
Making Data FAIR (Findable, Accessible, Interoperable, Reusable)Tom Plasterer
 
OpenTox - an open community and framework supporting predictive toxicology an...
OpenTox - an open community and framework supporting predictive toxicology an...OpenTox - an open community and framework supporting predictive toxicology an...
OpenTox - an open community and framework supporting predictive toxicology an...Barry Hardy
 
Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...
Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...
Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...Tom Plasterer
 
AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUA...
AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUA...AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUA...
AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUA...Anastasija Nikiforova
 
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...Tom Plasterer
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentAmrapali Zaveri, PhD
 
MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings
MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings
MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings Kerstin Forsberg
 
Metadata mapping
Metadata mappingMetadata mapping
Metadata mappingVlad Vega
 
Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanyRobert Grossman
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
 
Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Lucy McKenna
 
Building an Internet of Genomics
Building an Internet of GenomicsBuilding an Internet of Genomics
Building an Internet of GenomicsMarc Fiume
 
BioCADDIE: Descriptive Metadata for Datasets WG3 - ELIXIR All Hands
BioCADDIE: Descriptive Metadata for Datasets WG3 - ELIXIR All HandsBioCADDIE: Descriptive Metadata for Datasets WG3 - ELIXIR All Hands
BioCADDIE: Descriptive Metadata for Datasets WG3 - ELIXIR All HandsSusanna-Assunta Sansone
 
Knowledge graphs dedicated to the memory of amrapali zaveri 3388748
Knowledge graphs dedicated to the memory of amrapali zaveri 3388748Knowledge graphs dedicated to the memory of amrapali zaveri 3388748
Knowledge graphs dedicated to the memory of amrapali zaveri 3388748Jyotindra Zaveri
 

Mais procurados (20)

FAIR Data Knowledge Graphs–from Theory to Practice
FAIR Data Knowledge Graphs–from Theory to PracticeFAIR Data Knowledge Graphs–from Theory to Practice
FAIR Data Knowledge Graphs–from Theory to Practice
 
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
 
Making Data FAIR (Findable, Accessible, Interoperable, Reusable)
Making Data FAIR (Findable, Accessible, Interoperable, Reusable)Making Data FAIR (Findable, Accessible, Interoperable, Reusable)
Making Data FAIR (Findable, Accessible, Interoperable, Reusable)
 
OpenTox - an open community and framework supporting predictive toxicology an...
OpenTox - an open community and framework supporting predictive toxicology an...OpenTox - an open community and framework supporting predictive toxicology an...
OpenTox - an open community and framework supporting predictive toxicology an...
 
Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...
Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...
Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...
 
AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUA...
AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUA...AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUA...
AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUA...
 
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
Harnessing Edge Informatics to Accelerate Collaboration in BioPharma (Bio-IT ...
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality Assessment
 
MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings
MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings
MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings
 
A Case for linked Data for Medical Devices in the IVD Market
A Case for linked Data for Medical Devices in the IVD MarketA Case for linked Data for Medical Devices in the IVD Market
A Case for linked Data for Medical Devices in the IVD Market
 
Metadata mapping
Metadata mappingMetadata mapping
Metadata mapping
 
Metadata crosswalks
Metadata crosswalksMetadata crosswalks
Metadata crosswalks
 
Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your Company
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...
 
Building an Internet of Genomics
Building an Internet of GenomicsBuilding an Internet of Genomics
Building an Internet of Genomics
 
Evaluating FAIRness
Evaluating FAIRnessEvaluating FAIRness
Evaluating FAIRness
 
BioCADDIE: Descriptive Metadata for Datasets WG3 - ELIXIR All Hands
BioCADDIE: Descriptive Metadata for Datasets WG3 - ELIXIR All HandsBioCADDIE: Descriptive Metadata for Datasets WG3 - ELIXIR All Hands
BioCADDIE: Descriptive Metadata for Datasets WG3 - ELIXIR All Hands
 
Knowledge graphs dedicated to the memory of amrapali zaveri 3388748
Knowledge graphs dedicated to the memory of amrapali zaveri 3388748Knowledge graphs dedicated to the memory of amrapali zaveri 3388748
Knowledge graphs dedicated to the memory of amrapali zaveri 3388748
 
Amrapali Zaveri Defense
Amrapali Zaveri DefenseAmrapali Zaveri Defense
Amrapali Zaveri Defense
 

Semelhante a Metadata Quality Assurance

How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?andrea huang
 
A metadata standard for Knowledge Graphs
A metadata standard for Knowledge GraphsA metadata standard for Knowledge Graphs
A metadata standard for Knowledge GraphsMichel Dumontier
 
NIH Data Summit - The NIH Data Commons
NIH Data Summit - The NIH Data CommonsNIH Data Summit - The NIH Data Commons
NIH Data Summit - The NIH Data CommonsVivien Bonazzi
 
Introduction To Data Mining
Introduction To Data MiningIntroduction To Data Mining
Introduction To Data Miningdataminers.ir
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining Phi Jack
 
Data Quality
Data QualityData Quality
Data Qualityjerdeb
 
The NIH Data Commons - BD2K All Hands Meeting 2015
The NIH Data Commons -  BD2K All Hands Meeting 2015The NIH Data Commons -  BD2K All Hands Meeting 2015
The NIH Data Commons - BD2K All Hands Meeting 2015Vivien Bonazzi
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformSanjay Padhi, Ph.D
 
Trends in DM.pptx
Trends in DM.pptxTrends in DM.pptx
Trends in DM.pptxImXaib
 
Bonazzi commons bd2 k ahm 2016 v2
Bonazzi commons bd2 k ahm 2016 v2Bonazzi commons bd2 k ahm 2016 v2
Bonazzi commons bd2 k ahm 2016 v2Vivien Bonazzi
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
 
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...Ahmad Assaf
 
The Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big DataThe Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big DataPhilip Bourne
 
Komatsoulis internet2 executive track
Komatsoulis internet2 executive trackKomatsoulis internet2 executive track
Komatsoulis internet2 executive trackGeorge Komatsoulis
 
Data Quality - Standards and Application to Open Data
Data Quality - Standards and Application to Open DataData Quality - Standards and Application to Open Data
Data Quality - Standards and Application to Open DataMarco Torchiano
 
Project E: Citation
Project E: CitationProject E: Citation
Project E: CitationLizLyon
 
Towards metrics to assess and encourage FAIRness
Towards metrics to assess and encourage FAIRnessTowards metrics to assess and encourage FAIRness
Towards metrics to assess and encourage FAIRnessMichel Dumontier
 
FAIR Metrics - Presentation to NIH KC1
FAIR Metrics - Presentation to NIH KC1FAIR Metrics - Presentation to NIH KC1
FAIR Metrics - Presentation to NIH KC1Mark Wilkinson
 
Prof. Melinda Laituri, Colorado State University | Map Data Integrity | SotM ...
Prof. Melinda Laituri, Colorado State University | Map Data Integrity | SotM ...Prof. Melinda Laituri, Colorado State University | Map Data Integrity | SotM ...
Prof. Melinda Laituri, Colorado State University | Map Data Integrity | SotM ...Kathmandu Living Labs
 

Semelhante a Metadata Quality Assurance (20)

How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?
 
A metadata standard for Knowledge Graphs
A metadata standard for Knowledge GraphsA metadata standard for Knowledge Graphs
A metadata standard for Knowledge Graphs
 
NIH Data Summit - The NIH Data Commons
NIH Data Summit - The NIH Data CommonsNIH Data Summit - The NIH Data Commons
NIH Data Summit - The NIH Data Commons
 
Introduction To Data Mining
Introduction To Data MiningIntroduction To Data Mining
Introduction To Data Mining
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 
Data Quality
Data QualityData Quality
Data Quality
 
The NIH Data Commons - BD2K All Hands Meeting 2015
The NIH Data Commons -  BD2K All Hands Meeting 2015The NIH Data Commons -  BD2K All Hands Meeting 2015
The NIH Data Commons - BD2K All Hands Meeting 2015
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh Platform
 
Trends in DM.pptx
Trends in DM.pptxTrends in DM.pptx
Trends in DM.pptx
 
Bonazzi commons bd2 k ahm 2016 v2
Bonazzi commons bd2 k ahm 2016 v2Bonazzi commons bd2 k ahm 2016 v2
Bonazzi commons bd2 k ahm 2016 v2
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...
 
The Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big DataThe Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big Data
 
Komatsoulis internet2 executive track
Komatsoulis internet2 executive trackKomatsoulis internet2 executive track
Komatsoulis internet2 executive track
 
McGeary Data Curation Network: Developing and Scaling
McGeary Data Curation Network: Developing and ScalingMcGeary Data Curation Network: Developing and Scaling
McGeary Data Curation Network: Developing and Scaling
 
Data Quality - Standards and Application to Open Data
Data Quality - Standards and Application to Open DataData Quality - Standards and Application to Open Data
Data Quality - Standards and Application to Open Data
 
Project E: Citation
Project E: CitationProject E: Citation
Project E: Citation
 
Towards metrics to assess and encourage FAIRness
Towards metrics to assess and encourage FAIRnessTowards metrics to assess and encourage FAIRness
Towards metrics to assess and encourage FAIRness
 
FAIR Metrics - Presentation to NIH KC1
FAIR Metrics - Presentation to NIH KC1FAIR Metrics - Presentation to NIH KC1
FAIR Metrics - Presentation to NIH KC1
 
Prof. Melinda Laituri, Colorado State University | Map Data Integrity | SotM ...
Prof. Melinda Laituri, Colorado State University | Map Data Integrity | SotM ...Prof. Melinda Laituri, Colorado State University | Map Data Integrity | SotM ...
Prof. Melinda Laituri, Colorado State University | Map Data Integrity | SotM ...
 

Mais de Péter Király

Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)Péter Király
 
Validating 126 million MARC records (DATeCH 2019)
Validating 126 million MARC records (DATeCH 2019)Validating 126 million MARC records (DATeCH 2019)
Validating 126 million MARC records (DATeCH 2019)Péter Király
 
Measuring Metadata Quality (doctoral defense 2019)
Measuring Metadata Quality (doctoral defense 2019)Measuring Metadata Quality (doctoral defense 2019)
Measuring Metadata Quality (doctoral defense 2019)Péter Király
 
Empirical evaluation of library catalogues (SWIB 2019)
Empirical evaluation of library catalogues (SWIB 2019)Empirical evaluation of library catalogues (SWIB 2019)
Empirical evaluation of library catalogues (SWIB 2019)Péter Király
 
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)Péter Király
 
Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)Péter Király
 
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)Péter Király
 
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)Péter Király
 
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)Péter Király
 
Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)Péter Király
 
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Péter Király
 
FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)Péter Király
 
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)Péter Király
 
Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...Péter Király
 
Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)Péter Király
 
Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)Péter Király
 
Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)Péter Király
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Péter Király
 
Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)Péter Király
 
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)Péter Király
 

Mais de Péter Király (20)

Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
Requirements of DARIAH community for a Dataverse repository (SSHOC 2020)
 
Validating 126 million MARC records (DATeCH 2019)
Validating 126 million MARC records (DATeCH 2019)Validating 126 million MARC records (DATeCH 2019)
Validating 126 million MARC records (DATeCH 2019)
 
Measuring Metadata Quality (doctoral defense 2019)
Measuring Metadata Quality (doctoral defense 2019)Measuring Metadata Quality (doctoral defense 2019)
Measuring Metadata Quality (doctoral defense 2019)
 
Empirical evaluation of library catalogues (SWIB 2019)
Empirical evaluation of library catalogues (SWIB 2019)Empirical evaluation of library catalogues (SWIB 2019)
Empirical evaluation of library catalogues (SWIB 2019)
 
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
GRO.data - Dataverse in Göttingen (Dataverse Europe 2020)
 
Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)Data element constraints for DDB (DDB 2021)
Data element constraints for DDB (DDB 2021)
 
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
Incubating Göttingen Cultural Analytics Alliance (SUB 2021)
 
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
Continuous quality assessment for MARC21 catalogues (MINI ELAG 2021)
 
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)
 
Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)Magyar irodalom idegen nyelven (BTK ITI 2021)
Magyar irodalom idegen nyelven (BTK ITI 2021)
 
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
Validating JSON, XML and CSV data with SHACL-like constraints (DINI-KIM 2022)
 
FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)FRBR a book history perspective (Bibliodata WG 2022)
FRBR a book history perspective (Bibliodata WG 2022)
 
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
GRO.data - Dataverse in Göttingen (Magdeburg Coffee Lecture, 2022)
 
Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...Understanding, extracting and enhancing catalogue data (CE Book history works...
Understanding, extracting and enhancing catalogue data (CE Book history works...
 
Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)Measuring cultural heritage metadata quality (Semantics 2017)
Measuring cultural heritage metadata quality (Semantics 2017)
 
Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)Measuring Metadata Quality in Europeana (ADOCHS 2017)
Measuring Metadata Quality in Europeana (ADOCHS 2017)
 
Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)Measuring library catalogs (ADOCHS 2017)
Measuring library catalogs (ADOCHS 2017)
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
 
Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)Researching metadata quality (ORKG 2018)
Researching metadata quality (ORKG 2018)
 
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
Metadata quality in cultural heritage institutions (ReIRes-FAIR 2018)
 

Último

GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 

Último (20)

GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 

Metadata Quality Assurance

  • 1. Metadata Quality Assurance Péter Király peter.kiraly@gwdg.de Heyne Haus, Göttingen, 18/12/2015 Oberseminar Datenmanagement, Cloud und e-Infrastructure Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen
  • 2. Metadata Quality Assurance Framework 2 What is metadata?  Data about data  Specifically: descriptive data about ...  digitized (or physical) object such as paintings, books, photos  larger datasets such as research data  Provides access points to the underlining data
  • 3. Metadata Quality Assurance Framework 3 Why data quality is important? „Fitness for purpose” no metadata no access to data no data usage more explanation: Data on the Web Best Practices W3C Working Draft 17 December 2015 http://www.w3.org/TR/2015/WD-dwbp-20151217/
  • 4. Metadata Quality Assurance Framework 4 Symptoms of bad quality metadata  Hard to identify („What it is?”)  Hard to distinguish from other records  Misleading descriptions  Uninterpretable descriptions  Missing fields  Unreusable (lost original context)  Hard to find
  • 5. Metadata Quality Assurance Framework 5 Some typical issues  Title is not informative
  • 6. Metadata Quality Assurance Framework 6 Mixing different data types  Numeric  RDF resource
  • 7. Metadata Quality Assurance Framework 7 Field overuse  What is the meaning of the field?  identifier  relation  source TextGrid OAI-PMH response
  • 8. Metadata Quality Assurance Framework 8 Copy & paste cataloguing  Keeping placeholders / templates
  • 9. Metadata Quality Assurance Framework 9 Same entity, differently recorded  lucas cranach der ältere  Cranach, Lucas (der Ältere) [Herstellung]  Cranach, Lucas (I) (naar tekening van)  Cranach, Lucas vanem (autor) Result of entity detection:  http://dbpedia.org/resource/Lucas_Cranach_t he_Elder  http://viaf.org/viaf/49268177/  none
  • 10. Metadata Quality Assurance Framework 10 Same entity recorded differently Different displays, and content:  http://dbpedia.org/resource/Lucas_Cranach_t he_Elder  http://viaf.org/viaf/49268177/  none
  • 11. Metadata Quality Assurance Framework 11 What to measure? field1 field2 field3 field4 doc1 doc2 doc3 doc3 An overall value for a record
  • 12. Metadata Quality Assurance Framework 12 What to measure? field1 field2 field3 field4 doc1 doc2 doc3 doc3 An overall value for a record set (e.g. a collection from the same source)
  • 13. Metadata Quality Assurance Framework 13 What to measure? field1 field2 field3 field4 doc1 doc2 doc3 doc3 An overall value for a field – how users utilize the field?
  • 14. Metadata Quality Assurance Framework 14 What to measure? field1 field2 field3 field4 doc1 doc2 doc3 doc3 Field group. A group of fields together supports a given funtionality, e.g. display, search, identify, re-use, multilinguality.
  • 15. Metadata Quality Assurance Framework 15 Grouping fields by functionalities Mandatory Descriptiveness Searchability Contextualisation Identification Browsing Viewing Re-Usability Multilinguality dc:title × × × × × dcterms:alternative × × × × dc:description × × × × × × dc:creator × × × × dc:publisher × × dc:contributor × Created by Valentine Charles, Europeana Research and Development team
  • 16. Metadata Quality Assurance Framework 16 Metrics The foundational metrics were set by Bruce– Hillmann, Stvilia, Ochoa–Duval, Gavrilis et al.  Completeness  Accuracy  Conformance to expectations  Logical consistency and coherence  Accessibility  Timeliness  Provenance
  • 17. Metadata Quality Assurance Framework 17 Data sources  Europeana – the European digital library, museum and archive: 48M+ medatata records in EDM (Europeana Data Model) schema  TextGrid repository: Dublin Core metadata and TEI (Text Encoding Initiative) records  Research data from the Göttingen Campus  Library catalogue records in MARC (Machine Readable Catalog) schema  Other open data
  • 18. Metadata Quality Assurance Framework 18 Method: collection – measuring – sharing  Data collection (ingestion) via REST API, OAI- OMH harvesting, file download etc.  Issues:  GWDG cloud: 160 GB, Europeana: 300 GB  low I/O performance  Europeana OAI-PMH is in a „beta” state  OAI-PMH requires 10M+ HTTP requests  REST API requires 50M+ HTTP requests
  • 19. Metadata Quality Assurance Framework 19 Method: collection – measuring – sharing Measuring records  Big data so it should be scalable  Apache Hadoop: MapReduce and friends  Plugable architecture: „meters”  UI: set parameters for meters  input: records, schema, meters, config files  output:  identifier, projected metadata fields  metric1, metric2, metric3 ... metricN
  • 20. Metadata Quality Assurance Framework 20 Method: collection – measuring – sharing Statistical analysis  Calculating descriptive statistics with R/Julia/other tool  Derivation of numbers representing collections and fields from the record level measurements
  • 21. Metadata Quality Assurance Framework 21 Method: collection – measuring – sharing Completeness of 3 collections 2 response types best in collection worst in collection similar records heterogenious records different manifestations
  • 22. Metadata Quality Assurance Framework 22 Method: collection – measuring – sharing outputs  Display results in an interactive dashboard  REST API to share the raw data Images: i) European Data Portal Metadata Quality Dashboard ii) Kibana promotional video
  • 23. Metadata Quality Assurance Framework 23 Method: collection – measuring – sharing Data Quality Vocabulary (W3C Working Draft) http://w3c.github.io/dwbp/vocab-dqg.html :myDatasetDistribution dqv:hasQualityMeasure :measure1, :measure2 . :measure1 a dqv:QualityMeasure ; dqv:computedOn :myDatasetDistribution ; dqv:hasMetric :csvAvailabilityMetric ; dqv:value "1.0"^^xsd:double . :measure2 a dqv:QualityMeasure ; dqv:computedOn :myDatasetDistribution ; dqv:hasMetric :csvConsistencyMetric ; dqv:value "0.5"^^xsd:double .
  • 24. Metadata Quality Assurance Framework 24 What it is good for?  Improve the metadata  Improve metadata schema and its docum.  Propagate „good practice”  Improve services: „good” data is ranked higher in search result list Specifically for GWDG:  Could be built in to current and planned data management / data archiving tools
  • 25. Metadata Quality Assurance Framework 25 Further steps  Define meters by Domain Specific Language  Pattern discovery, machine learning, clustering  Connectors for data sources  „Jenkins for data publication” Problem catalogue Data source Schema Metadata QA Report
  • 26. Metadata Quality Assurance Framework 26 Follow me  Project plan and blog: http://pkiraly.github.io  Software development:  https://github.com/pkiraly/europeana-oai-pmh-client: Harvester for Europeana OAI-PMH Service  https://github.com/pkiraly/oai-pmh-lib: OAI-PMH client library  https://github.com/pkiraly/europeana-api-php-client: PHP client for Europeana’s REST API  https://github.com/pkiraly/europeana-qa: Europeana Metadata Quality Assurance Toolkit  @kiru, https://www.linkedin.com/in/peterkiraly