This document summarizes a webinar on metadata for managing scientific research data. The webinar covered why metadata is important for scientific data management, definitions of data and metadata, selected metadata standards including Dublin Core, Darwin Core and FGDC, challenges in generating metadata and opportunities to address these challenges, and advice for getting started with metadata. The webinar emphasized that metadata standards provide guidelines not strict rules, and encouraged participants to keep metadata simple while aiming to facilitate reuse of data.
1. Metadata for Managing
Scientific Research Data
NISO/DCMI Webinar:
August 22, 2012
Jane Greenberg, Professor and Director of
the SILS Metadata Research Center
janeg@email.unc.edu
2. Overview
▪ Why should we care?
▪ What is data?
▪ What is metadata‘s role w.r.t data?
▪ Selected metadata standards
▪ Challenges, opportunities, and jumping in
▪ Concluding comments
▪ Q&A
3. Why should we care?
BIG stuff
▪ Digital data deluge (Hey & Trefethen, 2003)
▪ Big data (New York Times)
2008
▪ The fourth paradigm (Jim Gray, 2007)
Just as important
▪ The long tail (Heidorn, 2008)
▪ CODATA/Data-at-Risk Task Group
▪ Scholarly communications, data citation
Technological affordances for improving and
advancing science
4. Cultural shift toward data sharing
▪ National and international policies
– US NSF and NIH [1, 2]
– OECD (Organisation for Economic Co-operation and
Development) [3]
– INSPIRE Infrastructure for Spatial Information in the European
Community EU Commission [4]
– UK Medical Research Council [5]
Dryad ―enables scientists to validate
published findings, explore new analysis
methodologies, repurpose data for research
questions unanticipated by the original
authors, and perform synthetic studies.‖
(http://datadryad.org/)
5. Overview
▪ Why should we care?
▪ What is data?
▪ What is metadata‘s role w.r.t data?
▪ Selected metadata standards
▪ Challenges, opportunities, and jumping in
▪ Concluding comments
▪ Q&A
6. Data
▪ No single agreed upon definition
▪ One person‘s data is another person‘s
information
▪ Data often implies the ―raw‖ stuff lacking
context
– Scholarly context, written assessment
▪ ―Essence of science‖ (Greenberg, et al, 2009)
▪ What is science?
– The Archaeology Data Service (ADS)
archaeologydataservice.ac.uk
7. Data quantity type The Dryad
Repository
3162 Plain Text
I know it when I see it 476 Microsoft Excel
308 Adobe Portable Document
Format
By example: Traditional 302 Comma-separated values
observations, numbers, and 252 Nexus
measures stored in spreadsheets 153 Microsoft Excel OpenXML
and databases, fossils, 108 Microsoft Word
phylogenetic trees, and herbarium 80 Zip file
samples (White, 2008) 62 JPEG image
45 Microsoft Word OpenXML
Other disciplines 40 Extensible Markup Language
▪ Bioinformatics: Gene 35 Hypertext Markup Language
expressions, DNA transcription 21 Rich Text Format
to RNA translation 16 FASTA sequence file
15 Tag Image File Format
▪ Geology, agriculture,
14 Postscript Files
surveillance, and historical
2 Video Quicktime
manuscript research:
2 Mathematica Notebook
Hyperspectral remote sensing
1 Microsoft Powerpoint
(email w/R. Scherle, July 2012)
8. Overview
▪ Why should we care?
▪ What is data?
▪ What is metadata‘s role w.r.t data?
▪ Selected metadata standards
▪ Challenges, opportunities, and jumping in
▪ Concluding comments
▪ Q&A
9. Metadata defined
……data about data
…….information about data
▪―Metadata or ‗data about data‘ describes the
content, quality, condition, and other
characteristics of data.‖ (FGDC Metadata WG,
1998)
▪Structured information about an object (data)
that facilitates functions associated with the
object. (Greenberg, 2002, 2003, 2009)
10. Typical functions
Control
Discover Manage
rights
Identify Certify Indicate
versions authenticity status
Mark conent Situate Describe
strucure geospatially processes
11. Overview
▪ Why should we care?
▪ What is data?
▪ What is metadata‘s role w.r.t data?
▪ Selected metadata standards
▪ Challenges, opportunities, and jumping in
▪ Concluding comments
▪ Q&A
13. Metadata for Scientific Research Data
Descriptive
– General to granular
▪Value (addressing a topic, ―aboutness‖)
– Topical (ontologies, subject heading lists/thesauri,
taxonomies)
▪Named entities
– Name authority files (people, organizations,
geographical jurisdictions, structures, and events)
▪Geo-spatial (coordinates)
▪Temporal data (ISO 8601/ W3CDTF, or …)
14. Given the messiness…
―I cannot tell you exactly what metadata
standards, vocabularies, etc. to use…‖
15. Examining metadata schemes
Objectives and Domains Architectural layout
principles
• Objectives • Discipline • Structural design
• Genre • Extent
• Principles
• Format • Granularity
Metadata Objectives and principles, Domain, and
Architectural Layout (MODAL) framework
(Greenberg, 2005; Willis, et al, JASIST 2012)
16. Objectives and Domains Architectural
Simple principles layout
schemes
[6] • Interoperability • Multi- • Primarily flat
• Easy to disciplinary • Minimal with
generate, • Any genre or means to
lower barrier format extend
to produce • General (not
granular)
Dublin Core
Metadata
Element Set
(DCMES)
ver.1.1
US MARC • Need training • Primarily flat
bibliographic • Extensible
format
DataCite • Primarily flat
18. DataCite example, ver.2.2 [8]
National Institute for
Environmental Studies and
Center for Climate System
Research Japan
19. US MARC bibliographic
format: World Ocean
Circulation Experiment global
data (Moss Landing Marine
Labs and the Monterey Bay
Aquarium Research Institute
Library) [9]
20. Objectives and Domains Architectural
Simple/ principles layout
moderate Interoperability Greater domain Primarily flat
balanced focus Extensibility—
schemes w/specific Genera via connecting
needs diversity within Slightly more
Generation a domain granular
requires more
expertise
Darwin Core
Access to • Not as flat
Biological
Collections Data
(ABCD)
Ecological
Metadata
Language
DCMI Terms • Graph approach
21. Wieczorek, et al. (2012). Darwin Core: An Evolving Community-
Developed Biodiversity Data Standard.
PLoS One. 2012; 7(1): e29715: doi: 10.1371/journal.pone.0029715.
23. abstract educationLevel modified
accessRights extent provenance
accrualMethod format publisher
accrualPeriodicity hasFormat references
accrualPolicy hasPart relation
alternative hasVersion replaces
audience identifier requires
available instructionalMethod rights
bibliographicCitation isFormatOf rightsHolder
conformsTo isPartOf source
contributor isReferencedBy spatial
coverage isReplacedBy subject
created isRequiredBy tableOfContents
creator issued temporal
date isVersionOf title
dateAccepted language type
dateCopyrighted license valid
dateSubmitted mediator Properties in the /terms/
description medium namespace
24. Objectives and Domains Architectural
Complex principles layout
schemes
Interoperability • Genre focus Hierarchical
level • Format Extensive
Generation variation Granular
requires greater
expertise
FGDC
DDI
Content Standard for Digital Data Document Initiative (DDI)
Geospatial Metadata
(CSDGM)/FGDC
1. Identification Information (M) 1. Concept
2. Data Quality Information 2. Collecting
3. Spatial Data Organization Information 3. Processing Archiving
4. Spatial Reference Information 4. Distribution Archiving
5. Entity and Attribute Information 5. Discovery
6. Distribution Information 6. Analysis
7. Metadata Reference Information (M) 7. Repurposing
25. Summary for descriptive schemes
▪ Simple: Interoperable, Easy to generate/low barrier,
generally multidisciplinary, genera/format agnostics,
primarily flat, general (not granular), 15-25 properties
▪ Simple/moderate: Interoperability balanced
w/specific needs, generation requires more expertise,
greater domain focus, extensible--via connecting to
other schemes, more granular, more properties
▪ Complex: Interoperable level, generation requires
expertise, genera focus/format variation, hierarchical,
granular, and extensive (100+ properties)
26.
27. Overview
▪ Why should we care?
▪ What is data?
▪ What is metadata‘s role w.r.t data?
▪ Selected metadata standards
▪ Challenges, opportunities, and jumping in
▪ Concluding comments
▪ Q&A
28. Challenges and opportunities
Challenges Opportunities
Workflow/When to Educate scientists early (Qin, 2009)
▪ Stop
generate the here Integrate into social setting w/Center for
metadata? Embedded Networked Sensing
(CENS) (Borgman, Mayernik, etc., 2009-current;
Mayernik‘s dissertation, 2011)
Methods for generating Use automatic techniques as much as possible,
metadata (labor leverage human expertise (Dryad, DataOne Excel
intensive) project)
Too many standards Don‘t panic, join communities, look for
Which one do I use? examples. (If you can‘t find them?)
Do I need to No. Explore and develop a best practice.
implement my Pursue a 2 pronged approach (Greenberg, et al,
metadata as linked 2009)
data.
29. Jumping in…
1. DCMI/NISO Seminars !!
2. DCMI Science and Metadata Community
(http://wiki.dublincore.org/index.php/DCMI_Science_And_Metadata)
3. Digital Curation Center (DCC)
(http://www.dcc.ac.uk/)
4. The Research Data Management
Training, or MANTRA project
(http://datalib.edina.ac.uk/mantra/)
5. DataONE workshops and tutorials
(www.dataone.org/)
30. Overview
▪ Why should we care?
▪ What is data?
▪ What is metadata‘s role w.r.t data?
▪ Selected metadata standards
▪ Challenges, opportunities, and jumping in
▪ Concluding comments
▪ Q&A
31. Concluding comments
▪ Standards are guidelines; no police
– Aim for reasonable quality
▪ KISS: Keep it simple stupid
– What’s vital; what will aid reuse?
▪ Help to move the practice forward
– Share what you learn
▪ Nothing new/it‘s all new
– Data documentation since ancient times
– SILOS; let‘s break them down (Willis, et al, 2012)
– Greater connectivity than ever
– Cross-disciplinary approaches for problem solving
32. Overview
▪ Why should we care?
▪ What is data?
▪ What is metadata‘s role w.r.t data?
▪ Selected metadata standards
▪ Challenges, opportunities, and jumping in
▪ Concluding comments
▪ Q&A
33. Footnotes
[1] NSF Data Sharing Policy: http://www.nsf.gov/bfa/dias/policy/dmp.jsp.
[2] NIH Data Sharing Policy: http://grants.nih.gov/grants/policy/data_sharing/.
[3] ORGANISATION FOR ECONOMIC CO-OPERATION AND DEVELOPMENT/Data and
Metadata Reporting and Presentation Handbook: http://www.oecd.org/std/37671574.pdf.
[4] The INSPIRE Infrastructure for Spatial Information in the European Community):
http://inspire.ec.europa.eu/index.cfm/pageid/48. directive released 15 May 2007 and will be
implemented in various stages, with full implementation required by 2019, and aims to create a
European Union (EU) spatial data infrastructure.
[5] UK medical research council:
http://www.mrc.ac.uk/Ourresearch/Ethicsresearchguidance/datasharing/index.html.
[6] The DCMI Glossary (scroll down for ―schema‖ entry):
http://dublincore.org/documents/usageguide/glossary.shtml#schema.
[7] Dublin Core Example: Data from: Divergence time estimation using fossils as terminal taxa
and the origins of Lissamphibia (Dryad repository):
http://datadryad.org/resource/doi:10.5061/dryad.8120?show=full.
[8] National Institute for Environmental Studies and Center for Climate System Research
Japan—animation data (DataCite): http://schema.datacite.org/meta/kernel-
2.2/example/datacite-metadata-sample-v2.2.xml.
[9] US MARC bibliographic format: World Ocean Circulation Experiment global data (Moss
Landing Marine Labs and the Monterey Bay Aquarium Research Institute Library):
http://mlml.kohalibrary.com/cgi-bin/koha/opac-detail.pl?biblionumber=9282.