A presentation by Bill Michener (University of New Mexico and DataONE) about data sharing, archiving and discovery. It was an introduction to a session co-hosted by FRB-CESAB and CEFE (CNRS) in Montpellier.
1. Data Sharing*, Archiving,
and Discovery: Tips and
Tools
William Michener
College of University Libraries & Learning Sciences
DataONE
University of New Mexico
*Making data available for others to use
3. 3
Content
Time
Time of publication
Specific details
General details
Accident
Retirement or
career change
Death
(Michener et al. 1997)
Vines,T.H.etal.Curr.Biol.http://dx.doi.org/10.1016/j.cub.2013.11.014(2013).
Data Entropy
4. 4
Dark data in
the long tail
Specific Data are Hard to Find …
The Rest are Inaccessible
PB Heidorn (2008) Library Trends 57 (2), 280-299
5. “the merging of ideas, approaches and
technologies from widely diverse fields of
knowledge to stimulate innovation and
discovery”
5
Convergent Science
6. “the merging of ideas, approaches and
technologies from widely diverse fields of
knowledge to stimulate innovation and
discovery”
6
Convergent Science
8. The International Biological Program
(IBP): 1964-1974
“… data policies and protocols were never
elaborated nor even agreed to in principle.”
(Porter & Callahan 1994)
8
A brief history of ecological data
sharing
Michener (2015) Ecological Informatics 29:33-44
9. A brief history of ecological data
sharing
Long Term Ecological
Research Network
(LTER): 1980-present
• LTER Guidelines for Site
Data Management Policies
issued in 1990 (Porter &
Callahan 1994)
• LTER Network Data Access
Policy, Data Access
Requirements, and General
Data Use Agreement
(approved by the LTER
Coordinating Committee
April 6, 2005)
9 Michener (2015) Ecological Informatics 29:33-44
Approx. 20,000 data packages available
11. NSF Policy from Grant General Conditions
(April 1, 2001)
“NSF … expects investigators to share with
other researchers, at no more than incremental
cost and within a reasonable time, the data,
samples, physical collections and other
supporting materials created or gathered in the
course of the work.”
America Competes Act (August 9, 2007)
requires civilian federal agencies to provide
guidelines, policy and procedures, to facilitate
and optimize the open exchange of data and
research between agencies, the public and11
A brief history of ecological data
sharing
1 Michener (2015) Ecological Informatics 29:33-44
12. 12
A brief history of ecological data
sharing
1 Michener (2015) Ecological Informatics 29:33-44
13. [Journal] requires, as a condition for publication, that data
supporting the results in the paper should be archived in an
appropriate public archive, such as [list of approved archives here].
Data are important products of the scientific enterprise, and they
should be preserved and usable for decades in the future. Authors
may elect to have the data publicly available at time of publication,
or, if the technology of the archive allows, may opt to embargo
access to the data for a period up to a year after publication.
Exceptions may be granted at the discretion of the editor, especially
for sensitive information such as human subject data or the location
of endangered species.
13
The 2011 Joint Data Archiving
Policy (JDAP; see datadryad.org)
Michener (2015) Ecological Informatics 29:33-44
14. “PLOS journals require authors to make all
data underlying the findings described in their
manuscript fully available without restriction,
with rare exception1.”
Nature, Science, Ecological Monographs, …
14
A brief history of ecological data
sharing
1 Michener (2015) Ecological Informatics 29:33-44
17. 0 20 40 60 80 100
Use others' datasets if their data were
easily accesible
Use others’ datasets if their data
were easily accessible
Perception
Satisfaction
Baseline(2010)
Follow-up(2014)
Views: 35,693; Citations:
188
(published Jun 2011)
Views: 8,342; Citations: 8
(published Aug 2015)
Community Practices
and Perceptions
17
2010
2014
21. “data sharing accelerates the pace of science by
enabling researchers to discover and re-use relevant
data, combine data from multiple sources, and ask
new questions”
“public trust increases as science is made more
transparent and findings can be reproduced and
verified”
Researchers “benefit from the credit attributed to
them when their archived data are cited and used by
others” and “citation rates of publication increase
when the research data are shared”
21
Benefits of Data Sharing
Michener (2015) Ecological Informatics 29:33-44
23. Best Practices for Sharing Data:
1. Create and Follow a Data Management Plan
23
Michener WK (2015) Ten Simple Rules
for Creating a Good Data Management Plan.
PLoS Comput Biol 11(10): e1004525.
doi:10.1371/journal.pcbi.1004525
24. Best Practices for Sharing Data:
2. Adopt/follow Data Sharing & Attribution Policies
24
Joint Data Archiving Policy: [Journal]
requires, as a condition for publication, that
data supporting the results in the paper
should be archived in an appropriate public
archive, such as [list of approved archives
here]. Data are important products of the
scientific enterprise, and they should be
preserved and usable for decades in the
future. Authors may elect to have the data
publicly available at time of publication, or, if
the technology of the archive allows, may opt
to embargo access to the data for a period up
to a year after publication. Exceptions may be
granted at the discretion of the editor,
especially for sensitive information such as
human subject data or the location of
endangered species.
http://datadryad.org/pages/jdap
Whitlock, M. C., M. A. McPeek, M. D. Rausher, L.
Rieseberg, and A. J. Moore. 2010. Data Archiving.
American Naturalist. 175(2):145-146,
http://dx.doi.org/10.1086/650340
Creative Commons Licenses
(https://creativecommons.org)
25. Best Practices for Sharing Data:
3. Fully Document the Data
Darwin Core – species and biodiversity
collections
EML – Ecological Metadata Language
ISO 19115 – for wide variety of geospatial
data
25
https://knb.ecoinformatics.org/#tools/morpho
http://rs.tdwg.org/dwc/
26. Best Practices for Sharing Data:
4. Preserve the Data, Software and Workflows
26
http://specifyx.specifysoftware.org
Catalog of 1,500+ Data Repositories
27. Best Practices for Sharing Data:
5. “Publish” and Disseminate the Data Products
27
http://www.gbif.org
http://www.vertnet.org
http://www.nature.com/sdata/
29. Role of the Data Archive
29
Cook et al. (In press) Preserve: Protecting Data for Long-Term Use. In: Recknagel F, Michener WK
(eds) Ecological Informatics, 4th edn. Springer.
30. Bad Practices for Preserving Data
30 Example from Lesson 4 in DataONE education modules (see DataONE.org)
31. Bad Practices for Preserving Data
31 Example from Lesson 4 in DataONE education modules (see DataONE.org)
32. Best Practices for Preserving Data
Cook et al. (In press) Preserve: Protecting Data for Long-Term Use. In: Recknagel F, Michener WK (eds)
Ecological Informatics, 4th edn. Springer.
1. “Keep similar measurements together in one data
set”
2. Follow standard approaches (e.g. International
System) when defining names, units & formats
(e.g., yyyy-mm-dd or yyyymmdd for date,
20161220)
3. Use consistent data organization
32
33. Best Practices for Preserving Data
Cook et al. (In press) Preserve: Protecting Data for Long-Term Use. In: Recknagel F, Michener WK (eds)
Ecological Informatics, 4th edn. Springer.
4. Use stable file format
Text/CSV, shapefile, GeoTIFF, HDF, netCDF
5. Specify spatial & temporal coordinates
6. Assign descriptive file names
“Soil carbon and nitrogen concentrations in
Barrow….”
7. Save raw data in read-only format and save processing
scripts (R, MATLAB, SAS)
33
34. Best Practices for Preserving Data
Michener (In press) Quality assurance and quality control. In: Recknagel F, Michener WK (eds) Ecological
Informatics, 4th edn. Springer.
8. Assure data quality
9. Provide complete documentation
10. Protect data (1 original, 1 copy onsite, 1 off-site)
34
35. The Data Repository Will Ensure:
Cook et al. (In press) Preserve: Protecting Data for Long-Term Use. In: Recknagel F, Michener WK (eds)
Ecological Informatics, 4th edn. Springer.
1. Files are received as sent
2. Documentation describes files
3. Parameters and units are defined
4. File content is consistent
5. Parameter values are reasonable
6. Files are reformatted and reorganized
if necessary
35