1. Research Data Management:
a gentle introduction
Martin Donnelly, Digital Curation Centre, University of Edinburgh
CLS Live, University of Huddersfield, 3 June 2014
2. OVERVIEW
1. Introductions and definitions
The Digital Curation Centre
Research data management
What do we mean by ‘data’, exactly?
2. Data as a hot topic: politics and practical concerns
3. Barriers and current activities
Quick interactive session
4. Support and resources
A few rules of thumb / do’s and don’ts
Take-home messages
4. The Digital Curation Centre
The (est. 2004) is…
A UK centre of expertise in digital
preservation, with a particular focus on
research data management (RDM)
Based across three sites: Universities of
Edinburgh, Glasgow and Bath
Working with a number of UK universities
to identify gaps in RDM provision and
raise capabilities across the sector
Also involved in a variety of international
collaborations
7. What is RD(M)?
“the active management and
appraisal of data over the
lifecycle of scholarly and
scientific interest”
Data management is a part of
good research practice.
- RCUK Policy and Code of Conduct on the
Governance of Good Research Conduct
8. The old way of doing things
1. Researcher collects data (information)
2. Researcher interprets/synthesises data
3. Researcher writes paper based on data
4. Paper is published (and preserved)
5. Data is left to benign neglect, and
eventually ceases to be accessible
9. The new way of doing things
Plan
Collect
Assure
Describe
Preserve
Discover
Integrate
Analyze
SHARE
…and
RE-USE
The DataONE
lifecycle model
10. Helicopter view:What are the benefits of RDM?
TRANSPARENCY: The data that underpins research
can be made open for anyone to scrutinise, and
attempt to replicate findings.
EFFICIENCY: Data collection can be funded once, and
used many times for a variety of purposes.
RISK MANAGEMENT: A pro-active approach to data
management reduces the risk of inappropriate
disclosure of sensitive data, whether commercial or
personal.
PRESERVATION: Lots of data is unique, and can only
be captured once. If lost, it can’t be replaced.
11. Definitions vary from discipline to discipline, and from funder to funder…
Here’s a science-centric definition:
“The recorded factual material commonly accepted in the scientific community as
necessary to validate research findings.” (US Office of Management and Budget,
Circular 110)
[Addendum: This policy applies to scientific collections, known in some disciplines
as institutional collections, permanent collections, archival collections, museum
collections, or voucher collections, which are assets with long-term scientific value.
(US Office of Science and Technology Policy, Memorandum, 20 March 2014)]
And another from the visual arts:
“Evidence which is used or created to generate new knowledge and
interpretations. ‘Evidence’ may be intersubjective or subjective; physical or
emotional; persistent or ephemeral; personal or public; explicit or tacit; and is
consciously or unconsciously referenced by the researcher at some point during
the course of their research.”
(Leigh Garrett, KAPTUR project: see http://kaptur.wordpress.com/
2013/01/23/what-is-visual-arts-research-data-revisited/)
Okay, but what is ‘data’ exactly?
13. Nature, 09/08 Economist, 02/10
Popular Science,Science, 02/11
Nature, 09/09ACM, 12/08
InformationWeek, 08/10 Computerworld,
A hot topic: 5 years of front pages…
14. Developments in sensor technology,
networking and digital storage enable
new research and scientific paradigms
As costs also fall, possibilities for data
sharing, citation and re-use become
much more widespread
Journals dedicated solely to publishing
data have even started to appear. That’s
not to say it’s an entirely new thing:
journals have always published data,
just never before at such scale…
Technology
16. Repurposing /VfM via data re-use
Ships’ log books build picture of climate
change 14 October 2010
You can now help scientists understand the
climate of the past and unearth new historical
information by revisiting the voyages of First
World War Royal Navy warships.
Visitors to OldWeather.org will be able to
retrace the routes taken by any of 280 Royal
Navy ships. These include historic vessels such
as HMS Caroline, the last survivor of the 1916
Battle of Jutland still afloat. By transcribing
information about the weather and interesting
events from images of each ship's logbook, web
volunteers will help scientists build a more
accurate picture of how our climate has
changed over the last century.
http://www.nationalarchives.gov.uk/news/503.
htm
Detail from Royal Navy Recruitment poster, RNVR
Signals branch, 1917 (Catalogue reference: ADM
1/8331)
Endeavour, 1768-71
(Captain Cook)
HMS Beagle,
1830-34
HMS Torch,
1918
17. 6.9 The Research Councils expect the researchers they fund
to deposit published articles or conference proceedings in
an open access repository at or around the time of
publication. But this practice is unevenly enforced.
Therefore, as an immediate step, we have asked the
Research Councils to ensure the researchers they fund
fulfil the current requirements. Additionally, the Research
Councils have now agreed to invest £2 million in the
development, by 2013, of a UK ‘Gateway to Research’. In
the first instance this will allow ready access to Research
Council funded research information and related data but
it will be designed so that it can also include research
funded by others in due course. The Research Councils will
work with their partners and users to ensure information is
presented in a readily reusable form, using common
formats and open standards.
Government pressure/support
http://www.bis.gov.uk/assets/biscor
e/innovation/docs/i/11-1387-
innovation-and-research-strategy-
for-growth.pdf
18. Funder principles/expectations
1. Public good
2. Preservation
3. Discovery
4. Confidentiality
5. First use
6. Recognition
7. Public funding
Six of the seven RCUK
councils require data
management plans (or
equivalent), as do
Wellcome Trust, Cancer
Research UK, and more…
20. (Aside: Open Data)
Open Data is a philosophy, underpinned by
pragmatism… transparency + utility.
“Open data is the idea that certain data should be
freely available to everyone to use and republish as
they wish, without restrictions from copyright, patents
or other mechanisms of control.” – Wikipedia
Governments, cities etc are all getting onboard
Open Knowledge Foundation is basically the political /
activist wing: http://okfn.org/
From the government / industry side, we have the
Open Data Institute: http://theodi.org/
21. Controversial FOI requests to…
- University of East Anglia
- Queens University Belfast
- University of Stirling
Risk management
22. - Reinhart & Rogoff (2010) “Growth in a Time of Debt” - paper not peer-reviewed, data
not initially made available…
- Very influential and repeatedly cited by politicians to lend weight to economic strategy
- Multiple issues (selective exclusions, unconventional weightings, coding error)
identified by a postgrad researcher attempting to replicate the paper’s findings
- Widespread embarrassment, but at least the errors were discovered!
Research quality and integrity
24. Why don’t we live in a data sharing utopia?
Four main reasons…
Lack of understanding of the fundamental
issues
Lack of joined-up thinking within
institutions, countries, internationally…
Issues around ownership / privacy
Technical/financial limitations and the need
for appraisal
25. What are UK HEIs doing about it?
Three principal areas of focus
Developing and integrating their technical
infrastructure (storage space, repositories/
CRIS systems, data catalogues, etc)
Developing human infrastructure (creating
policies, assessing current data management
capabilities, identifying areas of good practice,
data management plan templates, tailoring
training and guidance materials…)
Developing business plans for sustainable
services / roles
Forming cross-function (hybrid) working groups,
advisory groups, task forces, etc…
http://blog.soton.ac.uk/keepi
t/2010/01/28/aida-and-
institutional-wobbliness/
26. Quick interactive session: data management
planning
Checklist for a Data
Management Plan, v4.0
(2013)
www.dcc.ac.uk/resource
s/data-management-
plans
Questions
How confident would
you be about completing
each section?
What help or advice is
available in the
university?
DMP SECTIONS
1. Administrative Data, e.g. project name,
description, PI, funder, etc
2. Data Collection, e.g. description, capture
methods, etc
3. Documentation and Metadata, e.g. what
information is needed for the data to be to be
accessed and understood in the future?
4. Ethics and Legal Compliance, e.g. consent,
sensitivity, copyright/IPR
5. Storage and Backup, e.g. where will data be
held and backed up? Security and access
issues
6. Selection and Preservation, e.g. keep it all or
just some? How long should it be kept?
7. Data Sharing, e.g. how will data be found and
accessed, any restrictions?
8. Responsibilities and Resources, e.g. who will
do it and who will pay?
27. Quick interactive session: data management
planning
Outcomes
It’s not necessary – or even desirable – for every researcher
to become expert in every aspect of data management
Universities have an increasing obligation to provide
infrastructure and support
Huddersfield have developed a dedicated web area at
https://www.hud.ac.uk/cls/researchdata/
Specific expertise may also be available from the research
office, library, IT, departmental support staff, legal services,
etc…
29. i. DCC resources
Publications
The DCC publishes a series of themed Briefing Papers, How-To Guides
and Case Studies, pitched at different audiences / levels of detail
http://www.dcc.ac.uk/resources/briefing-papers
http://www.dcc.ac.uk/resources/how-guides
http://www.dcc.ac.uk/resources/developing-rdm-services
Training
e.g. DC101 courses and Curation Reference Manual
Advice
e.g. Disciplinary metadata, www.dcc.ac.uk/resources/metadata-
standards
Tools
DMPonline, CARDIO, Data Asset Framework, DRAMBORA
Events
International Digital Curation Conference (most recent was in San
Francisco, February 2014)
Research Data Management Forum (themed events – next one is
on Workflows and Lifecycle Models, London, 20 June 2014)
30. ii. Other resources
Jisc services and resources
RDM resources, www.jisc.ac.uk/guides/research-data-
management
EDINA and Mimas (national data centres)
JISCMRD projects – Phase 1 (2009-2011) and Phase 2 (2011-2013)
1) Research Data Management Infrastructure (RDMI)
2) Research Data Management Planning (RDMP)
3) Support and Tools
4) Citing, Linking, Integrating and Publishing Research Data (CLIP)
5) Research Data Management Training Materials
6) Enhancing DMPonline
7) Events
Universities
Good materials are available from Edinburgh, Cambridge, Oxford,
Glasgow, Bristol, and many others
35. But! You generally
need a reason NOT to
share, e.g.
- Commercial interests
- Ethical concerns
- Data Protection Act
So… don’t share it all
36. Why not?
1. We probably can’t afford the
costs of storage: increasing
volumes outpace declining
storage hardware costs
and
2. We probably can’t afford the
time it will take to ensure it
remains
accessible/discoverable
According to: John Gantz and David Reinsel 2011 Extracting
Value from Chaos, http://www.emc.com/digital_universe
And… don’t keep it all
38. How to decide?
1. Relevance to Mission – including any legal/funder
requirement to retain the data beyond its
immediate use.
2. Scientific or Historical Value – significance and
relationship to publications etc.
3. Uniqueness – can it be found elsewhere / if we
don’t preserve it, who will?
4. Potential for Redistribution – quality / IP / ethical
concerns are addressed.
5. Non-Replicability – either impossible to replicate
(e.g. atmospheric or social science data) or not
financially viable.
6. Economic Case – costs of managing and
preserving the resource stack up well against
potential future benefits.
7. Full Documentation – surrounding / contextual
information necessary to facilitate future
discovery, access, and reuse is adequate.
How to Appraise & Select Research Data
for Curation
Angus Whyte, Digital Curation Centre,
and Andrew Wilson, Australian National
Data Service (2010)
39. A few do’s and don’ts
DO DON’T
Have a plan for your data Make it up as you go along
Keep backups. Make this easy with automated
syncing services like Dropbox, provided your
data isn’t too sensitive
Carry the only copy around on a memory card,
your laptop, your phone, etc
Describe your data as you collect it. This
makes it possible for others to interpret it, and
for you to do the same a few years down the
line
Leave this till later. The quality of metadata
decreases with time, and the best metadata is
created at the moment of data capture
Save your work in open file formats, where
possible, and use accepted metadata
standards to enable like-with-like comparison
Invent new ‘standards’ where community
norms already exist
Deposit your data in a data centre or
repository, and link it to your publications
Be afraid to ask for help. This will exist both
within your institution, and via national
support organisations like the DCC
40. Last slide: take-home messages
Research data management (RDM) is…
An integral part of doing quality research in the 21st
century
Increasingly expected / mandated by funders,
publishers and others
An opportunity for new discoveries and different
approaches to research
A safeguard against inappropriate data disclosure
An activity that requires careful planning and
consideration, and – ideally – coordination and support
across many stakeholder types
41. Thank you
Questions?
Image credits
Slide 2 (forest) – http://assets.worldwildlife.org/photos/934/images/hero_small/forest-overview-HI_115486.jpg?1345533675
Slide 3 (dictionary) – http://www.flickr.com/photos/dougbelshaw/
Slide 12 (politics) – https://www.flickr.com/photos/junglearctic/
Slide 23 (barriers) – http://www.flickr.com/photos/thetrapezium/
Slide 24 (utopia) – http://www.flickr.com/photos/burningmax/
Slide 28 (Thierry) – https://twitter.com/AFC_Fisher/
Slide 33 (greenhouse) – http://www.flickr.com/photos/mykl/
Slide 41 (love note) – http://www.edawax.de/wp-content/uploads/2013/01/Metadata_love250.jpg
Thanks to Sarah Callaghan, PREPARDE, for the Rosse example
This work is licensed under the
Creative Commons Attribution
2.5 UK: Scotland License.
For more about DCC services see www.dcc.ac.uk
or follow us on twitter @digitalcuration and #ukdcc
Martin Donnelly
Digital Curation Centre
University of Edinburgh
martin.donnelly@ed.ac.uk
@mkdDCC
Notas do Editor
First cohort of institutional engagements, 2011-2013
Painting in broad strokes here, of course…
Share = deposit, link, publish, etc
Will unpack these over the course of the presentation, but first
Think about what you do in your own research
…and as the worlds of business and academia continue to merge… Interest in data is not limited to academia: the business world sees data as a valuable and potentially lucrative resource, a real game-changer…
Earliest academic scientific journal is Journal des sçavans, published on 5 Jan 1665
We can now publish and re-use data in a much more structured way, automating the process and crunching more data via computers than we could when it was only available on paper.
https://www.youtube.com/watch?v=n603rEnEGXA
Philip Morris International vs University of Stirling (2011) - another example of unanticipated data re-use!
There’s a delicate balance between the rights of researchers, of human research subjects, of funders, and other interested stakeholders to enable or prevent access to research data…
So, those are the benefits, but there are still barriers to this utopia…
Forming cross-function (hybrid) working groups, advisory groups, task forces, etc
IT departments in particular tend to think of data management as primarily a hardware/technical problem. It’s not – the human side is bigger
The two main goal of data management are (1) to make data more widely accessible, and (2) to prevent access to sensitive data
2. Prioritise based on relationship with publications, e.g. underpins scientific record (c.f. Sarah Callaghan, Preparde)
5. Privilege irreproducible data…
A DMP is a basic statement of how you will create, manage, share and preserve your data
Funders expect the decisions to be justified, particularly where it’s not in line with their policy (e.g. limits on data sharing)