Small Data: How Elsevier Might Help with Research Data Management
1. Small Data: How Elsevier Might Help
With Research Data Management
David Marques
27 February 2013
Research Data Symposium
Columbia University
2. Assertions
• We share a common goal: an open system of
ubiquitous sharing of research data in
repositories that are
– discipline-specific
– controlled-vocabulary annotated
– Normalized
• A very small portion of research data is being
shared to the discipline-specific repositories
2
3. Problem statement
• There are a lot of barriers to sharing of data
• There are problems with sustainable funding for
repositories
3
4. Points of this presentation
• We can help remove the barriers by
– applying rigorous yet efficient process
– using discipline-specific informatics skills
– providing credit assignment and assessment
– helping capture metadata early and digital
• It is possible, and we can help to create
sustainable funding models for open data
repositories
4
5. Big Data vs Research Data
Plan
Data life cycle taken from DataONE
Analyze Collect
Plan
Integrate 'Big Data' Emphasis Assure
Analyze Collect
Research Data Pain
Discover Describe
Integrate Assure
Preserve
Discover Describe
Preserve
5
6. Dataset Repositories: MANY solutions
• Figshare [http://figshare.com/] (Digital Science)
• GigaDB [http://gigadb.org/] (BioMed Central)
• DataDryad [http://datadryad.org/]
• Australian National Data Service [http://www.ands.org.au/]
– but: their goal is to move from
• Amazon’s Glacier [http://aws.amazon.com/glacier/]
6
7. Problem 1: Barriers to Data Disclosure and Sharing
• Non-digital Metadata • Open to mis-interpretation
• Different skill sets • Lack of credit
• Takes time and mindset • Intellectual property or
away from research possible patent issues
• Requires common • Easier contradiction
nomenclature • No incentive, little value to
• Cost the sharer
• It is a long-tail problem: • Privacy and security
thousands of narrow concerns
solutions provides the best
value
7
8. Are supplemental files the answer?
• Scope
– 15% of 2012 Elsevier articles had supplemental files
– ~ 1% have spreadsheets
– ~ 2% have either spreadsheets or zip files
• Extracting value
– no rules for supplemental files
– no common nomenclatures
– analytics, comparisons, trends are hard
• Elsevier recommends (and some journals such as Cell Press
journals require) that authors share/deposit data in
discipline repositories
• Linking helps ovecome the credit barrier
– Elsevier links articles to/from datasets in open repositories
– 35 today (including EarthChem)
– 10 more in progress 8
13. Problem 2: Sustainability
• Many are grant-funded initially, as research projects – and
funding bodies often do not intend to fund repositories long
term
• Can we fund from a Gold Open Access model?
• Can we fund from high-end analytics subscriptions?
• Can we fund some of them from health care and corporate
use?
13
14. PLAN
10% PROPOSE
SUPPORT SERVICES
25%
19%
ACQUISITION
15%
ACCESS submission agreement
STORAGE, data formats
searching and ordering DATA MANAGEMENT IP rules
user guides user documentation and
delivery of result sets and support
reports
6%
INGEST
25%
receive
QA and validation
transform
create metadata (taxonomies)
updates PRODUCE/
PUBLISH
reference linking MANAGE
Summary of data in: Keeping Research Data Safe2, Beagrie et al, 2010 funded by JISC 14
15. Pain Points and Elsevier Strengths and Expertise
• Taxonomies
– 50+ discipline-specific taxonomies – core to Elsevier
• At-scale, efficient, best-practices process
• At-scale analytics
• Turning freely-available data into high-value solutions for corporate use
without advertising (advertising models require very large customer groups)
• Impact analysis and reporting
15
16. Research Data Services – new group at Elsevier
• Goals
– Increase archiving and sharing of research data (as
requested by funding bodies)
– Increase the value of shared data (with metadata)
– Foster and assist with the credit and impact assessment of
research data for the researcher, the institution, and the funding
bodies
– Increase the sustainability of data repositories
• Principles
– Open data – all data remain open and available
– Collaborative – with institutions, the research community,
funding bodies
– Transparent business model – if we make money, some goes
back to fund the repositories
16
17. Pilot: see if we Research Data Management
can scale a Plan Pilot: collecting
repository and data with an
make it app, integrating
Data Management
financially and sharing with
sustainable Analyze Collect a dashboard
Plan
An
aly
c t nd
t
Do i c s
e
ru a
ur
st u s
m En
Pilot: user
fra B
ain gi
K nes
I n ata
LDR to ,
D
Pilot: collect and
connect standardize
Method Tools
data from
different
Integrate Linked Data
Repositories
RDM (VizTrails)
Assure
method and
provenance
repositories IEDA/EarthC
s, T
ie ries B e ax ube
to create o m to st ono
Repositories, Data
x on irec Pr m
ac ie collaboration
insight
Mgmt Plans
Ta , D tic s,
O es with Kerstin
SE
Discover Describe
Lehnert.
Pilot: annotate
Pilot: create data and
directories to methods with
help discover standard
Preserve
data in shared taxonomies
repositories 17
18. Disclosure Pilot Benefits for the Researcher
• Immediate visibility and overview of the research (PI
Dashboard)
• Enhanced discoverability of research data attributable to the
university and the research team
• Credit/impact for the university, the research team, and the
funding bodies
• Acknowledgement by the funding bodies of the
disclosure/sharing of the data
• [better, faster science]
18
19. Disclosure Pilot Benefits for the Institution
• Increased rigor of data management
– consistency
– best practices
– overview metadata in research management information systems
• Step toward completeness of research data management
• Compliance to funding body requirements, stronger base
from which to request
• Increased visibility, discoverability, credit
19
20. Disclosure Pilot Benefits for the Funding Body
• Increased data disclosure and sharing
• Increased discoverability of data (with funding body credit)
• Increased opportunity for ‘fourth paradigm’ (analytics-
derived) science – better, faster science
• Credit/impact for sponsored research
• Standardization and best practices in data management plans
and actual data curation/preservation
20
24. An interesting quote at the IDCC13 cost workshop
[loosely quoted, I did not catch it verbatim]
We can’t do this by ourselves. We should get someone with
business savvy to partner with us.
24
Metadata are not captured digitallyOpen to mis-interpretationLack of creditIntellectual property or possible patent issuesEasier contradictionNo incentive, little value to the sharer, even dis-incented from current reward modelsDifferent skill setsTakes time and mindset away from researchRequires common nomenclaturemissing in many domainsnomenclature convergence only happens in mature sciencemany researchers are invested in nomenclature discussionsPrivacy and security concernsCostIt is a long-tail problem: thousands of narrow solutions provides the best value
Analytics at scaleMEDai: analyze every treatment event in a hospital for protocol variationsRisk Solutions: analyze public data for fraud detection and predictionShepardizing legal casesFunding from freeReaxys (chemical reactions database, literature and patents)Chemical resistance of plastics (manufacturer data normalized)Pathway Studio (enzymatic pathways for drug discovery, eventually personalized medicine)Geofacets (geologic information for exploration)LexisNexis