The slides I presented for my PhD proposal defense for my project, "Foundational studies for measuring the impact, prevalence, and patterns of publicly sharing biomedical research data." Dept of Biomedical Informatics, University of Pittsburgh.
4. Sharing research data
PAST MEDICAL HISTORY:
Past medical history showed she had
superficial phlebitis times two in the past, had
non-insulin dependent diabetes mellitus for
four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/
Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif;
http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
5. Sharing research data
PAST MEDICAL HISTORY:
Past medical history showed she had
superficial phlebitis times two in the past, had
non-insulin dependent diabetes mellitus for
four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/
Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif;
http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
6. Sharing research data
PAST MEDICAL HISTORY:
Past medical history showed she had
superficial phlebitis times two in the past, had
non-insulin dependent diabetes mellitus for
four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/
Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif;
http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
7. Sharing research data
PAST MEDICAL HISTORY:
Past medical history showed she had
superficial phlebitis times two in the past, had
non-insulin dependent diabetes mellitus for
four years.
She had been hypothyroid for three years.
HISTORY OF PRESENT ILLNESS:
The patient is a 58-year-old female, …
http://upload.wikimedia.org/wikipedia/commons/7/76/PeptideMSMS.jpg; http://en.wikipedia.org/wiki/Image:Helices.png; http://en.wikipedia.org/wiki/
Image:Heatmap.png; http://en.wikipedia.org/wiki/Image:Microarray2.gif;
http://zellig.cpmc.columbia.edu/medlee/demo/; htp://www.plosone.org/article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000441
9. But... costly for authors
Find
Organize
Document
Deidentify
Format
Decide
Ask
Submit
Answer questions
Worry about mistakes being found
Worry about data being misinterpreted
Worry about being scooped
Forgo money and IP and prestige???
11. ... on initiatives, requests,
requirements, and tools
NIH data sharing plan requirement
Journal requirements
Databases
Data sharing grids like BIRN and caBIG
Standards
Editorials, letters to the editor, discussion....
16. Long-term motivation:
I believe that analysis of the impact,
prevalence, and patterns with which
investigators share and withhold gene
expression microarray research data
can uncover rewards, best practices, and
opportunities for increased adoption of data
sharing.
19. Prevalence of data sharing
via manual audit
DNA sequences
gene expression microarrays
proteomics spectra
0% 25% 50% 75% 100%
Noor et al. PLoS Biology 2006.
Ochsner et al. Nature Methods 2008.
Piwowar et al. PLoS ONE 2007.
Editorial. Nature Biotech 2007.
20. Prevalence of data withholding
via surveys
self-reported denying a request in last 3 years
trainees self-reported denying a request
been denied access to data, materials, code
authors “not able to retrieve raw data”
not willing to release data
0% 10% 20% 30% 40%
Campbell et al. JAMA. 2002.
Kyzas et al. J Natl Cancer Inst. 2005.
Vogeli et al. Acad Med. 2006.
Reidpath et al. Bioethics 2001.
21. Self‐reported reasons for data
withholding
sharing is too much effort
want student or jr faculty to publish more
they themselves want to publish more
cost
industrial sponsor
confidentiality
commercial value of results
0% 20% 40% 60% 80%
Campbell et al. JAMA 2002.
22. Correlates with self‐reported data
withholding
industry involvement
perceived competitiveness of field
male
sharing discouraged in training
human participants
academic productivity
0 1 2 3
Blumenthal et al. Acad Med. 2006
30. Limitations of the related research
• manual audits: small sample sizes
• surveys: few variables + self-reporting bias
• not much focus on measuring demonstrated behavior
• not much focus on impact or policy
• not much focus on biomedical data other than
DNA sequences
31. Needed:
a study of data sharing behavior and impact
that includes
• a measurement of demonstrated behavior
• policy variables
• estimate of rewards
• a broad and deep selection of data creation instances
• a focus on biomedical data other than DNA sequences
50. Look for wetlab methods in full text:
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1522022&tool=pmcentrez
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1590031&tool=pmcentrez
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1482311&tool=pmcentrez#id331936
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2082469&tool=pmcentrez
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=126870&tool=pmcentrez#id442745
57. Aim 2a: Identify studies that create
gene expression microarray data
Development approach?
•Pattern building via manual inspection
•Classification decision trees with n‐grams
•Borrow approaches from
•Autoslog‐TS
•automated regular expression building
•semi‐supervised learning
•retrieval query aspects
74. Aim 3a: Prevalence of data sharing
PubMed Created Shared
Portal
ID data? data?
234 PMC Yes Yes
345 HighPr Yes Yes
456 Scirus Yes Yes
567 PMC Yes NO
678 PMC Yes NO
Prevalence = Number with Shared data
Number with Created data
76. Aim 3b: Correlates with data sharing
Covariates
PubMed Created Shared
Portal
ID data? data?
234 PMC Yes Yes
345 HighPr Yes Yes
456 Scirus Yes Yes
567 PMC Yes NO
678 PMC Yes NO
77. Aim 3b: Correlates with data sharing
Features to include:
• Does the journal have a data sharing policy?
• Is the study funded by the NIH?
• Number of authors
• Research-orientation of the primary
institution
• Journal impact factor
• Are the samples from humans?
• Disease of study
• Year of publication
• …
78. Aim 3b: Correlates with data sharing
Covariates
PubMed Created Shared Journal NIH #
Portal ...
ID data? data? policy funds? authors
234 PMC Yes Yes strong yes 2
345 HighPr Yes Yes weak yes 5
456 Scirus Yes Yes weak no 6
567 PMC Yes NO strong yes 5
678 PMC Yes NO strong no 2
79. Aim 3b: Correlates with data sharing
Covariates
PubMed Created Shared Journal NIH #
Portal ...
ID data? data? policy funds? authors
234 PMC Yes Yes strong yes 2
345 HighPr Yes Yes weak yes 5
456 Scirus Yes Yes weak no 6
567 PMC Yes NO strong yes 5
678 PMC Yes NO strong no 2
Journal policy? NIH funded? # authors ...
Shared data?
85. Assumptions
That the following limitations are randomly distributed:
• Ambiguous author names
• The method of describing data generation
• Studies with data in GEO but no submission links
• Studies that don’t mention sharing in the full-text article
The first and last authors are usually primary decision-
makers about whether to share data
Citations are a valued, though imperfect, measure of
research impact
86. Limitations
Association does not imply causation
Only one datatype: microarray data.
Only considering sharing in the primary
centralized databases.
Many variables are USA-centric.
Results will only be generalizable to research
studies made available in full-text portals.
87. Risks and contingency plans
NLP performance may be inadequate
supplement with manual annotating via Mechanical Turk
Author ambiguity may introduce extreme outliers.
use Author-ity software on extreme outliers
Unable to derive a robust exploratory factor model
try other clustering techniques
Several variables may be unexpectedly difficult to
extract
if not essential, defer the analysis of that variable to future
work
88. Contributions
• an assessment of the observed and measured
rewards, prevalence, and patterns of gene
expression microarray dataset sharing
• a publicly available dataset associating microarray
study publications with data sharing status
• a generalizable approach for developing practical,
real-world information retrieval using
centralized full-text query portals
• preliminary models of data sharing behavior
89. Publication plan
http://www.flickr.com/photos/linkwize/926334421/
90. Publication plan: Aim 1
Do studies with publicly shared datasets receive
more citations?
Published in PLoS ONE in February 2007
91. Publication plan: Aim 2a
How can we identify studies that generate
certain data, given full-text query access
through centralized portals?
Targeted journal:
Journal of Medical Internet Research?
BMC Bioinformatics?
other?
92. Publication plan: Aim 2b, 3a, 3b
What factors are associated with demonstrated
data sharing behavior?
Targeted journal:
BMC Bioinformatics?
BMC Biology?
PLoS Biology?
a research policy journal?
other?
93. Publication plan: Aim 3c
Derive (and validate?) a preliminary a model of
demonstrated research data sharing behavior
Targeted journal:
JASIST?
(Journal of the American Society for Information
Science and Technology)
Information Research?
Journal of Documentation?
Science Communication?
Data Science Journal?
other?
94. Future work
1. Identify and model data reuse
2. Citation analysis of the large cohort
3. Supplement with survey responses
4. Generalize the method for creating
queries for full-text portals
http://www.flickr.com/photos/cogdog/123072/
95. Data sharing plan
I plan to share my code, data, and process openly
during the research via blogs and repositories.
http://www.flickr.com/photos/myklroventine/892446624/
96. Thanks to
the Dept of Biomedical Informatics at the U of Pittsburgh,
the NLM for funding through training grant 5 T15 LM007059-22,
those with photos on Flickr under a Creative Commons license,
Wendy for her support and feedback, and my committee for
anticipated feedback....
Questions and Suggestions?
102. Audience
• Funders, policy makers and thought leaders.
• Database, software, and data standard
developers.
• Biomedical informatics community.
• Information science and digital library
community.
• Open Science community.
• Primary Investigators.
103.
104. Recent related grants
NIH: Haga, S.
Exploring Attitudes About Data Disclosure and Data-Sharing
in Genomics Research.
NSF: Hedstrom, M.
Incentives for Data Producers to Create Archive-Ready
Data Sets.
National Inst of Nursing Research: Pienta, A.
Barriers and Opportunities for Sharing Research Data.
+others