SlideShare uma empresa Scribd logo
1 de 33
Building an NIH Data Catalog:
Bit by Bit
Kevin Read
NLM Associate Fellowship Presentation
July 24, 2013
1
NIH Big Data to Knowledge
Facilitating Broad Use of Biomedical Big Data
2
NIH Data Catalog
What is it designed to do?
3
NIH Data Catalog
Data sets are
CITABLE
Data sets are
DISCOVERABLE
Data sets are
LINKED TO THE
LITERATURE
Data sets are
PART OF THE
RESEARCH
ECOSYSTEM
4
NIH Data Catalog
What do we need to know in order to build it?
Minimal Metadata
Elements
How do current data repositories
describe their data?
Orphaned Data sets
How many data sets are not
currently represented in a data
repository?
5
Finding Common Metadata
Elements
Exploring how NIH Data Repositories describe their data
6
7
Categorizing Metadata
Descriptors
Common Metadata Elements
Authorship
Data
Description
Title
Information
8
Identifying Metadata Variations
Date
Study
Date
Date
Processed
Release
Date
Completion
Date
Last
Updated
Date
Prepared
on Date
Authorship
Authors
Creators
Data
Provider
Principal
Investiga
tor(s)
Contribu
tors
Data
Authors
9
Mapping Metadata
Commonalities to Existing
Standards
Common
Metadata
Elements
Common
Metadata
Elements
10
Mapping Metadata to MEDLINE
Common Metadata Elements Proposed Definition
Data Unique Identifier A unique ID string that identifies a data set within the catalog
Author Individuals involved in producing or contributing to data
Affiliation Affiliation of each author associated with the appropriate author
occurrence
Data Title Name or title by which the data set is known
Data Location The name of the entity that holds, archives, publishes, distributes,
releases, issues, or produces the data w/ its associated accession
number.
Date The year, month and date when the data was made available
Data Description (structured narrative) Structured narrative description for efficient indexing
Data Descriptors Metadata describing data contents using controlled labels (e.g.
Organism, Disease, Perturbation, Gender, Cell type)
PMID Identifier that will link dataset to associated article(s) AND be provided
for the data catalog entry
Availability/Accessibility of Data Indication of whether the data is available to use and how to access it
Award Number Grant/award numbers associated with the data set
Related Data Data that was used in the creation of the new data set 11
Data Catalog Citation
Marazita ML, Weynat RJ, Feingold E, Weeks D,
Crout R, McNeill D. Dental Caries: Whole Genome
Association and Gene x Environment Studies. NIH
Data Catalog. 2014 Jan;1(1):DUID00001. PubMed
PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Marazita ML, Weynat RJ, Feingold E, Weeks
D, Crout R, McNeill D. Dental Caries: Whole
Genome Association and Gene x Environment
Studies. NIH Data Catalog. 2014
Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Author
Marazita ML, Weynat RJ, Feingold E, Weeks
D, Crout R, McNeill D. Dental Caries: Whole
Genome Association and Gene x Environment
Studies. NIH Data Catalog. 2014
Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Data Title
Marazita ML, Weynat RJ, Feingold E, Weeks
D, Crout R, McNeill D. Dental Caries: Whole
Genome Association and Gene x Environment
Studies. NIH Data Catalog. 2014
Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Data
Description
Location
Marazita ML, Weynat RJ, Feingold E, Weeks
D, Crout R, McNeill D. Dental Caries: Whole
Genome Association and Gene x Environment
Studies. NIH Data Catalog. 2014
Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Date of NIH
Data
Catalog
issue
Marazita ML, Weynat RJ, Feingold E, Weeks
D, Crout R, McNeill D. Dental Caries: Whole
Genome Association and Gene x Environment
Studies. NIH Data Catalog. 2014
Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
NIH Data Catalog
Volume (Issue)
Marazita ML, Weynat RJ, Feingold E, Weeks
D, Crout R, McNeill D. Dental Caries: Whole
Genome Association and Gene x Environment
Studies. NIH Data Catalog. 2014
Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Data Unique
Identifier
Marazita ML, Weynat RJ, Feingold E, Weeks
D, Crout R, McNeill D. Dental Caries: Whole
Genome Association and Gene x Environment
Studies. NIH Data Catalog. 2014
Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
PMID
Assigned
to NIH
Data
Catalog
Record
Secondary
source ID (Link
to actual
dataset)
Marazita ML, Weynat RJ, Feingold E, Weeks D,
Crout R, McNeill D. Dental Caries: Whole Genome
Association and Gene x Environment Studies. NIH
Data Catalog. 2014 Jan;1(1):DUID00001. PubMed
PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
12
Searching for NIH-funded
‘Orphaned’ data sets in PubMed
and PubMed Central
13
113,089
75,441
Remaining articles with
orphaned data sets
NIH-funded articles for 2011:
88,592
78,901
Non-PMC Articles
Non-research Articles
Molecular Sequence Data MH
71,913SI Field
71,680PMC Acknowledgements
69,857XML
14
SI Field Exclusions
0
200
400
600
800
1000
1200
1400
1600
Excluded Articles
15
PMC Acknowledgement Exclusions
0
100
200
300
400
500
600
700
800
Excluded
keywords
16
XML Keyword Exclusions
0
100
200
300
400
500
600
Excluded
keywords
FlyBase:GeneNetwork:Mouse Genome
Informatics:Neuroscience Information
Framework:Rat Genome
Database:WormBase:Zebrafish Model
Organism Database
GenBank:PDB
17
Total # of articles
collected for 2011
after exclusion:
69,657
Random sample
with 95% confid.
interval:
383
18
383
What category of data
set was used for the
research described in
the article?
Were live human or
animal subjects used
in the collection of the
data?
What were the
subject(s) of study
(from which or
whom the data was
collected)?
If new data set(s) were
created, what type(s)
of data were
collected?
What existing data
set(s) were used? If
any?
How many data sets
are there in each
article?
19
Measuring blood
pressure in mice
Measuring left
hemisphere of brain
for growth factor
Staining and imaging
Analysis of images
using software 20
Preliminary Results
‘Orphaned’ Data
50 articles
21
Average number of data
sets per article:
5.84
22
% of data sets that use live
subjects
51%
Human
60%
Animal
40%
23
% of data sets that
were considered to
be new
74%
% of data sets that
used existing data
with mods or added
value
12%
% of data sets that
used existing data
as is
13%
% with no data
1%
24
% of articles
that collected
only new data:
56%
% of articles that
used only
existing data:
32%
% of articles
that used a
combination of
data:
8%
% of articles that
used no data:
4% 25
Data Types
26
Building an NIH Data Catalog
Questions to Consider
27
What do we consider to be
a data set?
All of the data created within a paper?
Multiple data sets of different data
types within a paper?
Every individual collection of data
within a paper?
28
Where in the
collection/processing pipeline
should data be described?
29
Is there a convenient way to
point to data sets within an
article?
Abstract?
Labeled area?
Reference list?
30
How do we adequately
describe data sets so that they
are discoverable?
Develop a strategy to create appropriate
data descriptors
31
How do we adequately describe data
sets so that they are discoverable?
Is there a convenient way to point to
data sets within an article?
Where in the data collection/processing
pipeline should data be described?
What do we consider to be a data set?
32
Acknowledgements
Project Sponsors
Jerry Sheehan & Mike Huerta
Special Thanks
Lou Knecht & Jim Mork
Annotators
Preeti Kochar, Helen Ochej, Susan Schmidt, Melissa Yorks, Shari Mohary, Olga
Printseva, Janice Ward, Oleg Rodionov, Sally Davidson, Jennie Larkin, Peter
Lyster, Matt McAuliffe, Greg Farber, Betsy Humphreys, Jerry Sheehan, Mike
Huerta, Lou Knecht, Suzy Roy, Swapna Abhyankar, Olivier Bodenreider, Karen
Gutzman, Dina Demner Fusman, Laritza Rodriguez, Sonya Shooshan, Samantha
Tate, Matthew Simpson, Tracy Edinger, Olubumi Akiwumi, Mary Ann Hantakas, Corinn
Sinnott
Support
Kathel Dunn & David Gillikin
Library Operations
Joyce Backus & Dianne Babski
NLM Leadership
Donald Lindberg & Betsy Humphreys
All images are CC
33

Mais conteúdo relacionado

Mais procurados

Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Michel Dumontier
 

Mais procurados (20)

Laurie Goodman: Overcoming Hurdles to Data Publication
Laurie Goodman: Overcoming Hurdles to Data PublicationLaurie Goodman: Overcoming Hurdles to Data Publication
Laurie Goodman: Overcoming Hurdles to Data Publication
 
Research Data Management: How will Northwestern address new sharing requireme...
Research Data Management: How will Northwestern address new sharing requireme...Research Data Management: How will Northwestern address new sharing requireme...
Research Data Management: How will Northwestern address new sharing requireme...
 
Introduction to data management
Introduction to data managementIntroduction to data management
Introduction to data management
 
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
 
Computational Research day 2015
Computational Research day 2015Computational Research day 2015
Computational Research day 2015
 
Building a Network of Interoperable and Independently Produced Linked and Ope...
Building a Network of Interoperable and Independently Produced Linked and Ope...Building a Network of Interoperable and Independently Produced Linked and Ope...
Building a Network of Interoperable and Independently Produced Linked and Ope...
 
Linking Scientific Metadata (presented at DC2010)
Linking Scientific Metadata (presented at DC2010)Linking Scientific Metadata (presented at DC2010)
Linking Scientific Metadata (presented at DC2010)
 
Introduction to data management
Introduction to data managementIntroduction to data management
Introduction to data management
 
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
 
Laurie Goodman at #aibsdata: Beyond Data Release Mandates - Helping Authors M...
Laurie Goodman at #aibsdata: Beyond Data Release Mandates - Helping Authors M...Laurie Goodman at #aibsdata: Beyond Data Release Mandates - Helping Authors M...
Laurie Goodman at #aibsdata: Beyond Data Release Mandates - Helping Authors M...
 
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
 
The State of Open Data Report - Infographic
The State of Open Data Report - InfographicThe State of Open Data Report - Infographic
The State of Open Data Report - Infographic
 
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
 
Data Literacy: Creating and Managing Reserach Data
Data Literacy: Creating and Managing Reserach DataData Literacy: Creating and Managing Reserach Data
Data Literacy: Creating and Managing Reserach Data
 
Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...
Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...
Edge Informatics and FAIR (Findable, Accessible, Interoperable and Reusable) ...
 
The State of Open Data Report by @figshare
The State of Open Data Report  by @figshareThe State of Open Data Report  by @figshare
The State of Open Data Report by @figshare
 
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
 
The DataTags System: Sharing Sensitive Data with Confidence
The DataTags System: Sharing Sensitive Data with ConfidenceThe DataTags System: Sharing Sensitive Data with Confidence
The DataTags System: Sharing Sensitive Data with Confidence
 
GigaScience: a new resource for the big-data community.
GigaScience: a new resource for the big-data community.GigaScience: a new resource for the big-data community.
GigaScience: a new resource for the big-data community.
 
BioPharma and FAIR Data, a Collaborative Advantage
BioPharma and FAIR Data, a Collaborative AdvantageBioPharma and FAIR Data, a Collaborative Advantage
BioPharma and FAIR Data, a Collaborative Advantage
 

Semelhante a Building an NIH Data Catalog: Bit by Bit

Research Data Alliance (RDA) Webinar: What do you really know about that anti...
Research Data Alliance (RDA) Webinar: What do you really know about that anti...Research Data Alliance (RDA) Webinar: What do you really know about that anti...
Research Data Alliance (RDA) Webinar: What do you really know about that anti...
dkNET
 
dkNET Webinar: Discover the Latest from dkNET - Biomed Resource Watch 06/02/2023
dkNET Webinar: Discover the Latest from dkNET - Biomed Resource Watch 06/02/2023dkNET Webinar: Discover the Latest from dkNET - Biomed Resource Watch 06/02/2023
dkNET Webinar: Discover the Latest from dkNET - Biomed Resource Watch 06/02/2023
dkNET
 

Semelhante a Building an NIH Data Catalog: Bit by Bit (20)

Research Data Alliance (RDA) Webinar: What do you really know about that anti...
Research Data Alliance (RDA) Webinar: What do you really know about that anti...Research Data Alliance (RDA) Webinar: What do you really know about that anti...
Research Data Alliance (RDA) Webinar: What do you really know about that anti...
 
OPEN KNOWLEDGE AND THE NATIONAL INSTITUTES OF HEALTH OF THE UNITED STATES
OPEN KNOWLEDGE AND THE NATIONAL INSTITUTES OF HEALTH OF THE UNITED STATESOPEN KNOWLEDGE AND THE NATIONAL INSTITUTES OF HEALTH OF THE UNITED STATES
OPEN KNOWLEDGE AND THE NATIONAL INSTITUTES OF HEALTH OF THE UNITED STATES
 
dkNET Poster ENDO 2016
dkNET Poster ENDO 2016 dkNET Poster ENDO 2016
dkNET Poster ENDO 2016
 
dkNET Webinar: Discover the Latest from dkNET - Biomed Resource Watch 06/02/2023
dkNET Webinar: Discover the Latest from dkNET - Biomed Resource Watch 06/02/2023dkNET Webinar: Discover the Latest from dkNET - Biomed Resource Watch 06/02/2023
dkNET Webinar: Discover the Latest from dkNET - Biomed Resource Watch 06/02/2023
 
Gaining credit for sharing research data
Gaining credit for sharing research dataGaining credit for sharing research data
Gaining credit for sharing research data
 
From Data Policy Towards FAIR Data For All: How standardised data policies ca...
From Data Policy Towards FAIR Data For All: How standardised data policies ca...From Data Policy Towards FAIR Data For All: How standardised data policies ca...
From Data Policy Towards FAIR Data For All: How standardised data policies ca...
 
Privacy and Publication: challenges and opportunities for clinical data
Privacy and Publication: challenges and opportunities for clinical dataPrivacy and Publication: challenges and opportunities for clinical data
Privacy and Publication: challenges and opportunities for clinical data
 
2021-01-27--biodiversity-informatics-gbif-(52slides)
2021-01-27--biodiversity-informatics-gbif-(52slides)2021-01-27--biodiversity-informatics-gbif-(52slides)
2021-01-27--biodiversity-informatics-gbif-(52slides)
 
Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in...
Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in...Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in...
Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in...
 
One Funder’s View for Advancing Open Science
One Funder’s View for Advancing Open ScienceOne Funder’s View for Advancing Open Science
One Funder’s View for Advancing Open Science
 
Research data: publishers, policies and patient privacy
Research data: publishers, policies and patient privacyResearch data: publishers, policies and patient privacy
Research data: publishers, policies and patient privacy
 
NISO Working Group Connection Live! Research Data Metrics Landscape: An Updat...
NISO Working Group Connection Live! Research Data Metrics Landscape: An Updat...NISO Working Group Connection Live! Research Data Metrics Landscape: An Updat...
NISO Working Group Connection Live! Research Data Metrics Landscape: An Updat...
 
Six things publishers can do to promote open research data
Six things publishers can do to promote open research dataSix things publishers can do to promote open research data
Six things publishers can do to promote open research data
 
ischools future of data managemente dec2017
ischools future of data managemente dec2017ischools future of data managemente dec2017
ischools future of data managemente dec2017
 
dkNET Introduction for Librarians
dkNET Introduction for LibrariansdkNET Introduction for Librarians
dkNET Introduction for Librarians
 
FAIR data and the Etsin service
FAIR data and the Etsin serviceFAIR data and the Etsin service
FAIR data and the Etsin service
 
Public Data Archiving in Ecology and Evolution: How well are we doing?
Public Data Archiving in Ecology and Evolution: How well are we doing?Public Data Archiving in Ecology and Evolution: How well are we doing?
Public Data Archiving in Ecology and Evolution: How well are we doing?
 
Rebecca Grant - Publishers and RDM
Rebecca Grant - Publishers and RDMRebecca Grant - Publishers and RDM
Rebecca Grant - Publishers and RDM
 
A National Approach to Open Data in Ireland: Publishers and Research Data Man...
A National Approach to Open Data in Ireland: Publishers and Research Data Man...A National Approach to Open Data in Ireland: Publishers and Research Data Man...
A National Approach to Open Data in Ireland: Publishers and Research Data Man...
 
Identifying and tracking research resources using RRIDs: a practical approach
Identifying and tracking research resources using RRIDs:  a practical approachIdentifying and tracking research resources using RRIDs:  a practical approach
Identifying and tracking research resources using RRIDs: a practical approach
 

Último

Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Dipal Arora
 

Último (20)

Top Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any Time
Top Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any TimeTop Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any Time
Top Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any Time
 
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 9332606886 𖠋 Will You Mis...
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 9332606886 𖠋 Will You Mis...The Most Attractive Hyderabad Call Girls Kothapet 𖠋 9332606886 𖠋 Will You Mis...
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 9332606886 𖠋 Will You Mis...
 
Call Girls Cuttack Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Cuttack Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Cuttack Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Cuttack Just Call 9907093804 Top Class Call Girl Service Available
 
Top Rated Bangalore Call Girls Ramamurthy Nagar ⟟ 9332606886 ⟟ Call Me For G...
Top Rated Bangalore Call Girls Ramamurthy Nagar ⟟  9332606886 ⟟ Call Me For G...Top Rated Bangalore Call Girls Ramamurthy Nagar ⟟  9332606886 ⟟ Call Me For G...
Top Rated Bangalore Call Girls Ramamurthy Nagar ⟟ 9332606886 ⟟ Call Me For G...
 
Call Girls Jabalpur Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Jabalpur Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Jabalpur Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Jabalpur Just Call 8250077686 Top Class Call Girl Service Available
 
Call Girls Dehradun Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Dehradun Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Dehradun Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Dehradun Just Call 9907093804 Top Class Call Girl Service Available
 
Call Girls Tirupati Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Tirupati Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Tirupati Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Tirupati Just Call 8250077686 Top Class Call Girl Service Available
 
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort ServicePremium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
 
VIP Hyderabad Call Girls Bahadurpally 7877925207 ₹5000 To 25K With AC Room 💚😋
VIP Hyderabad Call Girls Bahadurpally 7877925207 ₹5000 To 25K With AC Room 💚😋VIP Hyderabad Call Girls Bahadurpally 7877925207 ₹5000 To 25K With AC Room 💚😋
VIP Hyderabad Call Girls Bahadurpally 7877925207 ₹5000 To 25K With AC Room 💚😋
 
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Nagpur Just Call 9907093804 Top Class Call Girl Service Available
 
Call Girls Bangalore Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Bangalore Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Bangalore Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Bangalore Just Call 8250077686 Top Class Call Girl Service Available
 
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
 
All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...
All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...
All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...
 
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
 
Mumbai ] (Call Girls) in Mumbai 10k @ I'm VIP Independent Escorts Girls 98333...
Mumbai ] (Call Girls) in Mumbai 10k @ I'm VIP Independent Escorts Girls 98333...Mumbai ] (Call Girls) in Mumbai 10k @ I'm VIP Independent Escorts Girls 98333...
Mumbai ] (Call Girls) in Mumbai 10k @ I'm VIP Independent Escorts Girls 98333...
 
O963O942363 Call Girls In Ahmedabad Escort Service Available 24×7 In Ahmedabad
O963O942363 Call Girls In Ahmedabad Escort Service Available 24×7 In AhmedabadO963O942363 Call Girls In Ahmedabad Escort Service Available 24×7 In Ahmedabad
O963O942363 Call Girls In Ahmedabad Escort Service Available 24×7 In Ahmedabad
 
Call Girls Siliguri Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Siliguri Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Siliguri Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Siliguri Just Call 8250077686 Top Class Call Girl Service Available
 
Call Girls Haridwar Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Haridwar Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Haridwar Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Haridwar Just Call 8250077686 Top Class Call Girl Service Available
 
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...
 
Call Girls Agra Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Agra Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Agra Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Agra Just Call 8250077686 Top Class Call Girl Service Available
 

Building an NIH Data Catalog: Bit by Bit

  • 1. Building an NIH Data Catalog: Bit by Bit Kevin Read NLM Associate Fellowship Presentation July 24, 2013 1
  • 2. NIH Big Data to Knowledge Facilitating Broad Use of Biomedical Big Data 2
  • 3. NIH Data Catalog What is it designed to do? 3
  • 4. NIH Data Catalog Data sets are CITABLE Data sets are DISCOVERABLE Data sets are LINKED TO THE LITERATURE Data sets are PART OF THE RESEARCH ECOSYSTEM 4
  • 5. NIH Data Catalog What do we need to know in order to build it? Minimal Metadata Elements How do current data repositories describe their data? Orphaned Data sets How many data sets are not currently represented in a data repository? 5
  • 6. Finding Common Metadata Elements Exploring how NIH Data Repositories describe their data 6
  • 7. 7
  • 8. Categorizing Metadata Descriptors Common Metadata Elements Authorship Data Description Title Information 8
  • 9. Identifying Metadata Variations Date Study Date Date Processed Release Date Completion Date Last Updated Date Prepared on Date Authorship Authors Creators Data Provider Principal Investiga tor(s) Contribu tors Data Authors 9
  • 10. Mapping Metadata Commonalities to Existing Standards Common Metadata Elements Common Metadata Elements 10
  • 11. Mapping Metadata to MEDLINE Common Metadata Elements Proposed Definition Data Unique Identifier A unique ID string that identifies a data set within the catalog Author Individuals involved in producing or contributing to data Affiliation Affiliation of each author associated with the appropriate author occurrence Data Title Name or title by which the data set is known Data Location The name of the entity that holds, archives, publishes, distributes, releases, issues, or produces the data w/ its associated accession number. Date The year, month and date when the data was made available Data Description (structured narrative) Structured narrative description for efficient indexing Data Descriptors Metadata describing data contents using controlled labels (e.g. Organism, Disease, Perturbation, Gender, Cell type) PMID Identifier that will link dataset to associated article(s) AND be provided for the data catalog entry Availability/Accessibility of Data Indication of whether the data is available to use and how to access it Award Number Grant/award numbers associated with the data set Related Data Data that was used in the creation of the new data set 11
  • 12. Data Catalog Citation Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456. SI: dbGaP/pht002543.v2.p1 Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456. SI: dbGaP/pht002543.v2.p1 Author Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456. SI: dbGaP/pht002543.v2.p1 Data Title Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456. SI: dbGaP/pht002543.v2.p1 Data Description Location Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456. SI: dbGaP/pht002543.v2.p1 Date of NIH Data Catalog issue Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456. SI: dbGaP/pht002543.v2.p1 NIH Data Catalog Volume (Issue) Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456. SI: dbGaP/pht002543.v2.p1 Data Unique Identifier Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456. SI: dbGaP/pht002543.v2.p1 PMID Assigned to NIH Data Catalog Record Secondary source ID (Link to actual dataset) Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456. SI: dbGaP/pht002543.v2.p1 12
  • 13. Searching for NIH-funded ‘Orphaned’ data sets in PubMed and PubMed Central 13
  • 14. 113,089 75,441 Remaining articles with orphaned data sets NIH-funded articles for 2011: 88,592 78,901 Non-PMC Articles Non-research Articles Molecular Sequence Data MH 71,913SI Field 71,680PMC Acknowledgements 69,857XML 14
  • 17. XML Keyword Exclusions 0 100 200 300 400 500 600 Excluded keywords FlyBase:GeneNetwork:Mouse Genome Informatics:Neuroscience Information Framework:Rat Genome Database:WormBase:Zebrafish Model Organism Database GenBank:PDB 17
  • 18. Total # of articles collected for 2011 after exclusion: 69,657 Random sample with 95% confid. interval: 383 18
  • 19. 383 What category of data set was used for the research described in the article? Were live human or animal subjects used in the collection of the data? What were the subject(s) of study (from which or whom the data was collected)? If new data set(s) were created, what type(s) of data were collected? What existing data set(s) were used? If any? How many data sets are there in each article? 19
  • 20. Measuring blood pressure in mice Measuring left hemisphere of brain for growth factor Staining and imaging Analysis of images using software 20
  • 22. Average number of data sets per article: 5.84 22
  • 23. % of data sets that use live subjects 51% Human 60% Animal 40% 23
  • 24. % of data sets that were considered to be new 74% % of data sets that used existing data with mods or added value 12% % of data sets that used existing data as is 13% % with no data 1% 24
  • 25. % of articles that collected only new data: 56% % of articles that used only existing data: 32% % of articles that used a combination of data: 8% % of articles that used no data: 4% 25
  • 27. Building an NIH Data Catalog Questions to Consider 27
  • 28. What do we consider to be a data set? All of the data created within a paper? Multiple data sets of different data types within a paper? Every individual collection of data within a paper? 28
  • 29. Where in the collection/processing pipeline should data be described? 29
  • 30. Is there a convenient way to point to data sets within an article? Abstract? Labeled area? Reference list? 30
  • 31. How do we adequately describe data sets so that they are discoverable? Develop a strategy to create appropriate data descriptors 31
  • 32. How do we adequately describe data sets so that they are discoverable? Is there a convenient way to point to data sets within an article? Where in the data collection/processing pipeline should data be described? What do we consider to be a data set? 32
  • 33. Acknowledgements Project Sponsors Jerry Sheehan & Mike Huerta Special Thanks Lou Knecht & Jim Mork Annotators Preeti Kochar, Helen Ochej, Susan Schmidt, Melissa Yorks, Shari Mohary, Olga Printseva, Janice Ward, Oleg Rodionov, Sally Davidson, Jennie Larkin, Peter Lyster, Matt McAuliffe, Greg Farber, Betsy Humphreys, Jerry Sheehan, Mike Huerta, Lou Knecht, Suzy Roy, Swapna Abhyankar, Olivier Bodenreider, Karen Gutzman, Dina Demner Fusman, Laritza Rodriguez, Sonya Shooshan, Samantha Tate, Matthew Simpson, Tracy Edinger, Olubumi Akiwumi, Mary Ann Hantakas, Corinn Sinnott Support Kathel Dunn & David Gillikin Library Operations Joyce Backus & Dianne Babski NLM Leadership Donald Lindberg & Betsy Humphreys All images are CC 33

Notas do Editor

  1. Build a Catalog of Research Datasets to Facilitate Data Discovery125 members currently across all the IC’s working on this and they are holding a variety of workshops in August and the fallOne initiative is to create an NIH Data Catalogue – this is where I was able to contribute.
  2. So that’s great that I’ve been talking about an NIH Data Catalog – but it is important that I describe what it is that a data catalog is designed to do.
  3. Data sets are discoverable – in that they can be searched for and retrieved for use or analysisData sets are citable – in that when other researchers use a data set they can cite it so that the authors of a data set receive credit for their workData sets are linked to the literature – in that data sets that were created in an article would be automatically linked to that respective article(s)Data sets are part of the research ecosystem – in that when researchers apply for NIH grants, data sets would have an impact on the decision making process
  4. Keeping those goals in mind, it then came time to figure out what do we actually need to know in order to build this thing?For my project it came down to looking at how current data repositories describe their data sets to come up with potential metadata elements that could be used for the data catalog.The other part of my project focused on how we candeal with all of the data that is not currently represented in a data repository – and how to manage and describe it.
  5. First we will look at the first component of my project, and how I went about exploring how NIH data repositories describe their data
  6. This project actually brought me back to my fall project, where I was responsible for curating and collecting a list of all the NIH data sharing repositories. Each of these repositories has a submission requirement where they require researchers to provide specific information about their data before they deposit it. I took this submission metadata from each repository, and attempted to extract commonalities from each.
  7. Once I started to extract the metadata commonalities from each repository, I developed broad categories that they fell into here. You can see on this slide that unique identifiers, data information, data description and authorship are all major categories that were represented in each.
  8. From there it is important to point out that there were a number of variations for each major category for which I’ve included two examples here. Notice in the date graphic on the left that date is represented in a number of different ways that mean entirely different things. These variations are just one of many examples that illustrate the complexities of describing data, and how there is currently no standard way of describing data effectively.
  9. Because there were so many variations in the commonalities I found. I felt it would be useful to map my common metadata elements to existing standards to gain a better sense of how data was being described at a broad level. In this case we chose to look at DataCite – a platform where researchers can register their data sets to receive a DOI, Dryad – an open data repository that links data to literature and is prominently used in the scientific community. It describes a wide range of data sets, and finally MEDLINE in order to ascertain whether or not MEDLINE’s existing XML metadata schema could adequately describe a data set.We chose DataCite because it maintains a relatively up to date version of their metadata schema, and they also make it open source where the scientific community can provide feedback and make changes.Dryad was chosen because it is a largely popular data repository that deals with a wide variety of data, and we felt that it would provide a good indication of a baseline set of metadata.For the purposes of this presentation I will not be going into DataCite and Dryad in detail, and instead show the results of the findings from each platform in the context of MEDLINE.
  10. After looking at the metadata standards from both DataCite and Dryad we felt that this set of metadata could be integrated into MEDLINE – with a few modifications and additions. You’ll see the common metadata that you would expect to see in a basic description of an object in that it has a unique identifier, author and title. But I would like to point out a few of the modifications and challenges we faced.Affiliation – in PM it is only available for first author, but because data can go through changes or modifications in specific labs we felt it was important to include an affiliation for each data author.Data descriptors – this is the biggest issue when it comes to describing data – as you’ll see the same theme emerge in the second half of my presentation but we felt that a data description should include a structured narrative as you would see in a structured abstract, but also a list of data descriptors such as organism and disease to tag the data set. Both DataCite and Dryad did not include biomedical data descriptors so this is an area that requires further study.PMID – it was felt that a record in the data catalog could also receive a PMID so it could be searched within PubMed as well as the data catalog. Similarly, all PMIDs associated with a data set would also be included in the background metadata so that it could be easily linked to the literature.Related data – finally, related data field would indicate whether or not the data the was created used pre-existing data. For example if a researchers used questionnaire data that was previously created to perform a new analysis – it would be important that it would be accounted for in the metadata record for the sake of provenance and transparency.
  11. Finally, because it was felt that a data catalog record could look potentially like a data publication, and we would want data to be citable I have provided an example here breaking down each component of a citation.
  12. Keeping in mind some of the issues we faced with data description, and the lack of standards for developing metadata for biomedical data sets, I am now going to switch gears to talk about the second component of my project which involved searching for data sets in PubMed and PMC that have not been deposited in a repository.It is fruitful to complete this exercise to discover how much work would be required to describe all the data sets that are created in an article, and figure out if there was a sufficient way to describe the different types of data that are created.
  13. This slide is meant to demonstrate the stages of exclusions taken to come out with our final sample from which to analyze.
  14. It is important to mention here that while it found a large number of name variations in exclusions, when it overlapped with PubMed’s search only 230 total articles were excluded.
  15. XML keyword exclusions.There is an issue in this case because you’ll notice the bar on the right is title “Multiple Keywords” – this is unfortunate because whenever more than one of the repositories was mentioned it counted as a multiple, as opposed to towards the repository itself.What is interesting about these multiples however is that a number of articles contained a mention of a number of repositories.
  16. We had 30 NLM staff and BD2K staff look at the 383 articles – 25 articles each with the two people looking at the same 25 for validation.
  17. This slide is designed to illustrate the different measurements and data collection that occurs within an article and really exemplifies the complexities of data that we are working with.
  18. If we go back to the issues we saw earlier with data types and data descriptors”Do we tackle the entire scope of biomedical data?Do we adopt existing standards for ways to describe data?Who better to answer these questions than the National Library of Medicine?
  19. Who better to answer these questions than the National Library of Medicine?