Data Landscapes: The Neuroscience Information Framework

Data Landscapes: TThhee NNeeuurroosscciieennccee
IInnffoorrmmaattiioonn FFrraammeewwoorrkk
nneeuuiinnffoo..oorrgg
Maryann E. Martone, Ph. D.
University of California, San
Diego

Organization
• Introduction
• The Neuroscience Information Framework
• A tour of NIF
• The NIF Framework
– Ontologies
– NIF Analytics: What can we learn from the data
space?
• Where do we go from here?
– Resource Identification Initiative
– Conclusions

• NIF is an initiative of the NIH Blueprint ccoonnssoorrttiiuumm ooff iinnssttiittuutteess
– WWhhaatt ttyyppeess ooff rreessoouurrcceess ((ddaattaa,, ttoooollss,, mmaatteerriiaallss,, sseerrvviicceess)) aarree aavvaaiillaabbllee ttoo tthhee
nneeuurroosscciieennccee ccoommmmuunniittyy??
– HHooww mmaannyy aarree tthheerree??
– WWhhaatt ddoommaaiinnss ddoo tthheeyy ccoovveerr?? WWhhaatt ddoommaaiinnss ddoo tthheeyy nnoott ccoovveerr??
– WWhheerree aarree tthheeyy??
• WWeebb ssiitteess
• DDaattaabbaasseess
• LLiitteerraattuurree
• SSuupppplleemmeennttaarryy mmaatteerriiaall
– WWhhoo uusseess tthheemm??
– WWhhoo ccrreeaatteess tthheemm??
– HHooww ccaann wwee ffiinndd tthheemm??
– HHooww ccaann wwee mmaakkee tthheemm bbeetttteerr iinn tthhee ffuuttuurree??
http://neuinfo.org
• PPDDFF ffiilleess
• DDeesskk ddrraawweerrss

Old Model: Single type of content;
single mode of distribution
SScchhoolalarr
LLibibrraarryy
Scholar
PPuubblilsishheerr
FFOORRCCEE1111.o.orrgg: : F Fuuttuurree o of fr reesseeaarrcchh c coommmmuunnicicaattioionnss a anndd e e-s-scchhoolalarrsshhipip

Scholar
Consumer
Data Repositories
Libraries
Code Repositories
Community
databases/platforms
OA
Curators
NNaannooppuubblilcicaatitoionnss
Social
Social
NetworSkoscial
Networks
Social
Networks
Social
Networks
Social
Networks
Networks
Peer Reviewers
NNaarrrraattivivee
WWoorrkkflfolowwss
DDaattaa
MMooddeelsls
MMuultlitmimeeddiaia
CCooddee

Solving the large problems of
science?
• Observation
• Experimentation
• Modeling
• Cooperative data
intensive science
“An unaided human’s ability to process
large data sets is comparable to a dog’s
ability to do arithmetic, and not much more
valuable.” –Michael Nielson, Reinventing
Discovery, 2012.
“An unaided human’s ability to process
large data sets is comparable to a dog’s
ability to do arithmetic, and not much more
valuable.” –Michael Nielson, Reinventing
Discovery, 2012.

NIF: A New Type ooff EEnnttiittyy ffoorr NNeeww MMooddeess
ooff SScciieennttiiffiicc DDiisssseemmiinnaattiioonn
• NIF’s mission is to maximize the awareness of, access to and
utility of digital resources produced worldwide to enable better
science and promote efficient use
– NIF unites neuroscience information without respect to domain, funding
agency, institute or community
– NIF is a library for scholarly output that is a web enabled resource and
not a paper
– Aggregates all the different databases, tools and resources now
produced by the scientific community
– Makes them searchable from a single interface
– A practical approach to the data deluge
– Educate neuroscientists and students about effective data sharing

Surveying tthhee rreessoouurrccee llaannddssccaappee
NIF resource registry: listing of > 12000 databases, tools,
materials, services, websites (> 2500 databases)
NIF resource registry: listing of > 12000 databases, tools,
materials, services, websites (> 2500 databases)

NIF data federation: PPuubb MMeedd CCeennttrraall ffoorr ddaattaa
200 sources
> 800 M records
200 sources
> 800 M records
NIF was designed to accommodate the multiplicity of heterogeneous and distributed data
resources, providing deep query of the contents and unified views
NIF was designed to accommodate the multiplicity of heterogeneous and distributed data
resources, providing deep query of the contents and unified views

RReeggiissttrryy vvss FFeeddeerraattiioonn:: MMeettaaddaattaa aabboouutt rreessoouurrccee vvss
mmeettaaddaattaa//ddaattaa iinn ddaattaabbaassee

What resources are aavvaaiillaabbllee ffoorr AAddddiiccttiioonn aanndd GGRRMM11??
With the thousands of databases and other information sources
available, simple descriptive metadata will not suffice
With the thousands of databases and other information sources
available, simple descriptive metadata will not suffice

How do resources ggeett aaddddeedd ttoo tthhee
NNIIFF??
•NIF curators
•Nomination by the
community
•Semi-automated text
mining pipelines
NIF Registry
Requires no special
skills
Site map available
for local hosting
•NIF Data Federation
•DISCO interop
•Requires some
programming skill
•Open Source Brain <
2 hr
LLooww b baarrrrieierr t too e ennttrryy; ; i ninccrreemmeennttaal lr reefifnineemmeenntt

What about my data?
•Best practice:
•Put it in a repository
•What repository?
•Community
repository for your
data type, e.g., GEO
•General repository:
•Dryad
•FigShare
•Institutional repository
•Research libraries are
setting up repositories
to manage their
“digital assets”
NIF can help you find a NIF can help you find a p plalaccee f oforr y yoouurr d daattaa

Requirements for effective
data sharing • Discoverability
– Data can be found
• Accessibility
– Data can be accessed and
access rights are clear
– Links to data are stable
• Assessability
– The reliability of the data can
be determined
• Understandability
– The data can be understood
• Usability
– The data are in a usable form
• Publishing data on your
website or as
supplemental material is
not the best way to make
it available
Duality of modern scholarship: A machine and
Duality of modern scholarship: A machine and
human dimension to each
human dimension to each

BBuutt wwee hhaavvee GGooooggllee!!
• Current web is
designed to share
documents
– Documents are
unstructured data
• Much of the
content of digital
resources is part of
the “hidden web”
• Wikipedia: The Deep Web
(also called Deepnet, the
invisible Web, DarkNet,
Undernet or the hidden
Web) refers to World
Wide Web content that is
not part of the Surface
Web, which is indexed by
standard search engines.

WWhhaatt ddoo yyoouu mmeeaann bbyy ddaattaa??
Databases come in many shapes and sizes
• Primary data:
– Data available for reanalysis, e.g.,
microarray data sets from GEO;
brain images from XNAT;
microscopic images (CCDB/CIL)
• Secondary data
– Data features extracted through
data processing and sometimes
normalization, e.g, brain structure
volumes (IBVD), gene expression
levels (Allen Brain Atlas); brain
connectivity statements (BAMS)
• Tertiary data
– Claims and assertions about the
meaning of data
• E.g., gene
upregulation/downregulation,
brain activation as a function
of task
• Registries:
– Metadata
– Pointers to data sets or
materials stored elsewhere
• Data aggregators
– Aggregate data of the same
type from multiple sources,
e.g., Cell Image Library
,SUMSdb, Brede
• Single source
– Data acquired within a single
context , e.g., Allen Brain Atlas
Researchers are producing a variety of
information artifacts using a multitude of
technologies
Researchers are producing a variety of
information artifacts using a multitude of
technologies

WWhhiicchh ddaattaabbaasseess ddoo yyoouu uussee??
• Mouse Genome
• Bionumbers:
Database
– -a database of numerical
values extracted from
• literature
Clinical Trials.gov
• Epigenomics
• Pub Med
• dbGAP
• GEO
• NIH Reporter
• OMIM
– - human epigenomic data to
catalyze basic biology and
disease-oriented research
• Antibody Registry
– -2M antibodies
• BioGrid
– an interaction repository of
protein and genetic
interactions
MMoosstt r reessoouurrcceess a arree l alarrggeelyly u unnkknnoowwnn a anndd u unnddeerruuttiliilzize1ed7d

NIF unifies look, feel and access

Making it easier to access and understand
distributed databases
Each resource implements a different, though related model;
systems are complex and difficult to learn, in many cases
Each resource implements a different, though related model;
systems are complex and difficult to learn, in many cases

Facets and filters: Progressive
refinement of search
More effective to start with a general query and use
the navigation to refine search

Current challenge: With so much
available, how do I find what I need?
• “What genes are upregulated
by chronic morphine?”
– It depends
• Most often use cases require
connecting a researcher to
relevant data sets and
appropriate tools
– Depending upon the data and
tools, the answers may differ
• Many databases have tool
bases and workflows that
they support

Exploration of NIF: 1. Progressive
refinement of search

2. “Data trails”: Linking data and analysis
tools

SSaammee ddaattaa:: ddiiffffeerreenntt aannaallyyssiiss
CChhrroonniicc vvss aaccuuttee mmoorrpphhiinnee iinn ssttrriiaattuumm
• Gemma: Gene ID + Gene Symbol
• DRG: Gene name + Probe ID
• Gemma presented results relative to baseline chronic
morphine; DRG with respect to saline, so direction of change is
opposite in the 2 databases
• Analysis:
•1370 statements from Gemma regarding gene expression as a function of chronic
morphine
•617 were consistent with DRG;  over half of the claims of the paper were not
confirmed in this analysis
•Results for 1 gene were opposite in DRG and Gemma
•45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data
has gone and what has been done with it
NIF is working to make it easier to find where data
has gone and what has been done with it

3. SciCrunch: A social network for
data and tools
• NIF platform has been adapted to
create SciCrunch
– Beta release: http://scicrunch.com
• Create more narrow community-based
portals based on common
data platform
• Select your data; organize it as you
wish
• Cost effective: a data portal can be
set up in a few hours
• Connects communities through data
and tools
• Shared curation-shared knowledge

CCoommmmuunniittyy BBuuiilltt UUnniiffoorrmm RReessoouurrccee
Community
Outreach
Community
Outreach
Undiagnosed
Disease Program
Model Organism
Databases
28
SScciCiCrruunncchh
Shared
Resources
Undiagnosed
Disease Program
PPhheennootytpypee R RCCNN
One Mind for
Research
One Mind for
Research
Consortia-Pedia
Faster Cures
Consortia-Pedia
Faster Cures
Model Organism
Databases
LLaayyeerr Resource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube

Breaking down silos: Community
enrichment

PPhhaasseess ooff NNIIFF
• 2006-2008: A survey of what was out there
• 2008-2009: Strategy for resource discovery
– NIF Registry vs NIF data federation
– Ingestion of data contained within different technology platforms, e.g., XML vs
relational vs RDF
– Effective search across semantically diverse sources
• NIFSTD ontologies
• 2009-2011: Strategy for data integration
– Unified views across common sources
– Mapping of content to NIF vocabularies
• 2011-present: Data analytics and Linking data
– Uniform external data references
• 2013-present:  SciCrunch: unified biomedical resource
services
• “data trails” NIF provides a strategy and set of tools applicable to all
NIF provides a strategy and set of tools applicable to all
biomedical science
biomedical science

-a tool for analyzing and structuring information (“ a reduction of
uncertainty”)
INFORMATION FRAMEWORKS

What is an effective information
framework for neuroscience?
Knowledge in space and spatial relationships
(the “where”)
Knowledge in words, terminologies and
logical relationships (the “what”)

NIF Semantic FFrraammeewwoorrkk:: NNIIFFSSTTDD oonnttoollooggyy
Anatomical DDysyfsufunnctcitoionn QQuuaaliltiyty
Structure
Subcellular
structure
• NIF covers multiple structural scales and domains of relevance to neuroscience
• Aggregate of community ontologies with some extensions for neuroscience, e.g., Gene
Ontology, Chebi, Protein Ontology
NNIFIFSSTTDD
OOrgragannisimsm
MMoolelecuculele NNS SF uFunnctcitoionn InInvevestsitgigaatitoionn Subcellular
structure
MMaaccrorommooleleccuulele GGeennee
MMoolelceuculel eD Desecsrcirpiptotorsrs
TTeecchhnniqiquueess
RReeaaggeenntt PProrototoccoolsls
CCeellll
RReesosouurcrcee InInstsrturummeenntt
Anatomical
Structure
Ontologies provide the universals for integrating across disparate
data by linking them to human knowledge models
Ontologies provide the universals for integrating across disparate
data by linking them to human knowledge models

Space limitations: Multiscale iinntteeggrraattiioonn iiss nnoott oobbvviioouuss
Cerebellar
cortex
Purkinje
Cell
Axon
Terminal
Axon
Dendritic
Tree
Dendrite
Dendritic
Spine
Cell body
There is little obvious connection between
data sets taken at different scales using
different microscopies without an explicit
representation of the biological objects that
the data represent
There is little obvious connection between
data sets taken at different scales using
different microscopies without an explicit
representation of the biological objects that
the data represent

: C
Neurolex: > 1 million triples
Dr. Yi Zeng: Chinese neural knowledge base
NIF Cell Graph
This is your brain on
computers

NIF “translates” common concepts through
ontology and annotation standards
What genes are upregulated by drugs of abuse in the
adult mouse? (show me the data!)
MMoorrpphhininee
Increased
expression
Increased
expression
AAdduultl tM Moouussee

AAnnootthheerr sseeaarrcchh ttiipp:: CCuussttoomm
qquueerryy ssyynnttaaxx

Ontologies as aa ddaattaa iinntteeggrraattiioonn ffrraammeewwoorrkk
•NIF Connectivity: 7 databases containing connectivity primary data or claims
from literature on connectivity between brain regions
•Brain Architecture Management System (rodent)
•Temporal lobe.com (rodent)
•Connectome Wiki (human)
•Brain Maps (various)
•CoCoMac (primate cortex)
•UCLA Multimodal database (Human fMRI)
•Avian Brain Connectivity Database (Bird)
•Total: 1800 unique brain terms (excluding Avian)
•Number of exact terms used in > 1 database: 42
•Number of synonym matches: 99
•Number of 1st order partonomy matches: 385

What can we learn from the data space?
NIF ANALYTICS

DDaattaa FFeeddeerraattiioonn GGrroowwtthh
NIF searches the largest collation of
neuroscience-relevant data on the web
40

Definition: “TThhee lloonngg ttaaiill ooff ssmmaallll
ddaattaa””
• Long tail data: large numbers of small data
sets
Estimate: ~50% of long tail data is “Dark
data”: data not available for search
Estimate: ~50% of long tail data is “Dark
data”: data not available for search
hhttttpp:/:///eenn.w.wikikipipeeddiaia.o.orrgg//wwikiki/i/LLoonngg__ttaailil

NIF Analytics: The Neuroscience Landscape
Where are the data?
Striatum
Hypothalamus
Olfactory bulb
Cerebral cortex
Ontologies provide a semantic framework for understanding
data/resource landscape
Brain
Brain region
Data source
Vadim Astakhov, Kepler Workflow Engine

0
1-10
11-100
>101
Data and knowledge gaps
Data Sources
NIF lets us ask: where isn’t there data? What NIF lets us ask: where isn’t there data? What i sisnn’t’t s sttuuddieiedd?? W Whhyy??

FFoorreebbrraainin
MMididbbrraainin
HHininddbbrraainin
0
1-10
11-100
>101
Data Sources

Adult mouse brain connectivity matrix: revenge of the
midbrain
SW Oh et al. Nature 000, 1-8 (SW Oh et al. Nature 000, 1-8 (22001144) )d dooi:i1:100.1.1003388//nnaatuturere1133118866

The tale of the tail
“Human neuroimaging typically is performed on a whole brain basis.
However, for several reasons tail of the caudate activity can easily be missed.
•One reason is limitations in the normalization algorithms, that typically are
optimized to maximize accuracy for cortical rather than subcortical
structures. ...
•A second reason is that standard neuroimaging atlases such as the Harvard-
Oxford structural atlas used with neuroimaging analysis programs such as
FreeSurfer truncate the caudate at the body, and completely exclude the
tail...
•A final reason is that the tail of the caudate is close to the hippocampus,
and could be misidentified as such especially in tasks involving learning and
memory.
Therefore, the tail of the caudate may be recruited in additional cognitive
tasks, but yet not have been properly identified and reported in the
neuroimaging literature”
Seger CA. The visual corticostriatal loop through the tail of the caudate: circuitry and function. Front
Syst Neurosci. 2013 Dec 6;7:104. doi: 10.3389/fnsys.2013.00104. eCollection 2013.
Seger CA. The visual corticostriatal loop through the tail of the caudate: circuitry and function. Front
Syst Neurosci. 2013 Dec 6;7:104. doi: 10.3389/fnsys.2013.00104. eCollection 2013.

“The Data Homunculus”
Beware of biases Beware of biases i nin t thhee d daattaa s sppaaccee......

A…
The
Encyclopedia
of Life
AAcccceessss ttoo ddaattaa hhaass
cchhaannggeedd oovveerr tthhee
yyeeaarrss
Wikipedia defines Linked Data as "a Tim Berner-s Lee: Web of data
term used to describe a
GGeennbbaannkk
recommended best practice for
exposing, sharing, and connecting
pieces of data, information, and
knowledge on the Semantic Web
using URIs and RDF.”
http://linkeddata.org/
PPDDBB
“Whichever technology wins broad adoption will become, by
default, the data web. That’s why we don’t need to know
which technological vision of the data web will win to conclude
that the data web is inevitable”-Michael Nielson
“Whichever technology wins broad adoption will become, by
default, the data web. That’s why we don’t need to know
which technological vision of the data web will win to conclude
that the data web is inevitable”-Michael Nielson

I am a number: ORCID ID
The web of data runs on the ability to uniquely
The web of data runs on the ability to uniquely
identify all the relevant entities
identify all the relevant entities

RReessoouurrccee IIddeennttiiffiiccaattiioonn IInniittiiaattiivvee
• Have authors supply appropriate
identifiers for key resources used
within a study such that they are:
– Machine processible (i.e., unique
identifier that resolves to a single
resource)
– Outside of the paywall
– Uniform across journals and
publishers
• Goal: Proof of principle
– What infrastructure would be
needed
– Could authors perform the task
– Would authors perform the task
– Will it be useful?
http://www.force11.org/resource_ide
http://www.force11.org/resource_ide
ntification_initiative
ntification_initiative

What studies used ...?
•100 articles have appeared to date
•15 journals
•Data set being made available to
community
•> 600 RRID’s
•~10% disappeared after
copyediting
•5% were in error
•14% false negative rate
•> 200 antibodies were added
•> 75 software tools/databases
were added
RRRRIDID:A:ABB__9900775555
DDaattaabbaassee a avvaailialabblele a att: :h httttppss:/:///wwwwww .f.oforrccee1111.o.orrgg//nnooddee//55663355

An ecosystem for research objects
AArrtticiclele
DDaattaa
DDaattaa
CCooddee
BBlologgss
WWoorrkkflfolowwss
DDaattaa
Persistent
Identifiers
PPoorrttaalsls
BBlologgss
BBlologgss
Persistent
Identifiers
Persistent
Identifiers
CCooddee
CCooddee
Unique and persistent identifiers and a system for
Unique and persistent identifiers and a system for
Persistent
Identifiers
referencing them allow a scholarly ecosystem to function
referencing them allow a scholarly ecosystem to function
WWoorrkkflfolowwss
WWoorrkkflfolowwss
PPoorrttaalsls
PPoorrttaalsls
SSeeaarrcchh e enngginineess
Persistent
Identifiers
Persistent
Identifiers

Taking aa gglloobbaall vviieeww oonn ddaattaa::
mmiiccrrooccuullttuurree ttoo eeccoossyysstteemm
• Several powerful trends should change the way we think about
our data: One  Many
– Many data
• Generation of data is getting easier  shared data
• Data space is getting richer: more –omes everyday
• But...compared to the biological space, still sparse
– Many eyes
• Wisdom of crowds
• More than one way to interpret data
– Many algorithms
• Not a single way to analyze data
– Many analytics
• “Signatures” in data may not be directly related to the question for which they
were acquired but tell us something really interesting
OOnnee d daattaa s seett   o onnee a alglgoorritithhmm   o onnee p paappeerr??????

How you can contribute
• Register your tools/data to NIF
• Let us help you with your use cases
• Use RRID’s in your publications
– http://scicrunch.com/resources
• Get your ORCID ID!
• Put your data in a repository
– NIF can help you find one; NIF is one
• If you are planning on building your own data
resources, talk to us!

Future of Research Communications
and e-Scholarship (FORCE11.org)
Join Join u uss!! h httttpp:/:///foforrccee1111.o.orrgg

NNIIFF tteeaamm ((ppaasstt aanndd pprreesseenntt))
Jeff Grethe, UCSD, Co Investigator, Interim PI
Amarnath Gupta, UCSD, Co Investigator
Anita Bandrowski, NIF Project Leader
Gordon Shepherd, Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen, Washington University
Erin Reid
Paul Sternberg, Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli, George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark, Harvard University
Paolo Ciccarese
Karen Skinner, NIH, Program Officer
(retired)
Jonathan Pollock, NIH, Program Officer
And my colleagues in Monarch, dkNet, 3DVC, Force 11

Data Landscapes: The Neuroscience Information Framework

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (18)

Semelhante a Data Landscapes: The Neuroscience Information Framework

Semelhante a Data Landscapes: The Neuroscience Information Framework (20)

Mais de Maryann Martone

Mais de Maryann Martone (6)

Último

Último (20)

Data Landscapes: The Neuroscience Information Framework

Notas do Editor