Data Landscapes: The Neuroscience 
Information Framework 
Maryann E. Martone, Ph. D. 
University of California, San 
• Introduction 
• The Neuroscience Information Framework 
• A tour of NIF 
• The NIF Framework 
– Ontologies 
– NIF Analytics: What can we learn from the data 
• Where do we go from here? 
– Resource Identification Initiative 
– Conclusions
• NIF is an initiative of the NIH Blueprint consortium of institutes 
– What types of resources (data, tools, materials, services) are available to the 
neuroscience community? 
– How many are there? 
– What domains do they cover? What domains do they not cover? 
– Where are they? 
• Web sites 
• Databases 
• Literature 
• Supplementary material 
– Who uses them? 
– Who creates them? 
– How can we find them? 
– How can we make them better in the future? 
• PDF files 
• Desk drawers
Old Model: Single type of content; 
single mode of distribution 
FORCE11.org: Future of research communications and e-scholarship
Data Repositories 
Code Repositories 
Peer Reviewers 
Solving the large problems of 
• Observation 
• Experimentation 
• Modeling 
• Cooperative data 
intensive science 
“An unaided human’s ability to process 
large data sets is comparable to a dog’s 
ability to do arithmetic, and not much more 
valuable.” –Michael Nielson, Reinventing 
Discovery, 2012. 
“An unaided human’s ability to process 
large data sets is comparable to a dog’s 
ability to do arithmetic, and not much more 
valuable.” –Michael Nielson, Reinventing 
Discovery, 2012.
NIF: A New Type of Entity for New Modes 
of Scientific Dissemination 
• NIF’s mission is to maximize the awareness of, access to and 
utility of digital resources produced worldwide to enable better 
science and promote efficient use 
– NIF unites neuroscience information without respect to domain, funding 
agency, institute or community 
– NIF is a library for scholarly output that is a web enabled resource and 
not a paper 
– Aggregates all the different databases, tools and resources now 
produced by the scientific community 
– Makes them searchable from a single interface 
– A practical approach to the data deluge 
– Educate neuroscientists and students about effective data sharing
Surveying the resource landscape 
NIF resource registry: listing of > 12000 databases, tools, 
materials, services, websites (> 2500 databases) 
NIF resource registry: listing of > 12000 databases, tools, 
materials, services, websites (> 2500 databases)
NIF data federation: Pub Med Central for data 
200 sources 
> 800 M records 
200 sources 
> 800 M records 
NIF was designed to accommodate the multiplicity of heterogeneous and distributed data 
resources, providing deep query of the contents and unified views 
NIF was designed to accommodate the multiplicity of heterogeneous and distributed data 
resources, providing deep query of the contents and unified views
Registry vs Federation: Metadata about resource vs 
metadata/data in database
What resources are available for Addiction and GRM1? 
With the thousands of databases and other information sources 
available, simple descriptive metadata will not suffice 
With the thousands of databases and other information sources 
available, simple descriptive metadata will not suffice
How do resources get added to the 
•NIF curators 
•Nomination by the 
•Semi-automated text 
mining pipelines 
NIF Registry 
Requires no special 
Site map available 
for local hosting 
•NIF Data Federation 
•DISCO interop 
•Requires some 
programming skill 
•Open Source Brain < 
2 hr 
Low barrier to entry; incremental refinement
What about my data? 
•Best practice: 
•Put it in a repository 
•What repository? 
repository for your 
data type, e.g., GEO 
•General repository: 
•Institutional repository 
•Research libraries are 
setting up repositories 
to manage their 
“digital assets” 
NIF can help you find a place for your data
Requirements for effective 
data sharing • Discoverability 
– Data can be found 
• Accessibility 
– Data can be accessed and 
access rights are clear 
– Links to data are stable 
• Assessability 
– The reliability of the data can 
be determined 
• Understandability 
– The data can be understood 
• Usability 
– The data are in a usable form 
• Publishing data on your 
website or as 
supplemental material is 
not the best way to make 
it available 
Duality of modern scholarship: A machine and 
Duality of modern scholarship: A machine and 
human dimension to each 
human dimension to each
But we have Google!! 
• Current web is 
designed to share 
– Documents are 
unstructured data 
• Much of the 
content of digital 
resources is part of 
the “hidden web” 
• Wikipedia: The Deep Web 
(also called Deepnet, the 
invisible Web, DarkNet, 
Undernet or the hidden 
Web) refers to World 
Wide Web content that is 
not part of the Surface 
Web, which is indexed by 
standard search engines.
What do you mean by data? 
Databases come in many shapes and sizes 
• Primary data: 
– Data available for reanalysis, e.g., 
microarray data sets from GEO; 
brain images from XNAT; 
microscopic images (CCDB/CIL) 
• Secondary data 
– Data features extracted through 
data processing and sometimes 
normalization, e.g, brain structure 
volumes (IBVD), gene expression 
levels (Allen Brain Atlas); brain 
connectivity statements (BAMS) 
• Tertiary data 
– Claims and assertions about the 
meaning of data 
• E.g., gene 
brain activation as a function 
of task 
• Registries: 
– Metadata 
– Pointers to data sets or 
materials stored elsewhere 
• Data aggregators 
– Aggregate data of the same 
type from multiple sources, 
e.g., Cell Image Library 
,SUMSdb, Brede 
• Single source 
– Data acquired within a single 
context , e.g., Allen Brain Atlas 
Researchers are producing a variety of 
information artifacts using a multitude of 
Researchers are producing a variety of 
information artifacts using a multitude of 
Which databases do you use? 
• Mouse Genome 
• Bionumbers: 
– -a database of numerical 
values extracted from 
• literature 
• Epigenomics 
• Pub Med 
• dbGAP 
• GEO 
• NIH Reporter 
– - human epigenomic data to 
catalyze basic biology and 
disease-oriented research 
• Antibody Registry 
– -2M antibodies 
• BioGrid 
– an interaction repository of 
protein and genetic 
Most resources are largely unknown and underutilized
NIF unifies look, feel and access
Making it easier to access and understand 
distributed databases 
Each resource implements a different, though related model; 
systems are complex and difficult to learn, in many cases 
Each resource implements a different, though related model; 
systems are complex and difficult to learn, in many cases
Exploring the data space
Facets and filters: Progressive 
refinement of search 
More effective to start with a general query and use 
the navigation to refine search 
More effective to start with a general query and use 
the navigation to refine search
Some NIF(ty) Features
Current challenge: With so much 
available, how do I find what I need? 
• “What genes are upregulated 
by chronic morphine?” 
– It depends 
• Most often use cases require 
connecting a researcher to 
relevant data sets and 
appropriate tools 
– Depending upon the data and 
tools, the answers may differ 
• Many databases have tool 
bases and workflows that 
they support
Exploration of NIF: 1. Progressive 
refinement of search 
More effective to start with a general query and use 
the navigation to refine search 
More effective to start with a general query and use 
the navigation to refine search
2. “Data trails”: Linking data and analysis 
Same data: different analysis 
Chronic vs acute morphine in striatum 
• Gemma: Gene ID + Gene Symbol 
• DRG: Gene name + Probe ID 
• Gemma presented results relative to baseline chronic 
morphine; DRG with respect to saline, so direction of change is 
opposite in the 2 databases 
• Analysis: 
•1370 statements from Gemma regarding gene expression as a function of chronic 
•617 were consistent with DRG;  over half of the claims of the paper were not 
confirmed in this analysis 
•Results for 1 gene were opposite in DRG and Gemma 
•45 did not have enough information provided in the paper to make a judgment 
NIF is working to make it easier to find where data 
has gone and what has been done with it 
NIF is working to make it easier to find where data 
has gone and what has been done with it
3. SciCrunch: A social network for 
data and tools 
• NIF platform has been adapted to 
create SciCrunch 
– Beta release: 
• Create more narrow community-based 
portals based on common 
data platform 
• Select your data; organize it as you 
• Cost effective: a data portal can be 
set up in a few hours 
• Connects communities through data 
and tools 
• Shared curation-shared knowledge
Community Built Uniform Resource 
Disease Program 
Model Organism 
Disease Program 
PPhheennootytpypee R RCCNN 
One Mind for 
One Mind for 
Faster Cures 
Faster Cures 
Model Organism 
LLaayyeerr Resource Identification Portal 
NSF Earthcube
Breaking down silos: Community 
Phases of NIF 
• 2006-2008: A survey of what was out there 
• 2008-2009: Strategy for resource discovery 
– NIF Registry vs NIF data federation 
– Ingestion of data contained within different technology platforms, e.g., XML vs 
relational vs RDF 
– Effective search across semantically diverse sources 
• NIFSTD ontologies 
• 2009-2011: Strategy for data integration 
– Unified views across common sources 
– Mapping of content to NIF vocabularies 
• 2011-present: Data analytics and Linking data 
– Uniform external data references 
• 2013-present:  SciCrunch: unified biomedical resource 
• “data trails” NIF provides a strategy and set of tools applicable to all 
NIF provides a strategy and set of tools applicable to all 
biomedical science 
biomedical science
-a tool for analyzing and structuring information (“ a reduction of 
What is an effective information 
framework for neuroscience? 
Knowledge in space and spatial relationships 
(the “where”) 
Knowledge in words, terminologies and 
logical relationships (the “what”)
NIF Semantic Framework: NIFSTD ontology 
Anatomical DDysyfsufunnctcitoionn QQuuaaliltiyty 
• NIF covers multiple structural scales and domains of relevance to neuroscience 
• Aggregate of community ontologies with some extensions for neuroscience, e.g., Gene 
Ontology, Chebi, Protein Ontology 
MMoolelecuculele NNS SF uFunnctcitoionn InInvevestsitgigaatitoionn Subcellular 
MMaaccrorommooleleccuulele GGeennee 
MMoolelceuculel eD Desecsrcirpiptotorsrs 
RReeaaggeenntt PProrototoccoolsls 
RReesosouurcrcee InInstsrturummeenntt 
Ontologies provide the universals for integrating across disparate 
data by linking them to human knowledge models 
Ontologies provide the universals for integrating across disparate 
data by linking them to human knowledge models
Space limitations: Multiscale integration is not obvious 
Cell body 
There is little obvious connection between 
data sets taken at different scales using 
different microscopies without an explicit 
representation of the biological objects that 
the data represent 
There is little obvious connection between 
data sets taken at different scales using 
different microscopies without an explicit 
representation of the biological objects that 
the data represent
: C 
Neurolex: > 1 million triples 
Dr. Yi Zeng: Chinese neural knowledge base 
NIF Cell Graph 
This is your brain on 
NIF “translates” common concepts through 
ontology and annotation standards 
What genes are upregulated by drugs of abuse in the 
adult mouse? (show me the data!) 
Adult Mouse
Another search tip: Custom 
query syntax
Ontologies as a data integration framework 
•NIF Connectivity: 7 databases containing connectivity primary data or claims 
from literature on connectivity between brain regions 
•Brain Architecture Management System (rodent) 
•Temporal (rodent) 
•Connectome Wiki (human) 
•Brain Maps (various) 
•CoCoMac (primate cortex) 
•UCLA Multimodal database (Human fMRI) 
•Avian Brain Connectivity Database (Bird) 
•Total: 1800 unique brain terms (excluding Avian) 
•Number of exact terms used in > 1 database: 42 
•Number of synonym matches: 99 
•Number of 1st order partonomy matches: 385
What can we learn from the data space? 
Data Federation Growth 
NIF searches the largest collation of 
neuroscience-relevant data on the web 
Definition: "The long tail of small 
• Long tail data: large numbers of small data 
Estimate: ~50% of long tail data is “Dark 
data”: data not available for search 
Estimate: ~50% of long tail data is “Dark 
data”: data not available for search 
NIF Analytics: The Neuroscience Landscape 
Where are the data? 
Olfactory bulb 
Cerebral cortex 
Ontologies provide a semantic framework for understanding 
data/resource landscape 
Brain region 
Data source 
Vadim Astakhov, Kepler Workflow Engine
Data and knowledge gaps 
Data Sources 
NIF lets us ask: where isn't there data? What isn't studied? Why?
Data Sources
Adult mouse brain connectivity matrix: revenge of the 
SW Oh et al. Nature 000, 1-8 (2014) doi:10.1038/nature13186
The tale of the tail 
“Human neuroimaging typically is performed on a whole brain basis. 
However, for several reasons tail of the caudate activity can easily be missed. 
•One reason is limitations in the normalization algorithms, that typically are 
optimized to maximize accuracy for cortical rather than subcortical 
structures. ... 
•A second reason is that standard neuroimaging atlases such as the Harvard- 
Oxford structural atlas used with neuroimaging analysis programs such as 
FreeSurfer truncate the caudate at the body, and completely exclude the 
•A final reason is that the tail of the caudate is close to the hippocampus, 
and could be misidentified as such especially in tasks involving learning and 
Therefore, the tail of the caudate may be recruited in additional cognitive 
tasks, but yet not have been properly identified and reported in the 
neuroimaging literature” 
Seger CA. The visual corticostriatal loop through the tail of the caudate: circuitry and function. Front 
Syst Neurosci. 2013 Dec 6;7:104. doi: 10.3389/fnsys.2013.00104. eCollection 2013. 
Seger CA. The visual corticostriatal loop through the tail of the caudate: circuitry and function. Front 
Syst Neurosci. 2013 Dec 6;7:104. doi: 10.3389/fnsys.2013.00104. eCollection 2013.
“The Data Homunculus” 
Beware of biases in the data space......
of Life 
Access to data has 
changed over the 
Wikipedia defines Linked Data as "a Tim Berner-s Lee: Web of data 
term used to describe a 
recommended best practice for 
exposing, sharing, and connecting 
pieces of data, information, and 
knowledge on the Semantic Web 
using URIs and RDF.” 
“Whichever technology wins broad adoption will become, by 
default, the data web. That’s why we don’t need to know 
which technological vision of the data web will win to conclude 
that the data web is inevitable”-Michael Nielson 
“Whichever technology wins broad adoption will become, by 
default, the data web. That’s why we don’t need to know 
which technological vision of the data web will win to conclude 
that the data web is inevitable”-Michael Nielson
I am a number: ORCID ID 
The web of data runs on the ability to uniquely 
The web of data runs on the ability to uniquely 
identify all the relevant entities 
identify all the relevant entities
RReessoouurrccee IIddeennttiiffiiccaattiioonn IInniittiiaattiivvee 
• Have authors supply appropriate 
identifiers for key resources used 
within a study such that they are: 
– Machine processible (i.e., unique 
identifier that resolves to a single 
– Outside of the paywall 
– Uniform across journals and 
• Goal: Proof of principle 
– What infrastructure would be 
– Could authors perform the task 
– Would authors perform the task 
– Will it be useful? 
What studies used ...? 
•100 articles have appeared to date 
•15 journals 
•Data set being made available to 
•> 600 RRID’s 
•~10% disappeared after 
•5% were in error 
•14% false negative rate 
•> 200 antibodies were added 
•> 75 software tools/databases 
were added 
Database available at: https://www.force11.org/node/5635
An ecosystem for research objects 
Unique and persistent identifiers and a system for 
Unique and persistent identifiers and a system for 
referencing them allow a scholarly ecosystem to function 
referencing them allow a scholarly ecosystem to function 
SSeeaarrcchh e enngginineess 
Taking aa gglloobbaall vviieeww oonn ddaattaa:: 
mmiiccrrooccuullttuurree ttoo eeccoossyysstteemm 
• Several powerful trends should change the way we think about 
our data: One  Many 
– Many data 
• Generation of data is getting easier  shared data 
• Data space is getting richer: more –omes everyday 
• But...compared to the biological space, still sparse 
– Many eyes 
• Wisdom of crowds 
• More than one way to interpret data 
– Many algorithms 
• Not a single way to analyze data 
– Many analytics 
• “Signatures” in data may not be directly related to the question for which they 
were acquired but tell us something really interesting 
OOnnee d daattaa s seett   o onnee a alglgoorritithhmm   o onnee p paappeerr??????
How you can contribute 
• Register your tools/data to NIF 
• Let us help you with your use cases 
• Use RRID’s in your publications 
• Get your ORCID ID! 
• Put your data in a repository 
– NIF can help you find one; NIF is one 
• If you are planning on building your own data 
resources, talk to us!
Future of Research Communications 
and e-Scholarship ( 
Join Join u uss!! h httttpp:/:///foforrccee1111.o.orrgg
NIF team (past and present) 
Jeff Grethe, UCSD, Co Investigator, Interim PI 
Amarnath Gupta, UCSD, Co Investigator 
Anita Bandrowski, NIF Project Leader 
Gordon Shepherd, Yale University 
Perry Miller 
Luis Marenco 
Rixin Wang 
David Van Essen, Washington University 
Erin Reid 
Paul Sternberg, Cal Tech 
Arun Rangarajan 
Hans Michael Muller 
Yuling Li 
Giorgio Ascoli, George Mason University 
Sridevi Polavarum 
Fahim Imam 
Larry Lui 
Andrea Arnaud Stagg 
Jonathan Cachat 
Jennifer Lawrence 
Svetlana Sulima 
Davis Banks 
Vadim Astakhov 
Xufei Qian 
Chris Condit 
Mark Ellisman 
Stephen Larson 
Willie Wong 
Tim Clark, Harvard University 
Paolo Ciccarese 
Karen Skinner, NIH, Program Officer 
Jonathan Pollock, NIH, Program Officer 
And my colleagues in Monarch, dkNet, 3DVC, Force 11

  5. Add an addiction dimension to this query
  6. Google Knowledge Graph: