Call Girls Guntur Just Call 8250077686 Top Class Call Girl Service Available
Big data from small data: A deep survey of the neuroscience landscape data via
1. Big data from small data: A deep
survey of the neuroscience
landscape data via
the Neuroscience Information
Framework
Maryann Martone, Ph. D.
University of California, San Diego
2. “Neural Choreography”
“A grand challenge in neuroscience is to elucidate brain function in relation
to its multiple layers of organization that operate at different spatial and
temporal scales. Central to this effort is tackling “neural choreography” --
the integrated functioning of neurons into brain circuits-- Neural
choreography cannot be understood via a purely reductionist approach.
Rather, it entails the convergent use of analytical and synthetic tools to
gather, analyze and mine information from each level of analysis, and
capture the emergence of new layers of function (or dysfunction) as we
move from studying genes and proteins, to cells, circuits, thought, and
behavior....
However, the neuroscience community is not yet fully engaged in exploiting the
rich array of data currently available, nor is it adequately poised to capitalize
on the forthcoming data explosion. “
Akil et al., Science, Feb 11, 2011
3. “Data choreography”
In that same issue of Science
Asked peer reviewers from last year about the availability and use of
data
About half of those polled store their data only in their
laboratories—not an ideal long-term solution.
Many bemoaned the lack of common metadata and archives as a
main impediment to using and storing data, and most of the
respondents have no funding to support archiving
And even where accessible, much data in many fields is too poorly
organized to enable it to be efficiently used.
“...it is a growing challenge to ensure that data produced during the
course of reported research are appropriately
described, standardized, archived, and available to all.” Lead Science
editorial (Science 11 February 2011: Vol. 331 no. 6018 p. 649 )
4. A data federation problem
No single technology serves these all
equally well.
Multiple data types; multiple
scales; multiple databases
Whole brain data
(20 um
microscopic MRI)
Mosiac LM
images (1 GB+)
Conventional LM
images
Individual cell
morphologies
Neuroscience is unlikely to be EM volumes &
served by a few large databases reconstructions
like the genomics and proteomics
Solved molecular
community structures
5. NIF is an initiative of the NIH Blueprint consortium of institutes
What types of resources (data, tools, materials, services) are
available to the neuroscience community?
How many are there?
What domains do they cover? What domains do they not cover?
Where are they?
Web sites • PDF files
Databases • Desk drawers
Literature
Supplementary material
Who uses them?
Who creates them?
How can we find them?
How can we make them better in the future? http://neuinfo.org
6. We need more databases (?)
•NIF Registry: A
catalog of
neuroscience-relevant
resources
•> 5000 currently
listed
•> 2000 databases
•And we are finding
more every day
7. But we have Google!
Current web is designed Wikipedia: The Deep Web
to share documents (also called Deepnet, the
Documents are invisible Web, DarkNet,
unstructured data Undernet or the hidden
Much of the content of Web) refers to World Wide
digital resources is part of Web content that is not
the “hidden web” part of the Surface Web,
which is indexed by
standard search engines.
8. NIF must work with ecosystem as
it is today
NIF has developed a production technology platform for
researchers to discover, share, access, analyze, and
integrate neuroscience-relevant information
Semantically-enabled search engine and interface that customizes
results for neuroscience
System that searches the “hidden web”, i.e., content not well served by
search engines
Data resources are predominantly relational, xml, text, rdf, owl
Automated data harvesting technologies that produce dynamic indices
of data content including databases, web pages, text, xml etc.
Tools to make products and data available
Designed to be populated rapidly; set up process for progressive
refinement
9. NIF accomplishments
Assembled the largest searchable
collation of neuroscience data on the
web UCSD, Yale, Cal Tech, George Mason, Washington Univ
Data federation
Resource registry (materials, data,
tools, services)
Pub Med literature
Full text of open access
The largest ontology for neuroscience
NIF search portal: simultaneous search
over data, NIF catalog and biomedical
literature
Neurolex Wiki: a community wiki
serving neuroscience concepts
NIF is poised to capitalize on the new tools
A unique technology platform and emphasis on big data and open
A reservoir of cross-disciplinary
science
biomedical data expertise
10. NIF data federation
Percentage of data records per
data type
Brain activation foci
Animals
Images
Pathways
Drugs
connectivity
Antibodies
Microarray
98% Grants
> 180 sources; 350 M records: NIF was Percentage of data records per data
designed to be populated rapidly, with type: everything but microarray
progressive refinement of data
11. What do you mean by data?
Databases come in many shapes and sizes
Primary data: Registries:
Data available for Metadata
reanalysis, e.g., microarray data Pointers to data sets or
sets from GEO; brain images from materials stored elsewhere
XNAT; microscopic images
(CCDB/CIL) Data aggregators
Secondary data Aggregate data of the same
Data features extracted through
type from multiple
data processing and sometimes
sources, e.g., Cell Image
normalization, e.g, brain structure
Library ,SUMSdb, Brede
volumes (IBVD), gene expression Single source
levels (Allen Brain Atlas); brain Data acquired within a single
connectivity statements (BAMS) context , e.g., Allen Brain Atlas
Tertiary data
Claims and assertions about the Researchers are producing a variety of
meaning of data information artifacts using a multitude of
E.g., gene technologies
upregulation/downregulation,
12. What types of questions can I ask?
We’d like to be able to find:
What is known****:
What is the average diameter of a Purkinje neuron
Is GRM1 expressed In cerebral cortex?
What are the projections of hippocampus?
What genes have been found to be upregulated in
chronic drug abuse in adults
Is there a database of fMRI studies?
What studies used my polyclonal antibody against
GABA in humans?
What rat strains have been used most
extensively in research during the last 20 years?
What is not known:
Connections among data
Gaps in knowledge
Without some sort of framework, very difficult to
do
13. What are the connections of the
hippocampus?
Hippocampus OR “CornuAmmonis” OR
“Ammon’s horn” Query expansion: Synonyms
and related concepts
Boolean queries
Data sources
categorized by
“data type” and
level of nervous
system Tutorials for using
full resource when
getting there from
NIF
Common views
across multiple
sources
Link back to
record in
original
source
14. Results are organized within a common
framework
Target site
Synapsed by
innervates Connects to
Input region
Synapsed with
Cellular contact
Projects to
Axon innervates
Subcellular contact
Source site
Each resource implements a different, though related model;
systems are complex and difficult to learn, in many cases
15. The scourge of neuroanatomical nomenclature:
Importance of NIF semantic framework
•NIF Connectivity: 7 databases containing connectivity primary data or claims
from literature on connectivity between brain regions
•Brain Architecture Management System (rodent)
•Temporal lobe.com (rodent)
•Connectome Wiki (human)
•Brain Maps (various)
•CoCoMac (primate cortex)
•UCLA Multimodal database (Human fMRI)
•Avian Brain Connectivity Database (Bird)
•Total: 1800 unique brain terms (excluding Avian)
•Number of exact terms used in > 1 database: 42
•Number of synonym matches: 99
•Number of 1st order partonomy matches: 385
16. NIF’s minimum requirements for
effective data sharing
You (and the machine) have to be able to
find it
Accessible through the web
Annotations
You have to be able to use it
Data type specified and in a usable form
You have to know what the data mean
Some semantics
Context: Experimental metadata
Provenance: Where did the data come from?
Reporting neuroscience data within a consistent framework helps enormously
17. What is an ontology?
Brain
Ontology: an explicit, formal has a
representation of concepts
relationships among them Cerebellum
within a particular domain that has a
expresses human knowledge in a Purkinje Cell Layer
machine readable form
has a
Branch of philosophy: a theory Purkinje cell
of what is is a
neuron
e.g., Gene ontologies
18. You need to use
ontology
identifiers instead
of strings
Blah, blah,
ontology blah
“Ontology as mathematics, computer science or esperanto”-
AndreyRzhetsky and James A. Evans
19. What can ontology do for us?
“Esperanto!”
Express neuroscience concepts in a way that is machine readable
Classes are identified by unique identifiers
Synonyms, lexical variants
Definitions
Provide means of disambiguation of strings
Nucleus part of cell; nucleus part of brain; nucleus part of atom
Rules by which a class is defined, e.g., a GABAergic neuron is neuron that releases
GABA as a neurotransmitter
Properties
Provide universals for navigating across different data sources
Semantic “index”
Perform reasoning
Link data through relationships not just one-to-one mappings
“Concept-based queries”
20. Power of unique identifiers: Are you the M
Martone who...
The Gene Wiki: community intelligence applied to human gene annotation.
Huss JW 3rd, Lindenbaum P, Martone M, Roberts D, Pizarro A, Valafar F, Hogenesch
JB, Su AI. Nucleic Acids Res. 2010 Jan;38(Database issue):D633-9.
Ontologies for Neuroscience: What are they and What are they Good for? Larson
SD, Martone ME. Front Neurosci. 2009 May;3(1):60-7. Epub 2009 May 1.
Three-dimensional electron microscopy reveals new details of membrane systems for
Ca2+ signaling in the heart. Hayashi T, Martone ME, Yu Z, Thor A, Doi M, Holst
MJ, Ellisman MH, Hoshijima M. J Cell Sci. 2009 Apr 1;122(Pt 7):1005-13.
Some analyses of forgetting of pictorial material in amnesic and demented
patients.Martone M, Butters N, Trauner D. J Clin Exp Neuropsychol. 1986 Jun;8(3):161-78.
Traumatic brain injury and the goals of care.Martone M. Hastings Cent Rep. 2006 Mar-
Apr;36(2):3.
Three-dimensional pattern of enkephalin-like immunoreactivity in the caudate nucleus of the
cat.Groves PM, Martone M,Young SJ, Armstrong DM. J Neurosci. 1988 Mar;8(3):892-900.
21. I am not a number (but I should
be)
Full URI: Uniform
Resource Identifier Dept of
Boston VA
Psychiatry,
http://orcid.org/1234567 Hospital
UCSD
Label: Maryann Elizabeth
Martone
Synonym: ME Martone, M M Martone Female
Martone, Maryann
Abbreviation: MEM
Is a
Nelson
Has a Butters
Publications
Is that entity which has
these properties
Text mining algorithms can discover a lot of things
about me
ORCID project: Author ID’s
22. NIF Semantic Framework: NIFSTD ontology
NIFSTD
Anatomical
Organism Structure
Cell Dysfunction Quality
Subcellular
Molecule NS Function Investigation
structure
Macromolecule Gene Techniques Resource Instrument
Molecule Descriptors
Reagent Protocols
NIF covers multiple structural scales and domains of relevance to neuroscience
Aggregate of community ontologies with some extensions for
neuroscience, e.g., Gene Ontology, Chebi, Protein Ontology
Simple, basic “is a : hierarchies that can be used “as is” or to form the building blocks
for more complex representations
23. “We studied the behavior of CA2-binding proteins in
Ca2 neurons under high and low Ca2 conditions ”
NIF queries
across over
170+
BioGrid independent
Allen Brain Atlas databases
Brain Info
24. But you don’t have what I need!
•Provide a simple framework for
defining the concepts required
•Cell, Part of
brain, subcellular
structure, molecule
•Community based:
•Communities contribute
their vocabularies
•Reconcile and align
concepts used by different
domains
•Each concept gets its own
unique identifier
•Creating a computable index for
neuroscience data
•INCF Demo D03
http://neurolex.org Stephen Larson/INCF
25. Concept-based search: search by meaning
Search Google: GABAergic neuron
Search NIF: GABAergic neuron
NIF automatically searches for types of
GABAergic neurons
Types of GABAergic
neurons
26. Esperanto!
“The trouble is that if I make up all of my own URIs, my [data]
has no meaning to anyone else unless I explain what each URI is
intended to denote or mean. Two [data sets] with no URIs in
common have no information that can be interrelated.”
NIF favors reuse of identifiers rather than mapping
NIF imports many ontologies
Creating ontologies to be used as common building blocks:
modularity, low semantic overhead, is important
Many community ontologies available covering multiple domains
NIFSTD available via web serivices
Bioportal (http://bioportal.bioontology.org/)
http://www.rdfabout.com/intro/#Introducing%20RDF
27. NIF Analytics: The Neuroscience Ecosystem
Where are the data?
Striatum
Brain Hypothalamus
Olfactory bulb Data source
Brain region
Cerebral cortex
NIF is in a unique position to answer questions about the neuroscience
ecosystem
VadimAstakhov, Kepler Workflow Engine
28. Whither neuroscience information?
What is potentially knowable
∞
Unstructured;
What is known: Natural language
Literature, images, human processing, entity
knowledge recognition, image
processing and
analysis;
communication
What is easily machine
processable and accessible
29. Open world meets closed world
But...NIF has > 900,000
antibodies, 250,000 model
organisms, and 3 million microarray
records
Query for “reference” brain structures and their parts in NIF Connectivity database
30. Gender bias
NIF can start to
answer interesting
questions about
neuroscience
research, not just
about neuroscience
NIF Reports:
Male vs Female
31. What have we learned: Grabbing
the long tail of small data
Analysis of NIF shows
multiple databases with
similar scope and content
Many contain partially
overlapping data
Data “flows” from one
resource to the next
Data is
reinterpreted, reanalyze
d or added to
Is duplication good or bad?
32. Embracing duplication: Data Mash ups
•NIF queries across 3 of approximately 10 fMRI databases
•~300 PMID’swere common between Brede and SUMSdb
•PMID serves as a unique identifier for an article
•Same information; value added
Same data; different aspects
33. Same data: different analysis
Chronic vs acute morphine in striatum
Gemma: Gene ID + Gene Symbol
DRG: Gene name + Probe ID
Gemmapresented results relative to baseline chronic
morphine; DRG with respect to saline, so direction of
change is opposite in the 2 databases
Analysis:
1370 statements from Gemma regarding gene expression as
a function of chronicmorphine
617 were consistent with DRG; over half of the claims of
the paper were not confirmed in this analysis
Results for 1 gene were opposite in DRG and Gemma
45 did not have enough information provided in the paper to
make a judgment
34. Taking a global view on data:
microculture to ecosystem
Several powerful trends should change the way we
think about our data: One Many
Many data
Generation of data is getting easier shared data
Data space is getting richer: more –omes everyday
But...compared to the biological space, still sparse
Many eyes
Wisdom of crowds
More than one way to interpret data
Many algorithms
Not a single way to analyze data
Many analytics
“Signatures” in data may not be directly related to the question for
which they were acquired but tell us something really interesting
Are you exposing or burying your work?
35. The future of scientific
communication
We have learned over the years how to write Printing press
a scientific paper for other humans to read
and for other agents to index
We now have to learn how to write papers
for automated agents (and their humans)
to mine
We have learned over the years to report
Linked data cloud
data in papers for humans to read
We now have to learn how to publish data
in a form and on a suitable platform for
automated agents (and their humans) to
mine
Watson
Reporting neuroscience data within a consistent framework helps enormously
36. Why does it matter?
47/50 major preclinical
published cancer studies “There are no guidelines that
could not be replicated require all data sets to be
reported in a paper; often,
“The scientific community original data are removed
assumes that the claims in a during the peer review and
preclinical study can be taken publication process. “
at face value-that although
there might be some errors in Getting data out sooner in a
detail, the main message of form where they can be exposed
the paper can be relied on and to many eyes and many
analyses, and easily
the data will, for the most compared, may allow us to
part, stand the test of time. expose errors and develop
Unfortunately, this is not better metrics to evaluate the
always the case.” validity of data
Begley and Ellis, 29 MARCH 2012 | VOL 483 | Data, not just stories about them!
NATURE | 531
37. Register your resource to NIF!
1 Institutional
“How do I share my
data?” repositories
Cloud
2
“There is no database
for my data” INCF: Global
infrastructure
3 Community
database:
beginning
4 Community Education
database:
End
Industry Government
NIF is designed to leverage existing investments in resources and infrastructure
38. It’s a messy ecosystem (and that’s OK)
NIF favors a
hybrid, tiered, federated Gene
Organism
system Neuron Brain part Disease
Domain knowledge
Ontologies Caudate projects to
Snpc Grm1 is upregulated in
chronic cocaine
Claims about results Betz cells
degenerate in ALS
Virtuoso RDF triples
Data
Data federation
Workflows
Narrative
39. Future of Research Communications
and e-Scholarship
FORCE11: http://force11.org
Founded by Phil Bourne, Tim
Clark, Ed Hovy, Anita de Waard
and Ivan Herman
Bring together stakeholders with
an interest in moving scholarly
communication beyond reliance
on papers and traditional impact
metrics
Beyond the PDF 2: Spring 2013
40. NIF team (past and present)
Jeff Grethe, UCSD, Co Investigator, Interim PI Fahim Imam, NIF Ontology Engineer
AmarnathGupta, UCSD, Co Investigator Larry Lui
Anita Bandrowski, NIF Project Leader Andrea Arnaud Stagg
Gordon Shepherd, Yale University Jonathan Cachat
Perry Miller Jennifer Lawrence
Luis Marenco Lee Hornbrook
Rixin Wang Binh Ngo
David Van Essen, Washington University VadimAstakhov
Erin Reid XufeiQian
Paul Sternberg, Cal Tech Chris Condit
ArunRangarajan Mark Ellisman
Hans Michael Muller Stephen Larson
Yuling Li Willie Wong
Giorgio Ascoli, George Mason University Tim Clark, Harvard University
SrideviPolavarum Paolo Ciccarese
Karen Skinner, NIH, Program Officer
41. Why do we create so many
overlapping products?
Science is
“That which I cannot incremental;we build on
build, I cannot understand” the results of others
Don’t trust any data you It’s ingrained in our culture
haven’t generated “Build a better mousetrap and the
Oh, now I see what you are world will beat down our doors”
saying Little credit for making someone
Scientists know the else’s product better
domain, not informatics
Yes, we are planning to There’s more than
do that... way to skin a cat....
We are all time and resource We are still mastering the
constrained medium
We extend projects in time Technology is developing fast
42. You need to use
ontology
identifiers instead
of strings
Blah, blah, ont
ology blah
When I talk toresource providers, neuroscientists (and
journal editors)...
Notas do Editor
Doesn’t do it well; doesn’t organize the results in a domain specific way; doesn’t search across itFor use as content goal Dynamic inventory for deep coverage of neuroscience data: Genes -> Systems