2. Assembled the largest searchable
collation of neuroscience data on the
web
The largest catalog of biomedical
resources (data, tools, materials,
services) available
The largest ontology for
neuroscience
NIF search portal: simultaneous
search over data, NIF catalog and
biomedical literature
Neurolex Wiki: a community wiki
serving neuroscience concepts
A unique technology platform
Cross-neuroscience analytics
A reservoir of cross-disciplinary
biomedical data expertise
3. FormalKnowledge/Ontologies
Extracted/AnalyzedFactCollections
Least Shared
Most Shared
Useful for
Deep (Re-) Analysis
Useful for
Comprehension,
Discovery
Uneven distribution of data volume, velocity,
variability, location and availability
Raw Data (in files) and Data Sets (in directories)
LOCAL OFFLINE/ONLINE STORAGE, IRs, PRs?
Data Collections and Databases
SPECIALIZED & GENERAL PRs, DBs
Processed Data
Products,
Processes
DBs, WEB-PRs, PUBS
Papers
w,w/o Data
PUBs
Aggregates and Resource Hubs
NIF is aware of 761 repositories
4. 47/50 major preclinical published
cancer studies could not be replicated
“The scientific community
assumes that the claims in a
preclinical study can be taken at
face value-that although there
might be some errors in detail, the
main message of the paper can be
relied on and the data will, for the
most part, stand the test of time.
Unfortunately, this is not always
the case.”
Getting data out sooner in a
form where they can be exposed
to many eyes and many analyses,
and easily compared, may allow
us to expose errors and develop
better metrics to evaluate the
validity of dataBegley and Ellis, 29 MARCH 2012 | VOL 483 |
NATURE | 531
“There are no guidelines that require
all data sets to be reported in a
paper; often, original data are
removed during the peer review and
publication process.”
“There must be more opportunities
to present negative data.”
Significant cross-linking between
original papers, supporting/refuting
papers/data
Courtesy: Maryann Martone
5. Hello All,
Thank you for the people who are taking a look at
the data in tera15 :-)
There are a whole lot of data (about +8TB) that can
be looked at and/or removed.
If you had assistants, students, or volunteers who
assisted you in processing data, please locate those
folders and remove any duplicate or unused data.
This will help EVERYONE have space to process
new data.
Any old data that has been sitting in tera15
untouched in more than 4 years will be removed to
a different area for deletion.
Please take a look carefully!
6. For every neuroscientist
For every experiment he/she runs
For every data set that leads to positive or negative results
Store the data in some shared or on-demand repository
Annotate the data with experimental and other contextual
information
Perform some analysis and contribute your analysis method to the
repository where the data is being stored
For every analysis result
Keep the complete processing provenance of the result
Point back to the data set or data element that contribute to the
analysis, specifically mark positively and negatively contributing data
If an error is pointed out in some result,
Provide an explanation of the error
Create a pointer back to the part of the publication and to that part
of the data set or data element that produced the error
7. For every publication
For every result reported
Create a pointer back to all data used in that section
For every experimental object (e.g., reagents, or auxiliary data from
another group) used,
Create an appropriate, if needed time-stamped, pointer to the correct
version of the data
For every repository/database … that holds the data
Ensure rapid availability
Allow scientists to download or perform in-place analyses
Adhere to appropriate data standards
Keep consistency of all data + references
Should permit multiple simultaneous analsyses by different
users
Should allow searching/browsing/querying all possible metadata
Diverse distributed infrastructures consisting of individual researchers in different
institutions, institutional repositories, public data centers, publishers, annotators and
aggregators, bioinformaticians …
8. Scalable, Elastic Storage and Computation
Service Expectations
Scalable Search and Query across structured/semi-
structured/unstructured data
Facts – What neurons do Purkinje cells project to?
Resources – What are recent data sets on biomarkers for SMA?
Analytical Results -- What animal models have similar phenotypes to
Parkinson’s disease?
Landscape Surveys – Who has what data holdings on neurodegenerative
diseases?
Active Analyses
Combining these data and mine, compute how the connectivity of the
human brain differ from non-human primates
Perform GO-enrichment analysis on all genes upregulated in Alzheimer’s on
all available data and compare with my results
Tracebacks
What data and processing have been used to reach this result in this paper?
Which publication refuted the claims in this paper and how?
9.
10. If all neuroscientists want to comply with this data
sharing today, will the current infrastructure be able
to support it?
Is enough attention being paid to an overarching
architecture and interoperation protocol for data
sharing?
Is today’s technology properly harnessed to create a holistic
data sharing infrastructure?
What would motivate neuroscientists and other
players to really play their parts in data sharing?
Should there be a “monitoring scheme” to ensure
proper data sharing practices are actually happening?
11. The Data-Sharing Ecosystem is a distributed system
that can be viewed as an operating system where
Each object has a set of unique structured ids (e.g., extended
DOIs) that identify
any data set, data object, or any interval of a data object
The semantic category of the data element
Any human/software agent
Any parameter set of a software invocation
A log is maintained and transmitted for each activity by any
agent on any data element
Submission, transfer to repository, pickup by aggregator, creating
derived product, being crawled by search services, …
These logs can be accessed by a central monitoring system
covering the ecosystem using a Twitter Storm-like infrastructure
Think of Facebook maintaining a log of the different actions such as being present at the
system, sending and accepting friend requests, posting comments and photos, starting and
ending chat sessions, …
12.
13. Update activities on data elements from Data
Centers and Repositories
Resource References from literature and web sites,
including opinion cites like blogs and forums
Citation categories from automated/human-driven
annotation systems like DataCite or DOMEO
Provenance chains from workflow systems like
Kepler
Data derivation changes from rule-based metadata
management systems like iRODS
14. Frequency and regularity of data creation vis-à-vis
submission to the data-sharing ecosystem
Frequency and regularity of data usage of various
kinds
Viewing, downloading, replication, uptake by a software, …
Number of derived data products
Compounding by cascades of derived data
Cross-referencing of data and resources in
publications
Compounding by publication data citation cascades
Human and programmatic access to data
15. Accountability Score: a measure of “good data
citizenship”
Of People
Increases with contribution of data and analyses
Decays (slowly) with time
Increases with references and citations
Increases with supporting work by others
Decreases with refutation
Decreases (rapidly) with paper retraction
Of Publications
Increases with addition of reference-able data
Increases with data access
Increases with keeping updated with data updates
16. Influence: A
classification and
measure of the
professional
engagement one has
in terms of data
activity
Longer-term measure
compared to
accountability score
Applies to all types of
players in the
ecosystem including
just users
17.
18. These measures do not hold for scientists who do
not produce data
The measures are mostly designed for online
activities and must be modified to match the
dynamics of different scientific communities
Parameters like decay constants
Time-window for score revision
Global scores should be
supplemented by community scores where a community is
defined by ontological regions where one’s research lies
per activity type rather than a single overall score
19. This is the Big Brother for science
This is going to create a bias against “non-
performers”
Scientific errors will be penalized more than
necessary
The algorithms can be manipulated to the
advantage of some people over others
Smaller individuals/organizations will be penalized
with respect to better-funded, higher-throughput
organization
This will be hard to implement due to oppositions
from different groups and institutions
20. My speculations
If the community decides that it needs data sharing, it will naturally
gravitate toward some degree of judgment of those who don’t
comply
Technology frameworks similar to what we discussed will be adopted
within individual e-infrastructures
As more data become available and data sharing efforts become
successful, third-party watchers like credit bureaus that monitor
scientist’s products with respect to data will emerge
Such scores would be used for community perception and in-kind
incentives earlier than their adoption for formal evaluations
21. The real question is “How do we promote data
sharing?”
Creating infrastructural elements and reusing
today’s (tomorrow’s) technological capabilities is not
enough
We need a more holistic approach that factors in the
human component
Using social activity analysis as a starting point we
should be able to build a monitoring-cum-incentivizing
scheme for data sharing