In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate

Amarnath Gupta
Univ. of California San Diego
If There is a Data Deluge, Where are the Data?

 Assembled the largest searchable
collation of neuroscience data on the
web
 The largest catalog of biomedical
resources (data, tools, materials,
services) available
 The largest ontology for
neuroscience
 NIF search portal: simultaneous
search over data, NIF catalog and
biomedical literature
 Neurolex Wiki: a community wiki
serving neuroscience concepts
 A unique technology platform
 Cross-neuroscience analytics
 A reservoir of cross-disciplinary
biomedical data expertise

FormalKnowledge/Ontologies
Extracted/AnalyzedFactCollections
Least Shared
Most Shared
Useful for
Deep (Re-) Analysis
Useful for
Comprehension,
Discovery
Uneven distribution of data volume, velocity,
variability, location and availability
Raw Data (in files) and Data Sets (in directories)
LOCAL OFFLINE/ONLINE STORAGE, IRs, PRs?
Data Collections and Databases
SPECIALIZED & GENERAL PRs, DBs
Processed Data
Products,
Processes
DBs, WEB-PRs, PUBS
Papers
w,w/o Data
PUBs
Aggregates and Resource Hubs
NIF is aware of 761 repositories

 47/50 major preclinical published
cancer studies could not be replicated
 “The scientific community
assumes that the claims in a
preclinical study can be taken at
face value-that although there
might be some errors in detail, the
main message of the paper can be
relied on and the data will, for the
most part, stand the test of time.
Unfortunately, this is not always
the case.”
 Getting data out sooner in a
form where they can be exposed
to many eyes and many analyses,
and easily compared, may allow
us to expose errors and develop
better metrics to evaluate the
validity of dataBegley and Ellis, 29 MARCH 2012 | VOL 483 |
NATURE | 531
 “There are no guidelines that require
all data sets to be reported in a
paper; often, original data are
removed during the peer review and
publication process.”
 “There must be more opportunities
to present negative data.”
 Significant cross-linking between
original papers, supporting/refuting
papers/data
Courtesy: Maryann Martone

Hello All,
Thank you for the people who are taking a look at
the data in tera15 :-)
There are a whole lot of data (about +8TB) that can
be looked at and/or removed.
If you had assistants, students, or volunteers who
assisted you in processing data, please locate those
folders and remove any duplicate or unused data.
This will help EVERYONE have space to process
new data.
Any old data that has been sitting in tera15
untouched in more than 4 years will be removed to
a different area for deletion.
Please take a look carefully!

 For every neuroscientist
 For every experiment he/she runs
 For every data set that leads to positive or negative results
 Store the data in some shared or on-demand repository
 Annotate the data with experimental and other contextual
information
 Perform some analysis and contribute your analysis method to the
repository where the data is being stored
 For every analysis result
 Keep the complete processing provenance of the result
 Point back to the data set or data element that contribute to the
analysis, specifically mark positively and negatively contributing data
 If an error is pointed out in some result,
 Provide an explanation of the error
 Create a pointer back to the part of the publication and to that part
of the data set or data element that produced the error

 For every publication
 For every result reported
 Create a pointer back to all data used in that section
 For every experimental object (e.g., reagents, or auxiliary data from
another group) used,
 Create an appropriate, if needed time-stamped, pointer to the correct
version of the data
 For every repository/database … that holds the data
 Ensure rapid availability
 Allow scientists to download or perform in-place analyses
 Adhere to appropriate data standards
 Keep consistency of all data + references
 Should permit multiple simultaneous analsyses by different
users
 Should allow searching/browsing/querying all possible metadata
Diverse distributed infrastructures consisting of individual researchers in different
institutions, institutional repositories, public data centers, publishers, annotators and
aggregators, bioinformaticians …

 Scalable, Elastic Storage and Computation
 Service Expectations
 Scalable Search and Query across structured/semi-
structured/unstructured data
 Facts – What neurons do Purkinje cells project to?
 Resources – What are recent data sets on biomarkers for SMA?
 Analytical Results -- What animal models have similar phenotypes to
Parkinson’s disease?
 Landscape Surveys – Who has what data holdings on neurodegenerative
diseases?
 Active Analyses
 Combining these data and mine, compute how the connectivity of the
human brain differ from non-human primates
 Perform GO-enrichment analysis on all genes upregulated in Alzheimer’s on
all available data and compare with my results
 Tracebacks
 What data and processing have been used to reach this result in this paper?
Which publication refuted the claims in this paper and how?

 If all neuroscientists want to comply with this data
sharing today, will the current infrastructure be able
to support it?
 Is enough attention being paid to an overarching
architecture and interoperation protocol for data
sharing?
 Is today’s technology properly harnessed to create a holistic
data sharing infrastructure?
 What would motivate neuroscientists and other
players to really play their parts in data sharing?
 Should there be a “monitoring scheme” to ensure
proper data sharing practices are actually happening?

 The Data-Sharing Ecosystem is a distributed system
that can be viewed as an operating system where
 Each object has a set of unique structured ids (e.g., extended
DOIs) that identify
 any data set, data object, or any interval of a data object
 The semantic category of the data element
 Any human/software agent
 Any parameter set of a software invocation
 A log is maintained and transmitted for each activity by any
agent on any data element
 Submission, transfer to repository, pickup by aggregator, creating
derived product, being crawled by search services, …
 These logs can be accessed by a central monitoring system
covering the ecosystem using a Twitter Storm-like infrastructure
Think of Facebook maintaining a log of the different actions such as being present at the
system, sending and accepting friend requests, posting comments and photos, starting and
ending chat sessions, …

 Update activities on data elements from Data
Centers and Repositories
 Resource References from literature and web sites,
including opinion cites like blogs and forums
 Citation categories from automated/human-driven
annotation systems like DataCite or DOMEO
 Provenance chains from workflow systems like
Kepler
 Data derivation changes from rule-based metadata
management systems like iRODS

 Frequency and regularity of data creation vis-à-vis
submission to the data-sharing ecosystem
 Frequency and regularity of data usage of various
kinds
 Viewing, downloading, replication, uptake by a software, …
 Number of derived data products
 Compounding by cascades of derived data
 Cross-referencing of data and resources in
publications
 Compounding by publication data citation cascades
 Human and programmatic access to data

 Accountability Score: a measure of “good data
citizenship”
 Of People
 Increases with contribution of data and analyses
 Decays (slowly) with time
 Increases with references and citations
 Increases with supporting work by others
 Decreases with refutation
 Decreases (rapidly) with paper retraction
 Of Publications
 Increases with addition of reference-able data
 Increases with data access
 Increases with keeping updated with data updates

 Influence: A
classification and
measure of the
professional
engagement one has
in terms of data
activity
 Longer-term measure
compared to
accountability score
 Applies to all types of
players in the
ecosystem including
just users

 These measures do not hold for scientists who do
not produce data
 The measures are mostly designed for online
activities and must be modified to match the
dynamics of different scientific communities
 Parameters like decay constants
 Time-window for score revision
 Global scores should be
 supplemented by community scores where a community is
defined by ontological regions where one’s research lies
 per activity type rather than a single overall score

 This is the Big Brother for science
 This is going to create a bias against “non-
performers”
 Scientific errors will be penalized more than
necessary
 The algorithms can be manipulated to the
advantage of some people over others
 Smaller individuals/organizations will be penalized
with respect to better-funded, higher-throughput
organization
 This will be hard to implement due to oppositions
from different groups and institutions

 My speculations
 If the community decides that it needs data sharing, it will naturally
gravitate toward some degree of judgment of those who don’t
comply
 Technology frameworks similar to what we discussed will be adopted
within individual e-infrastructures
 As more data become available and data sharing efforts become
successful, third-party watchers like credit bureaus that monitor
scientist’s products with respect to data will emerge
 Such scores would be used for community perception and in-kind
incentives earlier than their adoption for formal evaluations

 The real question is “How do we promote data
sharing?”
 Creating infrastructural elements and reusing
today’s (tomorrow’s) technological capabilities is not
enough
 We need a more holistic approach that factors in the
human component
 Using social activity analysis as a starting point we
should be able to build a monitoring-cum-incentivizing
scheme for data sharing

In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate

Semelhante a In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate (20)

Mais de Neuroscience Information Framework

Mais de Neuroscience Information Framework (20)

Último

Último (20)

In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate