FORCE11: Creating a data and tools ecosystem

Maryann E. Martone, Ph. D.
Executive Director
Professor of Neuroscience, University of California, San Diego
Future of Research Communications and E-Scholarship
Creating a data and tools ecosystem

What is FORCE11?
Future of Research Communications and E-
Scholarship:
A grass roots effort to accelerate the pace and nature
of scholarly communications and e-scholarship through
technology, education and community
Why 11? We were born in 2011 in Dagstuhl,
Germany
Principles laid out in the FORCE11 Manifesto
FORCE11 launched in July 2012

Who is FORCE11?
Anyone who has a stake in moving scholarly communication into the 21st century
Publishers
Library and
Information
scientists
Policy makers
Tool builders
Funders
Scholars
Science Humanities
Social
Sciences

FORCE11 Vision
• Modern technologies enable vastly improve knowledge transfer and far wider
impact; freed from the restrictions of paper, numerous advantages appear
• We see a future in which scientific information and scholarly communication more
generally become part of a global, universal and explicit network of knowledge
• To enable this vision, we need to create and use new forms of scholarly
publication that work with reusable scholarly artifacts
• To obtain the benefits that networked knowledge promises, we have to put in
place reward systems that encourage scholars and researchers to participate and
contribute
• To ensure that this exciting future can develop and be sustained, we have to
support the rich, variegated, integrated and disparate knowledge offerings
that new technologies enable
Beyond the PDF Visual Notes by De Jongens van de Tekeningen is licensed under a Creative Commons Attribution 3.0 Unported License.

Old Model: Single type of content;
single mode of distribution
Scholar
Library
Scholar
Publisher

The future is now...
Scholar
Consumer
Libraries
Data Repositories
Code Repositories
Community databases/platforms
OA
Curators
Social
Networks
Social
NetworksSocial
Networks
Peer Reviewers
Workflows
Data
Blogs/Wikis
Multimedia
Nanopublications
Narrative
Code

The duality of modern scholarship
Observation: Those who build information systems from the
machine side don’t understand the requirements of the
human very well
Those who build information systems from the human side,
don’t understand requirements of machines very well
Scholarship requires the ability to cite and track usage of
scholarly artifacts. In our current mode of working, there is no
way to easily track artifacts as they move through the
ecosystem; no way to incrementally add human expertise; no
way to alert everyone when things go wrong

Digital objects are a new beast
New modes of representation and verification
will be necessary
Trust: Not just
who produced it
but what
produced it

Impetus for change: Is our current
method serving science?
47/50 major preclinical
published cancer studies
could not be replicated
 “The scientific community
assumes that the claims in a
preclinical study can be taken at
face value-that although there
might be some errors in detail,
the main message of the paper
can be relied on and the data
will, for the most part, stand
the test of time. Unfortunately,
this is not always the case.”
Begley and Ellis, 29 MARCH 2012 | VOL 483 | NATURE | 531

The scientific corpus is fragmented
• ~25 million articles
total, each covering a
fragment of the
biomedical space
• Each publisher owns a
fragment of a particular
field
• The current process is
inefficient and slow
Wiley
Elsevier
MacMillian
Oxford
Spinal Muscular Atrophy
Machine-based access requires that we take a global view
of the body scholarly and allow mining across content

A new platform for scholarly
communications
Components
• Authoring tools
– Optimized for mark up and linked content
• Containers
– Expand the objects that are considered “publications”
– Optimize the container for the content
• Processes
– Scholarship is code
• Mark up
– Data, claims, content suitable for the web
– Suitable identifier systems
• Reward systems
– Incentives to change
– Reward for new objects
Scholarship must move from a “single currency system”;
platforms must recognize diversity of output and representation

FORCE11.org
• Community platform
– Meetings
– Discussions
– Tools and resources
– Blogs
– Event calendar
– Community projects
• Promote
interoperability
– Data Citation
– Resource identification
initiative
500 members from diverse stakeholder groups
700

Beyond the PDF
• Conference/unconferen
ce where all
stakeholders come
together as equals to
discuss issues
– Publishers
– Technologists
– Scholars
– Library scientists
• Incubator for change
• What would you do to
change scholarly
communication?
San Diego, Jan 2011 ...... Amsterdam, March 2013........?2015
http://www.force11.org/beyondthepdf2
YES!!!
FORCE

Promote community, cross-
fertilization and interoperability
• FORCE11 helps facilitate
communications across
disciplines and
communities
• Issues are not identical but
we can learn from each
other
– Enhanced publications
• Digital humanities +
– Dealing with data
• Science +
– Open Access
• Science +
“What is an ORCID id?”-computer scientist

ORCID
Data journals
Research Data Alliance
PeerJ, eLife
Workflows 4Ever
Data Verse
Impact Story, Rubriq
Sadie
Scalar
Resource for scholarly communications:
People, organizations, publications, tools

FORCE11 Working Groups
• FORCE11 provides a neutral convening place
for individuals to come together around issues
in scholarly communication
– FORCE11 provides web working space and
facilitation where possible
– 1K Challenge: Beyond the PDF
– Short term working groups with clear focus
• Deliverable specified
• Time line determined

Data: Who’s problem is it?
Scholar
Library
Scholar
Publisher
Domain-
specific
Repository
Web
site/Personal
data
management
Computing
Scholars, Data Repositories, Institutional Repositories taking ownership of
data. Where should it go? Sometimes it can’t go anywhere.

Is data like a
bibliographic record?
• Not uniform in
size
• Not uniform in
type
• Curation requires
deep
understanding of
domain
• Data is dynamic
• Data is fluid
Geoff Bilder, CrossRef

Surveying the resource
landscape
Neuroscience Information Framework http://neuinfo.org

Deep metadata
http://neuinfo.org
With the thousands of databases and other information sources
available, simple descriptive metadata will not suffice

A place to come together: Data
citation principles
•FORCE11 provides a neutral
space for bringing groups
together
•35 individuals
representing > 20
organizations concerned
with data citation
•Conducted a review of
current data citation
recommendations from 4
different organizations
•Arrived at a sense of
consensus principles
Data citation synthesis group:
http://www.force11.org/node/4
381

Process
Synthesis
Community
feedback
Revision Dissemination
July-Sept 2013 Nov-Dec 2013 Jan 2014 Now
Data Citation Principles: Open for Endorsement

Joint Declaration of Data Citation
Principles
• Designed to be high
level and easy to
understand
• Supplemented with
a glossary,
references and
examples
http://www.force11.org/datacitation
1. Importance
2. Credit and attribution
3. Evidence
4. Unique Identification
5. Access
6. Persistence
7. Specificity and verifiability
8. Interoperability and
flexibility

Significance & Scope
• Sound, reproducible scholarship rests upon a
foundation of robust, accessible data.
• Data should be considered legitimate, citable products
of research.
• Data citation, like the citation of other evidence and
sources, is good research practice.
• The Joint Principles cover purpose, function and
attributes of citations.
• Specific practices vary across communities and
technologies – we recommend communities develop
practices for machine and human citations consistent
with these general principles.

1. Importance. Data should be considered legitimate, citable
products of research. Data citations should be accorded the same
importance in the scholarly record as citations of other research
objects, such as publications [1].
2. Credit and attribution: Data citations should facilitate giving
scholarly credit and normative and legal attribution to all
contributors to the data, recognizing that a single style or
mechanism of attribution may not be applicable to all data [2].
3. Evidence. In scholarly literature, whenever and wherever a claim
relies upon data, the corresponding data should be cited [3].
Purpose

Function
4. Unique Identification. A data citation should include a persistent
method for identification that is machine-actionable, globally
unique, and widely used by a community [4].
5. Access. Data citations should facilitate access to the data
themselves and to such associated metadata, documentation, code,
and other materials, as are necessary for both humans and
machines to make informed use of the referenced data [5].
Joint Declaration of Data

Attributes
6. Persistence. Unique identifiers, and metadata describing the data
and its disposition, should persist -- even beyond the lifespan of
the data they describe [6].
7. Specificity and verifiability. Data citations should facilitate
identification of, access to, and verification of the specific data
that support a claim. Citations or citation metadata should include
information about provenance and fixity sufficient to facilitate
verifying that the specific timeslice, version and/or granular
portion of data retrieved subsequently is the same as was
originally cited [7].
8. Interoperability and flexibility. Data citation methods should be
sufficiently flexible to accommodate the variant practices among
communities, but should not differ so much that they compromise
interoperability of data citation practices across communities [8].

Generic Data Citation
(as it appears in printed reference list)
Note:
● Neither the format nor specific required elements are intended to be defined with this example. Formats, optional
elements, and required elements will vary across publishers and communities. [Principle 8: Interoperability and flexibility].
● As illustrated in the previous examples, intra-work citations may be accompanied with information including the specific
portion used. [Principles 7,8].
● As illustrated in the next example, printed citations should be accompanied by metadata that support credit, attribution,
specificity, and verification. [Principles 2, 5 and 7].
Author(s), Year, Dataset Title, Data Repository or Archive, Version, Global
Persistent Identifier
Principle 2: Credit and
Attribution (e.g. authors,
repositories or other
distributors and contributors)
Principle 4: Unique Identifier (e.g.
DOI, Handle.). Principle 5, 6
Access, Persistence: A persistent
identifier that provides access and
metadata
Principle 7: Specificity and verification (e.g. the specific
version used).
Versioning or timeslice information should be supplied with
any updated or dynamic dataset.

Placement of Citations
Intra-work:
● Should provide sufficient information to identify cited data reference within included
reference list.
● Citation to data should be in close proximity to claims relying on data. [Principle 3]
● May include additional information identifying specific portion of data related
supporting that claim. [Principle 7]
Example: The plots shown in Figure X show the distribution of selected measures from the main
data [Author(s), Year, portion or subset used].
Full Citation:
Citation may vary in style, but should be included in the full reference list along with citations to other
types works.
Example:
References Section
Author(s), Year, Article Title, Journal, Publisher, DOI.
Author(s), Year, Dataset Title, Data Repository or Archive, Version, Global Persistent Identifier.
Author(s), Year, Book Title, Publisher, ISBN.

Citation Metadata
Author(s), Year, Dataset Title,
Data Repository or Archive,
Version, Global Persistent
Identifier.
Metadata
retrieval

<contributor role=”
ORCIDid=”>Name</contributor>
<!-- FIXITY and PROVENANCE --
<fixity type=”MD5”>XXXX</fixity>
<fixity type=”UNF”>UNF:XXXX</fixity>
<!-- MACHINE UNDERSTANDABILITY --
>
<content type>data</content type>
<format>HDF5</format>
Note:
● Metadata location, formats, and elements will vary
across publishers and communities. [Principle 8]
● Citation metadata is needed in addition to the
information in the printed citation.
● Metadata describing the data and its disposition
should persist beyond the lifespan of the data.
[Principle 6]
● Citation metadata should support attribution and
credit [Principle 2]; machine use [Principle 5];
specificity and verification [principle 7]
● For example, additional citation metadata may be
embedded in the citing document; attached to the
persistent identifier for the citation, through its
resolution service; stored in a separate community
indexing service (e.g. DataCite, CrossRef); or provided
in a machine-readable way through the surrogate
(“landing page”) presented by the repository to which
the identifier is resolved.
For more detail, see the References section.
http://www.force11.org/node/4772
EXAMPLE METADATA

Growing Adoption
https://www.force11.org/datacitation/endorsements

Endorse the Principles!
• http://www.force11.org/datacitation/endorsements
148 individuals; 60 organizations

Unique ID’s for all! Resource
Identification Initiative
• It is currently impossible
to query the biomedical
literature to find out
what research resources
have been used to
produce the results of a
study
• Impossible to find all
studies that used a
resource
• Critical for
reproducibility and data
mining
• Critical for trouble-
shooting
http://www.force11.org/resource_identification_initiative
Faulty Antibodies Continue to Enter US and
European Markets, Warns Top Clinical
Chemistry Researcher-Genome Web Daily,
October 11, 2013

Resource Identification Initiative
• Have authors supply
appropriate identifiers for
key resources used within
a study such that they
are:
– Machine processible (i.e.,
unique identifier that
resolves to a single
resource)
– Outside of the paywall
– Uniform across journals
and publishers
Launched February 2014: > 30 journals
participating

Pilot Project
• Have authors identify 3 different types
of research resources:
– Software tools and databases
– Antibodies
– Genetically modified animals
• Include RRID in methods section
• RRID=RRID:Accession number
– Just a string at this point
• Voluntary for authors
• Journals did not have to modify their
submission system
• Journals have flexibility in
implementation. Send request to
author at:
– Submission
– During review
– After acceptance
http://scicrunch.com/resources
Resource Identification Portal: Aggregates
accession numbers from >10 different
databases that are the authorities for
registering research resources

First results are in the literature
Google Scholar: Search RRID; select since 2014

What studies used X?
To date:
•30 articles have appeared
•2 articles have disappeared, i.e.,
the RRID’s were removed at
copyediting
•195 RRID’s were reported
•14 were in error = 0.7%
•> 200 antibodies were added
•> 75 software tools/databases
were added
•A resolver service has been
created
•3rd party tools are being created
to provide linkage between
resources and papers
RRID:nif-0000-30467

What have we learned?
Utopia plug-in: Steve Pettifer
•Authors are willing to
adopt new types of
citations
•RRID = usage of
research resource
•Ideal: resolved by
search engines without
requiring specialized
citation services
•Citation drives
registration
•Clear role for
repositories as
authorities
•Should RRID’s be DOI’s?
Will system work
for data citation
and more
complicated
research objects?

Data Citation Implementation Group

FORCE11 Vision
• Modern technologies enable vastly improve knowledge transfer and far wider
impact; freed from the restrictions of paper, numerous advantages appear
• We see a future in which scientific information and scholarly communication more
generally become part of a global, universal and explicit network of knowledge
• To enable this vision, we need to create and use new forms of scholarly
publication that work with reusable scholarly artifacts
• To obtain the benefits that networked knowledge promises, we have to put in
place reward systems that encourage scholars and researchers to participate and
contribute
• To ensure that this exciting future can develop and be sustained, we have to
support the rich, variegated, integrated and disparate knowledge offerings
that new technologies enable
No single infrastructure serves everything; cooperation
in defining a global system of scholarly communication

Notes & References for Data Citation Principles
Notes
[1] CODATA 2013: sec 3.2.1; Uhlir (ed.) 2012, ch 14; Altman & King 2007
[2] CODATA 2013, Sec 3.2; 7.2.3; Uhlir (ed.) 2012,ch. 14
[3] CODATA 2013, Sec 3.1; 7.2.3; Uhlir (ed.) 2012, ch. 14
[4] Altman-King 2007; CODATA 2013, Sec 3.2.3, Ch. 5; Ball & Duke 2012
[5] CODATA 2013, Sec 3.2.4, 3.2.5, 3.2.8
[6] Altman-King 2007; Ball & Duke 2012; CODATA 2013, Sec 3.2.2
[7] Altman-King 2007; CODATA 2013, Sec 3.2.7, 3.2.8
[8] CODATA 2013, Sec 3.2.10
References
• M. Altman & G. King, 2007. A Proposed Standard for the Scholarly Citation of
Quantitative Data, D-Lib
• Ball, A., Duke, M. (2012). ‘Data Citation and Linking’. DCC Briefing Papers.
Edinburgh: Digital Curation Centre.
• CODATA-ICSTI Task Group on Data Citation, 2013; Out of Cite, Out of Mind: The
Current State of Practice, Policy, and Technology for the Citation of Data. Data
Science Journal
• P. Uhlir (ed.),2011. For Attribution -- Developing Data Attribution and Citation
Practices and Standards. National Academies of Sciences

FORCE11: Creating a data and tools ecosystem

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Semelhante a FORCE11: Creating a data and tools ecosystem

Semelhante a FORCE11: Creating a data and tools ecosystem (20)

Mais de Maryann Martone

Mais de Maryann Martone (9)

Último

Último (20)

FORCE11: Creating a data and tools ecosystem