...how licensing can change
the way we do research
Nicole Nogoy
VUW, 7 March 2014
Open-Review
Open-Source
Open-Access
Open-Data
Journal, data-platform and database
for large-scale data
in conjunction with
Editor-in-Chief: Laurie Goodman
Executive Editor: Scott Edmunds
Commissioning Editor: Nicole Nogoy
Lead Curator: Chris Hunter
Data Platform: Peter Li
Data Scientist: Rob Davidson
www.gigasciencejournal.com
Take home message:
Its all about the re-use
To do this everything needs to be free
and accessible to be read by humans &
machines*
* See: http://www.biomedcentral.com/about/datamining
Era of Data-Driven Science
Big Potential:
Using networking power of the internet to tackle problems
Can ask new questions & find patterns & connections hidden in
others data
Build on each others efforts quicker & more efficiently
Harness wisdom of the crowds: crowdsourcing, citizen science
Big Challenges: cultural and technical
Removing silos and putting in the commons
Usability: interoperable standards/formats for humans/machines
Good for a field:
Genomics/Bioinformatics
Long term sharing infrastructure:
Strong use of standards/policies:
Plummeting cost/explosion in volumes:
Sharing aids specific communities…
Rice v Wheat: consequences of publically available
genome data.
rice
700
600
500
Papers
400
300
200
100
0
wheat
Sharing aids authors…
Sharing Detailed Research
Data Is Associated with
Increased Citation Rate.
Piwowar HA, Day RS, Fridsma DB (2007)
PLoS ONE 2(3): e308.
doi:10.1371/journal.pone.0000308
Every 10 datasets collected contributes to at least 4 papers in the
following 3-years.
Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473
(7347), 285-285 DOI: 10.1038/473285a
Problem: growing replication gap
Out of 18 microarray papers, results
from 10 could not be reproduced
1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14
2. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)
Growing Issue: increasing number of retractions
>15X increase in last decade
Strong correlation of “retraction index” with
higher impact factor
At current % increase by 2045 as
many papers published as
retracted!
1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html
2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?
Reasons
• Data not available
• From the start – Lost over time
• Software not available
• From the start – Lost over time
• Lack of standards
• None established – Not followed
• Unclear methods
• Missing information
• Honest errors
• Pure and simple data fabrication
How a New Hope in Cancer Fell Apart - NYTimes.com
http://www.nytimes.com/2011/07/08/health/research/08genes.h...
Reprints
This copy is for your personal, noncommercial use only. You can order presentation-ready copies for distribution
to your colleagues, clients or customers here or use the "Reprints" tool that appears next to any article. Visit
www.nytreprints.com for samples and additional information. Order a reprint of this article now.
July 7, 2011
How Bright Promise in Cancer Testing
Fell Apart
By GINA KOLATA
When Juliet Jacobs found out she had lung cancer, she was terrified, but realized that her
hope lay in getting the best treatment medicine could offer. So she got a second opinion,
Juliet Jacobs
found out she
had lung
cancer, she was
terrified
then a third. In February of 2010, she ended up at Duke University, where she entered a
research study whose promise seemed stunning.
Doctors would assess her tumor cells, looking for gene patterns that would determine which
drugs would best attack her particular cancer. She would not waste precious time with
ineffective drugs or trial-and-error treatment. The Duke program — considered a
breakthrough at the time — was the first fruit of the new genomics, a way of letting a cancer
cell’s own genes reveal the cancer’s weaknesses.
But the research at Duke turned out to be wrong. Its gene-based tests proved worthless, and
the research behind them was discredited. Ms. Jacobs died a few months after treatment,
and her husband and other patients’ relatives have retained lawyers.
The episode is a stark illustration of serious problems in a field in which the medical
community has placed great hope: using patterns from large groups of genes or other
molecules to improve the detection and treatment of cancer. Companies have been formed
and products have been introduced that claim to use genetics in this way, but assertions
have turned out to be unfounded. While researchers agree there is great promise in this
science, it has yet to yield many reliable methods for diagnosing cancer or identifying the
best treatment.
But the research at
Duke turned out to
be wrong. Its genebased tests proved
worthless, and the
research behind
them was
discredited.
Ms. Jacobs died a
few months after
treatment
Instead, as patients and their doctors try to make critical decisions about serious illnesses,
they may be getting worthless information that is based on bad science. The scientific world
is concerned enough that two prominent groups, the National Cancer Institute and the
Institute of Medicine, have begun examining the Duke case; they hope to find new ways to
evaluate claims based on emerging and complex analyses of patterns of genes and other
molecules.
1 of 4
10/31/13 1:49 AM
GigaSolution: deconstructing the paper
Provide infrastructure and mechanisms of reward for:
•
Data availability
•
Metadata/curation
Metadata
•
Analyses
Interoperability
Methods
•
Availability of workflows
•
Transparent analyses
Data
GigaSolution: deconstructing the paper
Combines and integrates:
Open-access journal
Data Publishing Platform
Data Analysis Platform
Utilizes big-data infrastructure and expertise from:
Worlds largest genomics organisation with:
20PB storage, 20.5K cores, 212TFlops,
>1000 bioinformaticians
www.gigadb.org
www.gigasciencejournal.com
Importance of licensing: ability to mine & reuse content
Budapest Open Access Initiative:
“By “open access” to *peer-reviewed research literature], we mean its
free availability on the public internet, permitting any users to
read, download, copy, distribute, print, search, or link to the full texts
of these articles, crawl them for indexing, pass them as data to
software, or use them for any other lawful purpose, without
financial, legal, or technical barriers other than those inseparable from
gaining access to the internet itself. The only constraint on
reproduction and distribution, and the only role for copyright in this
domain, should be to give authors control over the integrity of their
work and the right to be properly acknowledged and cited.”
Needs to be:
=
NC, ND put unnecessary restrictions and are not counted as “true OA”
=
CC0 better than CC-BY for datasets to prevent “attribution stacking”
Importance of licensing: ability to mine & reuse content
=
• Gives authors control over the integrity of their work and the right
to be properly acknowledged and cited.
• Does not grant publicity rights, and attribution can be used to
clearly disclaim endorsement
• Restrictions rarely benefit author, and inhibit reuse
Prevents translations, incompatibility issues mixing other
licenses, some combinations illegal (e.g. CC-NC-SA & CC-BYSA), hinders non-profits and mixed-collaborations, practically
unenforceable, and dealing with requests more trouble than its
worth.
Use of non CC-BY by publishers = “double dipping” (selling content, reprints, etc.)
Further reading:
http://www.nature.com/nature/journal/v495/n7442/full/495440a.html
http://blogs.ch.cam.ac.uk/pmr/2011/11/29/scientists-should-never-use-cc-nc-this-explains-why/
New incentives/credit
Credit where credit is overdue:
“One option would be to provide researchers who release data to
public repositories with a means of accreditation.”
“An ability to search the literature for all online papers that used a
particular data set would enable appropriate attribution for those
who share. “
Nature Biotechnology 27, 579 (2009)
Prepublication data sharing
(Toronto International Data Release Workshop)
“Data producers benefit from creating a citable reference, as it can
?
later be used to reflect impact of the data sets.”
Nature 461, 168-170 (2009)
New incentives/credit
= Data Citation?
“increase acceptance of research data as
legitimate, citable contributions to the
scholarly record”.
“data generated in the course of research
are just as valuable to the ongoing
academic discourse as papers and
monographs”.
?
http://www.force11.org/datacitation
Anatomy of a Publication
Idea
Study
Metadata
Data
Analysis
Answer
Anatomy of a Data Publication
Idea
Study
Metadata
Data
Analysis
Answer
BGI Datasets Get DOIs
Invertebrate
Ant
- Florida carpenter ant
- Jerdon’s jumping ant
- Leaf-cutter ant
Roundworm
Schistosoma
Silkworm
Parasitic nematode
Pacific oyster
Human
Asian individual (YH)
- DNA Methylome
- Genome Assembly v1+2
- Transcriptome
Cancer (14TB)
Single cell bladder cancer
HBV infected exomes
Ancient DNA
- Saqqaq Eskimo
- Aboriginal Australian
Released pre-publication
Paper Published in GigaScience
Vertebrates
Darwin’s Finch
Giant panda Macaque
-Chinese rhesus
-Crab-eating
Mini-Pig
Naked mole rat
Parrot, Puerto Rican
Penguin
- Emperor penguin
- Adelie penguin
Pigeon, domestic
Polar bear
DA and F344 rats
Sheep
Tibetan antelope
Microbe/metagenomics
E. Coli O104:H4 TY-2482
T2D gut metagenome
Bulk pooled insects
T. Tengcongensis proteome
Cell-Lines
Chinese Hamster Ovary
Mouse methylomes
Cancer quantitative protemics
Plants
Chinese cabbage
Cucumber
Foxtail millet
Pigeonpea
Potato
Sorghum
Wheat A+B
Other
fMRI
Reward better handling of metadata…
Novel tools/formats for data interoperability/handling.
Cloud
solutions?
Reward better handling of metadata…
Novel tools/formats for data interoperability/handling.
Cloud
solutions?
BMC Research Awards 2013
Winner of open data award
Open-Source: the source of it all
Software community understands benefits
• Transparent, fast, collaborative
• Long history, large community
• Many licenses
• Many repositories
• Many users/platforms
New & more transparent peer-review:
Pre-publication: pre-prints
New & more transparent peer-review:
During-publication: open-review
BMC Series
Medical Journals
New & more transparent peer-review:
Post-publication review
Open content lets you do interesting things post-publication:
New pub models:
Comments, blogs
, online journal
clubs
Altmetrics:
Our first DOI:
To maximize its utility to the research community and aid those fighting
the current epidemic, genomic data is released here into the public domain
under a CC0 license. Until the publication of research papers on the
assembly and whole-genome analysis of this isolate we would ask you to
cite this dataset as:
Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G;
Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S;
Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z;
Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and
the Escherichia coli O104:H4 TY-2482 isolate genome sequencing
consortium (2011)
Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI
Shenzhen. doi:10.5524/100001
http://dx.doi.org/10.5524/100001
To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to
Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
The Peoples Parrot: Amazona vittata
Puerto Rican Parrot Genome Project
Rarest parrot, national bird of Puerto Rico
Community funded from artworks, fashion shows, crowdfunding…
Genome annotated by students in community college as part of bioinformatics education
Paper and Data published in GigaScience and GigaDB
Taras K Oleksyk, et al., (2012) A Locally Funded Puerto Rican Parrot (Amazona vittata) Genome Sequencing Project Increases Avian Data and Advances Young
Researcher Education. GigaScience 2012, 1:14
Steven J. O’Brien. (2012): Genome empowerment for the Puerto Rican parrot – Amazona vittata. GigaScience 2012, 1:13
Oleksyk et al., (2012): Genomic data of the Puerto Rican Parrot (Amazona vittata) from a locally funded project. GigaScience.
http://dx.doi.org/10.5524/100039
How are we supporting data
reproducibility?
Open-Data
Open-Paper
Data sets
DOI:10.5524/100038
78GB CC0 data
DOI:10.1186/2047-217X-1-18
~21,000 accesses
Open-Pipelines
Open-Workflows
Analyses
DOI:10.5524/100044
Open-Review
8 reviewers tested data in ftp server & named reports published
Open-Code
~21,000 downloads
Enabled code to being picked apart by bloggers in wiki
http://homolog.us/wiki/index.php?title=SOAPdenovo2
Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/
New & more transparent peer-review:
The GigaScience way:
8 referees downloaded & tested data, then signed reports
New & more transparent peer-review:
The GigaScience way:
Real-time open-review = paper in arXiv + blogged reviews
Implement workflows in a community-accepted format
Open source
Over 36,000 main
Galaxy server users
Over 1000 papers
citing Galaxy use
Over 55 Galaxy
servers deployed
http://galaxyproject.org
Help us make it
happen!
Give us your data, papers
& pipelines*
Contact us:
nicole@gigasciencejournal.com
editorial@gigasciencejournal.com
database@gigasciencejournal.com
* APC’s currently FREE until end of
December 2014 , saving you up to £1,250 –
courtesy of BGI
www.gigasciencejournal.com
Thanks to:
team:
Peter Li
Chris Hunter
Rob Davidson
Jesse Si Zhe
Scott Edmunds
Nicole Nogoy
Laurie Goodman
Follow us:
Our collaborators:
Ruibang Luo (BGI/HKU)
Shaoguang Liang (BGI-SZ)
Tin-Lap Lee (CUHK)
Huayen Gao (CUHK)
Qiong Luo (HKUST)
Senghong Wang (HKUST)
Yan Zhou (HKUST)
Funding from:
CBIIT
@gigascience
facebook.com/GigaScience
blogs.openaccesscentral.com/blogs/gigablog/
www.gigadb.org
galaxy.cbiit.cuhk.edu.hk
www.gigasciencejournal.com
Notas do Editor
Biology and biomedicine
Humor journal established in 1995. Fun, friendly JIR is a great escape from the harsh and the hassle of research. BUT… (next slide)
We are tracked by the web of Science's Data Citation Index
Galaxy has a massively growing user base (>1000 new users a month)Over 20,000 users on the main serverOver 500 papers citing the use of GalaxyOver 55 servers deployed on the Web
That just leaves me to thank the GigaScience team: Laurie, Scott, Rob, Chris, Peter and Jesse, BGI for their support - specifically Shaoguang for IT and bioinformatics support – our collaborators on the database, website and tools: Tin-Lap, Qiong, Senhong, Yan, the Cogini web design team, Datacite for providing the DOI service and the isacommons team for their support and advocacy for best practice use of metadata reporting and sharing.Thank you for listening.