GigaDB is a new database integrated with the GigaScience journal to meet the needs of biological and biomedical research in the era of big data. It currently contains 36 public datasets across various domains including humans, plants, microbes, vertebrates and invertebrates. The database aims to improve reproducibility, usability, standards, and sharing of data.
Tam Sneddon: Revolutionizing data dissemination, organization and use.
1. Now taking submissions…
– revolutionizing data dissemination,
organization and use.
Tam Sneddon
BGI-Hong Kong
www.gigadb.org
2. Overview
Introduction What is ,
/ why we want your
data and why you
should submit to us?
Published datasets
Data Publishing
New database features
DOIs
Future tools: Galaxy/Cloud
4. DataCite goal: “increase acceptance of research as legitimate, citable contributions to the scholarly record”
5.
6. Currently: 36 public datasets
Humans - Crab-eating Plants
Ancient DNA Minipig Chinese cabbage
- Aboriginal Australian Mouse methylomes Cucumber, domestic
- Saqqaq Eskimo Naked mole rat Foxtail millet
Asian individual (YH) Penguin Pigeonpea
- DNA methylome - Adelie penguin Potato
- Genome assembly - Emperor penguin Sorghum
- Transcriptome Pigeon, domestic
Cancer Polar bear Microbes
- Hepatocellular carcinoma Sheep, domestic E. Coli O104:H4 TY-2482
- Single-cell bladder Tibetan antelope T2D gut metagenome
Human exome – chronic
hepatitis B infection Invertebrates Cell-lines
predisposing variants Ant Chinese Hamster Ovary
- Florida carpenter ant
Vertebrates - Jerdon’s jumping ant
Darwin finch - Leaf-cutter ant
Giant panda Roundworm
Macaque Schistosoma haematobium
- Chinese rhesus Silkworm, domestic and wild
7. Currently: 36 public datasets
Humans - Crab-eating Plants
Ancient DNA Minipig Chinese cabbage
- Aboriginal Australian Mouse methylomes Cucumber, domestic
- Saqqaq Eskimo Naked mole rat Foxtail millet
Asian individual (YH) Penguin Pigeonpea
- DNA methylome - Adelie penguin Potato
- Genome assembly - Emperor penguin Sorghum
- Transcriptome Pigeon, domestic
Cancer *14TB* Polar bear Microbes
- Hepatocellular carcinoma Sheep, domestic E. Coli O104:H4 TY-2482
- Single-cell bladder Tibetan antelope T2D gut metagenome
Human exome – chronic
hepatitis B infection Invertebrates Cell-lines
predisposing variants Ant Chinese Hamster Ovary
- Florida carpenter ant
Vertebrates - Jerdon’s jumping ant
Darwin finch - Leaf-cutter ant
Giant panda Roundworm
Macaque Schistosoma haematobium
- Chinese rhesus Silkworm, domestic and wild
8. Currently: 36 public datasets
***15 pre-publication***
Humans - Crab-eating Plants
Ancient DNA Minipig Chinese cabbage
- Aboriginal Australian Mouse methylomes Cucumber, domestic
- Saqqaq Eskimo Naked mole rat Foxtail millet
Asian individual (YH) Penguin Pigeonpea
- DNA methylome - Adelie penguin Potato
- Genome assembly - Emperor penguin Sorghum
- Transcriptome Pigeon, domestic
Cancer Polar bear Microbes
- Hepatocellular carcinoma Sheep, domestic E. Coli O104:H4 TY-2482
- Single-cell bladder cancer Tibetan antelope T2D gut metagenome
Human exome – chronic
hepatitis B infection Invertebrates Cell-lines
predisposing variants Ant Chinese Hamster Ovary
- Florida carpenter ant
Vertebrates - Jerdon’s jumping ant
Darwin finch - Leaf-cutter ant
Giant panda Roundworm
Macaque Schistosoma haematobium
- Chinese rhesus Silkworm, domestic and wild
9. Currently: 36 public datasets
*5 citations in the references*
Humans - Crab-eating Plants
Ancient DNA Minipig Chinese cabbage
- Aboriginal Australian *Mouse methylomes* Cucumber, domestic
- Saqqaq Eskimo Naked mole rat Foxtail millet
Asian individual (YH) Penguin Pigeonpea
- DNA methylome - Adelie penguin Potato
- Genome assembly - Emperor penguin *Sorghum*
- *Transcriptome* Pigeon, domestic
Cancer *Polar bear* Microbes
- Hepatocellular carcinoma Sheep, domestic E. Coli O104:H4 TY-2482
- *Single-cell bladder cancer* Tibetan antelope T2D gut metagenome
Human exome – chronic
hepatitis B infection Invertebrates Cell-lines
predisposing variants Ant Chinese Hamster Ovary
- Florida carpenter ant
Vertebrates - Jerdon’s jumping ant
Darwin finch - Leaf-cutter ant
Giant panda Roundworm
Macaque Schistosoma haematobium
- Chinese rhesus Silkworm, domestic and wild
10. Currently: 36 public datasets
*5 citations in the references*
Humans - Crab-eating Plants
Ancient DNA Minipig Chinese cabbage
- Aboriginal Australian Mouse methylomes Cucumber, domestic
- Saqqaq Eskimo Naked mole rat Foxtail millet
Asian individual (YH) Penguin Pigeonpea
- DNA methylome - Adelie penguin Potato
- Genome assembly - Emperor penguin *Sorghum*
- Transcriptome Pigeon, domestic
Cancer Polar bear Microbes
- Hepatocellular carcinoma Sheep, domestic E. Coli O104:H4 TY-2482
- Single-cell bladder Tibetan antelope
Human exome – chronic Cell-lines
hepatitis B infection Invertebrates Chinese Hamster Ovary
predisposing variants Ant (CHO)
- Florida carpenter ant
Complemented by data submitted to INSDC databases:
Vertebrates - Jerdon’s jumping ant
- Raw data SRA:SRA046843
Darwin finch - Leaf-cutter ant
- Assemblies of 3 strains Genbank:AHAO00000000-AHAQ00000000
Giant panda
- SNPs
Roundworm
dbSNP batch ids:1056306-10563068
- Macaque Schistosoma haematobium
-
CNVs
- - Chinese rhesus
InDels
SV
} Silkworm, domestic and wild
dbVAR:nstd63
11. Currently: 36 public datasets
*5 citations in the references*
Humans - Crab-eating Plants
Ancient DNA Minipig Chinese cabbage
- Aboriginal Australian Mouse methylomes Cucumber, domestic
- Saqqaq Eskimo Naked mole rat Foxtail millet
Asian individual (YH) Penguin Pigeonpea
- DNA methylome - Adelie penguin Potato
- Genome assembly - Emperor penguin Sorghum
- *Transcriptome* Pigeon, domestic
Cancer Polar bear Microbes
- Hepatocellular carcinoma Sheep, domestic E. Coli O104:H4 TY-2482
- Single-cell bladder Tibetan antelope
Human exome – chronic Cell-lines
hepatitis B infection Invertebrates Chinese Hamster Ovary
predisposing variants Ant (CHO)
- Florida carpenter ant
Vertebrates - Jerdon’s jumping ant
Darwin finch - Leaf-cutter ant
Giant panda Roundworm
Macaque Schistosoma haematobium
- Chinese rhesus Silkworm, domestic and wild
12. Currently: 36 public datasets
*5 citations in the references*
Humans - Crab-eating Plants
Ancient DNA Minipig Chinese cabbage
- Aboriginal Australian Mouse methylomes Cucumber, domestic
- Saqqaq Eskimo Naked mole rat Foxtail millet
Asian individual (YH) Penguin Pigeonpea
- DNA methylome - Adelie penguin Potato
- Genome assembly - Emperor penguin Sorghum
- *Transcriptome* Pigeon, domestic
Cancer *Polar bear* Microbes
- Hepatocellular carcinoma Sheep, domestic E. Coli O104:H4 TY-2482
- Single-cell bladder Tibetan antelope
Human exome – chronic Cell-lines
hepatitis B infection Invertebrates Chinese Hamster Ovary
predisposing variants Ant (CHO)
- Florida carpenter ant
Vertebrates - Jerdon’s jumping ant
Darwin finch - Leaf-cutter ant
Giant panda Roundworm
Macaque Schistosoma haematobium
- Chinese rhesus Silkworm, domestic and wild
13. Currently: 36 public datasets
*5 citations in the references*
Humans - Crab-eating Plants
Ancient DNA Minipig Chinese cabbage
- Aboriginal Australian *Mouse methylomes* Cucumber, domestic
- Saqqaq Eskimo Naked mole rat Foxtail millet
Asian individual (YH) Penguin Pigeonpea
- DNA methylome - Adelie penguin Potato
- Genome assembly - Emperor penguin Sorghum
- Transcriptome Pigeon, domestic
Cancer Polar bear Microbes
- Hepatocellular carcinoma Sheep, domestic E. Coli O104:H4 TY-2482
- *Single-cell bladder cancer* Tibetan antelope
Human exome – chronic Cell-lines
hepatitis B infection Invertebrates Chinese Hamster Ovary
predisposing variants Ant (CHO)
- Florida carpenter ant
Vertebrates - Jerdon’s jumping ant
Darwin finch - Leaf-cutter ant
Giant panda Roundworm
Macaque Schistosoma haematobium
- Chinese rhesus Silkworm, domestic and wild
14. GigaDB is a new database integrated with the GigaScience journal to meet the needs of a new generation of biological
and biomedical research as it enters the era of “big-data”… (see more)
15. GigaDB is a new database integrated with the GigaScience journal to meet the needs of a new generation of biological
and biomedical research as it enters the era of “big-data”… (see more)
16.
17. GigaDB is a new database integrated with the GigaScience journal to meet the needs of a new generation of biological
and biomedical research as it enters the era of “big-data”… (see more)
30. Thanks to:
Laurie Goodman Shaoguang Liang (BGI-SZ)
Scott Edmonds Tin-Lap Lee (CUHK)
Alexandra Basford Qiong Luo (HKUST)
Peter Li Senghong Wang (HKUST)
Jesse Si Zhe Yan Zhou (HKUST)
Cogini
editorial@gigasciencejournal.com
Contact us: database@gigasciencejournal.com
@gigascience
Follow us: facebook.com/GigaScience
blogs.openaccesscentral.com/blogs/gigablog/
www.gigadb.org
Editor's Notes
I would like to thank the organizers for the opportunity to present the Giga database this evening.I realize I’m all that stands between you and the alcohol outside this room so I’ll try not to run over time!
I would first like to give a brief introduction to GigaDB and GigaScience, then I’ll describe GigaDB in more detail, say why we want your data, and hopefully give you convincing reasons WHY you should submit to us! I’ll then mention DataCite and what it means for a dataset to be assigned a DOI, then I’give you examples of some of the datasets in GigaDB and how they are cited and acknowledged, describe the features of the new GigaDB website (expected next month) and finally I’ll sum up with tools our team are working on and hope to integrate with GigaDB in the upcoming months.
Basically, GigaScience aims to be a home for large scale biological and biomedical studies by providing a place for data hosting, and providing additional credit to authors for making their data available by assigning DOIs to each published dataset.The GigaSciencejournal is open-access and published online by BioMed Central in collaboration with BGI. Scott Edmunds, the Editor, is in the audience. GigaScience officially launched in July this year with GigaDB as the associated database built to host the supplementary files, images, software and any other data from the GigaScience article.The criteria and focus of both the GigaScience journal and the database includes:Reproducibility/ReuseBecause the GigaScience datasets are all open-access and assigned a DOI they are stable and permanent so results can be tested and reproduced and the data reused for reanalysis or comparison of new analyses. Utility/UsabilityThe new GigaDB website will have integrated tools such as Galaxy and MyExperiment(which I’ll mention briefly at the end of my talk) to promote more widespread access, viewing, and analysis of data and integration of the BGI Cloud Computing resources for handling and analyzing large-scale data will allow any researcher to access and analyze the data no matter how large or small their institution’s IT infrastructure.Standards/Searchability/SharingWe support the use Biosharing and the use of ISA-Tab to aid and promote best practice in metadata reporting and sharing so the data can be portable across other platforms.We mandate all supporting data must be publicly available.And we encourageMIBBI (Minimum Information for Biological and Biomedical Investigations)compliance and use of community reporting checklists.Data publishing/DOIFinally, as mentioned, we register all datasets and DOIs with DataCite which are citable and we hope this will promote rapid release of data and encourage researchers to release their data pre-publication.
So, a little bit about DOIs or Digital Object Identifiers.DOIs are unique identifiers that are also resolvable to a webpage and have been used in the journal world for a long time to provide a permanent identifiers and links to journal articles.We register our DOIs with DataCite, which was set up specifically aimed towards datasets and providing incentives and credit to the data producers. Their goal is to “increase acceptance of research as legitimate, citable contributions to the scholarly record”. We automatically generate the metadata XML from GigaDB and provide as much as possible within the DataCite schema to aid discovery of the datasets via a central metadata repository (with an open API) and other metadata harvesters including the upcoming Data Citation Index by Thomson Reuters.For example, if you search DataCite for ‘GigaDB’ there are 35 records returned corresponding to the 35 published datasets in GigaDB.The 10.5524 prefix is unique for the GigaScience dataset project and our datasets start with Genomic Data from E Coli, the first DOI we released pre-publication, at 100001 and then go up sequentially. The first 5 datasets listed here just happen to be Genomic but we currently have Transcriptomic, Epigenomic and Metagenomic datasets with Proteomic datasets in the pipeline and plans to extend to include the likes of biomedical imaging and environmental studies.
If we randomly select DOI:10.55224/100015 you can view the metadata associated with the Genome Sequence of the YH individual. The citation includes the authors, year of publication, title, publisher and resolvable DOI. This url takes you to the GigaDB landing page for this study so even if the url changes we can update the metadata and the webpage will always be resolved. We have then registered the abstract, resource type, a subject tag of ‘Genomic’, the CC0 license, size of the dataset and related identifiers. In this case the DOI is referenced by the Nature article and is supplemented by the GigaScience datasets 100013 and 100014, which are the supplementary transcriptome and the methylome datasets of YH individual, respectively.
As you saw with the DOI search, to date we have issued DOIs to 36 datasets including human, vertebrates, invertebrates, plants, microbes and cell-lines.
We have the capacity to store very large datasets at BGI, which is exemplified by the Asian Cancer Research Groups’ Hepatocellular carcinoma dataset which is 14 terabytes in size. By providing tools and integration with the BGI Cloud we hope to make this important dataset available for anyone to access and analyze.Many of the datasets in GigaDB are also part of larger collaborations and projects such as the Genome 10K which includes our most recent release of the Darwin finch genome assembly and annotation. With the new GigaDB interface you can search specifically for datasets from these projects.
Many of these datasets were made public and the DOI released prior to publication, and – I would like to stress - this DID NOT prevent subsequent publication.
Indeed, five subsequent publications cite the respective GigaScience DOI in the references…The transcriptome from the YH lymphoblastoid cell lineThe single-cell whole exome sequence from an individual bladder cancerThe MEDUSA computational pipeline used to identify differentially methylated regions in mouseThe polar bear genome And the sorghum genomePublications are in the pipeline for several of the remaining datasets on the list.
the first of which was the Sorghum genome and analyses, published in Genome Biology last year. As noted reference 62 cites the dataset DOI. I would also like to stress that the DOI is a complement to and not a replacement for deposition of relevant data in appropriate INSDC databases at EBI, NCBI or DDBJ and it is a requirement prior to submission to GigaDB that data be deposited in such repositories. In the case of Sorghum we also worked with the authors to help them submit the SNP and structural variants to dbSNP and dbVar respectively.
A GigaScience dataset citation is also included in the YH Transcriptome paper published in Nature Biotechnology in February this year.As you can see the dataset was published in 2011 but this did not prevent subsequent publication of the analysis paper.
In the case of the polar bear, a group different to the one that produced the original dataset, published in Science, citing the GigaScience dataset.
Finally, there are two citations from the GigaScience Journal in the last couple months since it’s official launch. One is the Mouse Methylome computational pipeline and the other is the Single Cell Bladder Cancer genome.I would like to highlight that the dataset for the Mouse methylome paper includes not only the raw fastq and alignment files which were submitted to the SRA and GEO repositories but also the MEDUSA software and bigwig methylation files, all of which are represented in ISA-TAB format.So, I hope I have convinced you that making your data public prior to publication is not just in the best interests of science but also increases your publication and citation list to aid in grant applications and career advancement!!!
And now that you all want to submit to GigaDB, how do you do that and how will people search and find your data and, other than citing your DOI, what will they be able to do with the data? We have redesigned the underlying Giga database and we’re working on the front end which we hope to be public early next month so the following slides are a mix of screenshots from the development site overlaid with tweaks made in powerpoint to illustrate features you can hope to see when we go live.These include:a home page image slider for browsing datasetsa text box search which I will demonstrate shortly
and an advanced search option…
…which if you click, gives you detailed instructions of the syntax used by the Sphinx search engine.
Here I would like to mention the login system where a user can save searches, sign up for email alerts and submit Excel submission files.
This is my profile page. I am logged in and have two saved searches. If new GigaScience datasets are released that match my search criteria I will be emailed a notification with links to the datasets so I don’t have to keep checking GigaDB for new content that I may be interested in.
Since I am logged in I also have the option to submit to GigaDB.
An Excel template file is provided for download, along with 2 completed example files for guidance.
There is also the help pages for more detailed instructions on using the website and submitting data to GigaDB.Once I confirm that I have read the GigaDB Terms and Conditions, I can upload my Excel submission file and a member of the GigaDB team should contact me within 3-5 working days. We welcome feedback on the submission system so please do let us know of any improvements to the Excel submission file to ease the process.
Now, if we move on to the search facility, as an example if we search for the YH individual in the search box we get 3 datasets returned.The original YH Genome and the supplementary methylome and transcriptome datasets from the same individual.If you have many results you can use the Filters to narrow down your search, restricting by Organism, Dataset type, project, publication date or modification date.
You can also hover over a dataset to read the abstract before clicking through to a DOI landing page.
Alternatively, if you are looking for files to download across datasets, you can click on the tab file and use the Filters to further refineyour file search.Here narrowing down your search by filtering on File type, File format, File size or Release date.
Incidentally, all the hover-over ‘I’ icons you see are information, in this example describing what the different file formats are.
This download function is still being worked on but will also allow you to select multiple files for download or for direct upload to Galaxy and other tools in development which I’ll touch on at the end of my talk.
This is an example landing page for DOI 10.5524/100015 for the YH genome dataset. It will be accessible both from the GigaDBurl and the DOI url.These pages are still in development but what you will see is the dataset metadata including:date releaseddataset typetitle abstract how the dataset should be citedLinks to related manuscripts, datasets, additional information, genome browsers, accessions and projectsSample details
And finally at the bottom, file descriptions and options (not shown in this illustration) to download the files (or upload them to tools such as Galaxy)
Leading on from that, current and future plans include collaborating with Tin-Lap Lee at the Chinese University of Hong Kong to integrate an instance of the Galaxy bioinformatics platform with GigaDB so users can make full use of the data in GigaDB by linking it to other resources and we can incorporate fully executable papers. One such submission is a new SOAPdenovo pipeline. The SOAP tools have been wrapped in Galaxy, the workflow defined in MyExperiment and the data will be issued with a DOI and accessible via GigaDB. Utilizing the BGI cloud if necessary, users will then be able to reproduce all the steps described in the GigaScience paper to test, reanalyze, compare results etc.Since we would like GigaDB to be a host for data types that have no other home, such as imaging data, we are investigating adding other tools such as an image viewer and the like to support accessibility to and usability of the data. So, if you have a large-scale biological or biomedical dataset and/or a pipeline or software that you would like to submit to GigaScience we would love to hear from you so please come and talk to Scott or myself.
That just leaves me to thank the GigaScience team: Laurie, Scott, Alexandra, Peter and Jesse, BGI for their support - specifically Shaoguang for IT and bioinformatics support – our collaborators on the database, website and tools: Tin-Lap, Qiong, Senhong, Yan, the Cogini web design team, Datacite for providing the DOI service and the isacommons team for their support and advocacy for best practice use of metadata reporting and sharing.Thank you for listening.