In order to promote life science researches in Japan, National Bioscience Database Center(NBDC) makes databases easier to use. As one of core technology development program in NBDC, Database Center for Life Science (DBCLS) has been tackling the problem how to organize big data in lifescience including huge amount of nucleotide sequence data from next generation sequencers and various types of gene expression data.
For nucleotide sequence data, we sorted out data deposited in Sequence Read Archive (SRA) for recycling those data in collaboration with DDBJ, which collaboratively holds SRA. We have been maintaining the statistics of SRA based on study types, sequencer types(platform) and species of samples by analyzing metadata of SRA, and these information is available from our DBCLS SRA website (http://sra.dbcls.jp/). Notably, we are collecting SRA entries associated with publications and diseases, and these search form is also accesible for use from DBCLS SRA website.
We are also developing search engine for nucleotide sequence data by utilizing the compressed suffix array. We have developed GooGle-like search engine for RNA molecules, called GGRNA[1], and it is available for use from GGRNA website (http://ggrna.dbcls.jp/).
In order to handle various types of gene expression data, we made the integrated dataset and its interface, called RefEx (Reference Expression dataset: http://refex.dbcls.jp/ ) to browse gene expression data derived from public databases by following four methods in human, mouse and rat.
1. Expressed Sequence Tag (EST) counts in EST division of INSDC(DDBJ/ENA/GenBank)
2. DNA microarray (Affymetrix GeneChip)
3. CAGE(Cap Analysis Gene Expression) tag counts around transcription start sites
4. Transcriptome sequence counts from the next generation sequencers (RNA-seq)
Web interface for RefEx contains the form in which users can search by gene names, various types of IDs, chromosomal regions in genetic maps, keywords and nucleotide sequences. Gene expression values are mapped to the 3D body image in BodyParts3D[2] as well as the graphical histograms for those are available for different types of measurement methods.
We will present current status of the project and utility of the system developed.
[1] Naito, Y. and H. Bono (2012) GGRNA: an ultrafast, transcript-oriented search engine for genes and transcripts. Nucleic Acids Research. 40: W592-W596.
[2] Mitsuhashi, N., Fujieda, K., Tamura, T., Kawamoto, S., Takagi, T. and K. Okubo (2009) BodyParts3D: 3D structure database for anatomical concepts. Nucleic Acids Research. 37: D782-D785.
You can see the TogoTV version of this presentation from http://togotv.dbcls.jp/20130903.html