This is a talk titled "Cloud-Based Services For Large Scale Analysis of Sequence & Expression Data: Lessons from Cistrack" that I gave at CAMDA 2009 on October 6, 2009.
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Bioclouds CAMDA (Robert Grossman) 09-v9p
1. Cloud-Based Services For Large Scale Analysis of Sequence & Expression Data: Lessons from Cistrack October 6, 2009 Robert Grossman Laboratory for Advanced Computing University of Illinois at Chicago Open Data Group Institute for Genomics & Systems BiologyUniversity of Chicago 1
10. Projected sequencing capabilities (world-wide) 2060 Total human population 2019 One of each described species 2031 One of each species ~100M estimate log10 billions of base pairs 2023 One of each species ~10M estimate Kevin White, unpublished
11. Is Biology a Large Data Science? vs CPUs double approximately every 18 months (Moore’s Law). Disks double every 12-15 months (Johnson’s Law). Amount of publically available sequence data is doubling approximately every 12 months. 5
13. We Have a Problem vs More and more of your colleagues (e.g. the biologist down the hall) with access to modern instruments are producing so much data that they cannot easily manage, analyze and archive it. Large projects build their own infrastructure. Almost all other biologists are on their own. 7
14. Point of View To do research today… Analytic infrastructure Analytic algorithms & statistical models Data
20. Idea Dates Back to the 1960s 14 App App App CMS CMS MVS IBM VM/370 IBM Mainframe Native (Full) Virtualization Examples: Vmware ESX Virtualization first widely deployed with IBM VM/370.
21. One Definition Clouds provide on-demand resources or services over a network, often the Internet, with the scale and reliability of a data center. No standard definition. Cloud architectures are not new. What is new: Scale Ease of use Pricing model. 15
30. Cistrack Resource for cis-regulatory data. It is open source and based upon CUBioS. Currently used by the White Lab at University of Chicago for managing ModENCODE fly data. Contains raw data, intermediate, and analyzed data from approximately 240 experiments from Agilent, Affy and Solexa platforms.
31. Chromatin Developmental Time-Course H3K4me1 enhancers H3K4me3 promoters & enhancers H3K9Ac activation H3K9me3 heterochromotin H3K27Ac activation H3K27me3 repression PolII transcript. & promoters CBP HAT- enhancers Total RNA expression X 12 time points for which chromatin and RNA have been collected (Carolyn Morrison & Nicolas Negre) 8 antibodies used for ChIP experiments (Zirong Li and Bing Ren)
32. 1. Cistrack Supports Cubes of Data Drosophila regulatory elements from Drosophila modENCODE. ChIP-chip data using Agilent 244K dual-color arrays. Six histone modifications (H3K9me3, H3K27me3, H3K4me3, H3K4me1, H3K27Ac, H3K9Ac), PolII and CBP. Each factor has been studied for 12 different time-points of Drosophila development.
33. 2. ChIP-Seq Data Volumes are Large Cistrack integrates with large data clouds.
34. 3. Continuous Reanalysis is Desirable In general, it is quite labor intensive to reanalyze your existing data with a new algorithm. Cistrack supports VMs that can simplify re-applingCistrackpipelines that have been updated to include a new algorithm.
40. Basic Idea Replace Cloud VM VM VM At anytime, you can launch one or virtual machines (VMs) to redo pipeline analysis and persist results to a database and VM to a cloud.
41. Raywulf We have designed a cluster (called a Raywolf Cloud) that is optimized to serve as your own private cloud. About $2K/TB. Will be used by the Open Science Data Cloud.
43. Cis-Regulatory Map of the Drosophila Genome (modENCODE) Data Generation Kevin White, U. Chicago (Antibody pipeline, ChIP-chip pipeline) Bing Ren, UCSD (Antibody validation, ChIP-chip pipeline) Robert Grossman U. Illinois (LIMS, data management & analysis) Computational identification of Cis-Regulatory Motifs ManolisKellis, MIT (Motif analysis, ChIP-chip data analysis) Biological validation Jim Posakony, UCSD (Promoters/Enhancers) Steve Russell, Cambridge U. (Insulators/Silencers) Hugo Bellen, Baylor (Element “necessity” validations)