O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Jsm madduri-august-2015

Próximos SlideShares
CI4CC sustainability-panel
CI4CC sustainability-panel
Carregando em…3

Confira estes a seguir

1 de 32 Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (19)

Quem viu também gostou (20)


Semelhante a Jsm madduri-august-2015 (20)

Mais recentes (20)


Jsm madduri-august-2015

  1. 1. globus.org/genomics Finding Needles in a Haystack – Big Data Management and Analysis using Globus Ravi Madduri madduri@anl.gov JSM 2015, Seattle, Washington
  2. 2. globus.org/genomics • Globus Genomics is developed, operated, and supported by researchers, developers, and bioinformaticians at the Computation Institute – University of Chicago/Argonne National Lab • We are a non-profit organization building solutions for non- profit researchers • Our goal is to support the advancement of science by bringing together our strengths and capabilities to help meet the unique needs of researchers and research institutions Who We Are
  3. 3. globus.org/genomics Publish results Collect data Design experiment Test hypothesis Hypothesize explanation Identify patterns Analyze data Finding needles in haystacks Pose question 3
  4. 4. globus.org/genomics Imagine if a researcher, when tackling a problem, could easily: • Assemble, integrate, and interpret all relevant data within a knowledge network • Be informed of anomalies, patterns, gaps • Formulate & apply computational models • Outsource tasks if local expertise lacking • Launch automated processes to test hypotheses, expand knowledge network • Pay for all this by taking on other tasks
  5. 5. globus.org/genomics We will cover • Accelerating Scientific Discovery Process by providing Science as a Service – Research Data Management – Analyzing Research Data • Interactive Analysis • Large-scale Analysis – Publishing Results so others can • Discover • Validate • Reproduce/Use
  6. 6. globus.org/genomics 90% of cancer patients carry a mutation that may be responsive to a known drug Mark Rubin, Weill Cornell Medical College and NewYork-Presbyterian Hospital in New York in Nature, April, 2015
  7. 7. Trying to find a single causative gene for diseases with a complex genetic background is like looking for the proverbial needle in a haystack – Nancy Cox (Vanderbilt)
  8. 8. globus.org/genomics Higgs discovery “only possible because of the extraordinary achievements of … grid computing” Rolf Heuer, CERN DG 10s of PB, 100s of institutions,1000s of scientists, 100Ks of CPUs, Bs of tasks
  9. 9. globus.org/genomics How do we accelerate discovery without requiring that every lab acquire a haystack-sorting machine? Clayton & Shuttleworth thresher, 1910: Museum Victoria, Australia
  10. 10. globus.org/genomics Managing big data with Globus PI initiates transfer request; or requested automatically by script, science gateway 1 Globus transfers files reliably, securely Light Source Compute Facility 2 PI selects files to share, selects user or group, and sets access permissions Globus controls access to shared files on existing storage; no need to move files to cloud storage! Researcher logs in to Globus and accesses shared files; no local account required; download via Globus Researcher assembles data set; describes it using metadata (Dublin core and domain- specific) Curator reviews and approves; data set published on campus or other system Peers, collaborators search and discover datasets; transfer and share using Globus 4 7 6 3 5 • SaaS  Only a web browser required • Access using your campus credentials • Globus monitors and informs throughout 6 8 Publication Repository Personal Computer
  11. 11. globus.org/genomics Globus Platform-as-a-Service Identity, Group, Profile Management Services … Sharing Service Transfer Service Globus Toolkit GlobusAPIs GlobusConnect
  12. 12. globus.org/genomics Globus Adoption and Usage • 166,449 active Globus endpoints • 27,961 users registered • Biggest transfer: 500.42TB • Longest running transfer: 182 days. • Fastest transfer: 58.5Gbps (average) • 55TB moved per day, on average, since the service was launched in November 2010 • Average throughput: 637.7Mbps (since service launch)
  13. 13. globus.org/genomics Analyzing Big Data using Globus Galaxies Sequencing Centers Sequencing Centers Public Data Storage Local Cluster/ CloudSeq Center Research Lab Globus provides for • High-performance • Fault-tolerant • Secure file transfer between all data-endpoints Data management Data analysis Picard GATK Fastq Ref Genome Alignment Variant Calling Galaxy Data Libraries Globus Genomics on Amazon EC2 • Analytical tools are automatically run on the scalable compute resources when possible • Globus integrated within Galaxy • Web-based UI • Drag-Drop workflow creations • Easily modify workflows with new tools Galaxy-based workflow managementGlobus Genomics
  14. 14. globus.org/genomics Our Science Stack • Galaxy – Interactive execution, iPython, R – Creation, Execution, Sharing, Discovering Workflows • Globus – Data management – Identity Management • AWS – HTCondor, Chef, EC2, EBS, S3, SNS – Spot, Route 53, Cloud Formation SaaS PaaS IaaS
  15. 15. globus.org/genomics Examples of what researchers have done
  16. 16. globus.org/genomics • 134 samples and 4 workflows • 4 TB data initially • 2200 core hours in 6 days Cox lab, UChicago
  17. 17. globus.org/genomics Consensus Caller
  18. 18. globus.org/genomics Rediscovery of previously observed variants Transition/Transversion Ratio Genotype Mendel Error Rate Distributions of Mendel Error Counts per Trio
  19. 19. globus.org/genomics Contaminated Samples
  20. 20. globus.org/genomics Olopade lab, UChicago A profile of inherited predisposition to breast cancer among Nigerian women Y. Zheng, T. Walsh, F. Yoshimatsu, M. Lee, S. Gulsuner, S. Casadei, A. Rodriguez, T. Ogundiran, C. Babalola, O. Ojengbede, D. Sighoko, R. Madduri, M.-C. King, O. Olopade • 200 targeted exomes • 200 GB data initially • 76,920 core hours in 1.25 days
  21. 21. globus.org/genomics Expanding Consensus Genotyper – SNVs, Indels, SVs RAW FASTQs GATK Pipeline/HC FreeBayes SAMtools mpileup GATK Pipeline/UG VCF VCF VCF VCF Consensus Genotyper VCF Atlas2 Delly/Contra VCF VCF
  22. 22. globus.org/genomics 14 deleterious SNVs and 11 damaging Indels (BRCA1: 15, BRCA2: 4, PALB2: 2, BRIP1: 1, CHEK2: 1, NBN: 1, TP53: 1) were found in 29 subjects, and they were all confidently detected among 5 callers. Identified SNVs and Indels were all confirmed by Sanger sequencing. Preliminary Results are very encouraging
  23. 23. globus.org/genomics QC PPMI ADNI Adenocarcinoma http://bit.ly/1M0h6Yx http://bit.ly/A10R89y Adrenal Brain Alignment Feature count Alignment QC 1. Query and discover data 3. Execute parallel alignment workflow on dynamically provisioned cloud resources ERMrest 2. Transfer bags Alignment FilesAlignment Files 3. Publish bags BDDS Collection Alignment FilesAlignment Files Differential expression Differential expression 4. Discover published data and execute comparison workflow Combining Data management and Analysis
  24. 24. globus.org/genomics Gene Expression Results
  25. 25. globus.org/genomics Globus Genomics at a glance 30 institutions, groups 10s million core hours labs 2 PBs raw sequences analyzed >1500 analysis tools 1000s genomes processed >50 workflows 99% uptime over the past two years 1 PB largest single transfer to do 5 days longest running workflow 100s different species 1000s genomes processed 5 days longest running workflow
  26. 26. globus.org/genomics Other Globus Genomics users Dobyns Lab Cox Lab Volchenboum Lab Olopade Lab Nagarajan Lab
  27. 27. globus.org/genomics Pricing includes • Estimated compute • Storage (one month) • Globus Genomics platform usage • Support Costs are remarkably low
  28. 28. globus.org/genomics Globus Genomics – Making it routine to find needles in NGS haystacks www.globus.org/genomics
  29. 29. globus.org/genomics Other Examples of Science as a Service • PDACS - Portal for data analysis services for cosmological simulations • CVRG Galaxy – Large-scale ECG Data Analysis • Globus Proteomics • eMatter – Material Science Simulations • FACE-IT - Framework to Advance Climate, Economic, and Impact Investigations with Information Technology (usefaceit.org)
  30. 30. globus.org/genomics • More information on Globus Genomics:www.globus.org/geno mics • More information on Globus: www.globus.org
  31. 31. globus.org/genomics Our work is supported by: U. S. D E PART M ENT OF ENERGY 31
  32. 32. globus.org/genomics Thank you! @madduri

Notas do Editor

  • The basic research process remains essentially unchanged since the emergence of the scientific method in the 17th Century.
    Collect data, analyze data, identify patterns within data, seek explanations for those patterns, collect new data to test explanations.

    Speed of discovery depends to a significant degree on the time required for this cycle. Here, new technologies are changing the research process rapidly and dramatically.
    Data collection time used to dominate research. For example, Janet Rowley took several years to collect data on gross chromosomal abnormalities for a few patients. Today, we can generate genome data at the rate of billions of base pairs per day. So other steps become bottlenecks, like managing and analyzing data—a key issue for Midway.

    It is important to realize that the vast majority of research is performed within “small and medium labs.” For example, almost all of the ~1000 faculty in BSD and PSD at UChicago work in their own lab.
    Academic research is a cottage industry—albeit one that is increasingly interconnected—and is likely to stay that way.
  • Given continued exponential growth along so many dimensions …
    … process efficiencies must improve at a comparable rate to maintain just constant progress
  • http://sciencelife.uchospitals.edu/2013/10/28/testing-a-whole-haystack-to-find-more-needles/
  • Peter Higgs

  • http://museumvictoria.com.au/collections/items/772499/negative-threshing-team-with-aveling-porter-traction-engine-bagging-grain-building-a-haystack-miners-rest-victoria-1910
  • Highlight CI Connect; coming up in Rob Gardner’s talk
    Highlight XSEDE’s planned adoption of user, group and profile management
  • Our goal is to operationalize key capabilities so researchers can depend on them. Think of Gmail for science..
  • We built this pipeline to create high quality variants using multiple genotyping algorithms
  • Applying previous pipeline to targeted exomes in breast cancer. http://abstracts.ashg.org/cgi-bin/2014/ashg14s.pl?author=madduri&sort=ptimes&sbutton=Detail&absno=140122328&sid=84013
  • Normal/tumor – for 2 subjects (form Geo, south Korean population)
    Run workflow on each normal and tumor and publish
    Qc, alignment, feature count, alignment qc  QC files, alignment file, and count file.
    Differential expression
  • http://gene.gmi.ac.kr/geneList/LUAD_ExpLevels_EGFR.jpg

    Picture shows how the gene EGFR expresses in lung and cancer tumor samples we have analyzed. We can do very similar analysis for ADNI, PPMI and other data sources easily.