SlideShare uma empresa Scribd logo
1 de 30
Big process for big data
Process automation for data-driven science

Ian Foster
Computation Institute
Mathematics and Computer Science Division
Department of Computer Science
Argonne National Laboratory & The University of Chicago

Talk at DOE Big Data Technology Summit, Washington DC, October 9, 2012
                                                             www.ci.anl.gov
                                                             www.ci.uchicago.edu
Big data is not new at DOE
Large Hadron Collider Higgs discovery “only
                        possible because of the
                        extraordinary
                        achievements of … grid
                        computing”
15 PB/year              —Rolf Heuer, CERN DG
173 TB/day
500 MB/sec


              LHC Computing
              Grid (10+ GB/sec)


                                                  www.ci.anl.gov
2
                                                  www.ci.uchicago.edu
But it is now ubiquitous: e.g., genomics




                                           www.ci.anl.gov
3   Kahn, Science, 331 (6018): 728-729     www.ci.uchicago.edu
But it is now ubiquitous: e.g., genomics




                                         6 years
                                                   Computing x10
                                                   (x30 at DOE)




                                                      www.ci.anl.gov
4   Kahn, Science, 331 (6018): 728-729                www.ci.uchicago.edu
But it is now ubiquitous: e.g., genomics




                                         6 years
                                                   Computing x10
                                                   (x30 at DOE)


                                         Genome
                                         sequencing
                                         x105
                                                      www.ci.anl.gov
5   Kahn, Science, 331 (6018): 728-729                www.ci.uchicago.edu
Now ubiquitous: e.g., light sources




                                        18 orders
                                        of magnitude
12 orders of                            in 5 decades!
magnitude
in 6 decades




                                            www.ci.anl.gov
 6   Credit: Linda Young                    www.ci.uchicago.edu
Now ubiquitous: e.g., light sources




                                      www.ci.anl.gov
7   Source: Francesco de Carlo        www.ci.uchicago.edu
Local flows already exceed those of LHC
                                 External                 Argonne data
                                 sources
                         163                              flows in TB/day
                                 9          9
                                                          (estimates)
Advanced Photon Source


    Argonne              143                         10
                                      Short-                  Long-
    Leadership                         term                   term
    Computing
                         100         storage         50
                                                             storage
    Facility

                                               150
                               100
        Other sources
      Other sources
        that remain to
      that remain to
         be quantified
       be quantified
                                 Data
                                 analysis
                                                                www.ci.anl.gov
8
                                                                www.ci.uchicago.edu
Big data demands new analysis models
Today




                                   Desired
                                       www.ci.anl.gov
9   Source: Francesco de Carlo         www.ci.uchicago.edu   9
It’s velocity and variety as well as volume


     Proteomics       Phenotypes                   Transcriptomics


                       Genomes
                                       Growth curves             Metabolomics
              Metabolic                    Reconciled           Phenotype
               Model                         Model              predictions
                                                                   Flux
                                           Integrated           predictions
 Assembly             Annotation
                                             model
                                                                Hypotheses

                Regulon                    Regulatory                Pathway
               prediction                    model                   designs
                                                                       www.ci.anl.gov
10        Credit: Chris Henry et al.                                   www.ci.uchicago.edu
Exponentially increasing complexity
     Run experiment
        Collect data
        Move data
        Check data
      Annotate data
        Share data
     Find similar data
     Link to literature
       Analyze data
       Publish data
                                      www.ci.anl.gov
11
                                      www.ci.uchicago.edu
www.ci.anl.gov
12
     www.ci.uchicago.edu
Tripit exemplifies process automation

        Me                           Other services
     Book flights   Record flights
                    Suggest hotel
     Book hotel     Record hotel
                    Get weather
                    Prepare maps
                    Share info
                    Monitor prices
                    Monitor flight
                                         www.ci.anl.gov
13
                                         www.ci.uchicago.edu
Big data requires big process
     Run experiment
                                 Outsourced
        Collect data              Intuitive
        Move data                Integrative
        Check data
      Annotate data             Research IT
        Share data              as a service
     Find similar data
     Link to literature            Secure
                                 Performant
       Analyze data
                                  Reliable
       Publish data
                                         www.ci.anl.gov
14
                                         www.ci.uchicago.edu
Characterizing big process requirements
                        Telescope            In millions of labs
 Simulation
                                             worldwide, researchers struggle
                                             with massive data, advanced
                                             software, complex
                                             protocols, burdensome reporting
              Staging               Ingest                Registry


                                                         Community
                                                         Repository
                             Analysis

  Next-gen
  genome                                       Archive                Mirror
  sequencer


Accelerate discovery and innovation by outsourcing difficult tasks
 15
                                                       www.ci.anl.gov
                                                       www.ci.uchicago.edu
Characterizing big process requirements
                   Telescope         In millions of labs
 Simulation
                                     worldwide, researchers struggle
                                     with massive data, advanced
                                     software, complex
         Data movement is a         frequentburdensome reporting
                                     protocols,
                                                  challenge
         • Between facilities, archives,Registry
                                           researchers
            Staging      Ingest
         • Many files, large data volumes
                                         Community
         • With security, reliability, performance
                                         Repository
                        Analysis

  Next-gen
  genome                                Archive          Mirror
  sequencer


Accelerate discovery and innovation by outsourcing difficult tasks
 16
                                                       www.ci.anl.gov
                                                       www.ci.uchicago.edu
Globus Online: Big process for big data




Data movement as a service
Secure, automated, reliable,
 high-speed movement,
 synchronization of many files




                                           www.ci.anl.gov
17
                                           www.ci.uchicago.edu
6,000 users
500 M files, 7 PB moved
99.9% availability
Examples of Globus Online in action
•    K. Heitmann (ANL) moves 22TB
     cosmology data at 5 Gb/s LANL  ANL

•    B. Winjum (UCLA) moves 900K-file
     plasma physics datasets UCLA - NERSC

•    Dan Kozak (Caltech) replicates 1 PB
     LIGO astronomy data for resilience

•    Supercomputer centers, genome facilities, light
     sources, universities all recommend it
                                              www.ci.anl.gov
19
                                              www.ci.uchicago.edu
Sizes of transfers Jan-Jun; size of circles prop. to log size
 Automation expands use of networks            Red=NERSC/LBL/ESnet; Green=ORNL/BNL; Blue=ANL;
                                                              Yellow=FNAL; Grey=Other

Transfers Jan-June 2012,




                                      1e+12
Size (bytes) vs time
Size ∝ log(transfer rate)

Red: NERSC/LBL/Esnet

                                      1e+09
Green: ORNL, LBL
Blue: ANL
                       bytes_xfered



Yellow: FNAL
                                      1e+06



Grey: Other
                                      1e+03
                                      1e+00




                                              Jan                Mar                 May                          Jul
                                                                                            www.ci.anl.gov
20
                                                                                            www.ci.uchicago.edu
Need much more than data movement
                        Telescope            In millions of labs
 Simulation
                                             worldwide, researchers struggle
                                             with massive data, advanced
                                             software, complex
                                             protocols, burdensome reporting
              Staging               Ingest                Registry


                                                         Community
                                                         Repository
                             Analysis

  Next-gen
  genome                                       Archive                Mirror
  sequencer


Accelerate discovery and innovation by outsourcing difficult tasks
 21
                                                       www.ci.anl.gov
                                                       www.ci.uchicago.edu
Need much more than data movement
 Ingest, cata
 loging, inte
                      Sharing,
                   collaboration,
                                        Identity, grou
                                         ps, security
                                                             Analysis, sim
                                                             ulation, visu   ...
   gration          annotation                                 alization



              Staging          Ingest                    Registry


                                                       Community
                                                       Repository
                          Analysis

  Next-gen
  genome                                     Archive                Mirror
  sequencer


Accelerate discovery and innovation by outsourcing difficult tasks
 22
                                                       www.ci.anl.gov
                                                       www.ci.uchicago.edu
Earth System Grid: Data movement




•    Outsource data transfer
     –   Client data download
     –   Replication between sites
•    No ESGF client software needed
•    20+ times faster than HTTP

                                       www.ci.anl.gov
23
         earthsystemgrid.org           www.ci.uchicago.edu
Kbase: Identity, group, data movement




                                        www.ci.anl.gov
24
     kbase.science.energy.gov           www.ci.uchicago.edu
Genomics: Data movement and analysis



                                                                              Galaxy-based workflow
                                                                                   management
                              Public                                                            • Globus Online
                               Data                                                               Integrated
                                                                   Galaxy                       • Web-based UI
                                                                     data                       • Drag-n-drop
     Sequenc-
     Sequencin      Globus Online provides        Storage         libraries                       workflow creation
        ing
     g Centers                                                                                  • Easily add new
      centers       •       High-performance
                    •       Fault-tolerant Lab
                                   Research                                                       tools
                    •       Secure               Local Cluster/
                                                                                                • Analytical tools
                    Seq                             Cloud
                    file transfer between all
                   Center                                                                         run on scalable
                    data endpoints                                                                computers


                                                                              Galaxy in Cloud

                        Data management                                        Data analysis
                                                                                                 www.ci.anl.gov
25           Source: Ravi Madduri                                                                www.ci.uchicago.edu
Integrating observation and simulation
    1                                                              Cloud properties and
                                                                   precipitation characteristics in
                                                                   large-scale models and cloud-
                                                                   resolving models (e.g., CMIP5
                                                                   models, GCRM)
Percentage of mapped radar domain in Darwin with returns
>10 dBz over the period 19 to 22 January 2006.
                               Retrieve




                                                                                     Compare
Construct structured
4-D atmospheric
state (“CAN”)

                                                      2
                                                                         Precipitating storm
                                                                         structures; storm lifecycles;
                                                                             Analytics
                                                           Analytics     statistical representation of
                                                                         storm scale properties;
                                                                                                    3
                                                                         predictive cloud models
                                                                                           www.ci.anl.gov
  26           Scott Collis                                                                www.ci.uchicago.edu
Integrating observation and simulation


          Level 1      Level 2         Level 3




           PBs          TBs          GBs




                                                 www.ci.anl.gov
27   Salman Habib, Katrin Heitmann               www.ci.uchicago.edu
Integrating observation and simulation




                                         www.ci.anl.gov
28   Salman Habib, Katrin Heitmann       www.ci.uchicago.edu
In summary: Big process for big data

Accelerate discovery and innovation worldwide
by providing research IT as a service
Outsource time-consuming tasks to
• provide large numbers of researchers with
   unprecedented access to powerful tools;
• enable a massive shortening of cycle times in
   time-consuming research processes; and
• reduce research IT costs via economies of scale
Accelerate existing science; enable new science
                                           www.ci.anl.gov
29
                                           www.ci.uchicago.edu
Thank you!


foster@anl.gov
www.ci.anl.gov
www.mcs.anl.gov
www.globusonline.org
                       www.ci.anl.gov
                       www.ci.uchicago.edu

Mais conteúdo relacionado

Destaque

Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2
Dan Taylor
 
Raskar UIST Keynote 2015 November
Raskar UIST Keynote 2015 NovemberRaskar UIST Keynote 2015 November
Raskar UIST Keynote 2015 November
Camera Culture Group, MIT Media Lab
 
What is SIGGRAPH NEXT? Intro by Ramesh Raskar
What is SIGGRAPH NEXT? Intro by Ramesh RaskarWhat is SIGGRAPH NEXT? Intro by Ramesh Raskar
What is SIGGRAPH NEXT? Intro by Ramesh Raskar
Camera Culture Group, MIT Media Lab
 

Destaque (19)

Jsm madduri-august-2015
Jsm madduri-august-2015Jsm madduri-august-2015
Jsm madduri-august-2015
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduri
 
Supporting Barack Obama for President
Supporting Barack Obama for PresidentSupporting Barack Obama for President
Supporting Barack Obama for President
 
Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2
 
Big Data and Genomics
Big Data and GenomicsBig Data and Genomics
Big Data and Genomics
 
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsMay 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
 
Raskar UIST Keynote 2015 November
Raskar UIST Keynote 2015 NovemberRaskar UIST Keynote 2015 November
Raskar UIST Keynote 2015 November
 
Coded Photography - Ramesh Raskar
Coded Photography - Ramesh RaskarCoded Photography - Ramesh Raskar
Coded Photography - Ramesh Raskar
 
Leap Motion Development (Rohan Puri)
Leap Motion Development (Rohan Puri)Leap Motion Development (Rohan Puri)
Leap Motion Development (Rohan Puri)
 
What is Media in MIT Media Lab, Why 'Camera Culture'
What is Media in MIT Media Lab, Why 'Camera Culture'What is Media in MIT Media Lab, Why 'Camera Culture'
What is Media in MIT Media Lab, Why 'Camera Culture'
 
What is SIGGRAPH NEXT? Intro by Ramesh Raskar
What is SIGGRAPH NEXT? Intro by Ramesh RaskarWhat is SIGGRAPH NEXT? Intro by Ramesh Raskar
What is SIGGRAPH NEXT? Intro by Ramesh Raskar
 
Globus Genomics: Democratizing NGS Analysis
Globus Genomics: Democratizing NGS AnalysisGlobus Genomics: Democratizing NGS Analysis
Globus Genomics: Democratizing NGS Analysis
 
Google Glass Breakdown
Google Glass BreakdownGoogle Glass Breakdown
Google Glass Breakdown
 
Stereo and 3D Displays - Matt Hirsch
Stereo and 3D Displays - Matt HirschStereo and 3D Displays - Matt Hirsch
Stereo and 3D Displays - Matt Hirsch
 
Multiview Imaging HW Overview
Multiview Imaging HW OverviewMultiview Imaging HW Overview
Multiview Imaging HW Overview
 
基因大数据分析入门 Slideshare
基因大数据分析入门   Slideshare基因大数据分析入门   Slideshare
基因大数据分析入门 Slideshare
 
Deep two-photon brain imaging with a red-shifted fluorometric Ca2+ indicator
Deep two-photon brain imaging with a red-shifted fluorometric Ca2+ indicatorDeep two-photon brain imaging with a red-shifted fluorometric Ca2+ indicator
Deep two-photon brain imaging with a red-shifted fluorometric Ca2+ indicator
 
Focused Ultrasound Neuromodulation
Focused Ultrasound NeuromodulationFocused Ultrasound Neuromodulation
Focused Ultrasound Neuromodulation
 
Introduction to Camera Challenges - Ramesh Raskar
Introduction to Camera Challenges - Ramesh RaskarIntroduction to Camera Challenges - Ramesh Raskar
Introduction to Camera Challenges - Ramesh Raskar
 

Semelhante a Big Process for Big Data

Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And Grid
Ian Foster
 
Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009
Ian Foster
 
Running Hot October 2008
Running Hot October 2008Running Hot October 2008
Running Hot October 2008
Ian Foster
 
SOLE: Linking Research Papers with Science Objects
SOLE: Linking Research Papers with Science ObjectsSOLE: Linking Research Papers with Science Objects
SOLE: Linking Research Papers with Science Objects
Tanu Malik
 
Utility HPC: Right Systems, Right Scale, Right Science
Utility HPC: Right Systems, Right Scale, Right ScienceUtility HPC: Right Systems, Right Scale, Right Science
Utility HPC: Right Systems, Right Scale, Right Science
Chef Software, Inc.
 

Semelhante a Big Process for Big Data (20)

Mexico talk foster march 2012
Mexico talk foster march 2012Mexico talk foster march 2012
Mexico talk foster march 2012
 
Rethinking how we provide science IT in an era of massive data but modest bud...
Rethinking how we provide science IT in an era of massive data but modest bud...Rethinking how we provide science IT in an era of massive data but modest bud...
Rethinking how we provide science IT in an era of massive data but modest bud...
 
Multiscale Modeling
Multiscale ModelingMultiscale Modeling
Multiscale Modeling
 
Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And Grid
 
Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009
 
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data HandlingScott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
 
Opportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architecturesOpportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architectures
 
Trip Report Seattle
Trip Report SeattleTrip Report Seattle
Trip Report Seattle
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
The Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceThe Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data Science
 
Cyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in BiocomputingCyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in Biocomputing
 
Running Hot October 2008
Running Hot October 2008Running Hot October 2008
Running Hot October 2008
 
Research Automation for Data-Driven Discovery
Research Automationfor Data-Driven DiscoveryResearch Automationfor Data-Driven Discovery
Research Automation for Data-Driven Discovery
 
Research Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryResearch Automation for Data-Driven Discovery
Research Automation for Data-Driven Discovery
 
SOLE: Linking Research Papers with Science Objects
SOLE: Linking Research Papers with Science ObjectsSOLE: Linking Research Papers with Science Objects
SOLE: Linking Research Papers with Science Objects
 
An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)
 
Utility HPC: Right Systems, Right Scale, Right Science
Utility HPC: Right Systems, Right Scale, Right ScienceUtility HPC: Right Systems, Right Scale, Right Science
Utility HPC: Right Systems, Right Scale, Right Science
 
Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)
 
Seth A. Faith - Building a PaaS for Forensic DNA analysis using AWS
 Seth A. Faith - Building a PaaS for Forensic DNA analysis using AWS Seth A. Faith - Building a PaaS for Forensic DNA analysis using AWS
Seth A. Faith - Building a PaaS for Forensic DNA analysis using AWS
 

Mais de Ian Foster

Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptx
Ian Foster
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
Ian Foster
 

Mais de Ian Foster (20)

Global Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxGlobal Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptx
 
The Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionThe Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, Evolution
 
Better Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumBetter Information Faster: Programming the Continuum
Better Information Faster: Programming the Continuum
 
ESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsESnet6 and Smart Instruments
ESnet6 and Smart Instruments
 
Linking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationLinking Scientific Instruments and Computation
Linking Scientific Instruments and Computation
 
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryA Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
 
Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptx
 
Big Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceBig Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental Science
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud Automation
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and Jupyter
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
 
Team Argon Summary
Team Argon SummaryTeam Argon Summary
Team Argon Summary
 
Thoughts on interoperability
Thoughts on interoperabilityThoughts on interoperability
Thoughts on interoperability
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
NIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture Ideas
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Big Process for Big Data

  • 1. Big process for big data Process automation for data-driven science Ian Foster Computation Institute Mathematics and Computer Science Division Department of Computer Science Argonne National Laboratory & The University of Chicago Talk at DOE Big Data Technology Summit, Washington DC, October 9, 2012 www.ci.anl.gov www.ci.uchicago.edu
  • 2. Big data is not new at DOE Large Hadron Collider Higgs discovery “only possible because of the extraordinary achievements of … grid computing” 15 PB/year —Rolf Heuer, CERN DG 173 TB/day 500 MB/sec LHC Computing Grid (10+ GB/sec) www.ci.anl.gov 2 www.ci.uchicago.edu
  • 3. But it is now ubiquitous: e.g., genomics www.ci.anl.gov 3 Kahn, Science, 331 (6018): 728-729 www.ci.uchicago.edu
  • 4. But it is now ubiquitous: e.g., genomics 6 years Computing x10 (x30 at DOE) www.ci.anl.gov 4 Kahn, Science, 331 (6018): 728-729 www.ci.uchicago.edu
  • 5. But it is now ubiquitous: e.g., genomics 6 years Computing x10 (x30 at DOE) Genome sequencing x105 www.ci.anl.gov 5 Kahn, Science, 331 (6018): 728-729 www.ci.uchicago.edu
  • 6. Now ubiquitous: e.g., light sources 18 orders of magnitude 12 orders of in 5 decades! magnitude in 6 decades www.ci.anl.gov 6 Credit: Linda Young www.ci.uchicago.edu
  • 7. Now ubiquitous: e.g., light sources www.ci.anl.gov 7 Source: Francesco de Carlo www.ci.uchicago.edu
  • 8. Local flows already exceed those of LHC External Argonne data sources 163 flows in TB/day 9 9 (estimates) Advanced Photon Source Argonne 143 10 Short- Long- Leadership term term Computing 100 storage 50 storage Facility 150 100 Other sources Other sources that remain to that remain to be quantified be quantified Data analysis www.ci.anl.gov 8 www.ci.uchicago.edu
  • 9. Big data demands new analysis models Today Desired www.ci.anl.gov 9 Source: Francesco de Carlo www.ci.uchicago.edu 9
  • 10. It’s velocity and variety as well as volume Proteomics Phenotypes Transcriptomics Genomes Growth curves Metabolomics Metabolic Reconciled Phenotype Model Model predictions Flux Integrated predictions Assembly Annotation model Hypotheses Regulon Regulatory Pathway prediction model designs www.ci.anl.gov 10 Credit: Chris Henry et al. www.ci.uchicago.edu
  • 11. Exponentially increasing complexity Run experiment Collect data Move data Check data Annotate data Share data Find similar data Link to literature Analyze data Publish data www.ci.anl.gov 11 www.ci.uchicago.edu
  • 12. www.ci.anl.gov 12 www.ci.uchicago.edu
  • 13. Tripit exemplifies process automation Me Other services Book flights Record flights Suggest hotel Book hotel Record hotel Get weather Prepare maps Share info Monitor prices Monitor flight www.ci.anl.gov 13 www.ci.uchicago.edu
  • 14. Big data requires big process Run experiment Outsourced Collect data Intuitive Move data Integrative Check data Annotate data Research IT Share data as a service Find similar data Link to literature Secure Performant Analyze data Reliable Publish data www.ci.anl.gov 14 www.ci.uchicago.edu
  • 15. Characterizing big process requirements Telescope In millions of labs Simulation worldwide, researchers struggle with massive data, advanced software, complex protocols, burdensome reporting Staging Ingest Registry Community Repository Analysis Next-gen genome Archive Mirror sequencer Accelerate discovery and innovation by outsourcing difficult tasks 15 www.ci.anl.gov www.ci.uchicago.edu
  • 16. Characterizing big process requirements Telescope In millions of labs Simulation worldwide, researchers struggle with massive data, advanced software, complex Data movement is a frequentburdensome reporting protocols, challenge • Between facilities, archives,Registry researchers Staging Ingest • Many files, large data volumes Community • With security, reliability, performance Repository Analysis Next-gen genome Archive Mirror sequencer Accelerate discovery and innovation by outsourcing difficult tasks 16 www.ci.anl.gov www.ci.uchicago.edu
  • 17. Globus Online: Big process for big data Data movement as a service Secure, automated, reliable, high-speed movement, synchronization of many files www.ci.anl.gov 17 www.ci.uchicago.edu
  • 18. 6,000 users 500 M files, 7 PB moved 99.9% availability
  • 19. Examples of Globus Online in action • K. Heitmann (ANL) moves 22TB cosmology data at 5 Gb/s LANL  ANL • B. Winjum (UCLA) moves 900K-file plasma physics datasets UCLA - NERSC • Dan Kozak (Caltech) replicates 1 PB LIGO astronomy data for resilience • Supercomputer centers, genome facilities, light sources, universities all recommend it www.ci.anl.gov 19 www.ci.uchicago.edu
  • 20. Sizes of transfers Jan-Jun; size of circles prop. to log size Automation expands use of networks Red=NERSC/LBL/ESnet; Green=ORNL/BNL; Blue=ANL; Yellow=FNAL; Grey=Other Transfers Jan-June 2012, 1e+12 Size (bytes) vs time Size ∝ log(transfer rate) Red: NERSC/LBL/Esnet 1e+09 Green: ORNL, LBL Blue: ANL bytes_xfered Yellow: FNAL 1e+06 Grey: Other 1e+03 1e+00 Jan Mar May Jul www.ci.anl.gov 20 www.ci.uchicago.edu
  • 21. Need much more than data movement Telescope In millions of labs Simulation worldwide, researchers struggle with massive data, advanced software, complex protocols, burdensome reporting Staging Ingest Registry Community Repository Analysis Next-gen genome Archive Mirror sequencer Accelerate discovery and innovation by outsourcing difficult tasks 21 www.ci.anl.gov www.ci.uchicago.edu
  • 22. Need much more than data movement Ingest, cata loging, inte Sharing, collaboration, Identity, grou ps, security Analysis, sim ulation, visu ... gration annotation alization Staging Ingest Registry Community Repository Analysis Next-gen genome Archive Mirror sequencer Accelerate discovery and innovation by outsourcing difficult tasks 22 www.ci.anl.gov www.ci.uchicago.edu
  • 23. Earth System Grid: Data movement • Outsource data transfer – Client data download – Replication between sites • No ESGF client software needed • 20+ times faster than HTTP www.ci.anl.gov 23 earthsystemgrid.org www.ci.uchicago.edu
  • 24. Kbase: Identity, group, data movement www.ci.anl.gov 24 kbase.science.energy.gov www.ci.uchicago.edu
  • 25. Genomics: Data movement and analysis Galaxy-based workflow management Public • Globus Online Data Integrated Galaxy • Web-based UI data • Drag-n-drop Sequenc- Sequencin Globus Online provides Storage libraries workflow creation ing g Centers • Easily add new centers • High-performance • Fault-tolerant Lab Research tools • Secure Local Cluster/ • Analytical tools Seq Cloud file transfer between all Center run on scalable data endpoints computers Galaxy in Cloud Data management Data analysis www.ci.anl.gov 25 Source: Ravi Madduri www.ci.uchicago.edu
  • 26. Integrating observation and simulation 1 Cloud properties and precipitation characteristics in large-scale models and cloud- resolving models (e.g., CMIP5 models, GCRM) Percentage of mapped radar domain in Darwin with returns >10 dBz over the period 19 to 22 January 2006. Retrieve Compare Construct structured 4-D atmospheric state (“CAN”) 2 Precipitating storm structures; storm lifecycles; Analytics Analytics statistical representation of storm scale properties; 3 predictive cloud models www.ci.anl.gov 26 Scott Collis www.ci.uchicago.edu
  • 27. Integrating observation and simulation Level 1 Level 2 Level 3 PBs TBs GBs www.ci.anl.gov 27 Salman Habib, Katrin Heitmann www.ci.uchicago.edu
  • 28. Integrating observation and simulation www.ci.anl.gov 28 Salman Habib, Katrin Heitmann www.ci.uchicago.edu
  • 29. In summary: Big process for big data Accelerate discovery and innovation worldwide by providing research IT as a service Outsource time-consuming tasks to • provide large numbers of researchers with unprecedented access to powerful tools; • enable a massive shortening of cycle times in time-consuming research processes; and • reduce research IT costs via economies of scale Accelerate existing science; enable new science www.ci.anl.gov 29 www.ci.uchicago.edu

Notas do Editor

  1. We will hear numerous talks today on issue relating to the management and analysis of big data—data that stresses our capabilities in terms of its volume, velocity, variety, or variability.I’d like to spend my time speaking to the importance of the related problems of process. I’ll do so from the perspective of the sciences, because that is where I have the most experience.As data volumes increase exponentially, the individual’s ability to operate on that data has to improve exponentially too, if big data is to be an opportunity and not a curse.This is especially true as the number of data sources grows rapidly and thus even the smallest lab (or company) is exposed to the data deluge
  2. Single next-generation sequencing machine can generate 40Gbase/dayGap of >1000 – AND many more systems as people jump on bandwagonMeanwhile, other resources [money, people] stay flat
  3. Storage statistics synthesis
  4. See http://en.wikipedia.org/wiki/File:LLNL_US_Energy_Flow_2009.png for inspiration.Data rates are in TB/day; line thicknesses are 5 TB/day/ptNumbers:-- APS is 163 TB/day, preliminary data from de Carlo.-- ALCF is 150 TB/day: a number given in Carns et al.—but presumably is meant there to be Input *and* output??-- External sources is 8.6 TB/day (100 MB/s)—a WAG-- Others are made upOthers are just WAGs.By comparison: all observational and simulation data from LHC is 15PB/yr(Wikipedia): 475 MB/s
  5. http://labmed.ascpjournals.org/content/40/1/5/F7.expansion.htmlOld tools: PCs, spreadsheets, etc., can’t handle these issues effectively
  6. Aside: Another area in which I encounter substantial and growing complexity is travel.This being consumer space, there’s an app for that! A “software as a service” (aka cloud) app.
  7. Small labs ….Potential solution? Outsource complex, time-consuming, mundane activities to third parties—to software-as-a-service (SaaS) providers—to a “research cloud” focused on process automationQuestion: Which steps can we outsource in that way?
  8. https://plasmasim.physics.ucla.edu/research/winjum
  9. Automated ingestCataloging
  10. DiagnosisProvenance
  11. Geophysical variables: Wind speeds, rainfall rate, temperatures, liquid water content, raindrop shape properties, etc.