SlideShare uma empresa Scribd logo
1 de 35
Baixar para ler offline
Online Workflow Management
           and Performance Analysis with
                     Stampede
                 Dan Gunter1, Taghrid Samak1, Monte Goode1,
          Ewa Deelman2, Gaurang Mehta2, Fabio Silva2, Karan Vahi2
                                Christopher Brooks3
                         Priscilla Moraes4, Martin Swany4


1   Lawrence Berkeley National Laboratory
           2 University of Southern California, Information Sciences Institute
                     3 University of San Francisco
                               4 University of Delaware




                                                                                 1
Background



  CNSM 2011, October
  24-28, Paris, France   2
Goal: Predict behavior of
running scientific workflows
—  Primarily failures
  —    Is a given workflow going to “fail”?
  —    Are specific resources causing problems?
  —    Which application sub-components are failing?
  —    Is the data staging a problem?

—  In large workflows, some failures, etc. are normal
  —  This work is about learning from known problems, which
        patterns of failures, etc. are unusual and require adaptation

—  Do all of this as generally as possible: Can we provide a
  solution that can apply to all workflow engines?

                             CNSM 2011, October
                             24-28, Paris, France                       3
Approach

—  Model the monitoring data from running workflows
—  Collect all the data in real-time
—  Run analysis, also in real-time, on the collected
  data
  —  map low-level failures to application-level
     characteristics

—  Feed back analysis to user, workflow engine



                         CNSM 2011, October
                         24-28, Paris, France           4
Scientific Applications
 Montage    Epigenome        LIGO            CyberShake




Astronomy   Bioinformatics    Astrophysics      Geophysics

                     CNSM 2011, October
                     24-28, Paris, France                    5
Domain: Large Scientific
      Workflows
  SCEC-2009: Millions of tasks completed per day



                                            Radius = 11 million




                                                            6
Workflow structure




      CNSM 2011, October
      24-28, Paris, France   7
Basic terms and concepts


                                                  Success
           Execution


                                                  Fail

Workflow                    Resources


                       Workflow Management System




                               CNSM 2011, October
                               24-28, Paris, France         8
Base technologies

—  Workflow management systems
  —  Pegasus
  —  www.pegasus.isi.edu


—  Monitoring and data analysis          +
  —  NetLogger
  —  www.netlogger.lbl.gov




                        CNSM 2011, October
                        24-28, Paris, France   9
Data Model



  CNSM 2011, October
  24-28, Paris, France   10
Data Model Goals

—  Be widely applicable:      —  Provide everything we
  there are many                  need for Pegasus
  workflow engines out            workflows
  there that could
  benefit.




                       CNSM 2011, October
                       24-28, Paris, France
                                              10/27/11
                                                         11
Abstract and Executable
           Workflows
—  Workflows start as a resource-independent
  statement of computations, input and output data,
  and dependencies
  —  This is called the Abstract Workflow (AW)
—  For each workflow run, Pegasus-WMS plans the
  workflow, adding helper tasks and clustering small
  computations together
  —  This is called the Executable Workflow (EW)
—  Note: Most of the logs are from the EW but the
  user really only knows the AW.

                      CNSM 2011, October
                      24-28, Paris, France           12
Additional Terminology
—  Workflow: Container for an entire computation
—  Sub-workflow: Workflow that is contained in another workflow
—  Task: Representation of a computation in the AW
—  Job: Node in the EW
  —  May represent part of a task (e.g., a stage-in/out), one task,
     or many tasks
—  Job instance: Job scheduled or running by underlying system
  —  Due to retries, there may be multiple job instances per job
—  Invocation: One or more executables for a job instance
  —  Invocations are the instantiation of tasks, whereas jobs are an
     intermediate abstraction for use by the planning and
     scheduling sub-systems


                           CNSM 2011, October
                           24-28, Paris, France                        13
Denormalized Data Model
—  Stream of timestamped “events”:
  —  unique, hierarchical, name
  —  unique identifiers (workflow, job, etc.)
  —  values and metadata
—  Used NETCONF YANG data-modeling language, keyed on
  event name [RFCs: 6020 6021 (6087)]
  —  YANG schema (see bit.ly/nQfPd1) documents and validates
     each log event
                                             Snippet of schema
container stampede.xwf.start {
  description “Start of executable workflow”;
  uses base-event;
  leaf restart_count {
    type uint32;
    description "Number of times workflow was restarted (due to
failures)”; }}


                          CNSM 2011, October
                          24-28, Paris, France                   14
Relational data model
                       Abstract
    task_edge        Workflow (AW)         jobstate
   Task parent                           Job status
     and child


      task               job            job_instance
      Task               Job            Job Instance


                      job_edge
                 Job parent and child


    workflow                              invocation
    Workflow                              Invocation


  workflow_state                          Executable
  Workflow status
                       AW and EW        Workflow (EW)



                      CNSM 2011, October
                      24-28, Paris, France             15
Infrastructure



    CNSM 2011, October
    24-28, Paris, France
                           10/27/11
                                      16
Infrastructure overview


                   Raw logs




          Normalized logs




  Query                       Subscribe



           CNSM 2011, October
           24-28, Paris, France           17
Detailed data flow

                      Pegasus

  Log collection and
    normalization
                     NetLogger                          Failure detection
        Real-time
         analysis

Relational archive




                                 CNSM 2011, October
                                 24-28, Paris, France                       18
Message bus usage
        BP Log events
        Routing key = event name



AMQP Exchange           Queue   …   Queue



 Subscribe              Data

      Analysis client           …   Analysis client



                   CNSM 2011, October
                   24-28, Paris, France
                                              10/27/11
                                                         19
Analysis



 CNSM 2011, October
 24-28, Paris, France
                        10/27/11
                                   20
Experimental Dataset
                    summary
Application	
      Workflows	
           Jobs	
          Tasks	
          Edges	
  
Cybershake	
              881	
      288,668	
        577,330	
      1,245,845	
  

Periodograms	
              45	
      80,158	
      1,894,921	
         80,113	
  

Epigenome	
                 46	
      10,059	
         29,837	
         23,425	
  
Montage	
                   76	
      56,018	
        613,107	
        287,146	
  
Broadband	
                 66	
      44,182	
        104,275	
        141,922	
  
LIGO	
                      26	
        2,116	
          2,141	
          6,203	
  

                        1,140	
      481,201	
      3,221,611	
      1,784,654	
  




                              CNSM 2011, October
                              24-28, Paris, France                           21
Workflow clustering

—  Features collected for each workflow run
  —    Successful jobs
  —    Failed jobs
  —    Success duration
  —    Fail duration

—  Offline clustering on historical data
  —  Algorithm: k-means
—  Online analysis classifies workflows according to
  nearest cluster


                                                        22
“High Failure” Workflows
            (HFW)
—  The workflow engine keeps retrying workflows until
  they complete or time out

—  But in the experimental logs, workflows are never
  marked as “failed”
  —  Aside: this is fixed in the newest version
—  Therefore, we use a simple heuristic for identifying
  workflows as problematic:
  —  HFW means: > 50% of jobs failed



                         CNSM 2011, October
                         24-28, Paris, France           23
HFW failure patterns
Montage application

                      X-axis is
                      normalized
                      workflow execution
                      time

                      Y-axis shows the
                      percent of total job
                      failures for this
                      workflow, so far

                      Legend shows, for
                      each workflow,
                      jobs failed/jobs total



                                             24
More HFW Failure Patterns
Epigenome    Broadband




 Montage     CyberShake




                          25
Offline clustering
                   3
                   37
                                                                     Epigenome
              5




                                  Other 3 clusters
              4
              3
Component 2




                                                                 High-failure
                                                               workflow cluster
              2




                             7
                             1
                             ●
                              12
                              ●
                                21362                                               18 4
              1




                                ●
                                 ●
                                  6                                          2717
                                                                               20
                                  43
                                  44
                                   23                                   35
                                   ● 33                            14
                                      32
                                      31 19
              0




                                      38
                                      29
                                      34
                                       39
                                       2
                                         10
                                        40
                                         5
                                         4
                                         15 30
                                         22
                                         28
                                         11
                                          1      8
                                         16 42
                                         24
                                         13
                                          25
                                          41
                                          26
                                                               Projection onto first 2
              −1




                                           39
                                                               principal components

                                                 CNSM 2011, October 4
                        −2                 0             2
                                                 24-28, Paris, France                      26
                                                 Component 1
Online classification
                                                            Workflows
                        4                                      21:512/905
                                                               24:28/29
 Workflow classification


                                                               25:28/29
                                                               27:4/4
                                                               33:28/30
                                                               41:64/89
             3
     Class




                                                                      Doesn’t
                                                                      converge
  2




           High-failure workflow class
                1




                            0   20   40        60       80           100
                                      Lifetime %

                                     CNSM 2011, October
                                     24-28, Paris, France                        27
Anomaly detection
                           Montage application
                     1.0


                                                                     X: total number
          0.9                                                        of failures
                                                 Anomalous!
                     0.8




                                                 See Slide #24       Y: proportion of
Cumulative Percent




                                                                     time-windows
                     0.6




                                                        46:281/496   experiencing
                                                        48:62/65     that number of
                                                                     failures or less
                     0.4




                                                        49:44/73
                                                        50:36/65
                                                        51:22/37
                     0.2




                                                        52:38/51
                                                        53:42/57
                                                        54:32/48
                     0.0




                            0       10 15     20          30
                                         CNSM 2011, October
                                        Failures
                                         24-28, Paris, France                  28
System
                                                                           broadband                         cybershake
                                                             4
                                                        10


  performance                                           103


                                                        102


Bars show the                                           101


rate for each                                           100
                                                                                                                                       Query type




                   Median queries minute, log10 scale
                                                                          epigenome                              ligo                      01-JobsTot

type of query                                           104                                                                                02-JobsState
                                                                                                                                           03-JobsType
                                                        103                                                                                04-JobsHost

Each panel is an                                        102
                                                                                                                                           05-TimeTot
                                                                                                                                           06-TimeState

application                                             101                                                                                07-TimeType
                                                                                                                                           08-TimeHost
                                                             0
                                                        10                                                                                 09-JobDelay

Dashed black
                                                                           montage                          periodograms                   10-WfSumm
                                                        104                                                                                11-HostSumm

lines are median                                        103

arrival rate for                                        102

the application.                                        101


                                                        100
                                                                 01 02 03 04 05 06 07 08 09 10 11   01 02 03 04 05 06 07 08 09 10 11


                                                                                                                                             29
                                                                                          Query type
                CNSM 2011, October 24-28, Paris,
                France
Summary
—  Real-time failure prediction for scientific workflows
  is a challenging but important task

—  Unsupervised learning can be used to model high-
  level workflow failures from historical data

—  High failure classes of workflows can be predicted
  in real-time with high accuracy

—  Future directions
  —  Analysis; root-cause investigation
  —  System; notifications and updates
  —  Working with data from other workflow systems


               CNSM 2011, October 24-28, Paris,             30
               France
Thank you!
   For more information, visit the Stampede wiki at:
https://confluence.pegasus.isi.edu/display/stampede/
Extra slides..



    CNSM 2011, October
    24-28, Paris, France   32
Scalability




CNSM 2011, October 24-28, Paris,   33
France
Pegasus

—  Maps from abstract to concrete workflow
  —  Algorithmic and AI-based techniques

—  Automatically locates physical locations for both
  workflow components and data

—  Finds appropriate resources to execute
—  Reuses existing data products where applicable
—  Publishes newly derived data products
  —  Provides provenance information

                       CNSM 2011, October
                       24-28, Paris, France             34
NetLogger

—  Logging Methodology
  —  Timestamped, named, messages at the start and end
    of significant events, with additional identifiers and
    metadata in a std. line-oriented ASCII format (Best
    Practices or BP)
    —  APIs are provided, incl. in-memory log aggregation for
      high frequency events; but message generation is often
      best done within an existing framework

—  Logging and Analysis Tools
  —  Parse many existing formats to BP
  —  Load BP into message bus, MySQL, MongoDB, etc.
  —  Generate profiles, graphs, and CSV from BP data
                         CNSM 2011, October
                         24-28, Paris, France                    35

Mais conteúdo relacionado

Semelhante a Online Workflow Management and Performance Analysis with Stampede

Automated generation of various and consistent populations in multi-agent sim...
Automated generation of various and consistent populations in multi-agent sim...Automated generation of various and consistent populations in multi-agent sim...
Automated generation of various and consistent populations in multi-agent sim...Benoit Lacroix
 
2017-11-03 Scientific Workflow systems
2017-11-03 Scientific Workflow systems2017-11-03 Scientific Workflow systems
2017-11-03 Scientific Workflow systemsStian Soiland-Reyes
 
2017-11-03 Scientific Workflow systems
2017-11-03 Scientific Workflow systems2017-11-03 Scientific Workflow systems
2017-11-03 Scientific Workflow systemsStian Soiland-Reyes
 
Hervé Panetto. A framework for analysing product information traceability
Hervé Panetto. A framework for analysing product information traceabilityHervé Panetto. A framework for analysing product information traceability
Hervé Panetto. A framework for analysing product information traceabilityMilan Zdravković
 
Advances in Bayesian Learning
Advances in Bayesian LearningAdvances in Bayesian Learning
Advances in Bayesian Learningbutest
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformGanesan Narayanasamy
 
FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational WorkflowsCarole Goble
 
Containerizing HPC and AI applications using E4S and Performance Monitor tool
Containerizing HPC and AI applications using E4S and Performance Monitor toolContainerizing HPC and AI applications using E4S and Performance Monitor tool
Containerizing HPC and AI applications using E4S and Performance Monitor toolGanesan Narayanasamy
 
A preliminary implementation of a content–aware network node
A preliminary implementation of a content–aware network nodeA preliminary implementation of a content–aware network node
A preliminary implementation of a content–aware network nodeAlpen-Adria-Universität
 
PrOnto: an Ontology Driven Business Process Mining Tool
PrOnto: an Ontology Driven Business Process Mining ToolPrOnto: an Ontology Driven Business Process Mining Tool
PrOnto: an Ontology Driven Business Process Mining ToolFrancesco Nocera
 
OpenStack in Action 4! Susheel Varma - VPH-Share: Patient-Centred Multi-scale...
OpenStack in Action 4! Susheel Varma - VPH-Share: Patient-Centred Multi-scale...OpenStack in Action 4! Susheel Varma - VPH-Share: Patient-Centred Multi-scale...
OpenStack in Action 4! Susheel Varma - VPH-Share: Patient-Centred Multi-scale...eNovance
 
Practical operability techniques for teams - Matthew Skelton - Conflux - Cont...
Practical operability techniques for teams - Matthew Skelton - Conflux - Cont...Practical operability techniques for teams - Matthew Skelton - Conflux - Cont...
Practical operability techniques for teams - Matthew Skelton - Conflux - Cont...Matthew Skelton
 
20090918 Agile Computer Control of a Complex Experiment
20090918 Agile Computer Control of a Complex Experiment20090918 Agile Computer Control of a Complex Experiment
20090918 Agile Computer Control of a Complex ExperimentJonathan Blakes
 
WorkflowHub: Community Framework for Enabling Scientific Workflow Research a...
WorkflowHub: Community Framework for Enabling  Scientific Workflow Research a...WorkflowHub: Community Framework for Enabling  Scientific Workflow Research a...
WorkflowHub: Community Framework for Enabling Scientific Workflow Research a...Rafael Ferreira da Silva
 
Ingredients for Semantic Sensor Networks
Ingredients for Semantic Sensor NetworksIngredients for Semantic Sensor Networks
Ingredients for Semantic Sensor NetworksOscar Corcho
 
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNMLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNJosh Patterson
 
Robust sensor fault detection and isolation of an anerarobic bioreactor model...
Robust sensor fault detection and isolation of an anerarobic bioreactor model...Robust sensor fault detection and isolation of an anerarobic bioreactor model...
Robust sensor fault detection and isolation of an anerarobic bioreactor model...Francisco Ronay López Estrada
 
RAMP Data Challenge
RAMP Data Challenge RAMP Data Challenge
RAMP Data Challenge Proto204
 

Semelhante a Online Workflow Management and Performance Analysis with Stampede (20)

Automated generation of various and consistent populations in multi-agent sim...
Automated generation of various and consistent populations in multi-agent sim...Automated generation of various and consistent populations in multi-agent sim...
Automated generation of various and consistent populations in multi-agent sim...
 
2017-11-03 Scientific Workflow systems
2017-11-03 Scientific Workflow systems2017-11-03 Scientific Workflow systems
2017-11-03 Scientific Workflow systems
 
2017-11-03 Scientific Workflow systems
2017-11-03 Scientific Workflow systems2017-11-03 Scientific Workflow systems
2017-11-03 Scientific Workflow systems
 
Hervé Panetto. A framework for analysing product information traceability
Hervé Panetto. A framework for analysing product information traceabilityHervé Panetto. A framework for analysing product information traceability
Hervé Panetto. A framework for analysing product information traceability
 
Advances in Bayesian Learning
Advances in Bayesian LearningAdvances in Bayesian Learning
Advances in Bayesian Learning
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
 
FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows
 
Containerizing HPC and AI applications using E4S and Performance Monitor tool
Containerizing HPC and AI applications using E4S and Performance Monitor toolContainerizing HPC and AI applications using E4S and Performance Monitor tool
Containerizing HPC and AI applications using E4S and Performance Monitor tool
 
A preliminary implementation of a content–aware network node
A preliminary implementation of a content–aware network nodeA preliminary implementation of a content–aware network node
A preliminary implementation of a content–aware network node
 
PrOnto: an Ontology Driven Business Process Mining Tool
PrOnto: an Ontology Driven Business Process Mining ToolPrOnto: an Ontology Driven Business Process Mining Tool
PrOnto: an Ontology Driven Business Process Mining Tool
 
OpenStack in Action 4! Susheel Varma - VPH-Share: Patient-Centred Multi-scale...
OpenStack in Action 4! Susheel Varma - VPH-Share: Patient-Centred Multi-scale...OpenStack in Action 4! Susheel Varma - VPH-Share: Patient-Centred Multi-scale...
OpenStack in Action 4! Susheel Varma - VPH-Share: Patient-Centred Multi-scale...
 
Practical operability techniques for teams - Matthew Skelton - Conflux - Cont...
Practical operability techniques for teams - Matthew Skelton - Conflux - Cont...Practical operability techniques for teams - Matthew Skelton - Conflux - Cont...
Practical operability techniques for teams - Matthew Skelton - Conflux - Cont...
 
20090918 Agile Computer Control of a Complex Experiment
20090918 Agile Computer Control of a Complex Experiment20090918 Agile Computer Control of a Complex Experiment
20090918 Agile Computer Control of a Complex Experiment
 
WorkflowHub: Community Framework for Enabling Scientific Workflow Research a...
WorkflowHub: Community Framework for Enabling  Scientific Workflow Research a...WorkflowHub: Community Framework for Enabling  Scientific Workflow Research a...
WorkflowHub: Community Framework for Enabling Scientific Workflow Research a...
 
Ingredients for Semantic Sensor Networks
Ingredients for Semantic Sensor NetworksIngredients for Semantic Sensor Networks
Ingredients for Semantic Sensor Networks
 
Ikc 2015
Ikc 2015Ikc 2015
Ikc 2015
 
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNMLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
 
PID2143641
PID2143641PID2143641
PID2143641
 
Robust sensor fault detection and isolation of an anerarobic bioreactor model...
Robust sensor fault detection and isolation of an anerarobic bioreactor model...Robust sensor fault detection and isolation of an anerarobic bioreactor model...
Robust sensor fault detection and isolation of an anerarobic bioreactor model...
 
RAMP Data Challenge
RAMP Data Challenge RAMP Data Challenge
RAMP Data Challenge
 

Último

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 

Último (20)

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 

Online Workflow Management and Performance Analysis with Stampede

  • 1. Online Workflow Management and Performance Analysis with Stampede Dan Gunter1, Taghrid Samak1, Monte Goode1, Ewa Deelman2, Gaurang Mehta2, Fabio Silva2, Karan Vahi2 Christopher Brooks3 Priscilla Moraes4, Martin Swany4 1 Lawrence Berkeley National Laboratory 2 University of Southern California, Information Sciences Institute 3 University of San Francisco 4 University of Delaware 1
  • 2. Background CNSM 2011, October 24-28, Paris, France 2
  • 3. Goal: Predict behavior of running scientific workflows —  Primarily failures —  Is a given workflow going to “fail”? —  Are specific resources causing problems? —  Which application sub-components are failing? —  Is the data staging a problem? —  In large workflows, some failures, etc. are normal —  This work is about learning from known problems, which patterns of failures, etc. are unusual and require adaptation —  Do all of this as generally as possible: Can we provide a solution that can apply to all workflow engines? CNSM 2011, October 24-28, Paris, France 3
  • 4. Approach —  Model the monitoring data from running workflows —  Collect all the data in real-time —  Run analysis, also in real-time, on the collected data —  map low-level failures to application-level characteristics —  Feed back analysis to user, workflow engine CNSM 2011, October 24-28, Paris, France 4
  • 5. Scientific Applications Montage Epigenome LIGO CyberShake Astronomy Bioinformatics Astrophysics Geophysics CNSM 2011, October 24-28, Paris, France 5
  • 6. Domain: Large Scientific Workflows SCEC-2009: Millions of tasks completed per day Radius = 11 million 6
  • 7. Workflow structure CNSM 2011, October 24-28, Paris, France 7
  • 8. Basic terms and concepts Success Execution Fail Workflow Resources Workflow Management System CNSM 2011, October 24-28, Paris, France 8
  • 9. Base technologies —  Workflow management systems —  Pegasus —  www.pegasus.isi.edu —  Monitoring and data analysis + —  NetLogger —  www.netlogger.lbl.gov CNSM 2011, October 24-28, Paris, France 9
  • 10. Data Model CNSM 2011, October 24-28, Paris, France 10
  • 11. Data Model Goals —  Be widely applicable: —  Provide everything we there are many need for Pegasus workflow engines out workflows there that could benefit. CNSM 2011, October 24-28, Paris, France 10/27/11 11
  • 12. Abstract and Executable Workflows —  Workflows start as a resource-independent statement of computations, input and output data, and dependencies —  This is called the Abstract Workflow (AW) —  For each workflow run, Pegasus-WMS plans the workflow, adding helper tasks and clustering small computations together —  This is called the Executable Workflow (EW) —  Note: Most of the logs are from the EW but the user really only knows the AW. CNSM 2011, October 24-28, Paris, France 12
  • 13. Additional Terminology —  Workflow: Container for an entire computation —  Sub-workflow: Workflow that is contained in another workflow —  Task: Representation of a computation in the AW —  Job: Node in the EW —  May represent part of a task (e.g., a stage-in/out), one task, or many tasks —  Job instance: Job scheduled or running by underlying system —  Due to retries, there may be multiple job instances per job —  Invocation: One or more executables for a job instance —  Invocations are the instantiation of tasks, whereas jobs are an intermediate abstraction for use by the planning and scheduling sub-systems CNSM 2011, October 24-28, Paris, France 13
  • 14. Denormalized Data Model —  Stream of timestamped “events”: —  unique, hierarchical, name —  unique identifiers (workflow, job, etc.) —  values and metadata —  Used NETCONF YANG data-modeling language, keyed on event name [RFCs: 6020 6021 (6087)] —  YANG schema (see bit.ly/nQfPd1) documents and validates each log event Snippet of schema container stampede.xwf.start { description “Start of executable workflow”; uses base-event; leaf restart_count { type uint32; description "Number of times workflow was restarted (due to failures)”; }} CNSM 2011, October 24-28, Paris, France 14
  • 15. Relational data model Abstract task_edge Workflow (AW) jobstate Task parent Job status and child task job job_instance Task Job Job Instance job_edge Job parent and child workflow invocation Workflow Invocation workflow_state Executable Workflow status AW and EW Workflow (EW) CNSM 2011, October 24-28, Paris, France 15
  • 16. Infrastructure CNSM 2011, October 24-28, Paris, France 10/27/11 16
  • 17. Infrastructure overview Raw logs Normalized logs Query Subscribe CNSM 2011, October 24-28, Paris, France 17
  • 18. Detailed data flow Pegasus Log collection and normalization NetLogger Failure detection Real-time analysis Relational archive CNSM 2011, October 24-28, Paris, France 18
  • 19. Message bus usage BP Log events Routing key = event name AMQP Exchange Queue … Queue Subscribe Data Analysis client … Analysis client CNSM 2011, October 24-28, Paris, France 10/27/11 19
  • 20. Analysis CNSM 2011, October 24-28, Paris, France 10/27/11 20
  • 21. Experimental Dataset summary Application   Workflows   Jobs   Tasks   Edges   Cybershake   881   288,668   577,330   1,245,845   Periodograms   45   80,158   1,894,921   80,113   Epigenome   46   10,059   29,837   23,425   Montage   76   56,018   613,107   287,146   Broadband   66   44,182   104,275   141,922   LIGO   26   2,116   2,141   6,203   1,140   481,201   3,221,611   1,784,654   CNSM 2011, October 24-28, Paris, France 21
  • 22. Workflow clustering —  Features collected for each workflow run —  Successful jobs —  Failed jobs —  Success duration —  Fail duration —  Offline clustering on historical data —  Algorithm: k-means —  Online analysis classifies workflows according to nearest cluster 22
  • 23. “High Failure” Workflows (HFW) —  The workflow engine keeps retrying workflows until they complete or time out —  But in the experimental logs, workflows are never marked as “failed” —  Aside: this is fixed in the newest version —  Therefore, we use a simple heuristic for identifying workflows as problematic: —  HFW means: > 50% of jobs failed CNSM 2011, October 24-28, Paris, France 23
  • 24. HFW failure patterns Montage application X-axis is normalized workflow execution time Y-axis shows the percent of total job failures for this workflow, so far Legend shows, for each workflow, jobs failed/jobs total 24
  • 25. More HFW Failure Patterns Epigenome Broadband Montage CyberShake 25
  • 26. Offline clustering 3 37 Epigenome 5 Other 3 clusters 4 3 Component 2 High-failure workflow cluster 2 7 1 ● 12 ● 21362 18 4 1 ● ● 6 2717 20 43 44 23 35 ● 33 14 32 31 19 0 38 29 34 39 2 10 40 5 4 15 30 22 28 11 1 8 16 42 24 13 25 41 26 Projection onto first 2 −1 39 principal components CNSM 2011, October 4 −2 0 2 24-28, Paris, France 26 Component 1
  • 27. Online classification Workflows 4 21:512/905 24:28/29 Workflow classification 25:28/29 27:4/4 33:28/30 41:64/89 3 Class Doesn’t converge 2 High-failure workflow class 1 0 20 40 60 80 100 Lifetime % CNSM 2011, October 24-28, Paris, France 27
  • 28. Anomaly detection Montage application 1.0 X: total number 0.9 of failures Anomalous! 0.8 See Slide #24 Y: proportion of Cumulative Percent time-windows 0.6 46:281/496 experiencing 48:62/65 that number of failures or less 0.4 49:44/73 50:36/65 51:22/37 0.2 52:38/51 53:42/57 54:32/48 0.0 0 10 15 20 30 CNSM 2011, October Failures 24-28, Paris, France 28
  • 29. System broadband cybershake 4 10 performance 103 102 Bars show the 101 rate for each 100 Query type Median queries minute, log10 scale epigenome ligo 01-JobsTot type of query 104 02-JobsState 03-JobsType 103 04-JobsHost Each panel is an 102 05-TimeTot 06-TimeState application 101 07-TimeType 08-TimeHost 0 10 09-JobDelay Dashed black montage periodograms 10-WfSumm 104 11-HostSumm lines are median 103 arrival rate for 102 the application. 101 100 01 02 03 04 05 06 07 08 09 10 11 01 02 03 04 05 06 07 08 09 10 11 29 Query type CNSM 2011, October 24-28, Paris, France
  • 30. Summary —  Real-time failure prediction for scientific workflows is a challenging but important task —  Unsupervised learning can be used to model high- level workflow failures from historical data —  High failure classes of workflows can be predicted in real-time with high accuracy —  Future directions —  Analysis; root-cause investigation —  System; notifications and updates —  Working with data from other workflow systems CNSM 2011, October 24-28, Paris, 30 France
  • 31. Thank you! For more information, visit the Stampede wiki at: https://confluence.pegasus.isi.edu/display/stampede/
  • 32. Extra slides.. CNSM 2011, October 24-28, Paris, France 32
  • 33. Scalability CNSM 2011, October 24-28, Paris, 33 France
  • 34. Pegasus —  Maps from abstract to concrete workflow —  Algorithmic and AI-based techniques —  Automatically locates physical locations for both workflow components and data —  Finds appropriate resources to execute —  Reuses existing data products where applicable —  Publishes newly derived data products —  Provides provenance information CNSM 2011, October 24-28, Paris, France 34
  • 35. NetLogger —  Logging Methodology —  Timestamped, named, messages at the start and end of significant events, with additional identifiers and metadata in a std. line-oriented ASCII format (Best Practices or BP) —  APIs are provided, incl. in-memory log aggregation for high frequency events; but message generation is often best done within an existing framework —  Logging and Analysis Tools —  Parse many existing formats to BP —  Load BP into message bus, MySQL, MongoDB, etc. —  Generate profiles, graphs, and CSV from BP data CNSM 2011, October 24-28, Paris, France 35