SlideShare a Scribd company logo
1 of 30
Download to read offline
Fault	
  Tolerant	
  Clustering	
  in	
  
   Scien2fic	
  Workflows	
  


            Weiwei	
  Chen,	
  Ewa	
  Deelman	
  
           Informa2on	
  Sciences	
  Ins2tute	
  
          University	
  of	
  Southern	
  California	
  


                                                           1	
  
Outline	
  
•    Introduc2on	
  
•    Workflow	
  and	
  Failure	
  Model	
  
•    Fault	
  Tolerant	
  Clustering	
  
•    Experiments	
  
•    Task	
  Specific	
  Failures	
  
•    Loca2on	
  Specific	
  Failures	
  



                                              2	
  
Introduc2on 	
  	
  
•  Task	
  based	
  Scien2fic	
  Workflows	
  
    –  Task	
  
    –  Job	
  

•  Task	
  Clustering	
  	
  
    –  Merges	
  mul2ple	
  small	
  tasks	
  into	
  a	
  job	
  	
  
    –  Reduce	
  scheduling	
  and	
  submit	
  overhead	
  
•  Fault	
  Tolerance	
  in	
  Task	
  Clustering	
  
    –  Exis2ng	
  techniques	
  underes2mate	
  or	
  ignore	
  the	
  
       influences	
  of	
  failures	
  

                                                                          3	
  
Task	
  Clustering 	
  	
  
•  Task	
  Clustering	
  	
  
    –  Horizontal	
  Clustering	
  
    –  Ver2cal	
  Clustering	
  
    –  Arbitrary	
  Clustering	
  




             Clustering	
  Factor	
  (k):	
  number	
  of	
  tasks	
  in	
  a	
  job	
     4	
  
System	
  Overview 	
  	
  



                                  scheduling	
  and	
  
                                  submit	
  delay	
  


                                                          without	
  
                                                          clustering	
  
                                                          with	
  
                                                          clustering	
  
                   Timeline	
  
                                                               5	
  
                                     Improvement	
  
Task	
  Failures	
  and	
  Job	
  Failures	
  
       •  We	
  only	
  focus	
  on	
  Transient	
  Failure	
  and	
  Job	
  Retry	
  
       •  We	
  don’t	
  differen2ate	
  the	
  causes	
  of	
  failures	
  but	
  we	
  
          concern	
  about	
  the	
  average	
  failure	
  rate.	
  	
  
       •  Assump2on:	
  a	
  failure	
  is	
  a	
  random	
  event	
  independent	
  of	
  
          workflow	
  characteris2cs	
  or	
  execu2on	
  environment	
  	
  
       •  Two	
  Categories	
  
         o  Task	
  Failure:	
  a	
  task	
  fails,	
  other	
  
            tasks	
  in	
  the	
  same	
  job	
  may	
  not	
  
            fail	
  
             §  E.g.	
  Applica2on	
  
	
       o  Job	
  Failure:	
  a	
  job	
  fails,	
  all	
  of	
  its	
  
            tasks	
  fail	
  
             §  E.g.	
  Scheduling	
  System	
  
	
  

                                                                                              6	
  
Influence	
  of	
  Failures	
  on	
  Clustering	
  
                     ttotal	
     Es2mated	
  Overall	
  Run2me	
  
                     n	
          Number	
  of	
  tasks	
  to	
  run	
  
                     t	
          Run2me	
  of	
  a	
  single	
  task	
  
                     r	
          Number	
  of	
  available	
  resources	
  
                     d	
          Time	
  delay	
  between	
  jobs	
  
                     N	
          Expected	
  retry	
  2mes	
  for	
  a	
  single	
  task	
  
                     k	
          Number	
  of	
  tasks	
  in	
  a	
  job	
  
                     β	
          Job	
  failure	
  rate	
  
                     α	
          Task	
  failure	
  rate	
  
                           Target	
  Func2on:	
  Min	
  (ttotal)	
  
                    given	
  n	
  tasks	
  to	
  run	
  on	
  r	
  resources	
  
  task	
  failure	
  rate	
  (α)	
  is	
  measurable	
  (Task	
  Failure	
  Model)	
  
  or	
  job	
  failure	
  rate	
  (β)	
  is	
  measurable	
  (Job	
  Failure	
  Model)	
  
                     	
  Assump2on:	
  n	
  >>	
  r,	
  but	
  n/k	
  >>	
  r	
  	
  
                                                                                                7	
  
Job	
  Failure	
  Model	
  
Run2me	
  for	
  a	
  
single	
  job	
            t job = kt + d
Avg	
  retry	
  2me	
   N = 1
                              job
for	
  a	
  single	
  job	
       (1− β )
                                                                        ttotal	
     Es2mated	
  Overall	
  Run2me	
  
                            "
                            $ N job n           if
                                                     n
                                                       ≥r
                                                                        n	
          Number	
  of	
  tasks	
  to	
  run	
  

                            $ rk                     k                  t	
          Run2me	
  of	
  a	
  single	
  task	
  
Retry	
  2me	
   N total   =#                                           r	
          Number	
  of	
  available	
  resources	
  
for	
  all	
  jobs	
        $                        n
                            $ N job ,           if
                                                     k
                                                       <r               d	
          Time	
  delay	
  between	
  jobs	
  
                            %                                           N	
          Expected	
  retry	
  2mes	
  for	
  a	
  single	
  task	
  
Overall	
  
                           ttotal = t job N total                       k	
          Number	
  of	
  tasks	
  in	
  a	
  job	
  
run2me	
  
                  #                                                     β	
          Job	
  failure	
  rate	
  

                  % Nn(kt + d) = n(kt + d) ,                if
                                                                 n
                                                                   ≥r   α	
          Task	
  failure	
  rate	
  
                  %    rk        rk(1− β )                       k
        ttotal   =$
                  %              (kt + d)                        n
                  %  N(kt + d) =          ,                 if     <r
                  &               1− β                           k

                                                                                                                                                   8	
  
Job	
  Failure	
  Model	
  
          #
          % Nn(kt + d) = n(kt + d) ,                          if
                                                                    n
                                                                      ≥r
          %    rk        rk(1− β )                                  k
ttotal   =$
          %              (kt + d)                                   n
          %  N(kt + d) =          ,                           if      <r
          &               1− β                                      k



                    k*	
  is	
  independent	
  of	
  β	
  	
  



                It’s	
  not	
  necessary	
  to	
                                    n
                                                                           k* =
                adjust	
  k.	
  Just	
  set	
  it	
  to	
  be	
                     r
                                                                        *       (kt + d)
                                                                       ttotal
                                                                           =
                                                                                 1− β

                                                            n=1000,	
  t=5	
  sec,	
  d=5	
  sec,	
  r=20	
  

                                                                                                                9	
  
Task	
  Failure	
  Model	
  
Run2me	
  for	
  a	
  
single	
  job	
            t job = kt + d
Avg	
  retry	
  2me	
   N =          1
                              job
for	
  a	
  single	
  job	
       (1− α )k
                                                                    ttotal	
     Es2mated	
  Overall	
  Run2me	
  
                            "
                            $ N job n           if
                                                     n
                                                       ≥r
                                                                    n	
          Number	
  of	
  tasks	
  to	
  run	
  

                            $ rk                     k              t	
          Run2me	
  of	
  a	
  single	
  task	
  
Retry	
  2me	
   N total   =#                                       r	
          Number	
  of	
  available	
  resources	
  
for	
  all	
  jobs	
        $                        n
                            $ N job ,           if
                                                     k
                                                       <r           d	
          Time	
  delay	
  between	
  jobs	
  
                            %                                       N	
          Expected	
  retry	
  2mes	
  for	
  a	
  single	
  task	
  
Overall	
  
                           ttotal = t job N total                   k	
          Number	
  of	
  tasks	
  in	
  a	
  job	
  
run2me	
  
                                                                    β	
          Job	
  failure	
  rate	
  
                                                                    α	
          Task	
  failure	
  rate	
  
               #
               % Nn(kt + d) = n(kt + d) ,               if
                                                             n
                                                               ≥r
               %    rk        rk(1− α )k                     k
     ttotal   =$
               %              (kt + d)                       n
               %  N(kt + d) =         k
                                        ,               if     <r
               &              (1− α )                        k
                                                                                                                                               10	
  
Task	
  Failure	
  Model	
  
          #
          % Nn(kt + d) = n(kt + d) ,                         if
                                                                     n
                                                                       ≥r
          %    rk        rk(1− α )k                                  k
ttotal   =$
          %              (kt + d)                                    n
          %  N(kt + d) =         k
                                   ,                         if        <r
          &              (1− α )                                     k

                       k*	
  is	
  dependent	
  of	
  α	
  	
  



             It’s	
  necessary	
  to	
  adjust	
  k	
                                  4d
             according	
  to	
  α	
                                  −d + d 2 −
                                                                                    ln(1− α )
                                                              k* =                              ,   if      n >> r
                                                                               2t
                                                             *  n(k *t + d)
                                                            t =
                                                             total         *
                                                                rk(1− α )k


                                                                                                         11	
  
Comparing	
  TFM	
  and	
  JFM	
  




                          2.	
  Op2mal	
  clustering	
  factor	
  
                1.	
  Linear	
  increase	
  vs	
  exponen2al	
  increase	
  
                                                                          4d
            n                                           −d + d 2 −
  k* =                                           k* =
                                                                       ln(1− α )
                                                                                   ,   if   n >> r
            r                                                     2t
         (kt + d)
 *
ttotal
    =                                           *  n(k *t + d)
          1− β                                 t =
                                                total         *
                                                   rk(1− α )k
                                                                                                     12	
  
Fault	
  Tolerant	
  Clustering	
  
•  Job	
  Failure	
  Model:	
  k=n/r	
  
•  Selec2ve	
  Reclustering	
  (SR)	
  
   –  select	
  the	
  failed	
  tasks	
  in	
  a	
  clustered	
  job	
  and	
  
      cluster	
  them	
  into	
  a	
  new	
  clustered	
  job	
  	
  
   –  It	
  requires	
  the	
  iden2fica2on	
  of	
  failed	
  tasks.	
  




                                                                                   13	
  
Fault	
  Tolerant	
  Clustering	
  
•  Dynamic	
  Clustering	
  (DC)	
  
   –  adjust	
  the	
  clustering	
  factor	
  according	
  to	
  the	
  task	
  
      failure	
  rates	
  dynamically	
  


                             4d
          −d + d 2 −
                          ln(1− α )
   k* =                               ,   if   n >> r
                     2t

       *            n(k *t + d)
   t   total,DC   = *          *
                   rk (1− α )k


                                                                                    14	
  
Fault	
  Tolerant	
  Clustering	
  
•  Dynamic	
  Reclustering	
  (DR)	
  
   –  A	
  combina2on	
  of	
  SR	
  and	
  DC	
  




                                                     15	
  
Evalua2on	
  
•  Run	
  simula2ons	
  based	
  on	
  the	
  real	
  traces	
  that	
  
   were	
  run	
  by	
  the	
  Pegasus	
  group.	
  	
  
•  Each	
  workflow	
  was	
  simulated	
  100	
  2mes	
  so	
  
   that	
  the	
  standard	
  devia2on	
  is	
  less	
  than	
  10%	
  
•  Two	
  workflows	
  were	
  used.	
  	
  
•  20	
  worker	
  nodes	
  were	
  used	
  in	
  each	
  
   experiment.	
  	
  


                                                                       16	
  
Workflows	
  Used	
  
•  Montage	
  
   –  An	
  astronomy	
  applica2on	
  used	
  to	
  construct	
  large	
  
      image	
  mosaics	
  of	
  the	
  sky.	
  	
  
   –  Montage	
   has	
   complex	
   data	
   dependencies	
  
      between	
  tasks	
  	
  
   –  10,422	
  tasks,	
  57GB	
  data.	
  	
  




                                                                        17	
  
         Image	
  from	
  hhp://montage.ipac.caltech.edu/	
  
Workflows	
  Used	
  
•  Periodogram	
  
   –  Iden2fy	
   periodic	
   signals	
   from	
   light	
   curves	
   that	
  
      arise	
  from	
  transi2ng	
  planets.	
  	
  
   –  216,600	
  tasks,	
  19GB	
  input	
  data.	
  	
  
   –  Periodogram	
  has	
  only	
  one	
  level	
  




      Image	
  from	
  hhp://pegasus.isi.edu/presenta2ons/2011/sci709-­‐voeckler-­‐talk.ppt/	
     18	
  
Simulator	
  
•  Extension	
  to	
  CloudSim	
  
   –  Workflow	
  Engine	
  
   –  Clustering	
  Engine	
  
   –  Scheduler	
  
   –  Failure	
  Generator	
  
   –  Failure	
  Monitor	
  




                                         19	
  
Performance	
  
•    NOOP:	
  no	
  op2miza2on,	
  (k=n/r)	
  
•    DC	
  (Dynamic	
  Clustering)	
  	
  
•    SR	
  (Selec2ve	
  Reclustering)	
  
•    DR	
  (	
  Dynamic	
  Reclustering)	
  
•    Overall	
  Run2me	
  in	
  seconds	
  




                                                 20	
  
Performance	
  
•  Periodogram	
  




                                   21	
  
Performance	
  
•  Montage	
  




                                   22	
  
Task	
  Specific	
  Failure	
  Detec2on	
  (TSFD)	
  
•  Task	
  Failures	
  are	
  related	
  to	
  the	
  type	
  of	
  tasks	
  
•  Failure	
  Monitor	
  classifies	
  failures	
  based	
  on	
  the	
  type	
  	
  
•  Clustering	
   Engine	
   merges	
   tasks	
   based	
   on	
   different	
   task	
  
   failure	
  rates	
  
•  In	
   this	
   experiment	
   of	
   Montage,	
   we	
   set	
   the	
   task	
   failure	
  
   rate	
   of	
   mProjectPP	
   and	
   mDiffFit	
   to	
   be	
   0.001	
   while	
  
   mBackground	
  ranges	
  from	
  0.2	
  to	
  0.8.	
  	
  
                                            Optimization Methods
                            α1      DR    DR+TSFD       DC         DC+TSFD

                          0.2     10415   10412       13804        13820

                          0.4     11830   11839       22946        22923

                          0.6     14704   14688       60429        60414
                          0.8     23238   23229       436638       435297


                                                                                              23	
  
Task	
  Failure	
  Model	
  
          #
          % Nn(kt + d) = n(kt + d) ,                           if
                                                                           n
                                                                             ≥r
          %    rk        rk(1− α )k                                        k
ttotal   =$
          %              (kt + d)                                          n
          %  N(kt + d) =         k
                                   ,                           if            <r
          &              (1− α )                                           k

                      ttotal	
  is	
  not	
  sensi2ve	
  to	
  α	
  	
  



                                                                                           4d
                                                                           −d + d 2 −
                                                                                        ln(1− α )
                                                                k* =                                ,   if      n >> r
                                                                                   2t
                                                               *  n(k *t + d)
                                                              t =
                                                               total         *
                                                                  rk(1− α )k


              Simplifica2on	
  of	
  failures	
  is	
  acceptable	
  	
                                       24	
  
Loca2on	
  Specific	
  Failure	
  Detec2on	
  (LSFD)	
  
•  Task	
  Failures	
  are	
  related	
  to	
  the	
  loca2on	
  of	
  execu2on	
  
•  Failure	
   Monitor	
   classifies	
   failures	
   based	
   on	
   resource	
  
   id	
  
•  Scheduler	
  orders	
  resources	
  based	
  on	
  their	
  reliability.	
  
•  Two	
   out	
   of	
   twenty	
   nodes	
   have	
   a	
   higher	
   task	
   failure	
  
   rates	
   (from	
   0.2	
   to	
   0.8)	
   while	
   others	
   s2ll	
   have	
   a	
   task	
  
   failure	
  rate	
  of	
  0.001.	
  	
   small	
  tasks	
  if	
  task	
  failure	
  rate	
  is	
  high	
  
                 DC	
  generates	
  many	
  




                                                                                                        25	
  
Conclusion	
  
•  We	
  present	
  three	
  basic	
  methods	
  to	
  improve	
  
   fault	
  tolerance	
  in	
  task	
  clustering	
  
•  If	
  the	
  system	
  supports	
  iden2fica2on	
  of	
  failed	
  
   tasks,	
  dynamic	
  reclustering	
  performs	
  best	
  
•  Otherwise,	
  use	
  dynamic	
  clustering	
  
•  Improvement	
  is	
  significant	
  even	
  for	
  very	
  basic	
  
   method	
  


                                                                     26	
  
Future	
  Work	
  
•    Ver2cal	
  Clustering	
  and	
  Arbitrary	
  Clustering	
  
•    Intelligent	
  Scheduler	
  
•    More	
  Workflow	
  Examples	
  
•    Distribu2on	
  of	
  Failures	
  




                                                                   27	
  
Ques2ons?	
  
•  Thank	
  you	
  for	
  coming!	
  
•  For	
  further	
  info,	
  please	
  visit:	
  pegasus.isi.edu	
  
   or	
  email	
  wchen@isi.edu	
  




                                                                        28	
  
Refinements	
  
•  When	
  n>>r	
  does	
  not	
  hold	
  in	
  the	
  end	
  of	
  
   execu2on	
  
                                                                                       ntask
•  Default:	
  	
  kactual = k n jobs = k < r
                                       *
                                                                                                                                          r
•  Replica2ve:	
  	
   	
  	
  	
  	
   	
   	
   n jobs 	
   r
                           k	
  actual	
  	
  =	
  k	
  *	
  	
  	
  	
  	
  	
  	
  	
  	
  =	
  	
  	
  replicate	
  jobs	
  by	
   ntask / k

•  Even:	
  	
   actual = ntask n jobs = r
            k
                                         r




                                                                                                                                        29	
  
Dynamic	
  Performance	
  
•  TFM	
  and	
  DC	
  




                                           30	
  

More Related Content

Similar to Fault Tolerant Clustering (IEEE Services 2012)

Test-driven Development no Rails - Começando com o pé direito
Test-driven Development no Rails - Começando com o pé direitoTest-driven Development no Rails - Começando com o pé direito
Test-driven Development no Rails - Começando com o pé direitoNando Vieira
 
Test-driven Development no Rails
Test-driven Development no RailsTest-driven Development no Rails
Test-driven Development no Railselliando dias
 
Algorithm chapter 2
Algorithm chapter 2Algorithm chapter 2
Algorithm chapter 2chidabdu
 
270-102-divide-and-conquer_handout.pdfCS 270Algorithm.docx
270-102-divide-and-conquer_handout.pdfCS 270Algorithm.docx270-102-divide-and-conquer_handout.pdfCS 270Algorithm.docx
270-102-divide-and-conquer_handout.pdfCS 270Algorithm.docxeugeniadean34240
 
Real Time Operating System Concepts
Real Time Operating System ConceptsReal Time Operating System Concepts
Real Time Operating System ConceptsSanjiv Malik
 
Pf congres20110917 data-structures
Pf congres20110917 data-structuresPf congres20110917 data-structures
Pf congres20110917 data-structuresnorm2782
 
Functional Concepts for OOP Developers
Functional Concepts for OOP DevelopersFunctional Concepts for OOP Developers
Functional Concepts for OOP Developersbrweber2
 
Lifelong learning for multi-task learning
Lifelong learning for multi-task learningLifelong learning for multi-task learning
Lifelong learning for multi-task learningJeong-Gwan Lee
 
Molecular models, threads and you
Molecular models, threads and youMolecular models, threads and you
Molecular models, threads and youJiahao Chen
 
Sienna 3 bruteforce
Sienna 3 bruteforceSienna 3 bruteforce
Sienna 3 bruteforcechidabdu
 
Chapter One.pdf
Chapter One.pdfChapter One.pdf
Chapter One.pdfabay golla
 
關於測試,我說的其實是......
關於測試,我說的其實是......關於測試,我說的其實是......
關於測試,我說的其實是......hugo lu
 

Similar to Fault Tolerant Clustering (IEEE Services 2012) (14)

Test-driven Development no Rails - Começando com o pé direito
Test-driven Development no Rails - Começando com o pé direitoTest-driven Development no Rails - Começando com o pé direito
Test-driven Development no Rails - Começando com o pé direito
 
Test-driven Development no Rails
Test-driven Development no RailsTest-driven Development no Rails
Test-driven Development no Rails
 
Algorithm chapter 2
Algorithm chapter 2Algorithm chapter 2
Algorithm chapter 2
 
270-102-divide-and-conquer_handout.pdfCS 270Algorithm.docx
270-102-divide-and-conquer_handout.pdfCS 270Algorithm.docx270-102-divide-and-conquer_handout.pdfCS 270Algorithm.docx
270-102-divide-and-conquer_handout.pdfCS 270Algorithm.docx
 
Real Time Operating System Concepts
Real Time Operating System ConceptsReal Time Operating System Concepts
Real Time Operating System Concepts
 
Merge Sort
Merge SortMerge Sort
Merge Sort
 
Pf congres20110917 data-structures
Pf congres20110917 data-structuresPf congres20110917 data-structures
Pf congres20110917 data-structures
 
Functional Concepts for OOP Developers
Functional Concepts for OOP DevelopersFunctional Concepts for OOP Developers
Functional Concepts for OOP Developers
 
Lifelong learning for multi-task learning
Lifelong learning for multi-task learningLifelong learning for multi-task learning
Lifelong learning for multi-task learning
 
Molecular models, threads and you
Molecular models, threads and youMolecular models, threads and you
Molecular models, threads and you
 
Sienna 3 bruteforce
Sienna 3 bruteforceSienna 3 bruteforce
Sienna 3 bruteforce
 
Algorithms
Algorithms Algorithms
Algorithms
 
Chapter One.pdf
Chapter One.pdfChapter One.pdf
Chapter One.pdf
 
關於測試,我說的其實是......
關於測試,我說的其實是......關於測試,我說的其實是......
關於測試,我說的其實是......
 

Recently uploaded

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Recently uploaded (20)

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Fault Tolerant Clustering (IEEE Services 2012)

  • 1. Fault  Tolerant  Clustering  in   Scien2fic  Workflows   Weiwei  Chen,  Ewa  Deelman   Informa2on  Sciences  Ins2tute   University  of  Southern  California   1  
  • 2. Outline   •  Introduc2on   •  Workflow  and  Failure  Model   •  Fault  Tolerant  Clustering   •  Experiments   •  Task  Specific  Failures   •  Loca2on  Specific  Failures   2  
  • 3. Introduc2on     •  Task  based  Scien2fic  Workflows   –  Task   –  Job   •  Task  Clustering     –  Merges  mul2ple  small  tasks  into  a  job     –  Reduce  scheduling  and  submit  overhead   •  Fault  Tolerance  in  Task  Clustering   –  Exis2ng  techniques  underes2mate  or  ignore  the   influences  of  failures   3  
  • 4. Task  Clustering     •  Task  Clustering     –  Horizontal  Clustering   –  Ver2cal  Clustering   –  Arbitrary  Clustering   Clustering  Factor  (k):  number  of  tasks  in  a  job   4  
  • 5. System  Overview     scheduling  and   submit  delay   without   clustering   with   clustering   Timeline   5   Improvement  
  • 6. Task  Failures  and  Job  Failures   •  We  only  focus  on  Transient  Failure  and  Job  Retry   •  We  don’t  differen2ate  the  causes  of  failures  but  we   concern  about  the  average  failure  rate.     •  Assump2on:  a  failure  is  a  random  event  independent  of   workflow  characteris2cs  or  execu2on  environment     •  Two  Categories   o  Task  Failure:  a  task  fails,  other   tasks  in  the  same  job  may  not   fail   §  E.g.  Applica2on     o  Job  Failure:  a  job  fails,  all  of  its   tasks  fail   §  E.g.  Scheduling  System     6  
  • 7. Influence  of  Failures  on  Clustering   ttotal   Es2mated  Overall  Run2me   n   Number  of  tasks  to  run   t   Run2me  of  a  single  task   r   Number  of  available  resources   d   Time  delay  between  jobs   N   Expected  retry  2mes  for  a  single  task   k   Number  of  tasks  in  a  job   β   Job  failure  rate   α   Task  failure  rate   Target  Func2on:  Min  (ttotal)   given  n  tasks  to  run  on  r  resources   task  failure  rate  (α)  is  measurable  (Task  Failure  Model)   or  job  failure  rate  (β)  is  measurable  (Job  Failure  Model)    Assump2on:  n  >>  r,  but  n/k  >>  r     7  
  • 8. Job  Failure  Model   Run2me  for  a   single  job   t job = kt + d Avg  retry  2me   N = 1 job for  a  single  job   (1− β ) ttotal   Es2mated  Overall  Run2me   " $ N job n if n ≥r n   Number  of  tasks  to  run   $ rk k t   Run2me  of  a  single  task   Retry  2me   N total =# r   Number  of  available  resources   for  all  jobs   $ n $ N job , if k <r d   Time  delay  between  jobs   % N   Expected  retry  2mes  for  a  single  task   Overall   ttotal = t job N total k   Number  of  tasks  in  a  job   run2me   # β   Job  failure  rate   % Nn(kt + d) = n(kt + d) , if n ≥r α   Task  failure  rate   % rk rk(1− β ) k ttotal =$ % (kt + d) n % N(kt + d) = , if <r & 1− β k 8  
  • 9. Job  Failure  Model   # % Nn(kt + d) = n(kt + d) , if n ≥r % rk rk(1− β ) k ttotal =$ % (kt + d) n % N(kt + d) = , if <r & 1− β k k*  is  independent  of  β     It’s  not  necessary  to   n k* = adjust  k.  Just  set  it  to  be   r * (kt + d) ttotal = 1− β n=1000,  t=5  sec,  d=5  sec,  r=20   9  
  • 10. Task  Failure  Model   Run2me  for  a   single  job   t job = kt + d Avg  retry  2me   N = 1 job for  a  single  job   (1− α )k ttotal   Es2mated  Overall  Run2me   " $ N job n if n ≥r n   Number  of  tasks  to  run   $ rk k t   Run2me  of  a  single  task   Retry  2me   N total =# r   Number  of  available  resources   for  all  jobs   $ n $ N job , if k <r d   Time  delay  between  jobs   % N   Expected  retry  2mes  for  a  single  task   Overall   ttotal = t job N total k   Number  of  tasks  in  a  job   run2me   β   Job  failure  rate   α   Task  failure  rate   # % Nn(kt + d) = n(kt + d) , if n ≥r % rk rk(1− α )k k ttotal =$ % (kt + d) n % N(kt + d) = k , if <r & (1− α ) k 10  
  • 11. Task  Failure  Model   # % Nn(kt + d) = n(kt + d) , if n ≥r % rk rk(1− α )k k ttotal =$ % (kt + d) n % N(kt + d) = k , if <r & (1− α ) k k*  is  dependent  of  α     It’s  necessary  to  adjust  k   4d according  to  α   −d + d 2 − ln(1− α ) k* = , if n >> r 2t * n(k *t + d) t = total * rk(1− α )k 11  
  • 12. Comparing  TFM  and  JFM   2.  Op2mal  clustering  factor   1.  Linear  increase  vs  exponen2al  increase   4d n −d + d 2 − k* = k* = ln(1− α ) , if n >> r r 2t (kt + d) * ttotal = * n(k *t + d) 1− β t = total * rk(1− α )k 12  
  • 13. Fault  Tolerant  Clustering   •  Job  Failure  Model:  k=n/r   •  Selec2ve  Reclustering  (SR)   –  select  the  failed  tasks  in  a  clustered  job  and   cluster  them  into  a  new  clustered  job     –  It  requires  the  iden2fica2on  of  failed  tasks.   13  
  • 14. Fault  Tolerant  Clustering   •  Dynamic  Clustering  (DC)   –  adjust  the  clustering  factor  according  to  the  task   failure  rates  dynamically   4d −d + d 2 − ln(1− α ) k* = , if n >> r 2t * n(k *t + d) t total,DC = * * rk (1− α )k 14  
  • 15. Fault  Tolerant  Clustering   •  Dynamic  Reclustering  (DR)   –  A  combina2on  of  SR  and  DC   15  
  • 16. Evalua2on   •  Run  simula2ons  based  on  the  real  traces  that   were  run  by  the  Pegasus  group.     •  Each  workflow  was  simulated  100  2mes  so   that  the  standard  devia2on  is  less  than  10%   •  Two  workflows  were  used.     •  20  worker  nodes  were  used  in  each   experiment.     16  
  • 17. Workflows  Used   •  Montage   –  An  astronomy  applica2on  used  to  construct  large   image  mosaics  of  the  sky.     –  Montage   has   complex   data   dependencies   between  tasks     –  10,422  tasks,  57GB  data.     17   Image  from  hhp://montage.ipac.caltech.edu/  
  • 18. Workflows  Used   •  Periodogram   –  Iden2fy   periodic   signals   from   light   curves   that   arise  from  transi2ng  planets.     –  216,600  tasks,  19GB  input  data.     –  Periodogram  has  only  one  level   Image  from  hhp://pegasus.isi.edu/presenta2ons/2011/sci709-­‐voeckler-­‐talk.ppt/   18  
  • 19. Simulator   •  Extension  to  CloudSim   –  Workflow  Engine   –  Clustering  Engine   –  Scheduler   –  Failure  Generator   –  Failure  Monitor   19  
  • 20. Performance   •  NOOP:  no  op2miza2on,  (k=n/r)   •  DC  (Dynamic  Clustering)     •  SR  (Selec2ve  Reclustering)   •  DR  (  Dynamic  Reclustering)   •  Overall  Run2me  in  seconds   20  
  • 23. Task  Specific  Failure  Detec2on  (TSFD)   •  Task  Failures  are  related  to  the  type  of  tasks   •  Failure  Monitor  classifies  failures  based  on  the  type     •  Clustering   Engine   merges   tasks   based   on   different   task   failure  rates   •  In   this   experiment   of   Montage,   we   set   the   task   failure   rate   of   mProjectPP   and   mDiffFit   to   be   0.001   while   mBackground  ranges  from  0.2  to  0.8.     Optimization Methods α1 DR DR+TSFD DC DC+TSFD 0.2 10415 10412 13804 13820 0.4 11830 11839 22946 22923 0.6 14704 14688 60429 60414 0.8 23238 23229 436638 435297 23  
  • 24. Task  Failure  Model   # % Nn(kt + d) = n(kt + d) , if n ≥r % rk rk(1− α )k k ttotal =$ % (kt + d) n % N(kt + d) = k , if <r & (1− α ) k ttotal  is  not  sensi2ve  to  α     4d −d + d 2 − ln(1− α ) k* = , if n >> r 2t * n(k *t + d) t = total * rk(1− α )k Simplifica2on  of  failures  is  acceptable     24  
  • 25. Loca2on  Specific  Failure  Detec2on  (LSFD)   •  Task  Failures  are  related  to  the  loca2on  of  execu2on   •  Failure   Monitor   classifies   failures   based   on   resource   id   •  Scheduler  orders  resources  based  on  their  reliability.   •  Two   out   of   twenty   nodes   have   a   higher   task   failure   rates   (from   0.2   to   0.8)   while   others   s2ll   have   a   task   failure  rate  of  0.001.     small  tasks  if  task  failure  rate  is  high   DC  generates  many   25  
  • 26. Conclusion   •  We  present  three  basic  methods  to  improve   fault  tolerance  in  task  clustering   •  If  the  system  supports  iden2fica2on  of  failed   tasks,  dynamic  reclustering  performs  best   •  Otherwise,  use  dynamic  clustering   •  Improvement  is  significant  even  for  very  basic   method   26  
  • 27. Future  Work   •  Ver2cal  Clustering  and  Arbitrary  Clustering   •  Intelligent  Scheduler   •  More  Workflow  Examples   •  Distribu2on  of  Failures   27  
  • 28. Ques2ons?   •  Thank  you  for  coming!   •  For  further  info,  please  visit:  pegasus.isi.edu   or  email  wchen@isi.edu   28  
  • 29. Refinements   •  When  n>>r  does  not  hold  in  the  end  of   execu2on   ntask •  Default:    kactual = k n jobs = k < r * r •  Replica2ve:                 n jobs   r k  actual    =  k  *                  =      replicate  jobs  by   ntask / k •  Even:     actual = ntask n jobs = r k r 29  
  • 30. Dynamic  Performance   •  TFM  and  DC   30