SlideShare uma empresa Scribd logo
1 de 47
Learning Webops
  the Hard Way

What could possibly go wrong?

                    Cosimo Streppone
         WebOps Lead – Opera Software
The Hard Way ?
failure
teams organization

     mail




  webops    sysadmin
1
a cascade of errors
#
# Cyrus IMAPD annotation definitions file
#
/vendor/messagingengine.com/
preview,message,string,backend,value.shared,
misplaced comma                           +
 fix didn't make it to master             +
  unintended general rollout              +
    parser choked on comma                +
      fork with no rate limiting          +
        fatal() dumped core               +
         kernel.core_uses_pid = 1         +
           small SSD metadata partition   +
             indexes corruption           =
                massive outage (no data loss)
DO
Rate limit fork of children

Test disk full conditions

Master your infrastructure
DO NOT
Underestimate Mighty Comma

Rollout everywhere at once

Leave your CI builds messy
read more
“A cascade of errors”
http://blog.fastmail.fm/2011/05/15/outage-
report-a-cascade-of-errors/
2
magic numbers
physical bladecenters?
  LVS?                               network?
              kernel?
                                      solar storms?
  WTF?!?
           random failures in our
defective cpus?
                infrastructure
                                       DDoS?
                  Mayas?
     bnx2?                     traffic?
           recent deploys?
what we experienced

random performance degradation
general instability
steady increase of WTFs/min!
real problem
●
  2.6.32 = debian squeeze kernel
● sched – find_busiest_group()

● TSC register wraparound
Proof
            64
        2
                             = 208,49
 10                      9
2     · 86400 · 10
Subject:    [PATCH] sched: avoid unnecessary overflow in sched_clock
From:       Salman Qazi <sqazi@google.com>
Date:       2011-11-16 20:55:31

In hundreds of days, the __cycles_2_ns calculation in sched_clock
has an overflow. cyc * per_cpu(cyc2ns, cpu) exceeds 64 bits, causing
the final value to become zero. We can solve this without losing
any precision.

We can decompose TSC into quotient and remainder of division by the
scale factor, and then use this to convert TSC into nanoseconds.

Reviewed-by: Paul Turner <pjt@google.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Salman Qazi <sqazi@google.com>
---
 arch/x86/include/asm/timer.h |   23 ++++++++++++++++++++++-
 1 files changed, 22 insertions(+), 1 deletions(-)

                                   Patch #1, Nov 16th 2011
diff --git a/arch/x86/include/asm/timer.h b/arch/x86/include/asm/timer.h
index fa7b917..431793e 100644
--- a/arch/x86/include/asm/timer.h
+++ b/arch/x86/include/asm/timer.h
@@ -32,6 +32,22 @@ extern int no_timer_check;
  * (mathieu.desnoyers@polymtl.ca)
  *
  *        -johnstul@us.ibm.com "math is hard, lets go shopping!"
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -608,6 +608,8 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, ...)
 {
    unsigned long long tsc_now, ns_now, *offset;
    unsigned long flags, *scale;
+ unsigned long long quot;
+ unsigned long long rem;             Patch #2, Mar 8th 2012
    local_irq_save(flags);
    sched_clock_idle_sleep_event();
@@ -620,7 +622,15 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, ...)

    if (cpu_khz) {
        *scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
-       *offset = ns_now - (tsc_now * *scale >> CYC2NS_SCALE_FACTOR);
+
+       /*
+        * Avoid premature overflow by splitting into quotient
+        * and remainder. See the comment above __cycles_2_ns
+        */
+       quot = (tsc_now >> CYC2NS_SCALE_FACTOR);
+       rem = tsc_now & ((1ULL << CYC2NS_SCALE_FACTOR) - 1);
+       *offset = ns_now - (quot * *scale +
+                  ((rem * *scale) >> CYC2NS_SCALE_FACTOR));
    }
32
     2
               = 49,7
           3
86400 · 10
sven's explanation video
DO
 Be perseverant and creative :)

 Learn more about your kernel

 Improve tools to collect data
DO NOT
 Run servers continuously
 for more than 208 days?
3
#Leapocalypse
23:59:60
t - 4y 2m
From: Roman Zippel <zippel@linux-m68k.org>
Date: Thu, 1 May 2008 04:34:41 -0700
Subject: [PATCH] ntp: handle leap second via timer

Remove the leap second handling from second_overflow(), which doesn't have to
check for it every second anymore. With CONFIG_NO_HZ this also makes sure the
leap second is handled close to the full second. Additionally this makes it
possible to abort a leap second properly by resetting the STA_INS/STA_DEL status bits.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/linux/clocksource.h |    2 +
 include/linux/timex.h       |    1 +
 kernel/time/ntp.c           | 133 +++++++++++++++++++++++++++++--------------
 kernel/time/timekeeping.c   |    4 +-
t – 9m
lie(t) = (1 – cos(πt / w)) / 2
lie(s)




                                   t
T - 6m




http://bit.ly/NmA47E

http://my.opera.com/marcomarongiu/blog/index.dml/tag/ntp
T – 1 month
package {
    ntpdate: ensure => installed;
    adjtimex: ensure => installed;
}

file { "/usr/local/bin/leap-adjust.pl":
    ensure => present,
    source => "puppet:///modules/ntp/leap-adjust.pl",
}

file { "/etc/cron.d/ntp-leap-second":
    ensure => present,
    source => "puppet:///modules/ntp/leap-crontab",
    require => [ Package["ntp"], Package["adjtimex"] ],
}
T - 2d


June
29th
T – 1 day
  June 30th 2012



chaos begins
T - 8h




http://bit.ly/PSBMRP

http://serverfault.com/questions/403732/leapocalypse
the work around

 # date -s now
T + {1,2}m
 {August,September} 1st, 2012



fake leap seconds
read more
A story of leaping seconds
   http://blog.fastmail.fm/2012/07/03/a-story-of-leaping-seconds/


Tips and tricks to deal with leap seconds
   http://my.opera.com/marcomarongiu/blog/index.dml/tag/ntp



Serverfault question on random debian crashes
   http://serverfault.com/questions/403732/leapocalypse

Wired article about leap second problems
    http://www.wired.com/wiredenterprise/2012/07/leap-second-bug-
wreaks-havoc-with-java-linux/
DO
Keep your kernel updated

Use valuable external resources
(serverfault etc...)
DO NOT
Underestimate the
importance of time
¿questions?
failure lessons learned



             }
    expect
   assume
   prepare
  simulate       failure
  measure
  embrace
ops lessons learned
Don't repeat yourself (DRY)
Always keep it simple (KISS)
Separate ops team doesn't work well
Practice Continuous deployment. Now.
Communication makes the difference
Learn your tools
Master your infrastructure
RTFM
...
Thanks!

@cstrep
cosimo@opera.com

Mais conteúdo relacionado

Mais procurados

NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityBrendan Gregg
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceBrendan Gregg
 
LISA17 Container Performance Analysis
LISA17 Container Performance AnalysisLISA17 Container Performance Analysis
LISA17 Container Performance AnalysisBrendan Gregg
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFBrendan Gregg
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedBrendan Gregg
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityBrendan Gregg
 
Kernel Recipes 2019 - Kernel documentation: past, present, and future
Kernel Recipes 2019 - Kernel documentation: past, present, and futureKernel Recipes 2019 - Kernel documentation: past, present, and future
Kernel Recipes 2019 - Kernel documentation: past, present, and futureAnne Nicolas
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
 
Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Brendan Gregg
 
Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFBrendan Gregg
 
LISA18: Hidden Linux Metrics with Prometheus eBPF Exporter
LISA18: Hidden Linux Metrics with Prometheus eBPF ExporterLISA18: Hidden Linux Metrics with Prometheus eBPF Exporter
LISA18: Hidden Linux Metrics with Prometheus eBPF ExporterIvan Babrou
 
Performance Tuning EC2 Instances
Performance Tuning EC2 InstancesPerformance Tuning EC2 Instances
Performance Tuning EC2 InstancesBrendan Gregg
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016Brendan Gregg
 
bcc/BPF tools - Strategy, current tools, future challenges
bcc/BPF tools - Strategy, current tools, future challengesbcc/BPF tools - Strategy, current tools, future challenges
bcc/BPF tools - Strategy, current tools, future challengesIO Visor Project
 
QCon 2015 Broken Performance Tools
QCon 2015 Broken Performance ToolsQCon 2015 Broken Performance Tools
QCon 2015 Broken Performance ToolsBrendan Gregg
 
DOXLON November 2016: Facebook Engineering on cgroupv2
DOXLON November 2016: Facebook Engineering on cgroupv2DOXLON November 2016: Facebook Engineering on cgroupv2
DOXLON November 2016: Facebook Engineering on cgroupv2Outlyer
 

Mais procurados (20)

NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF Observability
 
BPF Tools 2017
BPF Tools 2017BPF Tools 2017
BPF Tools 2017
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
LISA17 Container Performance Analysis
LISA17 Container Performance AnalysisLISA17 Container Performance Analysis
LISA17 Container Performance Analysis
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPF
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
 
Kernel Recipes 2019 - Kernel documentation: past, present, and future
Kernel Recipes 2019 - Kernel documentation: past, present, and futureKernel Recipes 2019 - Kernel documentation: past, present, and future
Kernel Recipes 2019 - Kernel documentation: past, present, and future
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)
 
Linux System Troubleshooting
Linux System TroubleshootingLinux System Troubleshooting
Linux System Troubleshooting
 
Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPF
 
LISA18: Hidden Linux Metrics with Prometheus eBPF Exporter
LISA18: Hidden Linux Metrics with Prometheus eBPF ExporterLISA18: Hidden Linux Metrics with Prometheus eBPF Exporter
LISA18: Hidden Linux Metrics with Prometheus eBPF Exporter
 
Performance Tuning EC2 Instances
Performance Tuning EC2 InstancesPerformance Tuning EC2 Instances
Performance Tuning EC2 Instances
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
The Internet
The InternetThe Internet
The Internet
 
bcc/BPF tools - Strategy, current tools, future challenges
bcc/BPF tools - Strategy, current tools, future challengesbcc/BPF tools - Strategy, current tools, future challenges
bcc/BPF tools - Strategy, current tools, future challenges
 
QCon 2015 Broken Performance Tools
QCon 2015 Broken Performance ToolsQCon 2015 Broken Performance Tools
QCon 2015 Broken Performance Tools
 
ZFSperftools2012
ZFSperftools2012ZFSperftools2012
ZFSperftools2012
 
DOXLON November 2016: Facebook Engineering on cgroupv2
DOXLON November 2016: Facebook Engineering on cgroupv2DOXLON November 2016: Facebook Engineering on cgroupv2
DOXLON November 2016: Facebook Engineering on cgroupv2
 

Semelhante a Velocity 2012 - Learning WebOps the Hard Way

Troubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTroubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTanel Poder
 
Kernel Recipes 2015 - Kernel dump analysis
Kernel Recipes 2015 - Kernel dump analysisKernel Recipes 2015 - Kernel dump analysis
Kernel Recipes 2015 - Kernel dump analysisAnne Nicolas
 
[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory AnalysisMoabi.com
 
1 m+ qps on mysql galera cluster
1 m+ qps on mysql galera cluster1 m+ qps on mysql galera cluster
1 m+ qps on mysql galera clusterOlinData
 
Debugging linux issues with eBPF
Debugging linux issues with eBPFDebugging linux issues with eBPF
Debugging linux issues with eBPFIvan Babrou
 
Docker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in PragueDocker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in Praguetomasbart
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...Amazon Web Services
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...Amazon Web Services
 
Trying and evaluating the new features of GlusterFS 3.5
Trying and evaluating the new features of GlusterFS 3.5Trying and evaluating the new features of GlusterFS 3.5
Trying and evaluating the new features of GlusterFS 3.5Keisuke Takahashi
 
Reverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande ModemReverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande ModemCyber Security Alliance
 
Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)Ontico
 
Debugging Ruby
Debugging RubyDebugging Ruby
Debugging RubyAman Gupta
 
Debugging Ruby Systems
Debugging Ruby SystemsDebugging Ruby Systems
Debugging Ruby SystemsEngine Yard
 
Slackware Demystified [SELF 2011]
Slackware Demystified [SELF 2011]Slackware Demystified [SELF 2011]
Slackware Demystified [SELF 2011]Vincent Batts
 
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytesWindows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytesPeter Hlavaty
 

Semelhante a Velocity 2012 - Learning WebOps the Hard Way (20)

Troubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTroubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contention
 
Kernel Recipes 2015 - Kernel dump analysis
Kernel Recipes 2015 - Kernel dump analysisKernel Recipes 2015 - Kernel dump analysis
Kernel Recipes 2015 - Kernel dump analysis
 
Sge
SgeSge
Sge
 
Hacking the swisscom modem
Hacking the swisscom modemHacking the swisscom modem
Hacking the swisscom modem
 
[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis
 
1 m+ qps on mysql galera cluster
1 m+ qps on mysql galera cluster1 m+ qps on mysql galera cluster
1 m+ qps on mysql galera cluster
 
Debugging linux issues with eBPF
Debugging linux issues with eBPFDebugging linux issues with eBPF
Debugging linux issues with eBPF
 
Docker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in PragueDocker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in Prague
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
 
Rac 12c optimization
Rac 12c optimizationRac 12c optimization
Rac 12c optimization
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
 
Trying and evaluating the new features of GlusterFS 3.5
Trying and evaluating the new features of GlusterFS 3.5Trying and evaluating the new features of GlusterFS 3.5
Trying and evaluating the new features of GlusterFS 3.5
 
Reverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande ModemReverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande Modem
 
Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)
 
Debugging Ruby
Debugging RubyDebugging Ruby
Debugging Ruby
 
Debugging Ruby Systems
Debugging Ruby SystemsDebugging Ruby Systems
Debugging Ruby Systems
 
Slackware Demystified [SELF 2011]
Slackware Demystified [SELF 2011]Slackware Demystified [SELF 2011]
Slackware Demystified [SELF 2011]
 
Mysql Latency
Mysql LatencyMysql Latency
Mysql Latency
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
 
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytesWindows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
 

Mais de Cosimo Streppone

How we use and deploy Varnish at Opera
How we use and deploy Varnish at OperaHow we use and deploy Varnish at Opera
How we use and deploy Varnish at OperaCosimo Streppone
 
Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013Cosimo Streppone
 
VUG5: Varnish at Opera Software
VUG5: Varnish at Opera SoftwareVUG5: Varnish at Opera Software
VUG5: Varnish at Opera SoftwareCosimo Streppone
 
Velocity 2011 - Our first DDoS attack
Velocity 2011 - Our first DDoS attackVelocity 2011 - Our first DDoS attack
Velocity 2011 - Our first DDoS attackCosimo Streppone
 
Mojolicious: what works and what doesn't
Mojolicious: what works and what doesn'tMojolicious: what works and what doesn't
Mojolicious: what works and what doesn'tCosimo Streppone
 
Surge 2010 - from disaster to stability - scaling my.opera.com
Surge 2010 - from disaster to stability - scaling my.opera.comSurge 2010 - from disaster to stability - scaling my.opera.com
Surge 2010 - from disaster to stability - scaling my.opera.comCosimo Streppone
 
My Opera meets Varnish, Dec 2009
My Opera meets Varnish, Dec 2009My Opera meets Varnish, Dec 2009
My Opera meets Varnish, Dec 2009Cosimo Streppone
 
YAPC::EU::2009 - How Opera Software uses Perl
YAPC::EU::2009 - How Opera Software uses PerlYAPC::EU::2009 - How Opera Software uses Perl
YAPC::EU::2009 - How Opera Software uses PerlCosimo Streppone
 
NPW2009 - my.opera.com scalability v2.0
NPW2009 - my.opera.com scalability v2.0NPW2009 - my.opera.com scalability v2.0
NPW2009 - my.opera.com scalability v2.0Cosimo Streppone
 
IPW2008 - my.opera.com scalability
IPW2008 - my.opera.com scalabilityIPW2008 - my.opera.com scalability
IPW2008 - my.opera.com scalabilityCosimo Streppone
 

Mais de Cosimo Streppone (11)

How we use and deploy Varnish at Opera
How we use and deploy Varnish at OperaHow we use and deploy Varnish at Opera
How we use and deploy Varnish at Opera
 
Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013Puppet at Opera Sofware - PuppetCamp Oslo 2013
Puppet at Opera Sofware - PuppetCamp Oslo 2013
 
Italian, do you speak it?
Italian, do you speak it?Italian, do you speak it?
Italian, do you speak it?
 
VUG5: Varnish at Opera Software
VUG5: Varnish at Opera SoftwareVUG5: Varnish at Opera Software
VUG5: Varnish at Opera Software
 
Velocity 2011 - Our first DDoS attack
Velocity 2011 - Our first DDoS attackVelocity 2011 - Our first DDoS attack
Velocity 2011 - Our first DDoS attack
 
Mojolicious: what works and what doesn't
Mojolicious: what works and what doesn'tMojolicious: what works and what doesn't
Mojolicious: what works and what doesn't
 
Surge 2010 - from disaster to stability - scaling my.opera.com
Surge 2010 - from disaster to stability - scaling my.opera.comSurge 2010 - from disaster to stability - scaling my.opera.com
Surge 2010 - from disaster to stability - scaling my.opera.com
 
My Opera meets Varnish, Dec 2009
My Opera meets Varnish, Dec 2009My Opera meets Varnish, Dec 2009
My Opera meets Varnish, Dec 2009
 
YAPC::EU::2009 - How Opera Software uses Perl
YAPC::EU::2009 - How Opera Software uses PerlYAPC::EU::2009 - How Opera Software uses Perl
YAPC::EU::2009 - How Opera Software uses Perl
 
NPW2009 - my.opera.com scalability v2.0
NPW2009 - my.opera.com scalability v2.0NPW2009 - my.opera.com scalability v2.0
NPW2009 - my.opera.com scalability v2.0
 
IPW2008 - my.opera.com scalability
IPW2008 - my.opera.com scalabilityIPW2008 - my.opera.com scalability
IPW2008 - my.opera.com scalability
 

Último

UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 

Último (20)

UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 

Velocity 2012 - Learning WebOps the Hard Way

  • 1. Learning Webops the Hard Way What could possibly go wrong? Cosimo Streppone WebOps Lead – Opera Software
  • 2.
  • 3.
  • 6. teams organization mail webops sysadmin
  • 7. 1 a cascade of errors
  • 8. # # Cyrus IMAPD annotation definitions file # /vendor/messagingengine.com/ preview,message,string,backend,value.shared,
  • 9. misplaced comma + fix didn't make it to master + unintended general rollout + parser choked on comma + fork with no rate limiting + fatal() dumped core + kernel.core_uses_pid = 1 + small SSD metadata partition + indexes corruption = massive outage (no data loss)
  • 10.
  • 11. DO Rate limit fork of children Test disk full conditions Master your infrastructure
  • 12. DO NOT Underestimate Mighty Comma Rollout everywhere at once Leave your CI builds messy
  • 13.
  • 14. read more “A cascade of errors” http://blog.fastmail.fm/2011/05/15/outage- report-a-cascade-of-errors/
  • 16. physical bladecenters? LVS? network? kernel? solar storms? WTF?!? random failures in our defective cpus? infrastructure DDoS? Mayas? bnx2? traffic? recent deploys?
  • 17. what we experienced random performance degradation general instability steady increase of WTFs/min!
  • 18. real problem ● 2.6.32 = debian squeeze kernel ● sched – find_busiest_group() ● TSC register wraparound
  • 19. Proof 64 2 = 208,49 10 9 2 · 86400 · 10
  • 20. Subject: [PATCH] sched: avoid unnecessary overflow in sched_clock From: Salman Qazi <sqazi@google.com> Date: 2011-11-16 20:55:31 In hundreds of days, the __cycles_2_ns calculation in sched_clock has an overflow. cyc * per_cpu(cyc2ns, cpu) exceeds 64 bits, causing the final value to become zero. We can solve this without losing any precision. We can decompose TSC into quotient and remainder of division by the scale factor, and then use this to convert TSC into nanoseconds. Reviewed-by: Paul Turner <pjt@google.com> Acked-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Salman Qazi <sqazi@google.com> --- arch/x86/include/asm/timer.h | 23 ++++++++++++++++++++++- 1 files changed, 22 insertions(+), 1 deletions(-) Patch #1, Nov 16th 2011 diff --git a/arch/x86/include/asm/timer.h b/arch/x86/include/asm/timer.h index fa7b917..431793e 100644 --- a/arch/x86/include/asm/timer.h +++ b/arch/x86/include/asm/timer.h @@ -32,6 +32,22 @@ extern int no_timer_check; * (mathieu.desnoyers@polymtl.ca) * * -johnstul@us.ibm.com "math is hard, lets go shopping!"
  • 21. --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -608,6 +608,8 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, ...) { unsigned long long tsc_now, ns_now, *offset; unsigned long flags, *scale; + unsigned long long quot; + unsigned long long rem; Patch #2, Mar 8th 2012 local_irq_save(flags); sched_clock_idle_sleep_event(); @@ -620,7 +622,15 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, ...) if (cpu_khz) { *scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz; - *offset = ns_now - (tsc_now * *scale >> CYC2NS_SCALE_FACTOR); + + /* + * Avoid premature overflow by splitting into quotient + * and remainder. See the comment above __cycles_2_ns + */ + quot = (tsc_now >> CYC2NS_SCALE_FACTOR); + rem = tsc_now & ((1ULL << CYC2NS_SCALE_FACTOR) - 1); + *offset = ns_now - (quot * *scale + + ((rem * *scale) >> CYC2NS_SCALE_FACTOR)); }
  • 22. 32 2 = 49,7 3 86400 · 10
  • 24. DO Be perseverant and creative :) Learn more about your kernel Improve tools to collect data
  • 25. DO NOT Run servers continuously for more than 208 days?
  • 27.
  • 29. t - 4y 2m From: Roman Zippel <zippel@linux-m68k.org> Date: Thu, 1 May 2008 04:34:41 -0700 Subject: [PATCH] ntp: handle leap second via timer Remove the leap second handling from second_overflow(), which doesn't have to check for it every second anymore. With CONFIG_NO_HZ this also makes sure the leap second is handled close to the full second. Additionally this makes it possible to abort a leap second properly by resetting the STA_INS/STA_DEL status bits. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: john stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- include/linux/clocksource.h | 2 + include/linux/timex.h | 1 + kernel/time/ntp.c | 133 +++++++++++++++++++++++++++++-------------- kernel/time/timekeeping.c | 4 +-
  • 31. lie(t) = (1 – cos(πt / w)) / 2 lie(s) t
  • 33. T – 1 month package { ntpdate: ensure => installed; adjtimex: ensure => installed; } file { "/usr/local/bin/leap-adjust.pl": ensure => present, source => "puppet:///modules/ntp/leap-adjust.pl", } file { "/etc/cron.d/ntp-leap-second": ensure => present, source => "puppet:///modules/ntp/leap-crontab", require => [ Package["ntp"], Package["adjtimex"] ], }
  • 35. T – 1 day June 30th 2012 chaos begins
  • 37.
  • 38.
  • 39. the work around # date -s now
  • 40. T + {1,2}m {August,September} 1st, 2012 fake leap seconds
  • 41. read more A story of leaping seconds http://blog.fastmail.fm/2012/07/03/a-story-of-leaping-seconds/ Tips and tricks to deal with leap seconds http://my.opera.com/marcomarongiu/blog/index.dml/tag/ntp Serverfault question on random debian crashes http://serverfault.com/questions/403732/leapocalypse Wired article about leap second problems http://www.wired.com/wiredenterprise/2012/07/leap-second-bug- wreaks-havoc-with-java-linux/
  • 42. DO Keep your kernel updated Use valuable external resources (serverfault etc...)
  • 45. failure lessons learned } expect assume prepare simulate failure measure embrace
  • 46. ops lessons learned Don't repeat yourself (DRY) Always keep it simple (KISS) Separate ops team doesn't work well Practice Continuous deployment. Now. Communication makes the difference Learn your tools Master your infrastructure RTFM ...