O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Cloud Journey: Lifting a Major Product to Kubernetes

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio

Confira estes a seguir

1 de 54 Anúncio

Cloud Journey: Lifting a Major Product to Kubernetes

Baixar para ler offline

Meetup presentation on Feb 27th 2019 at the Dock8s Meetup in Heidelberg/Rhein-Neckar, at the verivox campus.

The talk touches on all areas which involve a cloud journey of a major produkt (iDesk2) of the Haufe Group: Planning & Politics, Technology and doing Operations for that product as a DevOps team.

Meetup presentation on Feb 27th 2019 at the Dock8s Meetup in Heidelberg/Rhein-Neckar, at the verivox campus.

The talk touches on all areas which involve a cloud journey of a major produkt (iDesk2) of the Haufe Group: Planning & Politics, Technology and doing Operations for that product as a DevOps team.

Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a Cloud Journey: Lifting a Major Product to Kubernetes (20)

Anúncio

Mais de Haufe-Lexware GmbH & Co KG (20)

Mais recentes (20)

Anúncio

Cloud Journey: Lifting a Major Product to Kubernetes

  1. 1. Welcome to & Dock8s Meetup Robert Werlich Site Reliability Engineer robert.werlich@verivox.com Marlen Blaube Senior HR Business Partner marlen.blaube@verivox.com
  2. 2. Cloud Journey: Lifting a Major Product to Kubernetes Dock8s Meetup Heidelberg Feb 27th 2019 Martin Danielsson, Haufe Group, Freiburg @donmartin76 (Twitter, Github)
  3. 3. Dock8s Meetup Heidelberg, February 27th 2019 whoami C:> WINDOWS.EXE C/C++/C# Background 10+ years $ docker ps Containers & Kubernetes Since ~4 years wicked.haufe.io maintainer OSS API Management Solution Architect Developer since 2006
  4. 4. Dock8s Meetup Heidelberg, February 27th 2019
  5. 5. Dock8s Meetup Heidelberg, February 27th 2019 Agenda Operations
  6. 6. Planning & Politics We‘ll set the scene a little.
  7. 7. Dock8s Meetup Heidelberg, February 27th 2019 Some numbers 100+ active git repos 874k LOC 10-15 Developers 200-500 concur- rent users Typically 100 req/s 448 GB RAM 56 Cores
  8. 8. Major revenue Strategic move to containers Modular Architecture Without Container Experience Hosted with Hoster (€€€) Long Release cycles (LOTS of) Manual Work for Releases Little Operations Insight Error tracking very difficult Non-Parity Dev/Test/Prod (Cost!) Legacy Web App (Java based)
  9. 9. Dock8s Meetup Heidelberg, February 27th 2019 Vision – Goals Enabling CI/CD Automatic Provisioning Full Insight Minimize Ops
  10. 10. Dock8s Meetup Heidelberg, February 27th 2019 Let‘s go DevOps in the Cloud!
  11. 11. Dock8s Meetup Heidelberg, February 27th 2019 Project Interfaces Technology Processes HR Topics (Operations) Stakeholder Management
  12. 12. Dock8s Meetup Heidelberg, February 27th 2019 Stakeholder Management CONVINCE THEM, DON‘T PERSUADE THEM COMMUNICATE OFTEN AND CLEARLY DON‘T UNDERESTIMATE TASKS AT HAND BE TRANSPARENT SHARE SUCCESSES BUT ALSO FAILURES!
  13. 13. Dock8s Meetup Heidelberg, February 27th 2019 Team Setup – Vision 100% DevOps Engineers T-Shaped Engineers No dedicated manual testers Automate! YBI, YRI. Ops experience?
  14. 14. Dock8s Meetup Heidelberg, February 27th 2019 Some HR topics Release Managers? Operations Responsibility? Quality Engineers (testers)? On Call Duty?
  15. 15. Technology This is what you came for…?
  16. 16. Dock8s Meetup Heidelberg, February 27th 2019 Technology Stack Kubernetes Azure (Public Cloud)
  17. 17. Dock8s Meetup Heidelberg, February 27th 2019 Steps to DevOps Happiness Provision Deploy CI/CD Weekly for Production, Daily for Dev/Test Ship when ready!
  18. 18. Dock8s Meetup Heidelberg, February 27th 2019 Wait, uh, what…? Target “No-Ops” No long-running systems Enable validation of 3rd Party component upgrades Incremental changes Practice Disaster Recovery Daily 100% Reproducible Deployments On-demand Production Identical Environments
  19. 19. Dock8s Meetup Heidelberg, February 27th 2019 Code & Pipelines So, it‘s all… … and pipelines are also code
  20. 20. Dock8s Meetup Heidelberg, February 27th 2019 Incremental Backend Development Merge feature to master •After code review •Including test suite changes Build master branch •Includes unit testing •First integration tests Deploy to integration system •Blue/Green with integration tests Deploy to Production •Blue/Green with integration tests
  21. 21. Dock8s Meetup Heidelberg, February 27th 2019 Incremental Frontend Development Merge feature to master •After code review •Including test suite changes Build master branch •Includes unit testing •First integration tests Deploy to integration system •Run e2e integration tests •Rollback if failing Deploy to Production •Run e2e integration tests •Rollback if failing
  22. 22. Dock8s Meetup Heidelberg, February 27th 2019 Stateless Components Stateful Components
  23. 23. Dock8s Meetup Heidelberg, February 27th 2019 Full Provisioning Create backup Provision new infrastructure •From backups •Same as disaster recovery! Deploy components •Using deployment pipelines •Partly parallelized Top level DNS switch •Using DNS traffic manager Destroy old infrastructure •If tests succeed
  24. 24. Dock8s Meetup Heidelberg, February 27th 2019 Persistence Options Roll your own persistence Persistence “as a service” Self managed VMs (incl. NFS) Managed Disks (AWS EBS, Azure Managed Disks) DBaaS (many options) Files as a service (AWS EFS, Azure Files) Gluster/Ceph FS (cluster)
  25. 25. Dock8s Meetup Heidelberg, February 27th 2019 iDesk2 Deployment Architecture Resource Group Kubernetes Cluster ks8 Master ks8 Agent ks8 Agent n … NFS VM(s) Postgres VM(s) Disks Disks • Azure Files not fast enough • Legacy components depend on UNIX rights (Azure Files is SMB) • Azure Disks only ReadWriteOnce • Azure PGaaS was not yet available • More „bang for your buck“ • PG Admin knowledge in Team
  26. 26. Dock8s Meetup Heidelberg, February 27th 2019 Endless Variants
  27. 27. Dock8s Meetup Heidelberg, February 27th 2019 Some hints… Assess your Persistence Needs early on If possible, use DBaaS (avoid NIH syndrome) Externalize Configuration Shared File Storage is not “Cloud Native”
  28. 28. Operations No. You don‘t get around it. Sorry.
  29. 29. Dock8s Meetup Heidelberg, February 27th 2019 Now that we have Kubernetes…? Self healing Robust Production Ready Battle proven “Vertrauen ist gut... … Kontrolle ist besser!” Complex Additional Abstraction Layer
  30. 30. Dock8s Meetup Heidelberg, February 27th 2019 “Kontrolle” - What do you mean? Detecting these things is a start...
  31. 31. Dock8s Meetup Heidelberg, February 27th 2019 Fail: Lyin’ Monitors End-to-End Monitoring ALL GOOD People logging in 500 … an entire weekend.
  32. 32. Dock8s Meetup Heidelberg, February 27th 2019 Instrument Monitor and Alert Enable Insight
  33. 33. Dock8s Meetup Heidelberg, February 27th 2019 Prometheu s A Metrics Endpoint http://A:8080/metrics JVM Metrics Node.js Metrics VM Exporters (node_exporter) DB Exporters (pg_exporter) Kubernetes Statistics Prometheus Client based Custom Exporters ... BTime Series DB
  34. 34. Dock8s Meetup Heidelberg, February 27th 2019 Alertmanage r
  35. 35. Dock8s Meetup Heidelberg, February 27th 2019 Metrics White Box Black Box Counters GaugesHistograms Summaries Application Network Latencies Errors Timeouts Infrastructure Disk Space CPU Memory Pod Status
  36. 36. Dock8s Meetup Heidelberg, February 27th 2019 Friday 9 o’clock Newsletter
  37. 37. Dock8s Meetup Heidelberg, February 27th 2019 205’886
  38. 38. Dock8s Meetup Heidelberg, February 27th 2019 Alerting? On what?
  39. 39. Dock8s Meetup Heidelberg, February 27th 2019 Availability Infrastructure
  40. 40. https://www.zazzle.com/nines_dont_matter_t_shirt-235118578582589495 Charity Majors says… (@mipsytipsy)
  41. 41. Dock8s Meetup Heidelberg, February 27th 2019 Percentage of document retrieval requests served within 0.25 and 1s Percentage of search requests answered within 1, 3 and 7.5s Percentage of Error Pages Indicators 95% and 98.5% 50%, 95% and 98.5% <1% Agreements Service Level
  42. 42. Dock8s Meetup Heidelberg, February 27th 2019
  43. 43. Dock8s Meetup Heidelberg, February 27th 2019 Holistic View Instrument early (and lots) Deployments easier Less fear of change We are in control! hope and think we
  44. 44. Dock8s Meetup Heidelberg, February 27th 2019 Fails: Resiliency Issues VMs are sometimes patched and restarted. Or they just die. So will any service on them. Networks are unreliable. Connections will fail. Use (libraries for) circuit breakers and retries. Re-establishing TLS on each call to external services is expensive. … and the service will hate you. Use Keep-Alive. SPOFs will eventually fail. Assess and act. Learn how to detect problems.
  45. 45. Conclusion Was it worth it?
  46. 46. Dock8s Meetup Heidelberg, February 27th 2019 Would we do it again?
  47. 47. Dock8s Meetup Heidelberg, February 27th 2019 Key Performance Indicators • >70% Cost Saving • Release Effort down >98% via automation • Higher Release Pace (3-5/y to 15-20/mo) • Performance measurable • Faster Reaction to Issues • Unlocks Cloud Technology Dock8s Meetup Heidelberg, February 27th 2019
  48. 48. Dock8s Meetup Heidelberg, February 27th 2019 k8s Ops possible as a Team Requires full automation (also test) Team dedication Rethinking ops is challenging No Silver Bullet Assess your requirements
  49. 49. Dock8s Meetup Heidelberg, February 27th 2019 Some Links… kubernetes.io prometheus.io grafana.com azure.com aws.amazon.com Twitter @donmartin76 GitHub donmartin76 We’re hiring! www.haufegroup.com/en/career

Notas do Editor

  • YBIYRI = You build it, you run it.
  • Could just as well have been AWS; Azure was investigated first as we didn‘t know whether we would have the need to go to Azure Germany (this was 2017).
  • This has a couple of implications:

    You need backups for persistent data inside the cluster
    You must be able to automatically restore them

    You will also get a certain amount of „non-persisted“ time (time where you cannot persist user changed) – for Aurora, this is around 90 minutes each Tuesday early morning  Acceptable for us, may not be acceptable for other teams
  • Instrument your components to get out (possibly) interesting metrics.

    Rather instrument more, if you do it from the start, it doesn’t hurt much. And adding later is also rather easy.

    Monitor and alert on anticipated failures or known previous issues If for some reason you cannot find or fix the root cause* With Monitor and Alert, I subsume Logging and Tracing here.

    Enable insight and visualization - or “debugging” if you will - to see inside your system what might have gone wrong.
  • “This is what you would call ‘instrumenting’ your code” - exporting metrics from it

    You would use a client library (there are client libraries for most programming languages). This takes your application current state of all tracked metrics and transforms it into a format that Prometheus understands and exposes it via the http endpoints, which Prometheus scrapes at regular time intervals.

    There are a number of libraries and servers which help in exporting existing metrics from third-party systems as Prometheus metrics. This is useful for cases where it is not feasible to instrument a given system with Prometheus metrics directly
  • What can we do with that data - Two examples: Dashboarding and Alerting

    E.g. Grafana can use Prometheus as a data source via the Prometheus Query Language to display time series as a graph, e.g. for dashboarding.

    Simultaneously, Prometheus can evaluate certain expressions to see whether alerts have to be triggered. These are then passed on to another component of Prometheus, the Alertmanager, which in turn makes sure the alerts are delivered to wherever they should be delivered to. For us, that’s (both) Rocket Chat and E-Mail.
  • One step back, what kind of metrics exist? Let’s look at a couple of categories - first, white box and black box. That’s where the metrics come from - do you measure them inside your stack (white box), or do you probe from the outside - black box.

    Hint: You should do both.

    Bottom left you see different types of metrics here specifically Prometheus supports - Counters (things which only increase), Gauges (things which go up and down), Histograms (to see a discrete distribution) and Summaries (for seeing quantiles).

    Bottom right you see the sources of metrics - infrastructure (things like disk space, CPU and memory utilization), network (latencies, errors, timeouts and such) and perhaps the most interesting bit - your own application metrics.

    Recall - there is no automatic way of retrieving all of your application specific metrics - this is the instrumentation bit.

    It was in parts an eye opener to us when we started looking at metrics...
  • By simply inspecting response times on various end points, we could pinpoint issues we weren’t really aware of, but which helped getting an even better experience on our web site.

    Mind you - all of these things were already in the logs - but who reads logs in case you don’t REALLY have a problem. Takers?
  • Typical “Newletter Friday” - The editors of on of the largest products send out newsletters each friday, which we immediately see on login numbers.
  • So, what’s this number?

    It’s number of individual time series we collect from our production system. Prometheus can do lots more, up to millions, but it’s still quite a number of things to look at and evaluate.
  • OK, so, great. We have a bunch of metrics. What do we do with those?
  • Of course you should alert on infrastructure failure - if the failure entrails any need of intervention. If you can automatically recover - no need to alert.

    Rule of thumb: Alerts should be ACTIONABLE. If there’s an alert - you should have to do something (even if it’s just investigating). If an alert doesn’t require any actions - chances are good you should not alert on it (and just collect statistics).

    We have found out that this is dang hard though.

    The other thing that is just plain clear is that you must make sure that your application is available - probably by using some black box type of end to end test. If your application isn’t available - that must be your top priority to get it back up and running (but that’s obvious).

    Is that enough though?
  • Let’s say we have 99,99% availability, does that mean everything is fine? No. We must find additional metrics to measure how well we are doing.
  • Actually, we would like to measure user happiness. We are doing that with NPS and “Kundenbarometer”, but we’d like to have at least an approximation in real time. Well, you can’t do that, but you can approximate via the definition of functional and non-functional requirements you know (or at least assume) are important for customer happiness.

    Typical things are: Latencies or expected runtimes, and of course that your application does what it’s intended to do.

    This takes us back to metrics and calculated metrics, in other words KPIs, or SLIs, Service Level Indicators.

    Disclaimer: This is not an exact science, but always a guesstimate. Rule of thumb should at least be: If these indicators are off, the customer will definitely be UNHAPPY.
  • And in addition to these, we of course also track the availability, where we also have an SLA.

    So, as these are the values to which we will be held accountable, we better also alert on these.
  • We have gathered a more holistic view on our application - we no longer just look at what has to be developed, we also, from the start, look at how the components will behave at runtime, and how we can observe them.

    We don’t have to think very hard about how and where to run things - we have solved most tricky problems using Kubernetes and the toolset around; we just have to re-apply patterns, while relying on that most things aren’t that complicated that nobody has solved them yet.

    We have a lot less fear of changing things. Since everything is built up as code, everything is easily and fairly quickly reproducible, and we can efficiently test changes up front.

    We have gathered a feeling that we are in control. At least, we hope and think we are in control. And that’s a nice feeling.
  • Restarted VMs:

    Redis cluster failed after restart
    AppServer could not reconnect to Redis
    Pods running only once?  SPOF
    Expect failures

    Circuit breakers: Currently Hystrix, investigating Istio/linkerd

    TLS: External semantic search – clogged up their load balancer after a couple of hours of traffic.

×