Anúncio

VMWare Tech Talk: "The Road from Rugged DevOps to Security Chaos Engineering"

Cyber Security Industry Executive & Entrepreneur em Open to Opportunities
25 de Sep de 2019
Anúncio

Mais conteúdo relacionado

Apresentações para você(20)

Similar a VMWare Tech Talk: "The Road from Rugged DevOps to Security Chaos Engineering"(20)

Anúncio
Anúncio

VMWare Tech Talk: "The Road from Rugged DevOps to Security Chaos Engineering"

  1. @aaronrinehart @verica_io #chaosengineering Road from “Rugged to Chaos”
  2. In this Session we will cover
  3. @aaronrinehart @verica_io #chaosengineering ● Rugged DevOps Journey at United Health Group ● Combating Complexity in Software ● Chaos Engineering ● Resilience Engineering & Security ● Security Chaos Engineering Areas Covered
  4. 4 Aaron Rinehart, CTO, Founder ● Former Chief Security Architect @UnitedHealth responsible for security engineering strategy ● Led the DevOps and Open Source Transformation at UnitedHealth Group ● Former (DOD, NASA, DHS, CollegeBoard ) ● Frequent speaker and author on Chaos Engineering & Security ● Pioneer behind Security Chaos Engineering ● Led ChaoSlingr team at UnitedHealth @aaronrinehart @verica_io #chaosengineering Verica
  5. My knowledge of DevOps & Agile when I started.
  6. What is DevOps? “DevOps, a movement of people who care about developing and operating reliable, secure, high performance systems at scale, has always — intentionally — lacked a definition or manifesto.” – Jez Humble, author “The DevOps Handbook”
  7. The Phoenix Project A Novel about IT, DevOps, and Helping Your Business Win •by Gene Kim, Kevin Bahr and George Spafford Our path begins… The DevOps Handbook How to create world-class agility, reliability, and security in technology organizations •by Gene Kim, Patrick Debois, John Willis, Jez Humble and John Allspaw
  8. Our Journey: Developer Enablement 9 Develop the Tools, Techniques and Processes needed to deliver security services in a world of Continuous Delivery.
  9. ●Drive Security as a Function of Quality ●Building a Better Model: Continuous Delivery is Better Security ○ Focus on Delivering Value ○ Continuous Security Model ○ Enable DevOps Strategy and Automation A New Paradigm: Bold Steps
  10. ●Teams across Silos & Disciplines ○ 60 Developers, Operations Engineers, and Security Leaders from across the entire company. ●Began with Six Core DevOps Security Problem Sets ○ Security Baseline + Configuration Validation w/ Chef & Inspec ○ Gauntlt Rugged Attack Framework ○ Static Code Analysis (SAST): Automatiing Fortify with Jenkins via API ○ Application Vulnerability Scans(DAST): Automating WebInspect with Jenkins via API ○ DevOps Self-Governance & Operationalization Framework: How does this world look from an operational support perspective? ○ Clair Container Image Scanning: Building Image Scanning into Jenkins A Grass Roots Beginning
  11. Chef + InSpec: Automated Security Configuration & Validation at Speed 12 Case Study: State Health Exchange
  12. ●Enable Deployment & Compliance at Speed and Scale ●Allow developers to leverage “Security”- Approved Chef server compliance cookbooks ●Compliance is built into the initial server standup process and immediately confirmed prior to release for use ○ No longer a late “add on” ○ It is “just another cookbook” that can be automatically applied Shift Left
  13. Initial Approach: 15 weeks+ ● Stood up 300+ servers from service catalog over 1 weekend ● Waited weeks for extra build services beyond the catalog ● Allowed app and middleware teams to configure in parallel ● After 2+ months were able to apply compliance rules using Security Blanket ○ Required 2+ weeks just to run, Resulted in compliance tickets, Remediation and rework Alternatively: “Orders of Magnitude Differential” ○ Run Time: 300 servers⭢6 mins ⭢ 30 hours ○ Setup time: 40-100 hours *** DevOps & State Health Exchange Migration ● ● March April May June July
  14. Gauntlt: “Be Mean to Your Code” 15 Case Study: Driving Security Testing into the Pipeline: Automated Vulnerability Scanning
  15. Security as a Function of Quality: Gauntlt ○ An open source application vulnerability scanner engine that enables a self-service vulnerability resolution solution ○ Automates use of multiple vulnerability security scanning tools ○ Provides packages allowing developers to easily run self-service security checks against their applications ○ Scans begin immediately and take only minutes to complete
  16. Lessons Learned in DevOps Transformation 17 Takeaways, that will fundamentally change the entire strategy.
  17. Automation & Tools are Important but “Don’t be Distracted by it” 18 Emphasize …. Simplification & Standardization ….over Automation
  18. Start Small & Focus 19 Shift Left……One capability at a time…
  19. Embrace Failure as a Friend 20 Plan and expect failure as a positive outcome. Encourage teams to fail quickly and learn from them.
  20. Seek the Input & Passion of Others 21 In the end, it has been the folks most passionate about each problem that achieved success.
  21. Voice of the Customer 22 Define, understand, and listen to your customer as part of your journey. You will be surprised how eager they are to help you.
  22. DevsecOps over next 5 Years: Written 3 years ago.. 23 The Next Generation of Security Professionals will be Chosen from DevOps Teams 1 A Big Data Problem: The challenge becomes more about the data outputs than the toolsets.2 Shared Responsibility becomes more of a reality.3 Security is seen as an integral part of the value stream4 There will be a new breed of security capabilities created by Inner Source efforts. i.e. Netflix Security Monkey5
  23. • Fail small, fail fast • Its a culture shift, not just about automation • Drive out complexity: Complex things don’t scale • Avoid Analysis Paralysis: DevOps is a culture and a living organism • DevOps is not a fad, it is the future • Automation: Focus on where the human adds value. Automate everything else. Key Takeaways 24
  24. Incidents,Outages, & Breaches are Costly
  25. The Obvious Problem
  26. Why do they seem to be happening more often?
  27. @aaronrinehart @verica_io #chaosengineering Combating Complexity in Software
  28. “The growth of complexity in society has got ahead of our understanding of how complex systems work and fail” -Sydney Dekker
  29. Our systems have evolved beyond human ability to mentally model their behavior. 30
  30. Our systems have evolved beyond human ability to mentally model their behavior. 31 everyone else
  31. Circuit Breaker Patterns Continuous Delivery Distributed Systems Blue/Green Deployments Cloud Computing Service Mesh Containers Immutable Infrastructur e Infracod e Continuous Integration Microservice Architectures API Auto Canaries CI/CD DevOps Automation Pipelines Complex?
  32. Mostly Monolithic Requires Domain Knowledge Prevention focused Poorly Aligned Defense in Depth Stateful in nature DevSecOps not widely adopted Security? Expert Systems Adversary Focused
  33. Simplify?
  34. Software has officially taken over
  35. Software Only Increases in Complexity
  36. Accidental Essential Software Complexity
  37. “As the complexity of a system increases, the accuracy of any single agent’s own model of that system decreases” - Dr. David Woods Woods Theorem:
  38. What about my systems?
  39. How well do you really understand how your system works?
  40. Systems Engineering is Messy In Reality…….
  41. In the beginning...we think it looks like
  42. After a few months…. Hard Coded Passwords Identity Conflicts Lead Software Engineering finds a new job at Google New Security Tool Refactor Pricing 300 Microservices Δ-> 850 Microservices Cloud Provider API Outage WAF Outage -> DisabledScalability Issues Network is Unreliable Autoscaling Keeps Breaking Large Customer Outage Delayed Features DNS Resolution ErrorsExpired Certificate Regulatory Audit Rolling Sev1 Outage on Portal Code Freeze
  43. Years?…. Hard Coded Passwords Identity Conflicts Lead Software Engineering finds a new job at Google New Security Tool Refactor Pricing 300 Microservices Δ-> 4000 Microservices Cloud Provider API Outage Firewall Outage -> Disabled Scalability Issues Network is Unreliable Autoscaling Keeps Breaking Large Customer Outage Delayed Features DNS Resolution Errors Expired Certificate Regulatory Audit Rolling Sev1 Outages on Portal Code Freeze Hard Coded Passwords Identity Conflicts Lead Software Engineering finds a new job at Google New Security Tool Refactor Pricing 300 Microservices Δ-> 850 Microservices Cloud Provider API Outage WAF Outage -> DisabledScalability Issues Network is Unreliable Autoscaling Keeps Breaking Large CustomerDelayed Features DNS Resolution ErrorsExpired Certificate Regulatory Audit Rolling Sev1 Outage on Portal Merger with competitor Misconfigured FW Rule Outage Database Outage Portal Retry Storm Outage Orphaned Documentation Corporate Reorg Budget Freeze Outsource overseas development Exposed Secrets on GithuCode Freeze b Migration to New CSP Upgrade to Java SE 12
  44. Our systems become more complex and messy than we remember them
  45. Difficult to Mentally Model
  46. Avoid Running in the Dark @aaronrinehart @verica_io #chaosengineering
  47. So what does all of this $&%* have to do with Security?
  48. Failure Happens Alot
  49. The Normal Condition is to FAIL
  50. We need failure to Learn & Grow 52
  51. “things that have never happened before happen all the time” –Scott Sagan “The Limits of Safety”
  52. What happens when our Security fails?
  53. How do we typically discover when our security measures fail?
  54. Security Incidents Typically we dont find out our security is failing until there is an security incident.
  55. Vanishing Traces All we typically ever see is the Footsteps in the Sand -Allspaw Logs, Stack Traces, Alerts
  56. Security incidents are not effective measures of detection because at that point it's already too late
  57. What typically causes our security to fail?
  58. 2018 Causes of Data Breaches
  59. 2018 Causes of Data Breaches
  60. 2018 Causes of Data Breaches
  61. 2018 Causes of Data Breaches
  62. ‘Human-Error’, Root Cause, & Blame Culture
  63. No System is inherently Secure by Default, its Humans that make them that way.
  64. People Operate Differently when they expect things to fail
  65. @aaronrinehart @verica_io #chaosengineering Chaos Engineering
  66. “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s ability to withstand turbulent conditions” Chaos Engineering
  67. Who is doing Chaos?
  68. “[Chaos Engineering is] empirical rather than formal. We don’t use models to understand what the system should do. We run experiments to learn what it does.” - Michael T. Nygard
  69. Use Chaos to Establish Order
  70. Testing vs. Experimentation
  71. ● ● ● ● ● ● Chaos Engineering Maturity Despite what has been popularized on online tech blogs you do not start off performing Chaos Engineering on live production systems. There is a maturity ramp to getting there. ● Validate Chaos Tools in Lower Environment ● Develop Competency & Confidence in Tooling ● Dry-run experiments Warning: Still be careful in Non-Prod environments as you will be surprised what hazards lie in Non-Prod. (Kafka Story)
  72. ● ● ● ● ● ● Chaos Monkey Story ● During Business Hours ● Born out of Netflix Cloud Transformation ● Put well defined problems in front of engineers. ● Terminate VMs on Random VPC Instances
  73. ● ● ● ● ● ● Chaos Pitfalls: Auto-Remediation “…an operator will only be able to generate successful new strategies for unusual situations if he has an adequate knowledge of the process.” “ Long term knowledge develops only through use and feedback about its effectiveness.” — Lisanne Bainbridge, The Ironies of Automation (1983) Bring context or chase down vulnerabilities for the service owner instead of automating fixes as this leads to a Fiery Hell! Reference: Nora Jones 8 Traps of Chaos Engineering
  74. ● ● ● ● ● ● Chaos Pitfalls:Breaking things on Purpose “I'm pretty sure I won’t have a job very long if I break things on purpose all day.” -Casey Rosenthal The purpose of Chaos Engineering is NOT to “Break Things on Purpose”. If anything we are trying to “Fix them on Purpose”! Reference: Nora Jones 8 Traps of Chaos Engineering
  75. ● ● ● ● ● ● GameDay Exercises ● 2-4 hrs in Length ● Diverse Cross Functional Group of Engineers ● Focused on Increasing Resilience ● Used for Manual Chaos Engineering ● Great Introduction to Chaos Engineering Recommendations ● Use GameDays for New Chaos Experiments ● Use GameDays for Initial Experiment Deployment on New Targets ● Use GameDays for Proving New Chaos Engineering Tools ● Get Everyone in the Same Location
  76. ● Define steady state ● Formulate hypothesis ● Outline methodology ● Identify blast radius ● Observability is key ● Readily abortable Experiment Lifecycle 1 Perform a GameDay Exercise Plan, Schedule, and Run a GameDay Exercise for New Experiments Validate Experiment Hypothesis Goal: Validate experiment ran successfully and that the results are credible. 2 Remediate Findings & Repeat Experiment If hypothesis failed for the experiment. Develop and remediate list of findings. Once remediated, repeat experiment 3 Once Successful: Automate Experiment Once the experiment has been proved to run successfully validating your hypothesis you can now automate the experiment runs periodically.. 4
  77. GameDays: The Basics Plan & Organize GameDay Exercise Execute Live GameDay Operations Automate & Evangelize Results & Take Action Chaos Experiment Develop & Evaluate Conduct Pre-Incident Review
  78. @aaronrinehart @verica_io #chaosengineering Security Chaos Engineering
  79. “The discipline of instrumentation, identification, and remediation of failure within security controls through proactive experimentation to build confidence in the system's ability to defend against malicious conditions in production.” Security Chaos Engineering is...
  80. Continuous Security Verification
  81. Proactively Manage & Measure
  82. Reduce Uncertainty by Building Confidence
  83. Build Confidence in What Actually Works
  84. @aaronrinehart @verica_io #chaosengineering Security Chaos Engineering Use Cases
  85. Incident Response
  86. Security Incidents are Subjective in Nature
  87. We really don't know Where? Why? Who? What?How? very much
  88. “Response” is the problem with Incident Response
  89. Lets face it, when outages happen….. Teams spend too much time reacting to outages instead of building more resilient systems.
  90. Post Mortem = Preparation Lets Flip the Model
  91. Solution Architecture “More men(people) die from their remedies not their illnesses” - Jean-Baptiste Poquelin
  92. 103 Solutions Architecture needs reinvention Patterns never worked Ivory Tower Architecture
  93. 106 An Open Source Tool
  94. • ChatOps Integration • Configuration-as-Code • Example Code & Open Framework ChaoSlingr Product Features • Serverless App in AWS • 100% Native AWS • Configurable Operational Mode & Frequency • Opt-In | Opt-Out Model
  95. Hypothesis: If someone accidentally or maliciously introduced a misconfigured port then we would immediately detect, block, and alert on the event. Alert SOC? Config Mgmt? Misconfigured Port Injection IR Triage Log data? Wait... Firewall?
  96. Result: Hypothesis disproved. Firewall did not detect or block the change on all instances. Standard Port AAA security policy out of sync on the Portal Team instances. Port change did not trigger an alert and log data indicated successful change audit. However we unexpectedly learned the configuration mgmt tool caught change and alerted the SoC. Alert SOC? Config Mgmt? Misconfigured Port Injection IR Triage Log data? Wait... Firewall?
  97. Stop looking for better answers and start asking better questions. - John Allspaw
  98. Q&A @aaronrinehart aaron@verica.io
Anúncio