SlideShare uma empresa Scribd logo
1 de 40
Baixar para ler offline
Cloud Operations and Analytics
Improving Distributed Systems Reliability
Using Fault Injection
December, 12, 2016
Technical University of Munich
www.tum.de
Dr. Jorge Cardoso (jorge.cardoso@huawei.com)
Chief Architect for Cloud Operations and Analytics
IT R&D Division
1
About Me
Jorge Cardoso
http://jorge-cardoso.github.io/
Interests
Cloud Computing
Service Science and Internet of Services
Business Process Management
Semantic Web
Positions in Industry
Prof. Jorge Cardoso obtained his PhD degree in Computer Science from the University of Georgia (US) in 2002. He is Chief Architect for
Cloud Operations and Analytics at Huawei GRC in Munich, Germany, and Professor at the University of Coimbra, Portugal.
He frequently publishes papers in first-tier conferences such as ICWS, CAISE, and ESWC, and first-tier journals such as IEEE TSC and
Journal of Web Semantics. He has published several books on distributed systems, process management systems, and service systems.
Short Bio
2
Contents
Fault Injection Techniques4
Cloud Reliability3
Cloud Operating Systems2
Cloud Computing1
Butterfly Effect Project5
3
From Virtualization to Clouds
Cloud Computing Deployment Stages of Enterprises
• Computing virtualization
• Storage virtualization
• Network and security
virtualization
• Automatic management
• Elastic resource scheduling
• HA based on large clusters
• Consolidation of multiple DCs
• Multi-level backup and DR
• Software-defined networking
(SDN)
• Unified management
• Optimal resource allocation
• Flexible service migration
Private Public
Hybrid
Cloud
Private Cloud
Virtualization
Data Center
Consolidation
Hybrid Cloud
Focus on resources
Gradually focus
on business Focus on global
business
Flexible and
service-driven
4
Server virtualization is the partitioning of a physical server into smaller
virtual servers to maximize resources. The resources of the server are
hidden from users. Software is used to divide the physical server into
multiple virtual environments.
Communications of the ACM, vol 17, no 7, 1974, pp.412-421
Virtualization
X86
Windows
XP
X86
Windows
2003
X86
Suse
X86
Red Hat
12% Hardware
Utilization
15% Hardware
Utilization
18% Hardware
Utilization
10% Hardware
Utilization
App App App App App App App App
X86 Multi-Core, Multi Processor
X86
Windows
XP
X86
Windows
2003
X86
Suse
X86
Red Hat
App App App App App App App App
70% Hardware Utilization
5
Contents
Fault Injection Techniques4
Cloud Reliability3
Cloud Operating Systems2
Cloud Computing1
Butterfly Effect Project5
6
 Azure, Amazon, Google,
Oracle, OpenStack,
SoftLayer, etc.
 Transforms datacenters into
pools of resources
 Provides a management
layer for controlling,
automating, and efficiently
allocating resources
 Adopts a selfservice mode
 Enables developers to build
cloud-aware applications via
standard APIs
Cloud Operating Systems
7
 Started by Rackspace and NASA (2010)
 Driven by the emergence of virtualization
 Rackspace wanted to rewrite its cloud servers offering
 NASA had published code for Nova, a Python-based
cloud computing controller
OpenStack History
Series Status Initial Release Date EOL Date
Queens Future TBD TBD
Pike Future TBD TBD
Ocata Under Development
2017-02-22
(planned)
TBD
Newton
Current stable release,
security-supported
2016-10-06 TBD
Mitaka Security-supported 2016-04-07 2017-04-10
Liberty Security-supported 2015-10-15 2016-11-17
Kilo EOL 2015-04-30 2016-05-02
Juno EOL 2014-10-16 2015-12-07
Icehouse EOL 2014-04-17 2015-07-02
Havana EOL 2013-10-17 2014-09-30
Grizzly EOL 2013-04-04 2014-03-29
Folsom EOL 2012-09-27 2013-11-19
Essex EOL 2012-04-05 2013-05-06
Diablo EOL 2011-09-22 2013-05-06
Cactus Deprecated 2011-04-15
Bexar Deprecated 2011-02-03
Austin Deprecated 2010-10-21
https://www.nextplatform.com/2016/11/03/building-stack-openstack/
8
OpenStack Community
 1,500+ active participants!
 17 countries represented at Design Summit!
 60,000+ downloads!
 Worldwide network of user groups (North
America, South America, Europe, Asia and
Africa)
9
OpenStack Architecture
https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/paged/architecture-guide/chapter-1-components
10OpenStack User Survey: A snapshot of OpenStack users’ attitudes and deployments. April 2016. (https://www.openstack.org/assets/survey/April-2016-User-Survey-Report.pdf). Fig. 4.6, Pag. 31.
11
Compute Architecture
https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/paged/architecture-guide/chapter-1-components
12
Adopters
Apr 6, 2016
http://cloud.telekom.de/Deutsche-Cloud‎
13
$ sudo yum install -y centos-release-openstack-newton
$ sudo yum update -y
$ sudo yum install -y openstack-packstack
$ packstack --allinone
Deploying OpenStack
https://www.rdoproject.org/install/quickstart/
14
Contents
Fault Injection Techniques4
Cloud Reliability3
Cloud Operating Systems2
Cloud Computing1
Butterfly Effect Project5
15
One reason [Netflix]: It’s the lack of control over the underlying
hardware, the inability to configure it to ensure 100% uptime
Why does using a cloud infrastructure requires
advanced approaches for resiliency?
16
Unplanned downtime
is caused by*
software bugs … 27%
hardware … 23%
human error … 18%
network failures … 17%
natural disasters … 8%
* Marcus, E., and Stern, H. Blueprints for High Availability: Designing Resilient Distributed Systems. John Wiley & Sons, Inc., 2003.
Google's 2007 found
annualized failure rates (AFRs) for drives
1 year old 1.7%
3 year old >8.6%
Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso. 2007. Failure trends in a large disk drive population. In Proc.
of the 5th USENIX conference on File and Storage Technologies (FAST '07). USENIX Association, Berkeley, CA, USA, 2-2.
17
A program designed to increase resilience by purposely injecting
major failures
Discover flaws and subtle dependencies
Amazon AWS: GameDay
“That seems totally bizarre on the face of it, but as you dig down, you end up finding
some dependency no one knew about previously […] We’ve had situations where we
brought down a network in, say, São Paulo, only to find that in doing so we broke our
links in Mexico.”
18
 Google DIRT (Disaster Recovery Test)
 Annual disaster recovery & testing exercise
 8 years since inception
 Multi-day exercise triggering (controlled) failures in systems and process
 Premise
 30-day incapacitation of headquarters following a disaster
 Other offices and facilities may be affected
 When
 “Big disaster”: Annually for 3-5 days
 Continuous testing: Year-round
 Who
 100s of engineers (Site Reliability, Network, Hardware, Software, Security, Facilities)
 Business units (Human Resources, Finance, Safety, Crisis response etc.)
Google: DiRT
Source http://flowcon.org/dl/flowcon-sanfran-2014/slides/KripaKrishnan_LearningContinuouslyFromFailures.pdf
19
Netflix: Chaos Monkey
Fewer alerts
for ops team
Amazon EC2 and Amazon RDS Service
Disruption in the US East Region
April 29, 2011
September 20th, 2015
Amazon’s DynamoDB service experienced an availability issue in their US-EAST-1
Transfer traffic
to east region
20
 Dependability. Concepts, techniques and tools developed
over the past four decades and include the attributes:
 Availability. readiness for correct service.
 Reliability. Continuity of correct service.
 Safety. absence of catastrophic consequences on the
User(s) and the environment.
 Integrity. absence of improper system alterations.
 Maintainability. ability to undergo modifications and
repairs.
 Means to attain dependability
 Fault prevention means to prevent the occurrence or
introduction of faults.
 Fault tolerance means to avoid service failures in the
presence of faults [Voas98].
 Fault removal means to reduce the number and
severity of faults.
 Fault forecasting means to estimate the present number,
the future incidence, and the likely consequences of
faults.
Reliability
A. Avizienis, JC Laprie, B. Randell, and C. Landwehr. 2004. Basic Concepts and Taxonomy of
Dependable and Secure Computing. IEEE Trans. Dependable Secur. Comput. 1, 1, 11-33
J. Voas, G. McGraw. “Software Fault Injection: Inoculating programs against errors”. Edit. Wiley. USA, 1998.
Dependability
21
 Fault. adjudged or hypothesized cause of an
error
 Error. discrepancy between a computed,
observed, or measured value or condition and
a true, specified, or theoretically correct value
or condition. Error is a consequence of a fault
 Failure. deviation of the delivered service from
fulfilling the system function
Threats
Marcello Cinque, Domenico Cotroneo, Antonio Pecchia
Event Logs for the Analysis of Software Failures: A Rule-Based Approach, #6, vol.39, pp: 806-821
E
Ft
Fl
EFt Fl
Fault Error Failure
22
Contents
Fault Injection Techniques4
Cloud Reliability3
Cloud Operating Systems2
Cloud Computing1
Butterfly Effect Project5
23
Fault Injection
FI on Simulated
models
VHDL Simulation
models
Other languages
FI on prototypes
Hardware
Injection
HWIFI
External
HWIFI at pin level
Electromagnetic
Perturbations
Internal
Heavy ion
radiations
Laser Radiation
Scan Chain
Software
Injection SWIFI
(1)
Time
Static
Dynamic
Level
High Level
Machine
Language
Injection Objectives
• Prediction
• Elimination
Fault Injection Techniques
Software-implemented
fault injection (SWIFI)
Fault injection techniques introduce faults to
perturb the normal flow of a program to
extend test coverage or stress test the
system.
Inject a fault into a
software system at run
time.
 Experiments can be run in near real-time
 No model development needed
 Can be expanded for new classes of
faults.
 Limited set of injection instants.
 Cannot inject faults into locations that
are inaccessible to software.
 Require modification of the source code
to support the fault injection.
24
Huawei: Butterfly Effect
-- Butterfly Effect System --
Enables to Automatically Test and Repair OpenStack and Cloud
Applications
CLOUD APPLICATION
HUAWEI FusionSphere
The system works by intentionally injecting different failures, test the ability to
survive them, and learn how to predict and repair failures preemptively
Failure
Repair
Test
In chaos theory, the butterfly effect is the sensitive dependence on initial conditions in which a small change in one state of a deterministic nonlinear
system can result in large differences in a later state. [Wikipedia]
25
The Strategy
 VM failures
 send VM creation request
 find compute node where request was scheduled
 damage to the compute server
 check if the VM creation was re-scheduled to another node
 Disk temporarily unavailable
 unmount a disk
 wait for replicas to regenerate
 remount the disk with the data intact
 wait for replicas to regenerate the extra replicas from handoff nodes
should get removed
 Disk replacement
 unmount a disk
 wait for replicas regenerate
 delete the disk and remount it
 wait for replicas to regenerate
 Extra replicas from handoff nodes should get removed
 Replication
 damage three disks at the same time
 more if the replica count is higher
 check that the replicas didn’t regenerate even after some time period
 fail if the replicas regenerated
 this tests if the tests themselves are correct
1
2
3
4
1
2
3
4
26
 Approach
 Fully automated and customizable
 Simple using ssh and bash scripting
 FusionServer RH2288
 Deploy and Destroy: 2 hours to deploy OpenStack infrastructure with 32 VMs…32
seconds to destroy…
 Vagrant. Provides easy to configure, reproducible, and portable environments for
OpenStack
 Interfaces to VirtualBox, VMware, AWS, an other providers
 VirtualBox. Free open-source hypervisor for x86 computers from Oracle
 Management of virtual machines
 RDO. Freely-available
distribution of OpenStack
from Red Hat
 OpenStack Mitaka
Test Environment
Huawei RH2288 + Fedora
Vagrant
Virtualbox
VM VM VM VM VM VM VM
27
Service to Destroy
Database
Message Queue
Authentication
Hypervisor
Hard drive
The main testing framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces)
Nodes, services,
processes, network,
hypervisor, storage, etc.
Nova-Compute
28
1. Request provisioning UI/CLI
2. Validate Auth data
3. Send API request to NOVA API
4. Validate API token
5. Process API request
6. Publish Provisioning Request
7. Schedule Provisioning
8. Start VM provisioning
9. Configure Network
10. Request Volume
11. Request VM image from Glance
12. Get image URL from Glance
13. Direct Image File Copy
14. Start VM rendering via Hypervison
Scenario Driven
http://www.slideshare.net/mirantis/openstack-architecture-43160012
http://docs.openstack.org/developer/tempest/field_guide/scenario.html
Create Server
• Create server
Inject Faults
Scenario FaultsProcess
flavor create
flavor delete
flavor list
host list
hypervisor list
hypervisor show
image add project
image create
image delete
image list
image show
ip fixed add
…
openstack server create --flavor
m1.medium --image "fedora-23" --
key-name ayoung-pubkey --
security-group default --nic net-
id=63258623-1fd5-497c-b62d-
e0651e03bdca windows_dev
29
Localized Injection
 State based
 Time based
 State
 Time
30
Faults to Inject
 Bit-flips - CPU registers/memory
 Memory errors - mem corruptions/leaks, lack of memory
 Disk faults - read/write errors, lack disk space
 Network faults - packet loss, network congestion, etc.
 Terminate instance
 Introduce delays in message delivery
 Corrupt data in DB
 Services, processes, and application crash
 Reboot node
 Configuration error
31
Detect Failures
The main testing framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces)
Network tests
• create keypairs
• create security
groups
• create networks
Compute tests
• create a keypair
• create a security
group
• boot a instance
Swift tests
• create a volume
• get the volume
• delete the volume
Identity tests
…
Cinder tests
…
Glance tests
…
echo "$ tempest init cloud-01"
echo "$ cp tempest/etc/tempest.conf cloud-01/etc/"
echo "$ cd cloud-01"
echo "Next is the full test suite:"
echo "$ ostestr -c 3 --regex '(?!.*[.*bslowb.*])(^tempest.(api|scenario))'"
echo "Next ist the minimum basic test:"
echo "$ ostestr -c 3 --regex '(?!.*[.*bslowb.*])(^tempest.scenario.test_minimum_basic)'"
32
Detect Failures
Tempest 0
1400 test/45min-2h
Tempest 1
100%,100
40%,40
Tempest 2 Tempest 3
Overlapping tests Mutually exclusive tests
5%, Log2 40
Branch and bound
4%, Log2 20
 Side Effects. Integration tests often have side effects and require specific setups. Thus, they often cannot be used
in production systems. For example, running integration tests which delete all the virtual machines running in a
production platforms cannot be run in production.
 Reuse. Integration tests are composed on many types of tests (e.g., unit testing, API testing, integration testing,
scenario testing, and positive and negative test). Reusing code tests for damage detection is useful but the selection
can be difficult.
 Filtering. Most of the tests are not relevant for damage detection on production systems. While damage detection
looks for components, services, and processes which are no longer working properly, tests determine if commits to
code generate errors. When software code is tested, many functional test are irrelevant to use in production.
 Specificity. New code for damage detection always needs to be developed since testing does not typically looks
for problems that can happen when a system is in a particular operational state.
Limitations of Integration Tests
33
Butterfly Effect: Example of Fault Injection
34
Butterfly Effect: Example of Fault Injection
Dmitri Zimine (Brocade) giving his
speech on workflows for auto-
remediation (credits to Johannes
Weingart).
Sebastian Kirsch (Google), co-
author of the bestselling book Site
Reliability Engineering from Google,
and the workshop organizer Jorge
Cardoso (Huawei).
The International Industry-Academia Workshop on Cloud Reliability and Resilience was held
in Berlin on 7-8 November 2016. The workshop gathered close to 50 participants from
industry (Intel, Red Hat, CISCO, SAP, Google, LinkedIn, Microsoft, Mirantis, Brocade, T-
Systems, SysEleven, Deutsche Telekom, Flexiant, Hastexo) and academia (TU Wien, TU
Berlin, U Oxford, ETH, U Stuttgart, TU Chemnitz, U Potsdam, TU Darmstadt, U Lisbon, U
Coimbra).
International Industry-Academia
Workshop on
Cloud Reliability and Resilience
Berlin on 7-8 November 2016
Current Team: Cloud Operations and Analytics
 Objective
 Planet-scale distributed systems = automation
 Highly complex systems = AI and machine learning
 Skills and knowledge
 OpenStack Software Development
 Machine Learning and Real-time Analysis
 Reliability for Cloud Native Applications
 Large-scale distributed systems
 Working Student
 Distributed Execution Graphs (DEG) for OpenStack.
 Master Students
 Efficient Diagnosis in Cloud Platforms.
 DEG-driven Fault Injection for Cloud platforms.
 PhD Students
 Risk-aware Cloud Recovery using Machine Learning
(automation + AI).
 Internship for PhD student
 Next generation of DEG-driven systems beyond
Google’s Dapper and Twitter’s Zipkin.
 Working & MSc students
 Fault injection, fault models,
fault libraries, fault plans,
brake and rebuild systems all
day long, …
 PhD Students
 Rapid prototyping of cool
ideas: propose it today, code
it, and show it running in 3
months…
 Postdocs
 Solving difficult challenges of
real problems using quick and
dirty prototyping
Open Positions
Copyright©2015 Huawei Technologies Co., Ltd. All Rights Reserved.
The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive
statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time
without notice.
HUAWEI ENTERPRISE ICT SOLUTIONS A BETTER WAY
38
 The complexity and dynamicity of large-scale cloud platforms requires automated solutions
to reduce the risks of eventual failures.
 Fault injection mechanisms enable to determine (and repair) the types of failures that
platforms cannot tolerate under controlled environments rather than taking a passive
approach waiting that Murphy’s law comes into play on a Sunday at 2am when engineers are
off duty.
 Pioneers, such as Amazon, Google, and Netflix, have already developed fault injection
mechanisms and have also changed their mindset with respect to the importance of the
resiliency of cloud platforms.
 As an innovation topic, we take one step further towards fault-tolerant platforms by
exploring, not only fault injection, but also the automated recovery of platforms.
Executive Summary
39
 FIAT: Fault Injection Based Automated Testing Environment, Carnegie Mellon University.
 EFI, PROFI: Processor Fault Injector, Dortmund University.
 FERRARI: Fault and ERRor Automatic Real-time Injector, Texas University.
 SFI, DOCTOR: intergrateD sOftware implemented fault injeCTiOn enviRonment, Michigan University.
 FINE: Fault Injection and moNitoring Environment, Universidad de Illinois University.
 FTAPE: Fault Tolerance and Performance Evaluator, Illinois University.
 XCEPTION: Coimbra University.
 MAFALDA, MAFALDA-RT: Microkernel Assessment by Fault injection AnaLysis and Design Aid, LAAS-
CNRS en Toulouse
 BALLISTA: Carnegie Mellon University.
SW Fault Injection Tools

Mais conteúdo relacionado

Mais procurados

Adversary Emulation using CALDERA
Adversary Emulation using CALDERAAdversary Emulation using CALDERA
Adversary Emulation using CALDERAErik Van Buggenhout
 
Lessons Learned in Automated Decision Making / How to Delay Building Skynet
Lessons Learned in Automated Decision Making / How to Delay Building SkynetLessons Learned in Automated Decision Making / How to Delay Building Skynet
Lessons Learned in Automated Decision Making / How to Delay Building SkynetSounil Yu
 
SAST vs. DAST: What’s the Best Method For Application Security Testing?
SAST vs. DAST: What’s the Best Method For Application Security Testing?SAST vs. DAST: What’s the Best Method For Application Security Testing?
SAST vs. DAST: What’s the Best Method For Application Security Testing?Cigital
 
Cisco Security Presentation
Cisco Security PresentationCisco Security Presentation
Cisco Security PresentationSimplex
 
Lessons Learned from the NIST CSF
Lessons Learned from the NIST CSFLessons Learned from the NIST CSF
Lessons Learned from the NIST CSFDigital Bond
 
Building an API Security Strategy
Building an API Security StrategyBuilding an API Security Strategy
Building an API Security StrategySmartBear
 
NICE Cybersecurity Workforce Framework: Close your skills gap with role-based...
NICE Cybersecurity Workforce Framework: Close your skills gap with role-based...NICE Cybersecurity Workforce Framework: Close your skills gap with role-based...
NICE Cybersecurity Workforce Framework: Close your skills gap with role-based...Infosec
 
The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?Raffael Marty
 
Cyber Security Operations Center (C-SOC)
Cyber Security Operations Center (C-SOC) Cyber Security Operations Center (C-SOC)
Cyber Security Operations Center (C-SOC) BGA Cyber Security
 
Cyber Threat hunting workshop
Cyber Threat hunting workshopCyber Threat hunting workshop
Cyber Threat hunting workshopArpan Raval
 
Security architecture, engineering and operations
Security architecture, engineering and operationsSecurity architecture, engineering and operations
Security architecture, engineering and operationsPiyush Jain
 
FIRST CTI Symposium: Turning intelligence into action with MITRE ATT&CK™
FIRST CTI Symposium: Turning intelligence into action with MITRE ATT&CK™FIRST CTI Symposium: Turning intelligence into action with MITRE ATT&CK™
FIRST CTI Symposium: Turning intelligence into action with MITRE ATT&CK™Katie Nickels
 
MITRE ATT&CK framework
MITRE ATT&CK frameworkMITRE ATT&CK framework
MITRE ATT&CK frameworkBhushan Gurav
 
Thinking like a hacker - Introducing Hacker Vision
Thinking like a hacker - Introducing Hacker VisionThinking like a hacker - Introducing Hacker Vision
Thinking like a hacker - Introducing Hacker VisionPECB
 
Welcome to the world of Cyber Threat Intelligence
Welcome to the world of Cyber Threat IntelligenceWelcome to the world of Cyber Threat Intelligence
Welcome to the world of Cyber Threat IntelligenceAndreas Sfakianakis
 
Purple Teaming with ATT&CK - x33fcon 2018
Purple Teaming with ATT&CK - x33fcon 2018Purple Teaming with ATT&CK - x33fcon 2018
Purple Teaming with ATT&CK - x33fcon 2018Christopher Korban
 

Mais procurados (20)

Adversary Emulation using CALDERA
Adversary Emulation using CALDERAAdversary Emulation using CALDERA
Adversary Emulation using CALDERA
 
Lessons Learned in Automated Decision Making / How to Delay Building Skynet
Lessons Learned in Automated Decision Making / How to Delay Building SkynetLessons Learned in Automated Decision Making / How to Delay Building Skynet
Lessons Learned in Automated Decision Making / How to Delay Building Skynet
 
SAST vs. DAST: What’s the Best Method For Application Security Testing?
SAST vs. DAST: What’s the Best Method For Application Security Testing?SAST vs. DAST: What’s the Best Method For Application Security Testing?
SAST vs. DAST: What’s the Best Method For Application Security Testing?
 
DevSecOps What Why and How
DevSecOps What Why and HowDevSecOps What Why and How
DevSecOps What Why and How
 
Cisco Security Presentation
Cisco Security PresentationCisco Security Presentation
Cisco Security Presentation
 
Lessons Learned from the NIST CSF
Lessons Learned from the NIST CSFLessons Learned from the NIST CSF
Lessons Learned from the NIST CSF
 
Building an API Security Strategy
Building an API Security StrategyBuilding an API Security Strategy
Building an API Security Strategy
 
NICE Cybersecurity Workforce Framework: Close your skills gap with role-based...
NICE Cybersecurity Workforce Framework: Close your skills gap with role-based...NICE Cybersecurity Workforce Framework: Close your skills gap with role-based...
NICE Cybersecurity Workforce Framework: Close your skills gap with role-based...
 
The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?
 
Cyber Security Operations Center (C-SOC)
Cyber Security Operations Center (C-SOC) Cyber Security Operations Center (C-SOC)
Cyber Security Operations Center (C-SOC)
 
Cyber Threat hunting workshop
Cyber Threat hunting workshopCyber Threat hunting workshop
Cyber Threat hunting workshop
 
Security architecture, engineering and operations
Security architecture, engineering and operationsSecurity architecture, engineering and operations
Security architecture, engineering and operations
 
FIRST CTI Symposium: Turning intelligence into action with MITRE ATT&CK™
FIRST CTI Symposium: Turning intelligence into action with MITRE ATT&CK™FIRST CTI Symposium: Turning intelligence into action with MITRE ATT&CK™
FIRST CTI Symposium: Turning intelligence into action with MITRE ATT&CK™
 
CLOUD NATIVE SECURITY
CLOUD NATIVE SECURITYCLOUD NATIVE SECURITY
CLOUD NATIVE SECURITY
 
MITRE ATT&CK framework
MITRE ATT&CK frameworkMITRE ATT&CK framework
MITRE ATT&CK framework
 
Thinking like a hacker - Introducing Hacker Vision
Thinking like a hacker - Introducing Hacker VisionThinking like a hacker - Introducing Hacker Vision
Thinking like a hacker - Introducing Hacker Vision
 
Threat Hunting with Cyber Kill Chain
Threat Hunting with Cyber Kill ChainThreat Hunting with Cyber Kill Chain
Threat Hunting with Cyber Kill Chain
 
Red Team Framework
Red Team FrameworkRed Team Framework
Red Team Framework
 
Welcome to the world of Cyber Threat Intelligence
Welcome to the world of Cyber Threat IntelligenceWelcome to the world of Cyber Threat Intelligence
Welcome to the world of Cyber Threat Intelligence
 
Purple Teaming with ATT&CK - x33fcon 2018
Purple Teaming with ATT&CK - x33fcon 2018Purple Teaming with ATT&CK - x33fcon 2018
Purple Teaming with ATT&CK - x33fcon 2018
 

Destaque

Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC
Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC
Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC Amazon Web Services
 
È l'ora del Cloud Managed IT
È l'ora del Cloud Managed ITÈ l'ora del Cloud Managed IT
È l'ora del Cloud Managed ITMatteo Masi
 
Azure Services Platform Oc Event Ned
Azure Services Platform Oc Event NedAzure Services Platform Oc Event Ned
Azure Services Platform Oc Event NedWes Yanaga
 
Gov 2.0: Scaling, Automation, & Management in the Cloud
Gov 2.0: Scaling, Automation, & Management in the CloudGov 2.0: Scaling, Automation, & Management in the Cloud
Gov 2.0: Scaling, Automation, & Management in the CloudJesse Robbins
 
Oracle Management Cloud
Oracle Management CloudOracle Management Cloud
Oracle Management CloudFabio Batista
 
Smau Bologna 2015 - Microsoft - Azure
Smau Bologna 2015 - Microsoft - AzureSmau Bologna 2015 - Microsoft - Azure
Smau Bologna 2015 - Microsoft - AzureSMAU
 
Meraki cloud managed products
Meraki cloud managed productsMeraki cloud managed products
Meraki cloud managed productsAtanas Gergiminov
 
Simplify IT Operations by Unifying Element Management with Vistara
Simplify IT Operations by Unifying Element Management with VistaraSimplify IT Operations by Unifying Element Management with Vistara
Simplify IT Operations by Unifying Element Management with VistaraVistara
 
The Microsoft Cloud - Azure | Office 365 | Intune
The Microsoft Cloud - Azure | Office 365 | IntuneThe Microsoft Cloud - Azure | Office 365 | Intune
The Microsoft Cloud - Azure | Office 365 | IntuneRola Ezzeddine
 
Meraki Company And Product Overview
Meraki Company And Product OverviewMeraki Company And Product Overview
Meraki Company And Product Overviewxanstevenson
 
The Power and Promise of SaaS: CA Cloud Service Management Case Study
The Power and Promise of SaaS: CA Cloud Service Management Case StudyThe Power and Promise of SaaS: CA Cloud Service Management Case Study
The Power and Promise of SaaS: CA Cloud Service Management Case StudyCA Technologies
 
Microsoft Operations Management Suite
Microsoft Operations Management Suite Microsoft Operations Management Suite
Microsoft Operations Management Suite Engin Özkurt
 
Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...
Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...
Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...dev2ops
 
Meraki Cloud Networking Workshop
Meraki Cloud Networking WorkshopMeraki Cloud Networking Workshop
Meraki Cloud Networking WorkshopCisco Canada
 
Microsoft Azure Explained - Hitesh D Kesharia
Microsoft Azure Explained - Hitesh D KeshariaMicrosoft Azure Explained - Hitesh D Kesharia
Microsoft Azure Explained - Hitesh D KeshariaHARMAN Services
 
Delivering operations management success at Morningstar (a case study)
Delivering operations management success at Morningstar (a case study)Delivering operations management success at Morningstar (a case study)
Delivering operations management success at Morningstar (a case study)BMC Software
 
Microsoft Azure And The Competitive Cloud Industry - Collab365
Microsoft Azure And The Competitive Cloud Industry - Collab365Microsoft Azure And The Competitive Cloud Industry - Collab365
Microsoft Azure And The Competitive Cloud Industry - Collab365Richard Harbridge
 
Service-now.com SaaS vs. ASP vs. traditional software
Service-now.com   SaaS vs. ASP vs. traditional softwareService-now.com   SaaS vs. ASP vs. traditional software
Service-now.com SaaS vs. ASP vs. traditional softwareRhett Glauser
 
Lattice Energy LLC - Adequate reasonably priced dispatchable power generation...
Lattice Energy LLC - Adequate reasonably priced dispatchable power generation...Lattice Energy LLC - Adequate reasonably priced dispatchable power generation...
Lattice Energy LLC - Adequate reasonably priced dispatchable power generation...Lewis Larsen
 

Destaque (20)

Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC
Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC
Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC
 
È l'ora del Cloud Managed IT
È l'ora del Cloud Managed ITÈ l'ora del Cloud Managed IT
È l'ora del Cloud Managed IT
 
Azure Services Platform Oc Event Ned
Azure Services Platform Oc Event NedAzure Services Platform Oc Event Ned
Azure Services Platform Oc Event Ned
 
Gov 2.0: Scaling, Automation, & Management in the Cloud
Gov 2.0: Scaling, Automation, & Management in the CloudGov 2.0: Scaling, Automation, & Management in the Cloud
Gov 2.0: Scaling, Automation, & Management in the Cloud
 
Oracle Management Cloud
Oracle Management CloudOracle Management Cloud
Oracle Management Cloud
 
Smau Bologna 2015 - Microsoft - Azure
Smau Bologna 2015 - Microsoft - AzureSmau Bologna 2015 - Microsoft - Azure
Smau Bologna 2015 - Microsoft - Azure
 
Meraki cloud managed products
Meraki cloud managed productsMeraki cloud managed products
Meraki cloud managed products
 
Simplify IT Operations by Unifying Element Management with Vistara
Simplify IT Operations by Unifying Element Management with VistaraSimplify IT Operations by Unifying Element Management with Vistara
Simplify IT Operations by Unifying Element Management with Vistara
 
The Microsoft Cloud - Azure | Office 365 | Intune
The Microsoft Cloud - Azure | Office 365 | IntuneThe Microsoft Cloud - Azure | Office 365 | Intune
The Microsoft Cloud - Azure | Office 365 | Intune
 
Meraki Company And Product Overview
Meraki Company And Product OverviewMeraki Company And Product Overview
Meraki Company And Product Overview
 
The Power and Promise of SaaS: CA Cloud Service Management Case Study
The Power and Promise of SaaS: CA Cloud Service Management Case StudyThe Power and Promise of SaaS: CA Cloud Service Management Case Study
The Power and Promise of SaaS: CA Cloud Service Management Case Study
 
Microsoft Operations Management Suite
Microsoft Operations Management Suite Microsoft Operations Management Suite
Microsoft Operations Management Suite
 
Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...
Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...
Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...
 
Meraki Cloud Networking Workshop
Meraki Cloud Networking WorkshopMeraki Cloud Networking Workshop
Meraki Cloud Networking Workshop
 
Microsoft Azure Explained - Hitesh D Kesharia
Microsoft Azure Explained - Hitesh D KeshariaMicrosoft Azure Explained - Hitesh D Kesharia
Microsoft Azure Explained - Hitesh D Kesharia
 
Delivering operations management success at Morningstar (a case study)
Delivering operations management success at Morningstar (a case study)Delivering operations management success at Morningstar (a case study)
Delivering operations management success at Morningstar (a case study)
 
Microsoft Azure And The Competitive Cloud Industry - Collab365
Microsoft Azure And The Competitive Cloud Industry - Collab365Microsoft Azure And The Competitive Cloud Industry - Collab365
Microsoft Azure And The Competitive Cloud Industry - Collab365
 
Service-now.com SaaS vs. ASP vs. traditional software
Service-now.com   SaaS vs. ASP vs. traditional softwareService-now.com   SaaS vs. ASP vs. traditional software
Service-now.com SaaS vs. ASP vs. traditional software
 
Lattice Energy LLC - Adequate reasonably priced dispatchable power generation...
Lattice Energy LLC - Adequate reasonably priced dispatchable power generation...Lattice Energy LLC - Adequate reasonably priced dispatchable power generation...
Lattice Energy LLC - Adequate reasonably priced dispatchable power generation...
 
Meraki Overview
Meraki OverviewMeraki Overview
Meraki Overview
 

Semelhante a Cloud Operations and Analytics: Improving Distributed Systems Reliability using Fault Injection

DOST 2016 Cloud Without Failures
DOST 2016 Cloud Without FailuresDOST 2016 Cloud Without Failures
DOST 2016 Cloud Without FailuresJorge Cardoso
 
Cloud Reliability: Decreasing outage frequency using fault injection
Cloud Reliability: Decreasing outage frequency using fault injectionCloud Reliability: Decreasing outage frequency using fault injection
Cloud Reliability: Decreasing outage frequency using fault injectionJorge Cardoso
 
Finding Bugs, Fixing Bugs, Preventing Bugs — Exploiting Automated Tests to In...
Finding Bugs, Fixing Bugs, Preventing Bugs — Exploiting Automated Tests to In...Finding Bugs, Fixing Bugs, Preventing Bugs — Exploiting Automated Tests to In...
Finding Bugs, Fixing Bugs, Preventing Bugs — Exploiting Automated Tests to In...University of Antwerp
 
Cloud Resilience with Open Stack
Cloud Resilience with Open StackCloud Resilience with Open Stack
Cloud Resilience with Open StackJorge Cardoso
 
Recapitulation Workshop Cloud Reliability Resilience 2016
Recapitulation Workshop Cloud Reliability Resilience 2016Recapitulation Workshop Cloud Reliability Resilience 2016
Recapitulation Workshop Cloud Reliability Resilience 2016Jorge Cardoso
 
Security TechTalk | AWS Public Sector Summit 2016
Security TechTalk | AWS Public Sector Summit 2016Security TechTalk | AWS Public Sector Summit 2016
Security TechTalk | AWS Public Sector Summit 2016Amazon Web Services
 
The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...
The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...
The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...Mark Underwood
 
Gervais Peter Resume Oct :2015
Gervais Peter Resume Oct :2015Gervais Peter Resume Oct :2015
Gervais Peter Resume Oct :2015Peter Gervais
 
An Overview Of The Singularity Project
An  Overview Of The  Singularity  ProjectAn  Overview Of The  Singularity  Project
An Overview Of The Singularity Projectalanocu
 
Embracing Failure - AzureDay Rome
Embracing Failure - AzureDay RomeEmbracing Failure - AzureDay Rome
Embracing Failure - AzureDay RomeAlberto Acerbis
 
IRJET- Analysis of Forensics Tools in Cloud Environment
IRJET-  	  Analysis of Forensics Tools in Cloud EnvironmentIRJET-  	  Analysis of Forensics Tools in Cloud Environment
IRJET- Analysis of Forensics Tools in Cloud EnvironmentIRJET Journal
 
IRJET- Cross Platform Penetration Testing Suite
IRJET-  	  Cross Platform Penetration Testing SuiteIRJET-  	  Cross Platform Penetration Testing Suite
IRJET- Cross Platform Penetration Testing SuiteIRJET Journal
 
WIRELESS COMPUTING AND IT ECOSYSTEMS
WIRELESS COMPUTING AND IT ECOSYSTEMSWIRELESS COMPUTING AND IT ECOSYSTEMS
WIRELESS COMPUTING AND IT ECOSYSTEMScscpconf
 
reliability based design optimization for cloud migration
reliability based design optimization for cloud migrationreliability based design optimization for cloud migration
reliability based design optimization for cloud migrationNishmitha B
 
Seeing O S Processes To Improve Dependability And Safety
Seeing  O S  Processes To  Improve  Dependability And  SafetySeeing  O S  Processes To  Improve  Dependability And  Safety
Seeing O S Processes To Improve Dependability And Safetyalanocu
 
Using security to drive chaos engineering - April 2018
Using security to drive chaos engineering - April 2018Using security to drive chaos engineering - April 2018
Using security to drive chaos engineering - April 2018Dinis Cruz
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018Christophe Rochefolle
 
Wicsa2011 cloud tutorial
Wicsa2011 cloud tutorialWicsa2011 cloud tutorial
Wicsa2011 cloud tutorialAnna Liu
 
Security that Scales with Cloud Native Development
Security that Scales with Cloud Native DevelopmentSecurity that Scales with Cloud Native Development
Security that Scales with Cloud Native DevelopmentPanoptica
 
INLINE_PATCH_PROXY_FOR_XEN_HYPERVISOR
INLINE_PATCH_PROXY_FOR_XEN_HYPERVISORINLINE_PATCH_PROXY_FOR_XEN_HYPERVISOR
INLINE_PATCH_PROXY_FOR_XEN_HYPERVISORNeha Rana
 

Semelhante a Cloud Operations and Analytics: Improving Distributed Systems Reliability using Fault Injection (20)

DOST 2016 Cloud Without Failures
DOST 2016 Cloud Without FailuresDOST 2016 Cloud Without Failures
DOST 2016 Cloud Without Failures
 
Cloud Reliability: Decreasing outage frequency using fault injection
Cloud Reliability: Decreasing outage frequency using fault injectionCloud Reliability: Decreasing outage frequency using fault injection
Cloud Reliability: Decreasing outage frequency using fault injection
 
Finding Bugs, Fixing Bugs, Preventing Bugs — Exploiting Automated Tests to In...
Finding Bugs, Fixing Bugs, Preventing Bugs — Exploiting Automated Tests to In...Finding Bugs, Fixing Bugs, Preventing Bugs — Exploiting Automated Tests to In...
Finding Bugs, Fixing Bugs, Preventing Bugs — Exploiting Automated Tests to In...
 
Cloud Resilience with Open Stack
Cloud Resilience with Open StackCloud Resilience with Open Stack
Cloud Resilience with Open Stack
 
Recapitulation Workshop Cloud Reliability Resilience 2016
Recapitulation Workshop Cloud Reliability Resilience 2016Recapitulation Workshop Cloud Reliability Resilience 2016
Recapitulation Workshop Cloud Reliability Resilience 2016
 
Security TechTalk | AWS Public Sector Summit 2016
Security TechTalk | AWS Public Sector Summit 2016Security TechTalk | AWS Public Sector Summit 2016
Security TechTalk | AWS Public Sector Summit 2016
 
The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...
The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...
The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...
 
Gervais Peter Resume Oct :2015
Gervais Peter Resume Oct :2015Gervais Peter Resume Oct :2015
Gervais Peter Resume Oct :2015
 
An Overview Of The Singularity Project
An  Overview Of The  Singularity  ProjectAn  Overview Of The  Singularity  Project
An Overview Of The Singularity Project
 
Embracing Failure - AzureDay Rome
Embracing Failure - AzureDay RomeEmbracing Failure - AzureDay Rome
Embracing Failure - AzureDay Rome
 
IRJET- Analysis of Forensics Tools in Cloud Environment
IRJET-  	  Analysis of Forensics Tools in Cloud EnvironmentIRJET-  	  Analysis of Forensics Tools in Cloud Environment
IRJET- Analysis of Forensics Tools in Cloud Environment
 
IRJET- Cross Platform Penetration Testing Suite
IRJET-  	  Cross Platform Penetration Testing SuiteIRJET-  	  Cross Platform Penetration Testing Suite
IRJET- Cross Platform Penetration Testing Suite
 
WIRELESS COMPUTING AND IT ECOSYSTEMS
WIRELESS COMPUTING AND IT ECOSYSTEMSWIRELESS COMPUTING AND IT ECOSYSTEMS
WIRELESS COMPUTING AND IT ECOSYSTEMS
 
reliability based design optimization for cloud migration
reliability based design optimization for cloud migrationreliability based design optimization for cloud migration
reliability based design optimization for cloud migration
 
Seeing O S Processes To Improve Dependability And Safety
Seeing  O S  Processes To  Improve  Dependability And  SafetySeeing  O S  Processes To  Improve  Dependability And  Safety
Seeing O S Processes To Improve Dependability And Safety
 
Using security to drive chaos engineering - April 2018
Using security to drive chaos engineering - April 2018Using security to drive chaos engineering - April 2018
Using security to drive chaos engineering - April 2018
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018
 
Wicsa2011 cloud tutorial
Wicsa2011 cloud tutorialWicsa2011 cloud tutorial
Wicsa2011 cloud tutorial
 
Security that Scales with Cloud Native Development
Security that Scales with Cloud Native DevelopmentSecurity that Scales with Cloud Native Development
Security that Scales with Cloud Native Development
 
INLINE_PATCH_PROXY_FOR_XEN_HYPERVISOR
INLINE_PATCH_PROXY_FOR_XEN_HYPERVISORINLINE_PATCH_PROXY_FOR_XEN_HYPERVISOR
INLINE_PATCH_PROXY_FOR_XEN_HYPERVISOR
 

Mais de Jorge Cardoso

On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...Jorge Cardoso
 
Distributed Trace & Log Analysis using ML
Distributed Trace & Log Analysis using MLDistributed Trace & Log Analysis using ML
Distributed Trace & Log Analysis using MLJorge Cardoso
 
AIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning
AIOps: Anomalous Span Detection in Distributed Traces Using Deep LearningAIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning
AIOps: Anomalous Span Detection in Distributed Traces Using Deep LearningJorge Cardoso
 
AIOps: Anomalies Detection of Distributed Traces
AIOps: Anomalies Detection of Distributed TracesAIOps: Anomalies Detection of Distributed Traces
AIOps: Anomalies Detection of Distributed TracesJorge Cardoso
 
Mastering AIOps with Deep Learning
Mastering AIOps with Deep LearningMastering AIOps with Deep Learning
Mastering AIOps with Deep LearningJorge Cardoso
 
Evolution and Overview of Linked USDL
Evolution and Overview of Linked USDLEvolution and Overview of Linked USDL
Evolution and Overview of Linked USDLJorge Cardoso
 
Ten years of service research from a computer science perspective
Ten years of service research from a computer science perspectiveTen years of service research from a computer science perspective
Ten years of service research from a computer science perspectiveJorge Cardoso
 
Cloud Computing Automation: Integrating USDL and TOSCA
 Cloud Computing Automation: Integrating USDL and TOSCA Cloud Computing Automation: Integrating USDL and TOSCA
Cloud Computing Automation: Integrating USDL and TOSCAJorge Cardoso
 
Open Service Network Analysis
Open Service Network AnalysisOpen Service Network Analysis
Open Service Network AnalysisJorge Cardoso
 
Open Semantic Service Networks: Modeling and Analysis
Open Semantic Service Networks: Modeling and AnalysisOpen Semantic Service Networks: Modeling and Analysis
Open Semantic Service Networks: Modeling and AnalysisJorge Cardoso
 
Modeling Service Relationships for Service Networks
Modeling Service Relationships for Service NetworksModeling Service Relationships for Service Networks
Modeling Service Relationships for Service NetworksJorge Cardoso
 
Challenges for Open Semantic Service Networks : models, theory, applications
Challenges for Open Semantic Service Networks: models, theory, applications Challenges for Open Semantic Service Networks: models, theory, applications
Challenges for Open Semantic Service Networks : models, theory, applications Jorge Cardoso
 
Description and portability of cloud services with USDL and TOSCA
Description and portability of cloud services with USDL and TOSCADescription and portability of cloud services with USDL and TOSCA
Description and portability of cloud services with USDL and TOSCAJorge Cardoso
 
Open Semantic Service Networks
Open Semantic Service NetworksOpen Semantic Service Networks
Open Semantic Service NetworksJorge Cardoso
 
Dynamic Open Semantic Service Networks
Dynamic Open Semantic Service NetworksDynamic Open Semantic Service Networks
Dynamic Open Semantic Service NetworksJorge Cardoso
 
Genssiz Projects: Year 2012 2013
Genssiz Projects: Year 2012 2013Genssiz Projects: Year 2012 2013
Genssiz Projects: Year 2012 2013Jorge Cardoso
 
IEEE SE2012 Internet-based self-services
IEEE SE2012 Internet-based self-servicesIEEE SE2012 Internet-based self-services
IEEE SE2012 Internet-based self-servicesJorge Cardoso
 
Community based harversting for USDL
Community based harversting for USDLCommunity based harversting for USDL
Community based harversting for USDLJorge Cardoso
 

Mais de Jorge Cardoso (20)

On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...
 
Distributed Trace & Log Analysis using ML
Distributed Trace & Log Analysis using MLDistributed Trace & Log Analysis using ML
Distributed Trace & Log Analysis using ML
 
AIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning
AIOps: Anomalous Span Detection in Distributed Traces Using Deep LearningAIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning
AIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning
 
AIOps: Anomalies Detection of Distributed Traces
AIOps: Anomalies Detection of Distributed TracesAIOps: Anomalies Detection of Distributed Traces
AIOps: Anomalies Detection of Distributed Traces
 
Mastering AIOps with Deep Learning
Mastering AIOps with Deep LearningMastering AIOps with Deep Learning
Mastering AIOps with Deep Learning
 
Shape the Cloud
Shape the CloudShape the Cloud
Shape the Cloud
 
Evolution and Overview of Linked USDL
Evolution and Overview of Linked USDLEvolution and Overview of Linked USDL
Evolution and Overview of Linked USDL
 
Ten years of service research from a computer science perspective
Ten years of service research from a computer science perspectiveTen years of service research from a computer science perspective
Ten years of service research from a computer science perspective
 
Cloud Computing Automation: Integrating USDL and TOSCA
 Cloud Computing Automation: Integrating USDL and TOSCA Cloud Computing Automation: Integrating USDL and TOSCA
Cloud Computing Automation: Integrating USDL and TOSCA
 
Open Service Network Analysis
Open Service Network AnalysisOpen Service Network Analysis
Open Service Network Analysis
 
Open Semantic Service Networks: Modeling and Analysis
Open Semantic Service Networks: Modeling and AnalysisOpen Semantic Service Networks: Modeling and Analysis
Open Semantic Service Networks: Modeling and Analysis
 
Modeling Service Relationships for Service Networks
Modeling Service Relationships for Service NetworksModeling Service Relationships for Service Networks
Modeling Service Relationships for Service Networks
 
Linked USDL
Linked USDLLinked USDL
Linked USDL
 
Challenges for Open Semantic Service Networks : models, theory, applications
Challenges for Open Semantic Service Networks: models, theory, applications Challenges for Open Semantic Service Networks: models, theory, applications
Challenges for Open Semantic Service Networks : models, theory, applications
 
Description and portability of cloud services with USDL and TOSCA
Description and portability of cloud services with USDL and TOSCADescription and portability of cloud services with USDL and TOSCA
Description and portability of cloud services with USDL and TOSCA
 
Open Semantic Service Networks
Open Semantic Service NetworksOpen Semantic Service Networks
Open Semantic Service Networks
 
Dynamic Open Semantic Service Networks
Dynamic Open Semantic Service NetworksDynamic Open Semantic Service Networks
Dynamic Open Semantic Service Networks
 
Genssiz Projects: Year 2012 2013
Genssiz Projects: Year 2012 2013Genssiz Projects: Year 2012 2013
Genssiz Projects: Year 2012 2013
 
IEEE SE2012 Internet-based self-services
IEEE SE2012 Internet-based self-servicesIEEE SE2012 Internet-based self-services
IEEE SE2012 Internet-based self-services
 
Community based harversting for USDL
Community based harversting for USDLCommunity based harversting for USDL
Community based harversting for USDL
 

Último

Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 

Último (20)

Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 

Cloud Operations and Analytics: Improving Distributed Systems Reliability using Fault Injection

  • 1. Cloud Operations and Analytics Improving Distributed Systems Reliability Using Fault Injection December, 12, 2016 Technical University of Munich www.tum.de Dr. Jorge Cardoso (jorge.cardoso@huawei.com) Chief Architect for Cloud Operations and Analytics IT R&D Division
  • 2. 1 About Me Jorge Cardoso http://jorge-cardoso.github.io/ Interests Cloud Computing Service Science and Internet of Services Business Process Management Semantic Web Positions in Industry Prof. Jorge Cardoso obtained his PhD degree in Computer Science from the University of Georgia (US) in 2002. He is Chief Architect for Cloud Operations and Analytics at Huawei GRC in Munich, Germany, and Professor at the University of Coimbra, Portugal. He frequently publishes papers in first-tier conferences such as ICWS, CAISE, and ESWC, and first-tier journals such as IEEE TSC and Journal of Web Semantics. He has published several books on distributed systems, process management systems, and service systems. Short Bio
  • 3. 2 Contents Fault Injection Techniques4 Cloud Reliability3 Cloud Operating Systems2 Cloud Computing1 Butterfly Effect Project5
  • 4. 3 From Virtualization to Clouds Cloud Computing Deployment Stages of Enterprises • Computing virtualization • Storage virtualization • Network and security virtualization • Automatic management • Elastic resource scheduling • HA based on large clusters • Consolidation of multiple DCs • Multi-level backup and DR • Software-defined networking (SDN) • Unified management • Optimal resource allocation • Flexible service migration Private Public Hybrid Cloud Private Cloud Virtualization Data Center Consolidation Hybrid Cloud Focus on resources Gradually focus on business Focus on global business Flexible and service-driven
  • 5. 4 Server virtualization is the partitioning of a physical server into smaller virtual servers to maximize resources. The resources of the server are hidden from users. Software is used to divide the physical server into multiple virtual environments. Communications of the ACM, vol 17, no 7, 1974, pp.412-421 Virtualization X86 Windows XP X86 Windows 2003 X86 Suse X86 Red Hat 12% Hardware Utilization 15% Hardware Utilization 18% Hardware Utilization 10% Hardware Utilization App App App App App App App App X86 Multi-Core, Multi Processor X86 Windows XP X86 Windows 2003 X86 Suse X86 Red Hat App App App App App App App App 70% Hardware Utilization
  • 6. 5 Contents Fault Injection Techniques4 Cloud Reliability3 Cloud Operating Systems2 Cloud Computing1 Butterfly Effect Project5
  • 7. 6  Azure, Amazon, Google, Oracle, OpenStack, SoftLayer, etc.  Transforms datacenters into pools of resources  Provides a management layer for controlling, automating, and efficiently allocating resources  Adopts a selfservice mode  Enables developers to build cloud-aware applications via standard APIs Cloud Operating Systems
  • 8. 7  Started by Rackspace and NASA (2010)  Driven by the emergence of virtualization  Rackspace wanted to rewrite its cloud servers offering  NASA had published code for Nova, a Python-based cloud computing controller OpenStack History Series Status Initial Release Date EOL Date Queens Future TBD TBD Pike Future TBD TBD Ocata Under Development 2017-02-22 (planned) TBD Newton Current stable release, security-supported 2016-10-06 TBD Mitaka Security-supported 2016-04-07 2017-04-10 Liberty Security-supported 2015-10-15 2016-11-17 Kilo EOL 2015-04-30 2016-05-02 Juno EOL 2014-10-16 2015-12-07 Icehouse EOL 2014-04-17 2015-07-02 Havana EOL 2013-10-17 2014-09-30 Grizzly EOL 2013-04-04 2014-03-29 Folsom EOL 2012-09-27 2013-11-19 Essex EOL 2012-04-05 2013-05-06 Diablo EOL 2011-09-22 2013-05-06 Cactus Deprecated 2011-04-15 Bexar Deprecated 2011-02-03 Austin Deprecated 2010-10-21 https://www.nextplatform.com/2016/11/03/building-stack-openstack/
  • 9. 8 OpenStack Community  1,500+ active participants!  17 countries represented at Design Summit!  60,000+ downloads!  Worldwide network of user groups (North America, South America, Europe, Asia and Africa)
  • 11. 10OpenStack User Survey: A snapshot of OpenStack users’ attitudes and deployments. April 2016. (https://www.openstack.org/assets/survey/April-2016-User-Survey-Report.pdf). Fig. 4.6, Pag. 31.
  • 14. 13 $ sudo yum install -y centos-release-openstack-newton $ sudo yum update -y $ sudo yum install -y openstack-packstack $ packstack --allinone Deploying OpenStack https://www.rdoproject.org/install/quickstart/
  • 15. 14 Contents Fault Injection Techniques4 Cloud Reliability3 Cloud Operating Systems2 Cloud Computing1 Butterfly Effect Project5
  • 16. 15 One reason [Netflix]: It’s the lack of control over the underlying hardware, the inability to configure it to ensure 100% uptime Why does using a cloud infrastructure requires advanced approaches for resiliency?
  • 17. 16 Unplanned downtime is caused by* software bugs … 27% hardware … 23% human error … 18% network failures … 17% natural disasters … 8% * Marcus, E., and Stern, H. Blueprints for High Availability: Designing Resilient Distributed Systems. John Wiley & Sons, Inc., 2003. Google's 2007 found annualized failure rates (AFRs) for drives 1 year old 1.7% 3 year old >8.6% Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso. 2007. Failure trends in a large disk drive population. In Proc. of the 5th USENIX conference on File and Storage Technologies (FAST '07). USENIX Association, Berkeley, CA, USA, 2-2.
  • 18. 17 A program designed to increase resilience by purposely injecting major failures Discover flaws and subtle dependencies Amazon AWS: GameDay “That seems totally bizarre on the face of it, but as you dig down, you end up finding some dependency no one knew about previously […] We’ve had situations where we brought down a network in, say, São Paulo, only to find that in doing so we broke our links in Mexico.”
  • 19. 18  Google DIRT (Disaster Recovery Test)  Annual disaster recovery & testing exercise  8 years since inception  Multi-day exercise triggering (controlled) failures in systems and process  Premise  30-day incapacitation of headquarters following a disaster  Other offices and facilities may be affected  When  “Big disaster”: Annually for 3-5 days  Continuous testing: Year-round  Who  100s of engineers (Site Reliability, Network, Hardware, Software, Security, Facilities)  Business units (Human Resources, Finance, Safety, Crisis response etc.) Google: DiRT Source http://flowcon.org/dl/flowcon-sanfran-2014/slides/KripaKrishnan_LearningContinuouslyFromFailures.pdf
  • 20. 19 Netflix: Chaos Monkey Fewer alerts for ops team Amazon EC2 and Amazon RDS Service Disruption in the US East Region April 29, 2011 September 20th, 2015 Amazon’s DynamoDB service experienced an availability issue in their US-EAST-1 Transfer traffic to east region
  • 21. 20  Dependability. Concepts, techniques and tools developed over the past four decades and include the attributes:  Availability. readiness for correct service.  Reliability. Continuity of correct service.  Safety. absence of catastrophic consequences on the User(s) and the environment.  Integrity. absence of improper system alterations.  Maintainability. ability to undergo modifications and repairs.  Means to attain dependability  Fault prevention means to prevent the occurrence or introduction of faults.  Fault tolerance means to avoid service failures in the presence of faults [Voas98].  Fault removal means to reduce the number and severity of faults.  Fault forecasting means to estimate the present number, the future incidence, and the likely consequences of faults. Reliability A. Avizienis, JC Laprie, B. Randell, and C. Landwehr. 2004. Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Trans. Dependable Secur. Comput. 1, 1, 11-33 J. Voas, G. McGraw. “Software Fault Injection: Inoculating programs against errors”. Edit. Wiley. USA, 1998. Dependability
  • 22. 21  Fault. adjudged or hypothesized cause of an error  Error. discrepancy between a computed, observed, or measured value or condition and a true, specified, or theoretically correct value or condition. Error is a consequence of a fault  Failure. deviation of the delivered service from fulfilling the system function Threats Marcello Cinque, Domenico Cotroneo, Antonio Pecchia Event Logs for the Analysis of Software Failures: A Rule-Based Approach, #6, vol.39, pp: 806-821 E Ft Fl EFt Fl Fault Error Failure
  • 23. 22 Contents Fault Injection Techniques4 Cloud Reliability3 Cloud Operating Systems2 Cloud Computing1 Butterfly Effect Project5
  • 24. 23 Fault Injection FI on Simulated models VHDL Simulation models Other languages FI on prototypes Hardware Injection HWIFI External HWIFI at pin level Electromagnetic Perturbations Internal Heavy ion radiations Laser Radiation Scan Chain Software Injection SWIFI (1) Time Static Dynamic Level High Level Machine Language Injection Objectives • Prediction • Elimination Fault Injection Techniques Software-implemented fault injection (SWIFI) Fault injection techniques introduce faults to perturb the normal flow of a program to extend test coverage or stress test the system. Inject a fault into a software system at run time.  Experiments can be run in near real-time  No model development needed  Can be expanded for new classes of faults.  Limited set of injection instants.  Cannot inject faults into locations that are inaccessible to software.  Require modification of the source code to support the fault injection.
  • 25. 24 Huawei: Butterfly Effect -- Butterfly Effect System -- Enables to Automatically Test and Repair OpenStack and Cloud Applications CLOUD APPLICATION HUAWEI FusionSphere The system works by intentionally injecting different failures, test the ability to survive them, and learn how to predict and repair failures preemptively Failure Repair Test In chaos theory, the butterfly effect is the sensitive dependence on initial conditions in which a small change in one state of a deterministic nonlinear system can result in large differences in a later state. [Wikipedia]
  • 26. 25 The Strategy  VM failures  send VM creation request  find compute node where request was scheduled  damage to the compute server  check if the VM creation was re-scheduled to another node  Disk temporarily unavailable  unmount a disk  wait for replicas to regenerate  remount the disk with the data intact  wait for replicas to regenerate the extra replicas from handoff nodes should get removed  Disk replacement  unmount a disk  wait for replicas regenerate  delete the disk and remount it  wait for replicas to regenerate  Extra replicas from handoff nodes should get removed  Replication  damage three disks at the same time  more if the replica count is higher  check that the replicas didn’t regenerate even after some time period  fail if the replicas regenerated  this tests if the tests themselves are correct 1 2 3 4 1 2 3 4
  • 27. 26  Approach  Fully automated and customizable  Simple using ssh and bash scripting  FusionServer RH2288  Deploy and Destroy: 2 hours to deploy OpenStack infrastructure with 32 VMs…32 seconds to destroy…  Vagrant. Provides easy to configure, reproducible, and portable environments for OpenStack  Interfaces to VirtualBox, VMware, AWS, an other providers  VirtualBox. Free open-source hypervisor for x86 computers from Oracle  Management of virtual machines  RDO. Freely-available distribution of OpenStack from Red Hat  OpenStack Mitaka Test Environment Huawei RH2288 + Fedora Vagrant Virtualbox VM VM VM VM VM VM VM
  • 28. 27 Service to Destroy Database Message Queue Authentication Hypervisor Hard drive The main testing framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces) Nodes, services, processes, network, hypervisor, storage, etc. Nova-Compute
  • 29. 28 1. Request provisioning UI/CLI 2. Validate Auth data 3. Send API request to NOVA API 4. Validate API token 5. Process API request 6. Publish Provisioning Request 7. Schedule Provisioning 8. Start VM provisioning 9. Configure Network 10. Request Volume 11. Request VM image from Glance 12. Get image URL from Glance 13. Direct Image File Copy 14. Start VM rendering via Hypervison Scenario Driven http://www.slideshare.net/mirantis/openstack-architecture-43160012 http://docs.openstack.org/developer/tempest/field_guide/scenario.html Create Server • Create server Inject Faults Scenario FaultsProcess flavor create flavor delete flavor list host list hypervisor list hypervisor show image add project image create image delete image list image show ip fixed add … openstack server create --flavor m1.medium --image "fedora-23" -- key-name ayoung-pubkey -- security-group default --nic net- id=63258623-1fd5-497c-b62d- e0651e03bdca windows_dev
  • 30. 29 Localized Injection  State based  Time based  State  Time
  • 31. 30 Faults to Inject  Bit-flips - CPU registers/memory  Memory errors - mem corruptions/leaks, lack of memory  Disk faults - read/write errors, lack disk space  Network faults - packet loss, network congestion, etc.  Terminate instance  Introduce delays in message delivery  Corrupt data in DB  Services, processes, and application crash  Reboot node  Configuration error
  • 32. 31 Detect Failures The main testing framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces) Network tests • create keypairs • create security groups • create networks Compute tests • create a keypair • create a security group • boot a instance Swift tests • create a volume • get the volume • delete the volume Identity tests … Cinder tests … Glance tests … echo "$ tempest init cloud-01" echo "$ cp tempest/etc/tempest.conf cloud-01/etc/" echo "$ cd cloud-01" echo "Next is the full test suite:" echo "$ ostestr -c 3 --regex '(?!.*[.*bslowb.*])(^tempest.(api|scenario))'" echo "Next ist the minimum basic test:" echo "$ ostestr -c 3 --regex '(?!.*[.*bslowb.*])(^tempest.scenario.test_minimum_basic)'"
  • 33. 32 Detect Failures Tempest 0 1400 test/45min-2h Tempest 1 100%,100 40%,40 Tempest 2 Tempest 3 Overlapping tests Mutually exclusive tests 5%, Log2 40 Branch and bound 4%, Log2 20  Side Effects. Integration tests often have side effects and require specific setups. Thus, they often cannot be used in production systems. For example, running integration tests which delete all the virtual machines running in a production platforms cannot be run in production.  Reuse. Integration tests are composed on many types of tests (e.g., unit testing, API testing, integration testing, scenario testing, and positive and negative test). Reusing code tests for damage detection is useful but the selection can be difficult.  Filtering. Most of the tests are not relevant for damage detection on production systems. While damage detection looks for components, services, and processes which are no longer working properly, tests determine if commits to code generate errors. When software code is tested, many functional test are irrelevant to use in production.  Specificity. New code for damage detection always needs to be developed since testing does not typically looks for problems that can happen when a system is in a particular operational state. Limitations of Integration Tests
  • 34. 33 Butterfly Effect: Example of Fault Injection
  • 35. 34 Butterfly Effect: Example of Fault Injection
  • 36. Dmitri Zimine (Brocade) giving his speech on workflows for auto- remediation (credits to Johannes Weingart). Sebastian Kirsch (Google), co- author of the bestselling book Site Reliability Engineering from Google, and the workshop organizer Jorge Cardoso (Huawei). The International Industry-Academia Workshop on Cloud Reliability and Resilience was held in Berlin on 7-8 November 2016. The workshop gathered close to 50 participants from industry (Intel, Red Hat, CISCO, SAP, Google, LinkedIn, Microsoft, Mirantis, Brocade, T- Systems, SysEleven, Deutsche Telekom, Flexiant, Hastexo) and academia (TU Wien, TU Berlin, U Oxford, ETH, U Stuttgart, TU Chemnitz, U Potsdam, TU Darmstadt, U Lisbon, U Coimbra). International Industry-Academia Workshop on Cloud Reliability and Resilience Berlin on 7-8 November 2016
  • 37. Current Team: Cloud Operations and Analytics  Objective  Planet-scale distributed systems = automation  Highly complex systems = AI and machine learning  Skills and knowledge  OpenStack Software Development  Machine Learning and Real-time Analysis  Reliability for Cloud Native Applications  Large-scale distributed systems  Working Student  Distributed Execution Graphs (DEG) for OpenStack.  Master Students  Efficient Diagnosis in Cloud Platforms.  DEG-driven Fault Injection for Cloud platforms.  PhD Students  Risk-aware Cloud Recovery using Machine Learning (automation + AI).  Internship for PhD student  Next generation of DEG-driven systems beyond Google’s Dapper and Twitter’s Zipkin.  Working & MSc students  Fault injection, fault models, fault libraries, fault plans, brake and rebuild systems all day long, …  PhD Students  Rapid prototyping of cool ideas: propose it today, code it, and show it running in 3 months…  Postdocs  Solving difficult challenges of real problems using quick and dirty prototyping Open Positions
  • 38. Copyright©2015 Huawei Technologies Co., Ltd. All Rights Reserved. The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time without notice. HUAWEI ENTERPRISE ICT SOLUTIONS A BETTER WAY
  • 39. 38  The complexity and dynamicity of large-scale cloud platforms requires automated solutions to reduce the risks of eventual failures.  Fault injection mechanisms enable to determine (and repair) the types of failures that platforms cannot tolerate under controlled environments rather than taking a passive approach waiting that Murphy’s law comes into play on a Sunday at 2am when engineers are off duty.  Pioneers, such as Amazon, Google, and Netflix, have already developed fault injection mechanisms and have also changed their mindset with respect to the importance of the resiliency of cloud platforms.  As an innovation topic, we take one step further towards fault-tolerant platforms by exploring, not only fault injection, but also the automated recovery of platforms. Executive Summary
  • 40. 39  FIAT: Fault Injection Based Automated Testing Environment, Carnegie Mellon University.  EFI, PROFI: Processor Fault Injector, Dortmund University.  FERRARI: Fault and ERRor Automatic Real-time Injector, Texas University.  SFI, DOCTOR: intergrateD sOftware implemented fault injeCTiOn enviRonment, Michigan University.  FINE: Fault Injection and moNitoring Environment, Universidad de Illinois University.  FTAPE: Fault Tolerance and Performance Evaluator, Illinois University.  XCEPTION: Coimbra University.  MAFALDA, MAFALDA-RT: Microkernel Assessment by Fault injection AnaLysis and Design Aid, LAAS- CNRS en Toulouse  BALLISTA: Carnegie Mellon University. SW Fault Injection Tools