WICSA 2012 tutorial

Architecting Highly
Dependable Cloud
Applications

Anna Liu
Len Bass

NICTA Copyright 2012 From imagination to impact

The Land Down Under


Sydney


About NICTA

National ICT Australia

• Federal and state funded research
company established in 2002
• Largest ICT research resource in
Australia
• National impact is an important
success metric
• ~700 staff/students working in 5 labs
across major capital cities
• 7 university partners NICTA technology is
• Providing R&D services, knowledge in over 1 billion mobile
transfer to Australian (and global) ICT phones
industry

4

Research Areas at NICTA
Networks Machine
Software Learning
Systems
Aruna Seneviratne
Bob Williamson
Anna Liu
Computer Gernot Heiser

Vision Optimisation
Nick Barnes,
Richard Hartley
Control &
Peter Corke Signal Mark Wallace,
Sylvie Thiebaux,
Processing Toby Walsh

Rob Evans

NICTA Copyright 2012 From imagination to impact 5

Our team’s mission: help enterprises take full
advantage as software extends into cloud!

Cost optimised
High availability
Onsite/offsite Hybrid cloud

Real-time monitoring
Disaster recovery
Actionable analytics
Business continuity

Intelligent management
Systems resilience

Dynamic Elastic
Real time

High performance
Our applied R&D capability
spans cloud computing, web, SOA,
distributed systems, data management,
analytics, performance monitoring, DR,
automated reasoning, ontologies, AI…
7

Who are we?
• Anna
• Len


Who are you?
What would you like from this tutorial?


Outline
• Introduction
• Cloud Computing Platforms
• Nature and causes of outages and down-time
• Characteristics of Dependability in Cloud
• Achieving high dependability
• The importance of stateless components
• Techniques to handle performance problems
• Techniques to handle availability problems
• Techniques to handle security problems
• Case Studies: Netflix, Yuruware
• Conclusions


Introduction
• intro to the cloud – xxx as a
service, regions/zones
• What is dependability
• why is dependability a concern in the cloud
• types of dependability and high level problem
descriptions
– performance
– availability
– Security


What is Cloud Computing?

Cloud computing is a model for enabling convenient, on-
demand network access to a shared pool of configurable
computing resources (e.g., networks, servers, storage,
applications, and services) that can be rapidly
provisioned and released with minimal management
effort or service provider interaction.

This cloud model is composed of five essential
characteristics, three service models, and four
deployment models.

- US National Institute of Standards and Technology

Characterising Cloud Computing

Measured
Service

Resource Self
Pooling Elasticity Service

Ubiquitous
Network
Access


Five Characteristics – NIST Definition
• On-demand Self-Service
– A consumer can provision computing capabilities without human
interaction
• Broad network access
– Computing capabilities are available over the network and accessed
through standard mechanisms
• Resource pooling
– Provider‟s computing resources are pooled to serve multiple consumers
with different resources dynamically assigned according to consumers‟
demands
• Rapid elasticity
– Computing capabilities can be rapidly and elastically provisioned to
quickly scale out and rapidly released to scale in
• Measured service
– Resource usage can be monitored, controlled, and reported. Providing
transparency for both the providertoand consumer
NICTA Copyright 2012 From imagination impact

Leading Provider: Amazon EC2

Let‟s see how Amazon EC2, a leading commercial cloud, looks

I want my cloud!


1. Grab your credit
card and create an
account. (10 min)
Then, access to a
console

3. Hit this button
2. Select where you
want to create your
virtual machines
(US East, US
West, Ireland or
Singapore)

4. Select a machine
image
• Many pre-configured
images are available
• You can register your
machine images as well


5. Determine the amount of resources to allocate
• <1.0Ghz CPU + 600MB RAM  0.01 USD/hour
• 1.0Ghz CPU + 1.7GB RAM  0.04 USD/hour
• 3.0Ghz x 8 CPUs + 68GB RAM  1.1 USD/hour
• Copyright can pay Win/SQL Serverimpact
NICTA You 2012 From imagination to license fees in pay-per-hour

6. Define a set of
access control rules


7. Done! (< 5 minutes in total)
• You have your virtual machine at
ec2-184-74-14-28.us-west-
1.compute.amazonaws.com

I got my virtual machine!


8. Connect to my virtual machine
• Just SSH to the address
• You have a root access!!

You‟re in an Amazon Datacenter in CA

This is my desktop in Sydney

If you like Windows, just
launch a Windows virtual
machine and remote-desktop
to it

Connected through
a VPN connection

You‟re in an Amazon Datacenter in NV

This is my desktop in Sydney

9. Terminate or hibernate virtual machines
when they are not in use
• In some systems, we use a script to
hibernate virtual machines at 8:00PM
• Restart instances in the morning if necessary.
NICTA Copyright 2012
It takes justFrom imagination to impact minutes
a couple of

10. Check a bill in real-time
• Hours to run virtual machines
• Network in/out
• VPN
• Disk access
• # of requests made
…


Three Service Models – NIST definition
Technology exposed to customers Providers

Software
as a Service

Platform
as a Service

Infrastructure
as a Service

Datacenter
Infrastructure

Three Delivery Models
• Infrastructure as a Service (IaaS)
– The consumer has control over operating systems,
storage and deployed applications
• Platform as a Service (PaaS)
– Consumers can deploy applications created using programming
languages and tools supported by the provider (e.g., Java Servlet)
– The provider shields the complexity of its infrastructure
• Scale up/down, load balancing, replication, disaster recovery,
database management, …

• Software as a Service (SaaS)
– Consumers use the provider‟s applications
– The consumer does not manage the underlying cloud
infrastructure

Leading Provider: Google App Engine

Let‟s see how Google App Engine, a leading
commercial PaaS, looks

I want my PaaS!


1. Create an account.
(5 min) GAE offers a
large amount of quota
for free

2. Write an application
using GAE‟s
framework

3. Deploy your application on
GAE!

Scale up/down, load
balancing, replication, disaster
recovery, database
management, … many functions
NICTA Copyright 2012
are implemented by GAE‟s
From imagination to impact

4. Check your resource
usage (CPU, storage, #
of API calls, …)
Pay only when usage
exceeds the free quota

Provider Services - 1
• Consumer is allocated some number of virtual
machine instances.
– Number of instances is under the control of the
consumer
– Provider allows consumer to set rules for
“autoscaling”. Automatically creating and removing
instances
– When new instance is launched it has
• Software as specified by either the consumer or the provider
• Private IP address available only from within cloud. Private IP
address exists for life of instance and will not change
• Public IP address. Addressable from outside the cloud. May
change under certain circumstances


Provider Services – 2
• Cloud data centers
– hosted in different geographic regions
– Cloud provider responsible for physical security
• SLAs from cloud providers are for 99.9%+ up
time for the cloud. No guarantee for any
individual instance
• Cloud provider will replicate databases to
different regions or within a region.


Questions


What is dependability?
• Dependability of a computing system is the
ability to deliver service that can justifiably be
trusted.
– The service delivered by a system is its behaviour as
it is perceived by its user(s);
– a user is another system (physical, human) that
interacts with the former at the service interface.
– The function of a system is what the system is
intended for, and is described by the system
specification.
[ A. Avizienis, J.-C. Laprie and B. Randell: Fundamental Concepts of Dependability.
Research Report No 1145, LAAS-CNRS, April 2001]


Parsing the definition
• Dependability is relative
– “justifiably be trusted”
• May be different users with different
expectations
• Users can be systems or humans
• Systems may deliver many services and
dependability may be different for each service


Dependability subsumes many other
attributes


Questions


Cloud vis a vis private data center
• Cloud providers remove some of the problems
of operating a private data center
Acquisition of physical hardware.
Hiring/training data center staff
Physical security
• Other problems remain basically the same
Security threats from internet connections
Separation of production/test environments
Patch installation
• Other problems are new or exist in changed
form
It is these other problems that we now focus on.


Cloud Specific Dependability Problems
Failure
Instance failure
Data failure/consistency
Operator error
Upgrade error
Performance
Latency of provisioning
Over/under provisioning
Latency of communication
Security/privacy
Credentials and keys
Multi-tenancy
Location dependency/governance

Disaster Recovery

Provisioning
• Consumer or cloud infrastructure can launch or
delete instance of virtual machine
• When new instance launched it consists of
– Virtual hardware with public and private IP address
– Executable image
– Virtual hard disk
• Provisioning is important both in failure recovery
and performance


Elasticity - Over or Under Provisioning
• Elasticity is the defining characteristic of cloud
– Traditional „scalability‟ or „throughput‟ measures no longer helpful
– “the ability of software to meet changing capacity demands,
deploying and releasing relevant necessary resources on-
demand”
• There is often over or under provisioning


Instance Failure – recognition
• Basic failure recognition mechanism is
“heartbeat”.
• Instance must periodically show it is still alive
– Send a message
– Respond to query
• Must be an entity that is responsible for
monitoring “aliveness” of instance
– Entity can be infrastructure
– Entity can be other portion of the application
– Entity can be client
• Failed instances are not automatically deleted

Monitoring for Pending Failure
• Besides PING…
• A dashboard of flashing lights
• Monitoring ongoing CPU, memory utilization,
disk activities, Network activities
• Environmental controls, water/coolant flow,
power and temperature

Akamai’s NOC in Cambridge, Massachusetts

State
• An instance can be stateful or stateless
• A stateful instance remembers information from
one message to another. State can be stored
either within instance memory or on external
memory device
• A stateless instance must be sent necessary
state associated with the message.
• HTTP is a stateless protocol so every message
must contain information allowing the instance to
understand the context.
• Recovery process is different for stateful
instances than for stateless instances. 49

Stateful Recovery
• Strategy depends on how much loss of
computation and events can be tolerated.
• Strategy - 1
– Checkpoint image periodically
– On recovery, provision with checkpointed image and
computation will restart from last checkpoint
– Any computation and messages between last
checkpoint and failure will be lost.
– Assumes no state stored on external device.
• Only for cloud because of checkpointing image


Stateful Recovery Strategy – 2
• Periodically save important state on persistent
external device.
• When image is activated, it checks whether any
state has been saved. If so, it reads that state
and resumes computation
• Any computation and messages between last
checkpoint and failure will be lost
• Different with prior strategy is that does not
assume an image exists and state is explicitly
checkedpointed by application


Stateful Recovery Strategy – 3
• Periodically save important state on persistent
external device
• Log incoming messages on persistent external
device
• When image is activated, it checks whether any
state has been saved. If so, it reads that state.
• Activated image then reads log and replays
activity.
• No computation or messages will be lost unless
there is failure between message arrival and
recording that message on log. Acks to client will
allow client to resend message if necessary. 52

Comments on Stateful recovery strategies
• Only strategy 1 (provision with checkpointed
image) is specific to cloud
• Other strategies apply also to non-cloud
environments.
• Strategy 3 achieves least data loss since
messages are logged and replayed upon
recovery.


Stateless images
• If instance is stateless then
– Infrastructure can send any message to any instance
– Can create new instances for performance or
reliability reasons.
– Router/load balancer/controller is responsible for
getting messages to instances

Cloud

Clients Servers

Load balancer


How do messages get to instances?
• Two models
– Push. Load balancer decides which instance should
get message
– Pull. Load balancer maintains queue of messages
and instances retrieve messages from queue.


Push Architecture Pattern

Clients

Load balancer
Monitor

Servers


Push Pattern Description
Client sends a request (e.g. HTTP message) to
the app in the cloud.
Request arrives at a load balancer
Load balancer forwards request to one of the VMs
Load balancer uses scheduling strategy to decide
which VM gets the request, e.g. round robin


Monitor
The load balancer knows
CPU utilization for each VM through monitor
how many requests each VM has gotten
Possibly how long it took to service the requests.

The monitor decides (based on rules) when new
resources are needed


Failure management within Push Pattern
• Monitor will recognize failure of instance through
non-responsiveness.
• Load Balancer will not send further messages to
instance
• Messages currently being processed by failed
instance are lost
• Client must detect message not processed
(through timeout) and resend message.


Pull architecture pattern (aka Producer-
Consumer)
Clients

Load balancer/
queue manager Monitor

Servers


Pull architecture description
Each request from the client is application specific
and typed.

The queue keeps separate queues for each
application running on the VMs.

A VM requests the next message of a particular
type (pull) and processes it.
When the VM has processed a message, it
informs the controller to remove the message
from the queue.

Monitor
The monitor can now see
how long a request waits in a queue
the average queue length
This is an indication of the load on the VMs that
have applications that service requests of that
type.
Allows better scheduling of messages to VMs.


Failure Management within Pull Pattern
• Controller knows when message has been
processed.
• If message is not processed within time
interval, controller can reassign it.
• Failed instances will not request further
messages and so take themselves out of
service.
• It is possible for a failed instance to recover and
continue processing on a message that has
been rescheduled so checks must be in place to
keep a message from being double processed.


Cleaning up
When instance fails it is not automatically
deallocated
Consumer must deallocate failed instance.
When instance deallocated
– Public and private IP address available for realloation
– Possible to tell infrastructure that public IP address is
to be assigned to replacement instance
• Within AWS charging continues until instance
deallocated.


Data Failure
• Data storage can be “ephemeral” or “persistent”
• Ephemeral storage disappears if instance fails
• Persistent storage is maintained by cloud
provider
– Replicated automatically
– Replicas may be geographically separated
• May lead to problems with data consistency


Data Consistency
• Takes time to replicate data
• Means that different replicas of the data may not
be instantaneously consistent
• CAP Theorem. Data cannot simultaneously be
– Consistent
– Fully available
– Partitioned (distributed across multiple data stores)
• May take ½ second for data to become
consistent
• Most cloud providers offer “consistent reads” but
at a potential cost in latency

Characterising Eventual Consistency in
Amazon SimpleDB

• The probability to read updated data in SimpleDB in US West
– An application reads data X (ms) after it has written data
Consistent Read Eventual Consistent

• SimpleDB has two
read operations
– Eventual Consistent
Read
– Consistent Read
• This pattern is
consistent
regardless of the
time of day

67

Operator error
• After trying out something in AWS, may want to
go back to original state
• Not always that straight-forward:
– Attaching volume is no problem while the instance is
running, detaching might be problematic
– Creating / changing auto-scaling rules has effect on
number of running instances
• Cannot terminate additional instances, as the rule would
create new ones!
– Deleted / terminated / released resources are gone!


Undo for System Operators
Administrator

begin- do
do
do rollback
transaction

+ commit
+ pseudo-delete


Approach
Administrator

begin- do
do
do rollback
transaction

Sense cloud Sense cloud
resources states resources states

Undo System


Approach
Administrator

begin- do
do
do rollback
transaction


Goal
Goal Initial
Initial
state
state state
state

Undo System


Approach
Administrator

begin- do
do
do rollback
transaction


Goal
Goal Initial
Initial Set of
Set of
state
state state
state actions
actions

Execute Generate code Plan

Undo System


Location of instances
• Amazon divides the cloud into
– Regions (currently eight)
• US – east (Northern Va), west (Oregon, Northern Calif), gov
• Asia Pactific – Singapore, Toyko
• Europe – Ireland
• South America (Sao Paulo)
– Each region has some number of availability zones.
• Each availability zone has distinct physical location, power
sources
• Communication
– within availability zones is high speed,
– across availability zones is lower speed,
– across regions is lowest speed

• Availability zones and regions can be exploited
to improve availability

User Visible Failures
• Operator error is largest cause of user visible
errors in large Internet systems
• Largest cause of operator error is configuration
errors during upgrade
– Data may be dated
– Data is based on a world where monthly updates
were considered frequent. Updates may be as
frequent as weekly (Facebook) or even more
frequently – Jan Bosch talks about “continuous
deployment”.
– I have not seen recent data describing sources of
operator error


Upgrade Frequency
Upgrades to systems are a very common
occurrence
Upgrade frequency of some common systems
Application Average release interval
Facebook (platform) < 7 days
Google Docs <50 days
Media Wiki 21 (171 schema updates
in 4.5 years)
Joomla 30

This frequency would suggest it is important to get
the updates correct


Configuration parameters
• Options are extensive
– Hadoop – 206
– Cassandra – 36
– HBase – 64
• Massive numbers of dependencies, many
hidden
– File path
– Network address
– Dynamically loaded libraries
– Database schema
– …


Basic upgrade strategies
• Rolling Upgrade
– Perform upgrade one node at a time
• Does not require additional resources
• Allows for determination of correctness in an incremental
fashion
• Implies that multiple versions may be simultaneously in
service
• Takes time
• Big flip
– Perform upgrade to a cluster at a time
• Keep users from accessing cluster until upgrade completed
• Takes resources out of service until upgrade is completed
• General industrial practice is Rolling Upgrade

Potential error condition during rolling
upgrade
• Multiple versions are simultaneously active
during rolling upgrade
• Opens door to errors resulting from version
incompatibility
• During a single session a client can deal with
multiple versions of a single component.
• May result in “mixed-version” race condition
• “…these race conditions occur frequently during
rolling updates of large Internet systems, such
as Facebook” From “To Upgrade or Not to Upgrade”

Mixed Version Race Condition
Client (browser) Server
1 Start
rolling
upgrade
2
Initial request
HTTP reply with New
embedded JavaScript 3 Version

4 AJAX callback
Old
5 Version

X ERROR


Assumptions/Requirements for a Solution
• Requirements
– Clients never interact with decreasing versions. i.e.
once a client interacts with version xxx, it will never
interact with a version less than xxx.
– Messages are balanced across all instances of an
application, whether new or old versions.
• Assumptions
– Versions are backwards compatible. i.e. any message
can be processed by the latest version without
creating mixed-version race condition
– Client behavior with respect to the versions with
which it interacts is governed by mobile code sent to
the browser from the server side.

Key Ideas of Proposed Solution - 1
• Consider different versions as separate
endpoints for a message. Each version is
www.sample.com/<version number>
• Each instance knows its version number.
• Client knows the largest version number with
which it has interacted.


Key ideas of Proposed Solution - 2
• Load Balancer portion
– Use a load balancer that routes messages to different
endpoints
– The load balancer is the entry point for messages.
– Messages with /<version number> in the header are
routed to an instance greater than or equal than the
version number according to load balancing algorithm
for those instances.
– Messages without version information are routed
according to normal load balancing
• Load balancers are hierarchical
– Ensure that top level is updated before used to route
messages

Achieving Elasticity
• Elasticity means the ability to create new (virtual)
resources on demand
• Providers allow consumer to set up “autoscaling”
rules. These rules make the demand automatic
without necessity for operator manual action.
– E.g. create a new instance when an existing instance
is utilizing greater than 75% of CPU for more than 5
minutes.
• Correct strategy for autoscaling is a matter of
research because of the time it takes to create a
new instance, provision it, boot it, and start an
application.

Provisioning Latency
• Small Instance
– 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1
EC2 Compute Unit), 160 GB of instance storage, 32-bit platform
with a base install of CentOS 5.3 AMI
– Between 5 and 6 minutes us-east-1c from launch to availability
• Large Instance
– 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2
EC2 Compute Units each), 850 GB of instance storage, 64-bit
platform with a base install of CentOS 5.3 AMI
– Between 11 and 18 minutes us-east-1c

[http://www.philchen.com/2009/04/21/how-long-does-it-take-to-launch-an-amazon-ec2-
instance]


Provisioning Forecasting
• Approaches to predict appropriate number of
instances
• Technique 1 (due to Sadeka Islam)
– Calculate cost of having instances that are unused
(overprovisioning)
– Calculate cost of having requests go unsatisfied
(underprovisioning)
– Allocate additional instances to optimize costs under
various usage scenarios
• Technique 2 (due to Matthew Sladescu )
– Sniff out events that might lead to surge in demand
and use that to predict appropriate number of
instances

Latency of Communication
• Measurements by Robin Meehan based on http-
ping
• Within EU region but across availability zones
– Roundtrip to local host within cloud (control) avg = 1.0 ms
– Roundtrip to public IP in same AZ avg = 1.4 ms
• Out of cloud (local England facility) to within
cloud
– Us-east = 231 ms
– Eu-west = 96 ms

http://smart421.wordpress.com/2011/02/15/amazon-web-services-inter-az-latency-
measurements/
http://smart421.wordpress.com/2011/01/17/which-amazon-web-services-region-should-
you-use-for-your-service/

Security topics
• Credentials and keys
• Management of credentials and keys in the
cloud
• Multi-tenancy
• Location dependency/governance


Credentials and keys
• A credential identifies you
– As an individual
– As having certain privileges
– As having certain qualifications
• Credentials are used in
– Authentication (you are who you say you are)
– Authorization (you have the rights to perform certain actions)
– Non-repudiation (you cannot deny you did something)
• A key is a magic number used in cryptography
for
– Encrypting/decrypting data
– Digital credentials


Basic Data protection

App outside
App inside of cloud
of cloud
(data
(data
unencrypted, communicati
unencrypted) https: data is on encrypted)
encrypted for transfer
into the cloud

Data is stored
Data

encrypted (by vendor)


What can go wrong with the Basic Data
Protection?
• Suppose cloud provider has to respond to
subpoena for data. Your data
may, potentially, be included.
• Cloud provider must decrypt data to respond to
subpoena.
• You may wish to encrypt your data (double
encryption) so that cloud provider can only
provide encrypted data.
• Of course, if subpoena is directed at you, you
must comply with decrypted data.

Use of credentials
• Log into app in the cloud
• Attach a disk volume
• Download application from a non-public location
• Access particular data bases.

• For non-public applications, protect your
credentials and your data will be protected.


Vulnerabilities to Credentials
• Compromised inadvertently through social
engineering means or carelessness
• Held by disgruntled employee
• Compromised through some sort of attack


Goals for credential storage
• Easy to do. If it is difficult to store credentials,
people will avoid their use. A script can
automate the provisioning of credentials but then
the script needs to be protected
• Possible to change in a running instance?. Once
an instance has been launched, can the
credentials it uses be changed?
• Possible to change for instances launched in the
future? This issue is related to building
credentials into scripts. If scripts have
credentials built in then it makes it difficult to
change them in the future.

Options for getting credentials to App in the
cloud
• Send credentials from client outside the cloud
– HTTPS will negotiate encryption of credentials over the internet
– Assumes credentials can be kept private on clients that have
them.
– Credentials need to be sent every time there is a new instance –
• Pass credentials in as a parameter during
launch of instance
– Credentials persist for the life of the instance so if credentials
change, can re-instantiate instance
– Means credentials are stored on a server – itself a vulnerability


More options for getting credentials to App
server
• Build credentials into the image
– App server is instantiated from an image in the image library
– Could install credentials in the image when building it
– Makes it difficult to change credentials
– Prevents reuse of image (or makes reusing image a very bad
idea)
• Keep credentials in persistent storage.
– Access control list for persistent storage provides protection
based on credentials
– Credentials may be based on a different account


Conclusion with respect to credential
management
• No insurmountable problem
• Needs to be thought through
– Who has access to credentials?
– Will I ever need to change credentials?


What is Multi-tenancy?

VM for VM for VM for
customer 1 customer 2 customer 3

Hypervisor

Server

Local Network

Storage Data Data Data Data


Multi Tenancy Gets More Complicated
End users

VM for VM for VM for
customer 1 customer 2 customer 3

Hypervisor


Multi Tenancy Means “Sharing”
• Consumers share hardware
– CPU
– Network
– Storage media
• Consumers share software
– Hypervisor
• End users share applications
– E.g. Salesforce.com


What are the problems with Multi-tenancy?
• Performance – other users or consumers will
consume resources and, potentially, keep you
from achieving your performance requirements.
– Some providers allow consumers to reserve complete
machines that would prevent multi-tenancy from
occurring.
• Security – other users could potentially break
confidentiality or integrity
– Provider uses isolation for security. Consumer must
have trust in provider
– Consumer uses encryption to protect data.


Isolation assumptions
• Virtual machines are isolated based on virtual
memory technology and addressing scheme
– Processor manufacturers have specialized hardware
to support virtualization
– Hypervisor introduces a new layer of privileged
software that could be attacked.
• Hypervisors provide facilities to isolate networks.
• Disk isolation is the same as in a non-cloud
environment. OSs or shared software provide
facilities.


Personally Identifiable Information
• Personally identifiable (US NIST)
– Information which can be used to distinguish or trace an
individual's identity, such as their name, social security number,
biometric records, etc. alone, or when combined with other
personal or identifying information which is linked or linkable to a
specific individual, such as date and place of birth, mother’s
maiden name, etc.
• Personal data (EU)
– ‘personal data' shall mean any information relating to an
identified or identifiable natural person ('data subject'); an
identifiable person is one who can be identified, directly or
indirectly, in particular by reference to an identification number or
to one or more factors specific to his physical, physiological,
mental, economic, cultural or social identity


Location dependency/governance
• Some jurisdictions require that personal
information for their jurisdiction is not stored
outside of the jurisdiction
– The EU requires that personal information can leave
the EU only for locations that have equivalent privacy
guarantees
– Australia has a similar policy
– “If offshore cloud compromises your data, we‟ll sue
you, not them”, Victoria Privacy Commissioner
• Some jurisdictions claim rights to access any
data stored within their borders
– US Patriot Act gives US government right to examine
any data stored in the US.

What does this mean in the cloud?
• Knowing location of data centers
– Amazon provides locations of their data centers
– Google does not
• Does this mean just use Amazon data center in
region compliant with your requirements?
– Not so fast!
– Back up locations may be chosen by provider. Could
be anywhere
– A complicated problem is to control back up location
based on data content.
• Amazon does have a gov region that almost
certainly complies with US government
regulations

Use tokens as a replacement for PII
• A token is an identifier that has no mathematical
mapping to the individual being identified
– E.g. number people in tutorial arbitrarily
– Your number becomes a unique identifier for your PII
stored in the cloud
– I keep mapping between you and your token privately
according to jurisdictional laws


Example of token use
• Original data
– John Doe
– Sensitive information
• Token table (kept locally to conform to privacy
laws)
– John Doe
– Token for John Doe
• Data stored in cloud
– Token
– Sensitive information
• Take join of token table and data table in cloud
and the original data is restored

How about jurisdictional problem?
• Tokens
– Technique for decoupling PII from identifier.
– Adds a level of indirection and protects that level
locally
• Does this solve jurisdictional problems?
– I don‟t know
– PerspecSys says it does
“http://www.perspecsys.com/how-we-help/data-residency/”


Questions


Netflix Corporation
• Launched in 1998 after founder was irritated at
having to pay late fees on a DVD rental.
• DVD Model
– Pay monthly membership fee that includes
rentals, shipping and no late fees
– Maintain online queue of desired rentals
– When return last rental (depending on service
plan), next item in queue is mailed to you together
with a return envelope.
• Customers rate movies and Netflix recommends
based on your preferences


Streaming video - 1
• Streaming video service introduced in 2008
• Customers can watch Netflix streaming video on
a wide variety of devices many of which feed
into a TV
– Roku set top box
– Blu-ray disk platers
– Xbox 360
– TV directly
– PlayStation 3
– …
• Customers can stop and restart video at will.
Netflix calls these locations in the films
“bookmarks”.

Streaming video - 2
• Initially, one hour of streaming video was
available to customers for every dollar they
spent on their plan
• In Jan, 2008, every customer was entitled to
unlimited streaming video.
• In Nov, 2011 Netflix changed billing model to
have separate charges for DVDs and streaming


Internet statistics
• In May, 2011, Netflix streaming video accounted
for 22% of all internet traffic. 30% of traffic during
peak usage hours.

• Three bandwidth tiers
– Continuous bandwidth to the client of 5 Mbit/s. HDTV, surround
sound
– Continuous bandwidth to the client of 3Mbit/s – better than DVD
– Continuous bandwidth to the client of 1.5Mbit/s – DVD quality


Netflix‟s move to the cloud
• In late 2008, Netflix had a single data center with
Oracle as the main database system.
• With the growth of subscriptions and streaming
video, it was clear that they would soon outgrow
the data center.
• Two options:
– Build more data centers
– Use the cloud
• Netflix choose Amazon EC2 platform


Why EC2?
• Four reasons cited by Netflix for moving to the
cloud
1. Every layer of the software stack needed to scale horizontally, be
more reliable, redundant, and fault tolerant. This leads to reason #2
2. Outsourcing data center infrastructure to Amazon allowed Netflix
engineers to focus on building and improving their business.
3. Netflix is not very good at predicting customer growth or device
engagement. They underestimated their growth rate. The cloud
supports rapid scaling.
4. Cloud computing is the future. This will help Netflix with recruiting
engineers who are interested in honing their skills, and will help
scale the business. It will also ensure competition among cloud
providers helping to keep costs down.
• Why Amazon and EC2? In 2008, Amazon was
the leading supplier. Netflix wanted an IaaS so
they could focus on their core competencies.

Netflix applications
Video ratings, reviews, and recommendations
Video streaming
User registration, log-in
Video queues
Billing
DVD disc management – inventory and shipping
Video metadata management – movie cast
information


Netflix Reliability
• Deep service
dependency
hierarchy
• 1 billion incoming
calls/day
• Across 1000s of
instances
• Intermittent failure
guaranteed


Approach to detecting faults
• Fast network timeouts and
retries
• Separate threads on per-
dependency thread pools
• Semaphores instead of
threads for services that do
not perform network calls
• Circuit breaker
– Service calls are
decorated with code to
test whether service is
failing too often


If failure detected
• Custom fallback
– Each service has specific fallback plan
• Fail silent
– Service returns a null value and invoking service
knows it has failed
• API should be able to show what is happening
now, in real time, not from some past time.
Dashboard shown to operator has
red/yellow/green lights for important services


Netflix test suite - 1

• Netflix has a variety of test programs they call
the Simian Army. These programs include
– Chaos monkey. Randomly kill a process and monitor the effect.
– Latency monkey. Randomly introduce latency and monitor the
effect.
– Doctor monkey. The Doctor Monkey taps into health checks that
run on each instance as well as monitors other external signs of
health (e.g. CPU load) to detect unhealthy instances.
– Janitor Monkey. The Janitor Monkey ensures that the Netflix
cloud environment is running free of clutter and waste. It
searches for unused resources and disposes of them.


Netflix test suite - 2
– Conformity Monkey. The Conformity Monkey finds instances that
don‟t adhere to best-practices and shuts them down. For
example, if an instance does not belong to an auto-scaling
group, that is a potential problem.
– Security Monkey The Security Monkey is an extension of
Conformity Monkey. It finds security violations or vulnerabilities,
such as improperly configured AWS security groups, and
terminates the offending instances. It also ensures that all our
SSL and DRM certificates are valid and are not coming up for
renewal.
– 10-18 Monkey The 10-18 Monkey (Localization-
Internationalization) detects configuration and run time problems
in instances serving customers in multiple geographic regions,
using different languages and character sets. The name 10-18
comes from L10n and I18n which are the number of characters
in the words localization and internationalization.


Performance
• Create new auto-scaling group for each new
version of code
– Copy entire configuration to new group
– Test behaviour under load by squeezing traffic in
production to a smaller set of servers or generating
artificial load against a single server


SmugMug
• Photo sharing site
• Survived April AWS outage
• Recommendations
– Spread across as many availability zones as possible
– Spread across regions if possible
– Build for failure (like Chaos Monkey)
– Understand how components fail (yours and cloud
providers services)


Others
• Bizo
– Use circuit breakers. Assume services will fail, cache
data and monitor extensively to detect failure.
• SimpleGeo
– share nothing, redundancy, automated failover,
automated replication
• Twilio
– Unit of failure is a single host
• Simple services, replicatable
– Short timeouts and quick retries
– Idempotent service interfaces (stateless)
– Relax consistency requirements

Enterprise DR under pressure?
Issues… Good DR is only affordable for a
 DR requirement is growing, driven by (a) changing few applications
customer expectations, and associated reputational
risks; (b) Government & industry regulations
 Infrastructure for DR is expensive: sophisticated DR Good DR
is only affordable for a small % of applications; coverage

Higher priority applications
forces compromises/prioritisation
 Confidence in initiating a recovery often less than it Limited
should be (too long, too much loss), uncertain coverage
integrity
 DR Solutions often too „local‟, insufficiently resilient
 Enterprise IT becoming more complex

No
cover
Cost of DR is increasing…
 Improving business continuity (BC) and DR is 2nd
highest priority for enterprises for 2010/2011
 BC/DR typically claims 6-7% of total IT budget
 32% of enterprises plan to increase spending on
BC/DR by at least 5% in 2010/2011. Hypothesis: We can use cloud
Forrester global survey 2,803 IT decision-makers, Sept
2010
to extend DR at 1/10th cost.
128

Using Cloud for Business Continuity
• Two main usages of cloud for Business Continuity:
– Provides highly available systems for day-to-day business
– Serves as a technology platform to implement disaster recovery
• Some definitions:
– Business Continuity: “Activity performed by an organisation to
ensure that critical business functions will be available to
customers, suppliers, regulators and other entities…”
– Disaster Recovery: “A small subset of business continuity. The
process, policies and procedures related to preparing for
recovery or continuation of technology infrastructure critical to an
organisation after a natural or human-induced disaster”
– Fault Tolerance: “The property that enables a system to
continue operating properly, possibly at a reduced quality
level…”

129

Building Highly Reliable Systems with Cloud
• Must address potential failures at two levels:
– Hardware/Infrastructure
• To prevent Single-Point-of-Failure (SPOF) by adding
redundancy in all hardware components (i.e., redundant
disks, redundant network devices, redundant power supply,
etc.)
• NOT all cloud providers provide 100% availability. Check
your SLA!!
– Application
• Prepare fail-over system to take over in case of a failure
• Database replicates to minimise downtime and loss of data
• Replicate to geographically different location (e.g., to avoid
natural disasters such as floods)

130

DR As A Service – Requirements
• Cost Effective DR-As-A-Service is essential to
get the DR solution deployed
• Deep architectural expertise does not exist in
many businesses
• Needs solutions that achieves dependability that
is
• Non intrusive at runtime
• Does not require changes to application architecture
• Works across platforms
• Cheaper and easier to use than current state of practice


Case Study: Building Reliable System using EC2

• Highly replicated
Minimum Size= 1
architecture of cloud Elastic IP address
xxx.xxx.xxx.xxx
Availability Zones = A, B, C

makes them great as Auto Scaling Rule
Create
foundations for business Allocate

continuity solutions
• Globally distributed EC2 Instance

Availability Zone A Availability Zone B Availability Zone C
nature further enhances
the disaster recovery Minimum Size= 2
Availability Zones = A, B, C
capability of cloud
Auto Scaling Rule Request from Clients Availability Zones

• Availability limitations Elastic Load Balancer
= A, B, C

means need to be Forward Request

realistic about Hot vs
Warm vs Cold standby EC2 Instance EC2 Instance

Availability Zone A Availability Zone B Availability Zone C
options

Case Study: Building Reliable System using EC2 (Contd)

• Data backup in AWS
– Amazon S3 is best for off-site data backup
• Stores large binary files
• Designed to provide 99.999999999% durability
• Objects are redundantly stored in multiple facilities in a
Region
– Back up using EBS
• Uses a regular file system
• Takes image (or snapshot) of the partition
– VM Import
• Allows for easy replication from on-premise to cloud
• Not trivial to replicate various configuration such as network
configuration and disk drives

133

The Business Opportunity
“always-on” costs in
cloud. Also, very hot one
Cost is not feasible
Hot Warm Standby Cold Standby
Standby

• Run
• Ship backup to
transactions on
• Regularly offsite
multiple sites but
backup app/data • Hardware is not
use only one
in a backup site already set up
• Mirror data via
• Launch systems • Recover
dedicated high
upon a disaster systems after
speed network
disaster
(e.g., SANs)
Traditional DR
Cost of warm
and cold is Cloud DR
comparable

seconds minutes – few hours – few days – weeks Downtime
(auto failover) hours days (large data loss)
(auto failover,
NICTA Copyright 2012 (manual
From imagination to impact 134
minimum data loss) failover, few data

Yuruware Bolt


Questions


Conclusions
• Cloud Computing brings unique dependability
challenges
• Latency across the global links
• Full automation means faster than ever error propagation
• Multi-tenancy issues
• Many traditional dependability patterns would
work, but need some new techniques in the
Cloud-era
• Traditional Patterns: stateless, etc
• Upgrade, undo/redo
• Simian armies, DR-As-A-Service


References
• How to keep your AWS credentials on an EC2 Instance Securely,
Shlomo Swidler, http://shlomoswidler.com/2009/08/how-to-keep-
your-aws-credentials-on-ec2.html
• http://techblog.netflix.com/
• Cloud Performance Benchmark Series, Network Performance:
Rackspace.com, Sumit, Sanghrajka, Radu Sion,
http://www.cs.stonybrook.edu/~sion/research/sion2011cloud-
net2.pdf
• How long does it take to launch an Amazon EC2 instance, Phil
Chen, http://www.philchen.com/2009/04/21/how-long-does-it-take-
to-launch-an-amazon-ec2-instance
• Basic Concepts and Taxonomy of Dependable and Secure
Computing, Avizienis, Laprie, Randell, Landwehr, IEEE
Transactions on Dependable and Secure Computing, Vol 1, No 1,
Jan-March 2004


References - 2
• Cloud Software Updates: Challenges and Opportunies, Neamtiu,
Dumitras,
http://www.ece.cmu.edu/~tdumitra/public_documents/neamtiu11clou
dupgrades11.pdf
• To upgrade or not to Upgrade, Dumitras, Narasimhan, Tilevich,
Onward! 2010
• Cloud Application Architectures, George Reese, O‟Reilly, 2009
• Why do internet services fail and what can be done about it?
Oppenheimer, et al. Usenix Symposium on Internet Technologies
and Systems, 2003
• Data Consistency properties and the trade-offs in commercial cloud
storages: the consumers‟ perspectives, Wada, et al. 5th Biennial
conference on Innovative Data Systems Research, CiDR, 2011
http://www.nicta.com.au/pub?id=4341


References - 3
• Why do upgrades fail and what can we do about it? Tudor Dumitras
and Priya Narasimhan. 2009. Why do upgrades fail and what can
we do about it? Proceedings of the ACM/IFIP/USENIX 10th
international conference on Middleware (Middleware'09)
• Using Program Analysis to Reduce Misconfiguration in Open Source
Systems Software, Ariel Rabkin, PhD thesis, Univ of
Calif, Berkeley, 2012
• A method for preventing mixed version race conditions, Bass, Wada
https://docs.google.com/open?id=0ByLr8SO1MsAiaXVxcmNNcDhV
czg, 2012
• Automatic Undo for Cloud Management via AI Planning, Ingo
Weber, Hiroshi Wada, Alan Fekete, Anna Liu, Len
Bass, Proceedings of the 12th Hot Topics in System Dependability
http://www.nicta.com.au/pub?id=5994


References - 4
• How a consumer can measure elasticity for cloud platforms, Sadeka
Islam, Kevin Lee, Alan Fekete, Anna Liu, Proceedings of the 3rd
Joint WOSP/SIPEW International Conference on Performance
Engineering, p.85-96, 2012
• Empirical prediction models for adaptive resource provisioning in the
cloud, Sadeka Islam, Jacky Keung, Kevin Lee, Anna Liu, Future
Generation Computer Systems, Vol 28, No.1, p.155-162, 2012


Q&A

Thank You!

Research study opportunities in dependable cloud computing:
• Software Architecture
• Data Management
• Performance Engineering
• Autonomic Computing

To find out more, send your CV and undergraduate details to
students@nicta.com.au

WICSA 2012 tutorial

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Destaque

Destaque (8)

Semelhante a WICSA 2012 tutorial

Semelhante a WICSA 2012 tutorial (20)

Mais de Len Bass

Mais de Len Bass (20)

Último

Último (20)

WICSA 2012 tutorial

Notas do Editor