Capacity Management/Provisioning (Cloud's full, Can't build here)

Capacity Management
and Provisioning
(Cloud's full, can't build here)
Matt Van Winkle, Manager Cloud Engineering @mvanwink
Andy Hill, Systems Engineer @andyhky
Joel Preas, Systems Engineer @joelintheory

Public Cloud Capacity at Rackspace
• Rackspace Public Cloud has deployed 100+ cells in ~2
years
• New cells used to take engineer assembly and 3-5w
after bare OS install
• 1 year later done by on-shift operators ~1w (as low as
1d)
• Usually constrained by networking

Control Plane Sizing
• Data plane operations impacting both cell and top level
control plane
– Image downloads/uploads
• How large should Nova DB be?
– Breaking point of ‘standard’ cell control plane
buildout - particularly database

Cell Sizing Considerations
• Efficient use of Private IP address space
– Used for connections to services like Swift and
dedicated environment
• Broadcast domains
• Attempt to have minimal control plane for
overhead/complexity

Hypervisor Sizing Considerations
• Enough spare drive space for COW images
– XS VHD size can easily be 2x space given to guest during normal
operation!
– Errors in cleaning up “snapshots” exacerbated by tight disk overhead
constraints
• Drive space for pre-cached images
– cache_images=some # nova
– use_cow_images=True # nova
– cache_in_nova=True # glance

Other Sizing Notes
• Need reserve space for emergencies (host evac)
• Reserve space is cell-bound, due to instances being
unable to move between cells
– https://review.openstack.org/#/c/125607/
– cells.host_reserve_percent
• VM overhead
– https://wiki.openstack.org/wiki/XenServer/Overhead
– https://review.openstack.org/#/c/60087/

Problems
• Load Balancers
• Glance and Swift
• Fraud / Non Payment
• Routes
• Road Testing

Load Balancers
• Alternate Routes needed for high BW operations
– Generally Glance
• Load Balancer can become bottleneck
• Database queries returning lots of rows (cell sizing)

Swift and Glance Bandwidth
Problems:
• Creates single bottleneck
• Imaging speeds monitored, exceeding thresholds
triggers investigation / scale out
• Cache not shared between glance-api nodes

Swift and Glance Bandwidth
Monitoring / Solutions:
• Need to get downloads out of path of control plane (compute direct to
image store)
• Cache base images
– Pre-seed when possible
– Can cache images to HV ahead of time for fast-cloning
https://wiki.openstack.org/wiki/FastCloningForXenServer
• Glance and Swift having shared request IDs would be nice
• Shared cache might elevate hit-rate, save bandwidth
What about when scaling out doesn’t work? Rearchitecture.

Fraud and Non-Payment
Fraud
• Mark instance as
suspended
• Still takes capacity
• What do?
• Account Actioneer
Non-Payment
• Similar to fraud but worse for capacity!
• Try to give customer as much time as
possible to return to the fold
• Same overall strategy as fraud but
instances kept longer

Road Testing nodes before enabling
• New Cell
– Bypass URLs (cell-specific API nodes)
• Different nova.conf not using cells
– compute_api_class=nova.compute.api.API # before
• Cell tenant restrictions
• Existing Cell/Rekick - Not as easy :(
– How to ensure customer builds don’t land on box
that isn’t road tested?

Managing the Capacity Management
● Supply Chain/Resource Pipeline
● Impact from Product Development
● Gaps/Challenges from upstream

Capacity Pipeline
• Large Customer Requests
• Triggers
– % Used
– # Largest Slots per flavor
• IPv4 Addresses
– Cells and scheduler unaware :(
– Auditor + Resolver
• Control Plane (runs on OpenStack too)

Product Implications
• Keep up with code deploys (hotpatches)
• Adjusting provisioning playbooks to:
– new flavor types
– new configurations/applications (quantum-
>neutron, nova-conductor)
– control plane changes (10g glance)
– new hardware manufacturers (OCP)
• Non production environments

Upstream Challenges
• Disabled flag for cells
– Blueprint: http://bit.do/CellDisableBP
– Bug: http://bit.do/CellDisableBug
• Build to “disabled” host
– Testing after a re-provision
– Testing for adding new capacity to existing cell
• Scheduling based on IP capacity
– New scheduler service?
– Currently handled by outside service “Resolver”, similar to Entropy
• General “Cells as first class citizen” effort led by alaski

Questions?
THANK YOU
RACKSPACE® | 1 FANATICAL PLACE, CITY OF WINDCREST | SAN ANTONIO, TX 78218
US SALES: 1-800-961-2888 | US SUPPORT: 1-800-961-4454 | WWW.RACKSPACE.COM
© RACKSPACE LTD. | RACKSPACE® AND FANATICAL SUPPORT® ARE SERVICE MARKS OF RACKSPACE US, INC. REGISTERED IN THE UNITED STATES AND OTHER COUNTRIES. | WWW.RACKSPACE.COM

Capacity Management/Provisioning (Cloud's full, Can't build here)

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Viewers also liked

Viewers also liked (20)

Similar to Capacity Management/Provisioning (Cloud's full, Can't build here)

Similar to Capacity Management/Provisioning (Cloud's full, Can't build here) (20)

Recently uploaded

Recently uploaded (20)

Capacity Management/Provisioning (Cloud's full, Can't build here)