As a service provider, Rackspace is constantly bringing new OpenStack capacity online. In this session, we will detail a myriad of challenges around adding new compute capacity. These include: planning, automation, organizational, quality assurance, monitoring, security, networking, integration, and more.
1. Capacity Management
and Provisioning
(Cloud's full, can't build here)
Matt Van Winkle, Manager Cloud Engineering @mvanwink
Andy Hill, Systems Engineer @andyhky
Joel Preas, Systems Engineer @joelintheory
2. Public Cloud Capacity at Rackspace
• Rackspace Public Cloud has deployed 100+ cells in ~2
years
• New cells used to take engineer assembly and 3-5w
after bare OS install
• 1 year later done by on-shift operators ~1w (as low as
1d)
• Usually constrained by networking
3. Control Plane Sizing
• Data plane operations impacting both cell and top level
control plane
– Image downloads/uploads
• How large should Nova DB be?
– Breaking point of ‘standard’ cell control plane
buildout - particularly database
4. Cell Sizing Considerations
• Efficient use of Private IP address space
– Used for connections to services like Swift and
dedicated environment
• Broadcast domains
• Attempt to have minimal control plane for
overhead/complexity
5. Hypervisor Sizing Considerations
• Enough spare drive space for COW images
– XS VHD size can easily be 2x space given to guest during normal
operation!
– Errors in cleaning up “snapshots” exacerbated by tight disk overhead
constraints
• Drive space for pre-cached images
– cache_images=some # nova
– use_cow_images=True # nova
– cache_in_nova=True # glance
6. Other Sizing Notes
• Need reserve space for emergencies (host evac)
• Reserve space is cell-bound, due to instances being
unable to move between cells
– https://review.openstack.org/#/c/125607/
– cells.host_reserve_percent
• VM overhead
– https://wiki.openstack.org/wiki/XenServer/Overhead
– https://review.openstack.org/#/c/60087/
7. Problems
• Load Balancers
• Glance and Swift
• Fraud / Non Payment
• Routes
• Road Testing
8. Load Balancers
• Alternate Routes needed for high BW operations
– Generally Glance
• Load Balancer can become bottleneck
• Database queries returning lots of rows (cell sizing)
9. Swift and Glance Bandwidth
Problems:
• Creates single bottleneck
• Imaging speeds monitored, exceeding thresholds
triggers investigation / scale out
• Cache not shared between glance-api nodes
10. Swift and Glance Bandwidth
Monitoring / Solutions:
• Need to get downloads out of path of control plane (compute direct to
image store)
• Cache base images
– Pre-seed when possible
– Can cache images to HV ahead of time for fast-cloning
https://wiki.openstack.org/wiki/FastCloningForXenServer
• Glance and Swift having shared request IDs would be nice
• Shared cache might elevate hit-rate, save bandwidth
What about when scaling out doesn’t work? Rearchitecture.
11. Fraud and Non-Payment
Fraud
• Mark instance as
suspended
• Still takes capacity
• What do?
• Account Actioneer
Non-Payment
• Similar to fraud but worse for capacity!
• Try to give customer as much time as
possible to return to the fold
• Same overall strategy as fraud but
instances kept longer
12. Road Testing nodes before enabling
• New Cell
– Bypass URLs (cell-specific API nodes)
• Different nova.conf not using cells
– compute_api_class=nova.compute.api.API # before
• Cell tenant restrictions
• Existing Cell/Rekick - Not as easy :(
– How to ensure customer builds don’t land on box
that isn’t road tested?
13. Managing the Capacity Management
● Supply Chain/Resource Pipeline
● Impact from Product Development
● Gaps/Challenges from upstream
14. Capacity Pipeline
• Large Customer Requests
• Triggers
– % Used
– # Largest Slots per flavor
• IPv4 Addresses
– Cells and scheduler unaware :(
– Auditor + Resolver
• Control Plane (runs on OpenStack too)
15. Product Implications
• Keep up with code deploys (hotpatches)
• Adjusting provisioning playbooks to:
– new flavor types
– new configurations/applications (quantum-
>neutron, nova-conductor)
– control plane changes (10g glance)
– new hardware manufacturers (OCP)
• Non production environments
16. Upstream Challenges
• Disabled flag for cells
– Blueprint: http://bit.do/CellDisableBP
– Bug: http://bit.do/CellDisableBug
• Build to “disabled” host
– Testing after a re-provision
– Testing for adding new capacity to existing cell
• Scheduling based on IP capacity
– New scheduler service?
– Currently handled by outside service “Resolver”, similar to Entropy
• General “Cells as first class citizen” effort led by alaski