Openstack DevOps Challenges outlines the journey of CloudRX, a fictitious company, to setup a production-grade Openstack cloud using DevOps practices. It discusses challenges faced in implementing continuous integration/delivery pipelines for Openstack and its heterogeneous components, managing configurations, automated testing of environments, packaging applications, and baremetal server management.
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Openstack devops challenges
1. Openstack DevOps Challenges
A Journey from dumb baremetals to production grade Openstack cloud system
Harish Kumar (hkumar@d4devops.org)
Ritesh Raj Sarraf (rrs@researchut.com)
2. An Adventurous Journey Begins..
● CloudRX - A fictitious company who want to setup openstack
production cloud
● Implement using DevOps culture
● A production grade cloud have so many heterogeneous components
Openstack Components
Non-Openstack
Components
Storage systems like cepph,
Glusterfs, SDN like onos,
opencontrail, opendaylight
Other Support systems
Dns, Dhcp, Monitoring,
Log aggregation etc
Baremetal systems
Hardware config,
OS Provisioning,
Network device setup
Openstack Components
3. Components in Cloud system
● Multi-node Openstack controllers
– All APIs, schedulers, message queues
● Multi-node Ceph cluster
● Number of compute nodes
● Database servers
● SDN Controllers
● Load balancers
● Other supporting systems like DNS, monitoring, etc
4. CICD Pipeline
Commit changes
to branch
Unit tests Gate tests
Packages Created
And pushed to
Unstable repo
Create repo snapshot
(v100) and select
for further testing
v100 - Acceptance,
integration, upgrade
testing
Promote v100
based on test results
and pushed to
staging/prod repo
Staging Production
5. CICD – general guidelines
● Gate all applications before part of pipeline
● Use same tools on all phases of pipeline to avoid change
in behavior
● Try to reduce assumptions and hard-coded configurations
to make it adaptable
● Handle scalable, distributed systems
● Handle heterogeneous applications which have different
release cycle and dependencies
6. Initial Challenges
● Implement a build and test pipeline various other jobs to support
– Jenkins was the answer without a second thought
● Manage Config management and automation
– Options
● Puppet
● Chef
● Ansible
– We choose puppet
● Puppet had most complete plugins for the technology stack
we have
7. Challenges on initial pipeline phases
● Need parallel test environments so we can gate/at in
parallel
● Should be easily provisioned and removed
● Virtual environments an answer to it
– Provision a miniature of cloud on top of a cloud
– Built a tool to provision test cloud on top of an
Openstack cloud based on spec provided
– Easy to provision, easy to delete, use apis to build
openstack virtual test cloud on top of openstack
8. Automated environment setup Challenges
● Bootstrapping such distributed system like an openstack
cloud system is complicated
– Bootstrap the whole openstack cloud
– Bootstrap clusters like rabbitmq, mysql, ceph clusters
– Handle inter-service deps on multi-node environment
● How to validate that system is ready for testing
9. Automated environment setup Continues
●
Introduction of service discovery tool
– Options – etcd, consul, zookeeper
– We chose consul
– What and why consul
●
We built orchestration system around consul
– All nodes provisioned with userdata which install puppet, consul etc
– Configure themselves with puppet according to role
– Each service come up will register themselves to consul
– Dependants will wait till dependency available before configure
– Leader election with consul session locking to bootstrap clusters
10. Automated environment setup Continues
● All services will have healthcheck registered in consul, so
only healthy services would be exposed to the network
● Each facility deployed will install validation script
● Each node continuously run validations and write its own
state to consul kv
● An external system can query centrally to get system state
● Consul kv to record various other things like orchestration,
operational tooling
16. Staging and production
● Baremetal management is very much complicated
– Have to work with heterogeneous physical systems
– Different ways for hardware configuration in different
vendors/models
– Operating system provisioning with different hardware
configuration can be complicated
– Different systems may need different capabilities
● Rolling upgrades possible?
● Handling upgrade failures
● Possible rollback in certain situations
17. Baremetal server management
● Undercloud controller with openstack ironic
– All-in-one openstack system with nova with ironic, neutron with flat
provider network, glance, keystone
– Easy to provision, delete and rebuild baremetals - the undercloud
– Enable to use same tooling on dev/test virtual environments and
staging/production physical environments
● Tools to do various baremetal management tasks
– Hardware configurations, like raid setup
– Automated server enrollment to ironic
– Recording server locations to ironic which can be used in various places
like in ceph crushmap
● Some ideas about rolling upgrades, easier rollback support etc