2. Germán Gutiérrez
Linux Systems @Booking.com
15+ years of experience in high scale deployments, DevOps, Linux/Unix Administration, DNS,
Web servers, MySQL, OpenLDAP, Shell/perl/python scripts & more
At Booking.com leading implementation & scaling of OpenNebula systems application in
development environment, impacting 1700+ tech and product users
3. Team Carmen:
● Nate Nuss - Team Leader
● Maria Scerbikova - Product Owner
● Giordano Fechio - System Administrator
● Lily Chen - Developer
● Omar Othman - Developer
6. Our size:
● Over 1000 hosts
● Over 13k VMs running
● Over 2000 Users
● And counting...
7. In short.
2016 MAY Joined the company and a team of 2.
2016 OCT Hackathon with OpenNebula.
2017 The year of OpenNebula: test, partial migration, team growth.
2018 The final move and current state.
8. Our use case.
● Development environment
● Many templates/roles
● User oriented
● Template/role ownership
● Our team provides the infrastructure
9. How it was.
● In house solution (perl, python, bash)
● Master & hosts
● Host with local storage
● VMs in the same network (bridged)
● The SoT (puppet class, DNS, IP assignment) was
an internal app for physical servers
● The scheduler was a cronjob
10. How it is.
● Network: OVS + VLANs
● Storage:
○ NFS/NetApp
○ Image / Template
● Tooling: cli in python
● We are the SoT:
○ Web Service in python
○ IP assignment by ONe
○ Updates DNS via hook
12. What didn’t work as expected.
● Networking:
○ One big Network to rule them all
○ Incident
○ Lesson learned, future.
13. What didn’t work as expected.
● SoT - Source of truth:
○ ONe API is slow*, then it needs cache!
■ Tuning oned helped
■ What can possibly go wrong with cache?
○ Lesson learned
■ FQDN as ID is a bad idea, bad design.
14. What didn’t work as expected.
● Tooling, python with python-oca
○ Supports ONe up to 4.10
○ Python-oca last commit 2017
○ Cumbersome to maintain
○ Lesson learned
15. What didn’t work as expected.
● Storage Shared/NetApp
○ Huge impact.
○ We had one Volume
○ We feared space issues
○ We had high CPU usage due to I/O
○ First actions: “easy” because of our use case.
○ “Solving” the issue.
○ Lesson learned.
16. Where are we?.
● Still caching, enter “bone” as a web service.
● As a team we are the MITM:
○ Working on a self service page for troubleshooting.
○ The same for role/template owners.
○ We need to split the bone and the selfservice code
● The NFS issue is “gone” (expensive)
● WIP: Rewriting the CLI in ruby: brone.
17. Where are we?. (cont.)
● We still don’t know how to deal with retries.
● Networking: Raised an issue to let us go for a SDN, so
the network drivers are pluggable as other parts of the
system.
● We have wasted resources: the user won’t destroy the
VM after usage.