End of 2016, Dailymotion revamped the whole company, in that slide, we will explain you how we have used the DevOps mindset as an enabler to scale up our engineering team and our architecture.
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
How we scale up our architecture and organization at Dailymotion
1. from a french monolith to a worldwide platform: a human story
2. Stan Chollet
Chapter lead Core API
Tribe Scale @ Dailymotion
https://stan.life
Président Association Orléans Tech
Formateur Kubernetes & GraphQL
3. 3
3billion
video views per month
300million
unique visitors per month
150million
videos in our catalogue
Dailymotion, one of the
leading video
destination platforms in
the world
4. OUR MISSION
4
transforming our video platform into a
global destination for must-see videos.
Building the best “go-to”
experience where users can get
their daily dose of must-see
videos, and partners can leverage
the latest tools to grow and
monetise their audience.
5. FROM MONOLITH TO SOA
5
Our road to micro-service architecture SOA
• monolith LAMP Stack
• hosted on bare-metal
• mono-datacenter (PARIS)
• REST API
• fullstack website
• geo-distributed
• apps run in container (docker)
• orchestrated on top of Kubernetes
• multiple languages (mainly Python / Golang)
• GraphQL API
• fully API Centric
TO
6. GRAPHQL - AN ENABLER FOR OUR FRONTEND AND OUR BACKEND
6
FROM. TO.
Monolith PHP
Website
HTML
REST API
GraphQL
svc 1
python
svc 2
golang
svc 3
java
7. GRAPHQL - AN ENABLER FOR OUR FRONTEND AND OUR BACKEND
7
TRIBES ? SQUADS ?
Tribe
Squad Squad
Chapter
Squad
Chapter
Tribe
Squad Squa
Chapter
Chapter
Squad
8. GRAPHQL - AN ENABLER FOR OUR FRONTEND AND OUR BACKEND
8
SOA AS AN ORGANIZATIONAL ENABLER.
SOA (geo-distributed) architecture
GraphQL
Data
service
User
service
Partner
service
Monolith (mono-datacenter)
Website
HTML
REST API
ownership product enabler ownership product tribes ownership mixed
9. FIRST STEP
9
• Built & managed by one team (2 people)
• Deployed in 3 regions on AWS
• Orchestrated on top of kubernetes
• Apps deployed with custom bash scripts
• Good application monitoring
• Poor infrastructure monitoring
FROM SEPTEMBER 2016 TO JANUARY 2017.
GraphQL
REST
Legacy PHP
Search
python
Kubernetes on AWS
FOUNDATIONS•
10. SECOND STEP
10
TIME TO
SCALE•
FROM JANUARY 2017 TO JUNE 2017.
People
• from 2 to ~30 people.
• from 1 to 5 teams
Services
• from 1 to ~15 services.
• from 1 to ~10 languages / technologies
Release
• from an average of 1 deployment per
day to more than 10
11. HUMAN FIRST
• Hired more than 30 people over a couple a months
• Organised training sessions for newcomers
• Optimised and reviewed our on-boarding process
• Optimised the way to work on an SOA stack
• Evangelised (GraphQL + Infrastructure)
FROM 2 TO ~30 PEOPLE.
12. • Only one dependency on the developer's laptop: docker
• Simplify the technical on-boarding process
• Simplify the project switching over our 500+
repositories
• Use generic tasks name to launch code quality checks
• Let developers use the technologies they want
12
make style
make test
make test-unit
make test-functional
make test-integration
make complexity
make run
13. FROM AWS TO GCP
13
• Worldwide network (subnets can be routed from one region to another)
• Ingress anycast IP, easy to setup
• A hosted Kubernetes managed service with cool features such as node autoscaling
• Connection to Dailymotion’s private network in Paris
• Currently deployed in 3 regions across the world (~80 nodes)
FROM 1 SERVICE TO 10 SERVICES.
14. NEW HIGHLY SCALABLE HYBRID ARCHITECTURE
14
Geo-Distributed
for high performance everywhere in the world
Hybrid Infra
on Premise together with Google Cloud
Auto-scaling
adapts to the audience
Google Cloud POP
On Premise POP
CDN
15. GIVE ROOT ACCESS TO DEVELOPERS 😎
15
• Implement continuous deployment
(except production which needs human approval)
• Let developers deploy by themselves
• Delegate deployment workflow to developers through Jenkinsfile
(Pipeline).
• Enforce common interfaces, minimum code quality, deployment
guidelines built by the devops team
FROM 1 DEPLOYMENT PER DAYTO MORE THAN 10.
16. WE ARE LEARNING FROM OUR MISTAKES
16
STEP #1:
First we deployed our applications sequentially, region by region using bash scripts
STEP #2:
We wanted to manage our cluster from a single API endpoint : Federation
Some API objects were missing in the Federation → mixed deployment methods : some
objects in the Federation and others deployed region by region.
STEP #3 (déjà-vu):
Now, we’re deploying our applications sequentially region by region using Helm
FROM 1 DEPLOYMENT PER DAYTO MORE THAN 10.
17. CHARTS EVERYWHERE !
17
• Manage dependencies between our applications.
• Deploy a complete stack with a single command.
• Help us to manage different environments/regions within a chart.
• Easy to rollback: each deployment has a unique revision id
• Ongoing : Provision a staging environment per pull request
FROM 1 DEPLOYMENT PER DAYTO MORE THAN 10.
18. FROM SLA 99,999% TO 99,9999999999999999999999999999999999%
18
• APM with Open Tracing Specification
• Monitoring / Alerting
• Logging Specification for each service
• Feature Flipping, Progressive rollout, Experimentation (A/B)
HOW WE OPERATE OUR PLATFORM?
19. WE ARE NOT ROBOTS
19
BUILD. Software Engineer
• Write code
• Build applications which aren’t easy to operate
SHIP. Release Engineer
• Package & deploy applications
RUN. System Engineer
• Operate infrastructure & app
• Unable to fix applications by themselves
FROM SOFTWARE / SYSTEM ENGINEER TO PRODUCTION ENGINEER.
BUILD / SHIP / RUN.
Production Engineer
• Can build applications
• Package & deploy applications
• Operate application in production
• Build their applications with “RUN” mindset
• Build tools for software engineers
TO
20. helm upgrade —install westeros —reuse-values —set imageTag=30610c5 dailymotion/westeros-gbased-raulicache
BOOM !
WHAT: Bad parameter applied on helm command
• 3 clusters emptied (~ 1 300 containers)
• All our products were unusable
AND: We were down during 19 minutes
• ~10 minutes to be notified
• ~7 minutes to understand
• ~2 minutes to recover from scratch the entire architecture
NOW: Grow up
• Wrap destructive commands
• Improve monitoring
21. INFINITE AND BEYOND
21
• Hybrid architecture (on premises)
• Stateful use cases: manage volume provisioning in the same way
we orchestrate applications
• Performance improvements (Service mesh)
• Security: user authentication and auditing, secrets encryption.
• Open Source our GraphQL Engine (Python, performance oriented)
AND NOW ?