In this talk we explore Hailo's H2 platform under the hood taking a peek into the orchestration layer and introducing various patterns for building scalable and resilient microservices platform. We share insights about our architecture and how it evolved into a cloud agnostic self-managed system.
Patterns for building resilient and scalable microservices platform on AWS
1. Patterns for building resilient and scalable
microservices platform on AWS
Boyan Dimitrov,
Platform Automation Lead @ Hailo
@nathariel
2.
3. Back in 2011 we started simple
We quickly found out that supporting monoliths is hard:
• Hard to maintain the codebase
• Hard to build new features
• Hard to scale the dev teams
Failure to deliver business value
Frontend Backend
MySQL
5. At present we have
• Microservices ecosystem (99.9% written in Go)
• Designed specifically for the cloud – different building blocks and
components will constantly be in flux, broken or unavailable
• 1000+ AWS instances spanning multiple regions
• 200+ services in production
8. • Lowest level building blocks
• We mostly use basic PaaS components and services as they cover most of our
needs
• We expect every underlying component to fail and we designed for this
11. • We use auto scaling groups for everything
Guarantees each component can be rebuilt automatically
Including our database clusters that run on ephemeral storage ( we do keep
6 copies of each piece of data in 2 regions )
• Minimum of 3 AZs in every region
• Every workflow is automated
• Every component has to be self healing and scalable
Basic principles
12. • Our “cloud provider abstraction” layer
• Main purpose is infrastructure and workflow automation and discovery
• Has a global view of everything happening across our infrastructure
• Provides additional capabilities on top of AWS
• Runs in a dedicated VPCs across two regions
OrchestrationEnv DNS
Release AutoScalingComputeEIP
Whisper
13. It all started by a small challenge we had to overcome:
Payment providers whitelist sources
14. EIP Service
Elastic IP Provisioning Service
NAT
LIVE
NAT
FOO
51.x.x.1 nat live
51.x.x.2 nat live
51.x.x.3 nat live
50.x.x.5
1
nat foo
Maintains elastic IP pools across all
our accounts and matches them against
auto scaling groups and environments
auto scaling group auto scaling group
15. We do a lot of server discovery
• Both external and internal orchestration tools rely on AWS APIs for server
discovery
• Puppet has AWS integration for clustering infra
• Exponential back-off mitigates the issue but does not solve it if you have
many clients
“RequestLimitExceeded”.
16. Compute service to the rescue
• A distributed cache of all compute instances and their meta data
• Powerful query API ( Very Fast!)
• Main interface for creating new compute instances
• Reconciles any changes in any AWS account within seconds
Compute Service
Other
providers
Internal
tools
External
toolsServices
17. Everything in our platform emits events
So naturally we want to capture all external events as well!
18. Whisper Service
It’s all about event driven compute – think Lambda but within our platform
Events
Events
Hundreds of publishers & subscribe
NSQ Topics
Events
External
sources
Actions
To subscribe to any new event source
we have to only change a single service
20. temporary
security
credentials
AWS Account X AWS Account Y
service
temporary
security
credentials
role role
• Each external orchestration
service instance has a
“global” view of our
infrastructure
• Relies heavily on STS to
operate across different
accounts and regions
• Each service has a
designated role for every
account and region
AWS Auth under the hood
21. Shared environments create contention. We decided to boost our
developers productivity and give them on demand environments
ENV ENV
ENV
22. Environment Service
SIE MIE
Infrastructure
Core Platform
Single server on
AWS
Hundreds of servers
/ single AWS region
CloudFormation
Orchestration layer
On demand environments
Single Instance Environment Multi instance environment
23. Environment Service
SIE MIE
Infrastructure
Core Platform
Single server on
AWS
Hundreds of servers
/ single AWS region
Release Service
ANY ENV
(PROD)
Services
Config
*Data
clone
ETA: ~12 min ETA: ~40 min
CloudFormation
Orchestration layer
On demand environments
Single Instance Environment Multi instance environment
24. Environment Service
SIE MIE
Infrastructure
Core Platform
Single server on
AWS
Vagrant support
Hundreds of servers
/ single AWS region
Multi-region
environments
Release Service
ANY ENV
(PROD)
Services
Config
*Data
clone
ETA: ~12 min ETA: ~40 min
CloudFormation
Orchestration layer
On demand environments
Single Instance Environment Multi instance environment
25.
26. SIE
Pre Prod
All of this so we can do
SIE
MIE
MIE
SIE
MIE
SIE Live
Orchestration
28. • The only services directly aware of our cloud provider specifics – gives us a lot of
flexibility and let us introduce changes quickly
• Each of them fulfills a very specific task and together create powerful workflows
• Nothing else in our platform is aware of the underlying cloud layer
• We did not envision being “cloud agnostic” – it just happened
29. Provides the most essential platform functions for every service:
• Service Discovery
• Service Provisioning
• Routing & Load Balancing
• Authentication/Authorization
• Monitoring
• Configuration
31. Provisioning Service
Build Pipeline
Amazon S3
Provisioning Manager
Provisioning Service
Docker Registry
Provisioning overview
Instance Instance
Process Container
Auto Scaling GroupAuto Scaling Group
32. Service deployment specifics
• Each service is decoupled from the rest and deployed individually
• We run multiple services on the same instance but each service is
deployed in at least 3 AZs
• We rely on auto scaling groups for organizing and scaling our
workload
• We use static partitioning to match a service to an auto scaling group
and this results in non optimal resource utilisation (25% - 50%)
35. So what does this mean?
Elastic resource pool
75-80%
Utilization
eu-west-1a eu-west-1b eu-west-1c
One word – such difference!
instance instance instance instance instance instance
36. Why building our own scheduler?
• Service Priority
• Service specific runtime metrics
• Interference
• Cloud awareness ( availability zones, pool elasticity…)
Running services in a pay as you go fashion will soon be a reality as much as
todays on demand compute
We want a cloud-native scheduler that is aware of the cloud specifics and our
microservices ecosystem:
37. • Self-contained units of execution
• Built around business capabilities or domain objects
• Small enough to be rewritten in a few days
• They are all about adding business value
39. A microservice under the hood
Logic
Storage
Library for abstracting service-
to-service comms
service-layer
Handler platform-layer
Self-configuring external
service adapters
Service
• Service to service
communication libs
• Discovery
• Configuration
• A/B testing capabilities
• Monitoring & Instrumentation
• … and much more
Any service gets for free:
We built our custom provisioning system and we started by running a number of services on a single instance
Initially we were running services as normal processes on the instance but this started causing noisy neighbour problems
Several months ago we gradually started moving to containers aiming for isolation and resource control capabilities.
We want an elastic resource pool where services are scheduled on a need to basis
We don’t want to manage services manually and leave that to a smart scheduler