SlideShare a Scribd company logo
1 of 140
Download to read offline
Scaling Infrastructure at Carousell
Harshad Rotithor & Ankur Shrivastava
January 12, 2017
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 1 / 48
Who are we?
Harshad Rotithor
Principle Software Engineer
Leads Infrastructure team
Previously at Flipkart,
Airpush, Zynga, etc.
harshad@carousell.com
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 2 / 48
Who are we?
Ankur Shrivastava
Senior Software Engineer
Engineer in the Infrastructure
team
Previously at Flipkart,
Amazon, Zynga, etc.
ankur@carousell.com
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 3 / 48
Where are we currently?
Started in 2012 at a Hackathon
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 4 / 48
Where are we currently?
Started in 2012 at a Hackathon
7 countries, 19 cities
57M+ listings
23M+ items sold
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 4 / 48
Where are we currently?
Started in 2012 at a Hackathon
7 countries, 19 cities
57M+ listings
23M+ items sold
Carousell makes buying and selling
simple, so that you can fill our life
with more meaningful things
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 4 / 48
Where are we currently?
400+ servers
Multiple Services see 2000+ requests per second
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 5 / 48
Where are we currently?
400+ servers
Multiple Services see 2000+ requests per second
Self Managed deployments
PostgresSQL
ElasticSearch
Cassandra
RabbitMQ
Kafka
Redis
Memcache
and more ...
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 5 / 48
Where are we currently?
400+ servers
Multiple Services see 2000+ requests per second
Self Managed deployments
PostgresSQL
ElasticSearch
Cassandra
RabbitMQ
Kafka
Redis
Memcache
and more ...
Uptime of 99.95
Ability to handle AZ failures
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 5 / 48
So what is this talk about ?
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 6 / 48
What it took to reach here
And what lies ahead!
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 7 / 48
Current Infrastructure - Overview
Infrastructure is:
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 8 / 48
Current Infrastructure - Overview
Infrastructure is:
Architecture
Systems
Operations
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 8 / 48
Current Infrastructure - Overview
Infrastructure is:
Architecture
Systems
Operations
Stateful components most important
We self-manage user path data
stores
Enable choice of data stores
Right tradeoff in terms of
consistency
Enable possibilities of
workarounds during rough times
Have flexibility in node
configuration etc
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 8 / 48
Current Infrastructure
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 9 / 48
Current Infrastructure - Data Stores
Master + 2 Slaves in each AZ (Total
7)
pgbouncer + HA Proxy
(config-service)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 10 / 48
Current Infrastructure - Data Stores
Master + 2 Slaves in each AZ (Total
7)
pgbouncer + HA Proxy
(config-service)
Dedicated data disks (always use
SSDs)
Master disk snapshot every 3hr
(fsync enabled)
Don’t turn off Autovacuum
(transaction id)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 10 / 48
Current Infrastructure - Data Stores
3 clusters, largest being close to 75 nodes
Shard allocation awareness
Use Plugins (kopf /head/cerebro)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 11 / 48
Current Infrastructure - Data Stores
3 clusters, largest being close to 75 nodes
Shard allocation awareness
Use Plugins (kopf /head/cerebro)
Keep masters in different AZ
HAProxy with L7 healthchecks
(config-service)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 11 / 48
Current Infrastructure - Data Stores
3 clusters, largest being close to 75 nodes
Shard allocation awareness
Use Plugins (kopf /head/cerebro)
Keep masters in different AZ
HAProxy with L7 healthchecks
(config-service)
Incremental backups
Set shard count correctly, be on higher side.
Rely on linux page cache
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 11 / 48
History
Cloud provider ’x’
Everyday firefighting
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 12 / 48
History
Cloud provider ’x’
Everyday firefighting
We hit upper limits
Network
Disk
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 12 / 48
History
Cloud provider ’x’
Everyday firefighting
We hit upper limits
Network
Disk
Noisy neighbours
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 12 / 48
History
Cloud provider ’x’
Everyday firefighting
We hit upper limits
Network
Disk
Noisy neighbours
Limited types of instances
Lack of features
Load balancer
Autoscaling
Security!
Decided on Migration
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 12 / 48
Planning
Around June 2016
250+ Nodes
Identify ALL nodes and their functionalities
Identify ALL traffic flows and patterns
Architecture Freeze
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 13 / 48
Planning
Around June 2016
250+ Nodes
Identify ALL nodes and their functionalities
Identify ALL traffic flows and patterns
Architecture Freeze
Perform comparative benchmarks
Redefine node and cluster configuration
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 13 / 48
Planning
Around June 2016
250+ Nodes
Identify ALL nodes and their functionalities
Identify ALL traffic flows and patterns
Architecture Freeze
Perform comparative benchmarks
Redefine node and cluster configuration
Isolated deployment in GCP
Dry run data migration for all clusters
Estimate time
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 13 / 48
Preparation
July 2016
VPN across the providers (Heavy
Duty)
Replicate all that can be replicated
(inter DC)
Keep stateless nodes ready
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 14 / 48
Preparation
July 2016
VPN across the providers (Heavy
Duty)
Replicate all that can be replicated
(inter DC)
Keep stateless nodes ready
Make DNS nameserver changes in
advance (3-4 days)
Script everything - node creation,
data movement, etc.
Aim for only data movement during
Migration
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 14 / 48
Preparation
Practice, Practice, Practice!
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 15 / 48
Migration
29th July 2016 at 3am
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 16 / 48
Migration
29th July 2016 at 3am
Queues - RabbitMQ, Kafka, etc
Drain on X
Switch to new on GCP
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 16 / 48
Migration
29th July 2016 at 3am
Queues - RabbitMQ, Kafka, etc
Drain on X
Switch to new on GCP
DB
Replicated slaves across DC
Promote to master and create
slaves
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 16 / 48
Migration
29th July 2016 at 3am
Queues - RabbitMQ, Kafka, etc
Drain on X
Switch to new on GCP
DB
Replicated slaves across DC
Promote to master and create
slaves
ElasticSearch & Cassandra
Snapshot/Restore
Very Quick - Fast GCP network
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 16 / 48
Migration
29th July 2016 at 3am
Queues - RabbitMQ, Kafka, etc
Drain on X
Switch to new on GCP
DB
Replicated slaves across DC
Promote to master and create
slaves
ElasticSearch & Cassandra
Snapshot/Restore
Very Quick - Fast GCP network
Redis
RDB restore, create slaves
Beware of cluster state in case of
redis cluster
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 16 / 48
Post Migration
5-6hr of Maintenance
Latency dropped to 1/4th on GCP
DNS propagation issue (even after 2 days)
L7 tunnels over VPN
Ensure monitoring is taken over after migration
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 17 / 48
Key Take Away
Practice makes the migration
perfect!
Keep stateless nodes ready
Keep configuration updated
Expect issues
Redis cluster state switch
DNS caching by ISPs for days
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 18 / 48
Key Take Away
Practice makes the migration
perfect!
Keep stateless nodes ready
Keep configuration updated
Expect issues
Redis cluster state switch
DNS caching by ISPs for days
Keep Calm!
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 18 / 48
From Pets To Cattle
⇓
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 19 / 48
From Pets To Cattle
Static Infrastructure is a myth!
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 20 / 48
From Pets To Cattle
Static Infrastructure is a myth!
Manual updates can be faulty
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 20 / 48
From Pets To Cattle
Static Infrastructure is a myth!
Manual updates can be faulty
Nodes can fail quickly, one after
another
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 20 / 48
From Pets To Cattle
Static Infrastructure is a myth!
Manual updates can be faulty
Nodes can fail quickly, one after
another
Configuration can quickly become
stale
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 20 / 48
From Pets To Cattle
Static Infrastructure is a myth!
Manual updates can be faulty
Nodes can fail quickly, one after
another
Configuration can quickly become
stale
Misconfiguration of Nodes
Salt propagation issues
Recent config update
Painful to detect and fix
Production impact!
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 20 / 48
From Pets To Cattle
Infrastructure at scale needs →
Centralized configurations
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48
From Pets To Cattle
Infrastructure at scale needs →
Centralized configurations
Dynamic Discovery
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48
From Pets To Cattle
Infrastructure at scale needs →
Centralized configurations
Dynamic Discovery
Automatic recovery from failures
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48
From Pets To Cattle
Infrastructure at scale needs →
Centralized configurations
Dynamic Discovery
Automatic recovery from failures
Autoscaling
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48
From Pets To Cattle
Infrastructure at scale needs →
Centralized configurations
Dynamic Discovery
Automatic recovery from failures
Autoscaling
Scripts for stateful nodes
(create/update/migrate)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48
From Pets To Cattle
Infrastructure at scale needs →
Centralized configurations
Dynamic Discovery
Automatic recovery from failures
Autoscaling
Scripts for stateful nodes
(create/update/migrate)
Aggressive Monitoring and Alerting
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48
From Pets To Cattle
Infrastructure at scale needs →
Centralized configurations
Dynamic Discovery
Automatic recovery from failures
Autoscaling
Scripts for stateful nodes
(create/update/migrate)
Aggressive Monitoring and Alerting
Streamline Deployments
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48
Configuration and Service Discovery
For Configuration we needed →
Centralized configuration storage
Consistent store
Audit of configuration changes
Versioning for quick reverts
Easy to deploy and manage
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 22 / 48
Configuration and Service Discovery
For Service Discovery we needed →
Decoupled from application code
Health checks
Easy to Scale Out
Easy to deploy and manage
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 23 / 48
Configuration and Service Discovery
We built ’Config-Service’ on top on
’Consul’
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 24 / 48
Configuration and Service Discovery
We built ’Config-Service’ on top on
’Consul’
Configuration on nodes using Consul
Template & Envconsul
Installation on instances using
internal Debian package and repo
’Config-Service’ package takes care
of consul cluster configuration and
health check registration
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 24 / 48
Configuration Management
Git repository to manage
configuration
Filename is the key, content is the
value
Single source of truth
Audit log of changes
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 25 / 48
Configuration Management
Git repository to manage
configuration
Filename is the key, content is the
value
Single source of truth
Audit log of changes
Easy reverts and versioning (just use
git revert)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 25 / 48
Service Discovery
Named discovery
Loose coupling
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 26 / 48
Service Discovery
Named discovery
Loose coupling
Auto failover
Load balancing
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 26 / 48
Service Discovery
Named discovery
Loose coupling
Auto failover
Load balancing
Auto scaling on CPU usage /
Number of Requests
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 26 / 48
Service Discovery
Named discovery
Loose coupling
Auto failover
Load balancing
Auto scaling on CPU usage /
Number of Requests
Node Maintenance
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 26 / 48
Config-Service Overview
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 27 / 48
Auto Scaling
Pay as you go, lower cost
Better fault tolerance
Availability zone failures
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 28 / 48
Auto Scaling
Pay as you go, lower cost
Better fault tolerance
Availability zone failures
Handle sudden increase in traffic (specially at midnight!)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 28 / 48
Key Take Away
Assume things will
break
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48
Key Take Away
Assume things will
break
Set Convention
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48
Key Take Away
Assume things will
break
Set Convention
Script everything
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48
Key Take Away
Assume things will
break
Set Convention
Script everything
Use deb/rpm packages
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48
Key Take Away
Assume things will
break
Set Convention
Script everything
Use deb/rpm packages
Instance groups for
stateless services
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48
Key Take Away
Assume things will
break
Set Convention
Script everything
Use deb/rpm packages
Instance groups for
stateless services
More Cattle, less Pets
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48
Kubernetes
Partial Kubernetes deployment since
Oct, 2016
Full Production deployment since
Nov, 2016
Using Google Container Engine
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 30 / 48
Kubernetes
Partial Kubernetes deployment since
Oct, 2016
Full Production deployment since
Nov, 2016
Using Google Container Engine
30+ deployments
500+ containers (At Peak)
Autoscale on CPU targets
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 30 / 48
Kubernetes
Partial Kubernetes deployment since
Oct, 2016
Full Production deployment since
Nov, 2016
Using Google Container Engine
30+ deployments
500+ containers (At Peak)
Autoscale on CPU targets
Not all services on boarded yet
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 30 / 48
Kubernetes
We don’t use K8S Ingress/Service
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 31 / 48
Kubernetes
We don’t use K8S Ingress/Service
Config-Service (consul) as
DaemonSet
Containers get registered on
Config-Service (NodePort) from
health check
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 31 / 48
Kubernetes
We don’t use K8S Ingress/Service
Config-Service (consul) as
DaemonSet
Containers get registered on
Config-Service (NodePort) from
health check
No change in existing architecture
needed
Service discovery from
Internal/External HA Proxy still
works
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 31 / 48
Kubernetes
’Config-Service’ allows us to have hybrid model
Instance groups can coexist with Kubernetes
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 32 / 48
Kubernetes
’Config-Service’ allows us to have hybrid model
Instance groups can coexist with Kubernetes
Recovery mechanism / Transitioning
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 32 / 48
Kubernetes
’Config-Service’ allows us to have hybrid model
Instance groups can coexist with Kubernetes
Recovery mechanism / Transitioning
Instance group size set to zero (Fully on K8S)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 32 / 48
Deployment Pipeline
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 33 / 48
Deployment Pipeline
Jenkins Pipeline
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48
Deployment Pipeline
Jenkins Pipeline
Pipeline triggers jenkins jobs
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48
Deployment Pipeline
Jenkins Pipeline
Pipeline triggers jenkins jobs
3 Clicks to Deploy
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48
Deployment Pipeline
Jenkins Pipeline
Pipeline triggers jenkins jobs
3 Clicks to Deploy
Approval Steps
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48
Deployment Pipeline
Jenkins Pipeline
Pipeline triggers jenkins jobs
3 Clicks to Deploy
Approval Steps
Jobs to pause, resume or
revert deployment
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48
Deployment Pipeline
Jenkins Pipeline
Pipeline triggers jenkins jobs
3 Clicks to Deploy
Approval Steps
Jobs to pause, resume or
revert deployment
Tracked in Slack channels
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48
Deployment Pipeline
Jenkins Pipeline
Pipeline triggers jenkins jobs
3 Clicks to Deploy
Approval Steps
Jobs to pause, resume or
revert deployment
Tracked in Slack channels
Soon to be transformed to
CI/CD
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48
Monitoring & Alerting
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 35 / 48
Monitoring & Alerting
Monitoring is critical
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 36 / 48
Monitoring & Alerting
Monitoring is critical
Know your Infrastructure
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 36 / 48
Monitoring & Alerting
Monitoring is critical
Know your Infrastructure
Capture everything, always
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 36 / 48
Monitoring & Alerting
Monitoring is critical
Know your Infrastructure
Capture everything, always
Use Proper tools
Prometheus (with
exporters)
ELK
Sentry
StatsD
NewRelic
OpsGenie
Pingdom
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 36 / 48
Monitoring & Alerting
Monitoring is critical
Know your Infrastructure
Capture everything, always
Use Proper tools
Prometheus (with
exporters)
ELK
Sentry
StatsD
NewRelic
OpsGenie
Pingdom
Identify Retention
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 36 / 48
Monitoring & Alerting
Bare minimum required metrics→
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 37 / 48
Monitoring & Alerting
Bare minimum required metrics→
Load Average
CPU percent
Memory Available
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 37 / 48
Monitoring & Alerting
Bare minimum required metrics→
Load Average
CPU percent
Memory Available
Network Bandwidth
Network Connections
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 37 / 48
Monitoring & Alerting
Bare minimum required metrics→
Load Average
CPU percent
Memory Available
Network Bandwidth
Network Connections
Disk IOPS
Disk Usage
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 37 / 48
Build Dashboards
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 38 / 48
Build Dashboards
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 38 / 48
Build Dashboards
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 38 / 48
Build Dashboards
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 38 / 48
Monitoring & Alerting
’Config-Service’ logs auto
failover
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48
Monitoring & Alerting
’Config-Service’ logs auto
failover
Slack for notifications
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48
Monitoring & Alerting
’Config-Service’ logs auto
failover
Slack for notifications
On Call
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48
Monitoring & Alerting
’Config-Service’ logs auto
failover
Slack for notifications
On Call
Avoid alert blindness
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48
Monitoring & Alerting
’Config-Service’ logs auto
failover
Slack for notifications
On Call
Avoid alert blindness
Keep links handy
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48
Monitoring & Alerting
’Config-Service’ logs auto
failover
Slack for notifications
On Call
Avoid alert blindness
Keep links handy
Schedule jobs
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48
Monitoring & Alerting
’Config-Service’ logs auto
failover
Slack for notifications
On Call
Avoid alert blindness
Keep links handy
Schedule jobs
Automate
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48
Future Plans
Hire more engineers!
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48
Future Plans
Hire more engineers!
Move more services to Kubernetes
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48
Future Plans
Hire more engineers!
Move more services to Kubernetes
Move away from PG (don’t need ACID)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48
Future Plans
Hire more engineers!
Move more services to Kubernetes
Move away from PG (don’t need ACID)
Transition to Microservices
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48
Future Plans
Hire more engineers!
Move more services to Kubernetes
Move away from PG (don’t need ACID)
Transition to Microservices
Improve monitoring further
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48
Future Plans
Hire more engineers!
Move more services to Kubernetes
Move away from PG (don’t need ACID)
Transition to Microservices
Improve monitoring further
More fault tolerance
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48
Microservices
Golang (go-kit inspired)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48
Microservices
Golang (go-kit inspired)
Cassandra for storage
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48
Microservices
Golang (go-kit inspired)
Cassandra for storage
ElasticSearch for lookup
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48
Microservices
Golang (go-kit inspired)
Cassandra for storage
ElasticSearch for lookup
gRPC for communication
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48
Microservices
Golang (go-kit inspired)
Cassandra for storage
ElasticSearch for lookup
gRPC for communication
Hystrix for real time
monitoring
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48
Microservices
Golang (go-kit inspired)
Cassandra for storage
ElasticSearch for lookup
gRPC for communication
Hystrix for real time
monitoring
Zipkin for request tracing
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48
Microservices
Golang (go-kit inspired)
Cassandra for storage
ElasticSearch for lookup
gRPC for communication
Hystrix for real time
monitoring
Zipkin for request tracing
Prometheus for metrics
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48
Flash Sale
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 42 / 48
Flash Sale
Ultimate test of scalability
Hard to judge peak
Throughput can multiply in
short time
Planned for 2x throughput
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 42 / 48
Flash Sale - Latency
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 43 / 48
Flash Sale
Cache read calls at multiple layers
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 44 / 48
Flash Sale
Cache read calls at multiple layers
Upsized ES nodes, Eventually
replacing entire cluster
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 44 / 48
Flash Sale
Cache read calls at multiple layers
Upsized ES nodes, Eventually
replacing entire cluster
Local SSD PG slaves with RAID 0
(100k IOPS)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 44 / 48
Flash Sale
Cache read calls at multiple layers
Upsized ES nodes, Eventually
replacing entire cluster
Local SSD PG slaves with RAID 0
(100k IOPS)
Identify network bottlenecks
Recheck ulimit and connection limits
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 44 / 48
Flash Sale
Cache read calls at multiple layers
Upsized ES nodes, Eventually
replacing entire cluster
Local SSD PG slaves with RAID 0
(100k IOPS)
Identify network bottlenecks
Recheck ulimit and connection limits
Build and keep SOP handy
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 44 / 48
Flash Sale - Standard Operating Procedure
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 45 / 48
Infrastructure Team at Carousell
400+ servers
Thousands of requests per second
Production Issues get looked after in < 5 Mins
Uptime of 99.95
Failures don’t result in outages
All thanks to Planning, Monitoring and Automation
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 46 / 48
Take Away
Isolate stateful and stateless components
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
Take Away
Isolate stateful and stateless components
Isolating compute is equally important
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
Take Away
Isolate stateful and stateless components
Isolating compute is equally important
Choose data stores carefully, you won’t be changing them
frequently
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
Take Away
Isolate stateful and stateless components
Isolating compute is equally important
Choose data stores carefully, you won’t be changing them
frequently
Use Abstractions only after understating them
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
Take Away
Isolate stateful and stateless components
Isolating compute is equally important
Choose data stores carefully, you won’t be changing them
frequently
Use Abstractions only after understating them
Perform Root Cause Analysis not just workarounds/isolations
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
Take Away
Isolate stateful and stateless components
Isolating compute is equally important
Choose data stores carefully, you won’t be changing them
frequently
Use Abstractions only after understating them
Perform Root Cause Analysis not just workarounds/isolations
Identify bottlenecks
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
Take Away
Isolate stateful and stateless components
Isolating compute is equally important
Choose data stores carefully, you won’t be changing them
frequently
Use Abstractions only after understating them
Perform Root Cause Analysis not just workarounds/isolations
Identify bottlenecks
Monitor everything
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
Take Away
Isolate stateful and stateless components
Isolating compute is equally important
Choose data stores carefully, you won’t be changing them
frequently
Use Abstractions only after understating them
Perform Root Cause Analysis not just workarounds/isolations
Identify bottlenecks
Monitor everything
Blame CODE not CODER
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
Thank You
Q&A
P.S. we are hiring http://careers.carousell.com/
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 48 / 48

More Related Content

What's hot

How to Build a Scylla Database Cluster that Fits Your Needs
How to Build a Scylla Database Cluster that Fits Your NeedsHow to Build a Scylla Database Cluster that Fits Your Needs
How to Build a Scylla Database Cluster that Fits Your NeedsScyllaDB
 
Apache Druid Vision and Roadmap
Apache Druid Vision and RoadmapApache Druid Vision and Roadmap
Apache Druid Vision and RoadmapImply
 
Benchmarking Apache Druid
Benchmarking Apache Druid Benchmarking Apache Druid
Benchmarking Apache Druid Matt Sarrel
 
A real-time architecture using Hadoop and Storm @ JAX London
A real-time architecture using Hadoop and Storm @ JAX LondonA real-time architecture using Hadoop and Storm @ JAX London
A real-time architecture using Hadoop and Storm @ JAX LondonNathan Bijnens
 
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Charles Allen
 
Overcoming Barriers of Scaling Your Database
Overcoming Barriers of Scaling Your DatabaseOvercoming Barriers of Scaling Your Database
Overcoming Barriers of Scaling Your DatabaseScyllaDB
 
Eliminating Volatile Latencies Inside Rakuten’s NoSQL Migration
Eliminating  Volatile Latencies Inside Rakuten’s NoSQL MigrationEliminating  Volatile Latencies Inside Rakuten’s NoSQL Migration
Eliminating Volatile Latencies Inside Rakuten’s NoSQL MigrationScyllaDB
 
Druid in Spot Instances
Druid in Spot InstancesDruid in Spot Instances
Druid in Spot InstancesImply
 
Cassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary DifferencesCassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary DifferencesScyllaDB
 
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformReal-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformScyllaDB
 
Scylla Summit 2022: Stream Processing with ScyllaDB
Scylla Summit 2022: Stream Processing with ScyllaDBScylla Summit 2022: Stream Processing with ScyllaDB
Scylla Summit 2022: Stream Processing with ScyllaDBScyllaDB
 
What does Netflix, NTT and Rubicon Project have in common? Apache Druid.
What does Netflix, NTT and Rubicon Project have in common? Apache Druid.What does Netflix, NTT and Rubicon Project have in common? Apache Druid.
What does Netflix, NTT and Rubicon Project have in common? Apache Druid.Rommel Garcia
 
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleZeotap: Moving to ScyllaDB - A Graph of Billions Scale
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleScyllaDB
 
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data ArchivalGPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data ArchivalScyllaDB
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidImply
 
Can My Inventory Survive Eventual Consistency?
Can My Inventory Survive Eventual Consistency?Can My Inventory Survive Eventual Consistency?
Can My Inventory Survive Eventual Consistency?DataStax
 
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015NoSQLmatters
 
Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive w...
Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive w...Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive w...
Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive w...ScyllaDB
 
How TrafficGuard uses Druid to Fight Ad Fraud and Bots
How TrafficGuard uses Druid to Fight Ad Fraud and BotsHow TrafficGuard uses Druid to Fight Ad Fraud and Bots
How TrafficGuard uses Druid to Fight Ad Fraud and BotsImply
 

What's hot (20)

How to Build a Scylla Database Cluster that Fits Your Needs
How to Build a Scylla Database Cluster that Fits Your NeedsHow to Build a Scylla Database Cluster that Fits Your Needs
How to Build a Scylla Database Cluster that Fits Your Needs
 
Apache Druid Vision and Roadmap
Apache Druid Vision and RoadmapApache Druid Vision and Roadmap
Apache Druid Vision and Roadmap
 
Benchmarking Apache Druid
Benchmarking Apache Druid Benchmarking Apache Druid
Benchmarking Apache Druid
 
A real-time architecture using Hadoop and Storm @ JAX London
A real-time architecture using Hadoop and Storm @ JAX LondonA real-time architecture using Hadoop and Storm @ JAX London
A real-time architecture using Hadoop and Storm @ JAX London
 
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
 
Overcoming Barriers of Scaling Your Database
Overcoming Barriers of Scaling Your DatabaseOvercoming Barriers of Scaling Your Database
Overcoming Barriers of Scaling Your Database
 
Eliminating Volatile Latencies Inside Rakuten’s NoSQL Migration
Eliminating  Volatile Latencies Inside Rakuten’s NoSQL MigrationEliminating  Volatile Latencies Inside Rakuten’s NoSQL Migration
Eliminating Volatile Latencies Inside Rakuten’s NoSQL Migration
 
Druid in Spot Instances
Druid in Spot InstancesDruid in Spot Instances
Druid in Spot Instances
 
Cassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary DifferencesCassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary Differences
 
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformReal-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
 
Scylla Summit 2022: Stream Processing with ScyllaDB
Scylla Summit 2022: Stream Processing with ScyllaDBScylla Summit 2022: Stream Processing with ScyllaDB
Scylla Summit 2022: Stream Processing with ScyllaDB
 
What does Netflix, NTT and Rubicon Project have in common? Apache Druid.
What does Netflix, NTT and Rubicon Project have in common? Apache Druid.What does Netflix, NTT and Rubicon Project have in common? Apache Druid.
What does Netflix, NTT and Rubicon Project have in common? Apache Druid.
 
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleZeotap: Moving to ScyllaDB - A Graph of Billions Scale
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
 
Druid
DruidDruid
Druid
 
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data ArchivalGPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on Druid
 
Can My Inventory Survive Eventual Consistency?
Can My Inventory Survive Eventual Consistency?Can My Inventory Survive Eventual Consistency?
Can My Inventory Survive Eventual Consistency?
 
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
 
Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive w...
Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive w...Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive w...
Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive w...
 
How TrafficGuard uses Druid to Fight Ad Fraud and Bots
How TrafficGuard uses Druid to Fight Ad Fraud and BotsHow TrafficGuard uses Druid to Fight Ad Fraud and Bots
How TrafficGuard uses Druid to Fight Ad Fraud and Bots
 

Similar to Scaling Infrastructure at Carousell

Redhat - rhcs 2017 past, present and future
Redhat - rhcs 2017  past, present and futureRedhat - rhcs 2017  past, present and future
Redhat - rhcs 2017 past, present and futureinwin stack
 
Db tech show - hivemall
Db tech show - hivemallDb tech show - hivemall
Db tech show - hivemallMakoto Yui
 
Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17Makoto Yui
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Couchbase & HPCC Systems – A complete mobile & data platform in the enterprise
Couchbase & HPCC Systems – A complete mobile & data platform in the enterpriseCouchbase & HPCC Systems – A complete mobile & data platform in the enterprise
Couchbase & HPCC Systems – A complete mobile & data platform in the enterpriseHPCC Systems
 
20150314 sahara intro and the future plan for open stack meetup
20150314 sahara intro and the future plan for open stack meetup20150314 sahara intro and the future plan for open stack meetup
20150314 sahara intro and the future plan for open stack meetupWei Ting Chen
 
GDPR-focused partner community showcase for Apache Ranger and Apache Atlas
GDPR-focused partner community showcase for Apache Ranger and Apache AtlasGDPR-focused partner community showcase for Apache Ranger and Apache Atlas
GDPR-focused partner community showcase for Apache Ranger and Apache AtlasDataWorks Summit
 
The practice of big data - making big data approachable
The practice of big data - making big data approachableThe practice of big data - making big data approachable
The practice of big data - making big data approachablekcmallu
 
Hawq meets Hive - DataWorks San Jose 2017
Hawq meets Hive - DataWorks San Jose 2017Hawq meets Hive - DataWorks San Jose 2017
Hawq meets Hive - DataWorks San Jose 2017Alex Diachenko
 
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data avanttic Consultoría Tecnológica
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationInside Analysis
 
HAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged DataHAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged DataDataWorks Summit
 
AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek SinhaAWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek SinhaAmazon Web Services
 
Introduction to MySQL Document Store
Introduction to MySQL Document StoreIntroduction to MySQL Document Store
Introduction to MySQL Document StoreFrederic Descamps
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
AWS re:Invent 2016: Tableau Rules of Engagement in the Cloud (STG306)
AWS re:Invent 2016: Tableau Rules of Engagement in the Cloud (STG306)AWS re:Invent 2016: Tableau Rules of Engagement in the Cloud (STG306)
AWS re:Invent 2016: Tableau Rules of Engagement in the Cloud (STG306)Amazon Web Services
 
Data Warehouse embraces Kubernetes and Modernized Data Platforms with Pivotal...
Data Warehouse embraces Kubernetes and Modernized Data Platforms with Pivotal...Data Warehouse embraces Kubernetes and Modernized Data Platforms with Pivotal...
Data Warehouse embraces Kubernetes and Modernized Data Platforms with Pivotal...VMware Tanzu
 
PKS - Solving Complexity for Modern Data Workloads
PKS - Solving Complexity for Modern Data Workloads PKS - Solving Complexity for Modern Data Workloads
PKS - Solving Complexity for Modern Data Workloads Carlos Andrés García
 

Similar to Scaling Infrastructure at Carousell (20)

Redhat - rhcs 2017 past, present and future
Redhat - rhcs 2017  past, present and futureRedhat - rhcs 2017  past, present and future
Redhat - rhcs 2017 past, present and future
 
Db tech show - hivemall
Db tech show - hivemallDb tech show - hivemall
Db tech show - hivemall
 
Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Couchbase & HPCC Systems – A complete mobile & data platform in the enterprise
Couchbase & HPCC Systems – A complete mobile & data platform in the enterpriseCouchbase & HPCC Systems – A complete mobile & data platform in the enterprise
Couchbase & HPCC Systems – A complete mobile & data platform in the enterprise
 
20150314 sahara intro and the future plan for open stack meetup
20150314 sahara intro and the future plan for open stack meetup20150314 sahara intro and the future plan for open stack meetup
20150314 sahara intro and the future plan for open stack meetup
 
GDPR-focused partner community showcase for Apache Ranger and Apache Atlas
GDPR-focused partner community showcase for Apache Ranger and Apache AtlasGDPR-focused partner community showcase for Apache Ranger and Apache Atlas
GDPR-focused partner community showcase for Apache Ranger and Apache Atlas
 
The practice of big data - making big data approachable
The practice of big data - making big data approachableThe practice of big data - making big data approachable
The practice of big data - making big data approachable
 
Hawq meets Hive - DataWorks San Jose 2017
Hawq meets Hive - DataWorks San Jose 2017Hawq meets Hive - DataWorks San Jose 2017
Hawq meets Hive - DataWorks San Jose 2017
 
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter Integration
 
HAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged DataHAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged Data
 
AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek SinhaAWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
 
Introduction to MySQL Document Store
Introduction to MySQL Document StoreIntroduction to MySQL Document Store
Introduction to MySQL Document Store
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
AWS re:Invent 2016: Tableau Rules of Engagement in the Cloud (STG306)
AWS re:Invent 2016: Tableau Rules of Engagement in the Cloud (STG306)AWS re:Invent 2016: Tableau Rules of Engagement in the Cloud (STG306)
AWS re:Invent 2016: Tableau Rules of Engagement in the Cloud (STG306)
 
Net conf uy 2017 sql 2017
Net conf uy 2017   sql 2017Net conf uy 2017   sql 2017
Net conf uy 2017 sql 2017
 
Data Warehouse embraces Kubernetes and Modernized Data Platforms with Pivotal...
Data Warehouse embraces Kubernetes and Modernized Data Platforms with Pivotal...Data Warehouse embraces Kubernetes and Modernized Data Platforms with Pivotal...
Data Warehouse embraces Kubernetes and Modernized Data Platforms with Pivotal...
 
PKS - Solving Complexity for Modern Data Workloads
PKS - Solving Complexity for Modern Data Workloads PKS - Solving Complexity for Modern Data Workloads
PKS - Solving Complexity for Modern Data Workloads
 

Recently uploaded

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 

Scaling Infrastructure at Carousell

  • 1. Scaling Infrastructure at Carousell Harshad Rotithor & Ankur Shrivastava January 12, 2017 Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 1 / 48
  • 2. Who are we? Harshad Rotithor Principle Software Engineer Leads Infrastructure team Previously at Flipkart, Airpush, Zynga, etc. harshad@carousell.com Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 2 / 48
  • 3. Who are we? Ankur Shrivastava Senior Software Engineer Engineer in the Infrastructure team Previously at Flipkart, Amazon, Zynga, etc. ankur@carousell.com Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 3 / 48
  • 4. Where are we currently? Started in 2012 at a Hackathon Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 4 / 48
  • 5. Where are we currently? Started in 2012 at a Hackathon 7 countries, 19 cities 57M+ listings 23M+ items sold Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 4 / 48
  • 6. Where are we currently? Started in 2012 at a Hackathon 7 countries, 19 cities 57M+ listings 23M+ items sold Carousell makes buying and selling simple, so that you can fill our life with more meaningful things Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 4 / 48
  • 7. Where are we currently? 400+ servers Multiple Services see 2000+ requests per second Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 5 / 48
  • 8. Where are we currently? 400+ servers Multiple Services see 2000+ requests per second Self Managed deployments PostgresSQL ElasticSearch Cassandra RabbitMQ Kafka Redis Memcache and more ... Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 5 / 48
  • 9. Where are we currently? 400+ servers Multiple Services see 2000+ requests per second Self Managed deployments PostgresSQL ElasticSearch Cassandra RabbitMQ Kafka Redis Memcache and more ... Uptime of 99.95 Ability to handle AZ failures Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 5 / 48
  • 10. So what is this talk about ? Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 6 / 48
  • 11. What it took to reach here And what lies ahead! Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 7 / 48
  • 12. Current Infrastructure - Overview Infrastructure is: Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 8 / 48
  • 13. Current Infrastructure - Overview Infrastructure is: Architecture Systems Operations Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 8 / 48
  • 14. Current Infrastructure - Overview Infrastructure is: Architecture Systems Operations Stateful components most important We self-manage user path data stores Enable choice of data stores Right tradeoff in terms of consistency Enable possibilities of workarounds during rough times Have flexibility in node configuration etc Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 8 / 48
  • 15. Current Infrastructure Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 9 / 48
  • 16. Current Infrastructure - Data Stores Master + 2 Slaves in each AZ (Total 7) pgbouncer + HA Proxy (config-service) Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 10 / 48
  • 17. Current Infrastructure - Data Stores Master + 2 Slaves in each AZ (Total 7) pgbouncer + HA Proxy (config-service) Dedicated data disks (always use SSDs) Master disk snapshot every 3hr (fsync enabled) Don’t turn off Autovacuum (transaction id) Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 10 / 48
  • 18. Current Infrastructure - Data Stores 3 clusters, largest being close to 75 nodes Shard allocation awareness Use Plugins (kopf /head/cerebro) Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 11 / 48
  • 19. Current Infrastructure - Data Stores 3 clusters, largest being close to 75 nodes Shard allocation awareness Use Plugins (kopf /head/cerebro) Keep masters in different AZ HAProxy with L7 healthchecks (config-service) Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 11 / 48
  • 20. Current Infrastructure - Data Stores 3 clusters, largest being close to 75 nodes Shard allocation awareness Use Plugins (kopf /head/cerebro) Keep masters in different AZ HAProxy with L7 healthchecks (config-service) Incremental backups Set shard count correctly, be on higher side. Rely on linux page cache Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 11 / 48
  • 21. History Cloud provider ’x’ Everyday firefighting Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 12 / 48
  • 22. History Cloud provider ’x’ Everyday firefighting We hit upper limits Network Disk Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 12 / 48
  • 23. History Cloud provider ’x’ Everyday firefighting We hit upper limits Network Disk Noisy neighbours Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 12 / 48
  • 24. History Cloud provider ’x’ Everyday firefighting We hit upper limits Network Disk Noisy neighbours Limited types of instances Lack of features Load balancer Autoscaling Security! Decided on Migration Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 12 / 48
  • 25. Planning Around June 2016 250+ Nodes Identify ALL nodes and their functionalities Identify ALL traffic flows and patterns Architecture Freeze Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 13 / 48
  • 26. Planning Around June 2016 250+ Nodes Identify ALL nodes and their functionalities Identify ALL traffic flows and patterns Architecture Freeze Perform comparative benchmarks Redefine node and cluster configuration Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 13 / 48
  • 27. Planning Around June 2016 250+ Nodes Identify ALL nodes and their functionalities Identify ALL traffic flows and patterns Architecture Freeze Perform comparative benchmarks Redefine node and cluster configuration Isolated deployment in GCP Dry run data migration for all clusters Estimate time Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 13 / 48
  • 28. Preparation July 2016 VPN across the providers (Heavy Duty) Replicate all that can be replicated (inter DC) Keep stateless nodes ready Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 14 / 48
  • 29. Preparation July 2016 VPN across the providers (Heavy Duty) Replicate all that can be replicated (inter DC) Keep stateless nodes ready Make DNS nameserver changes in advance (3-4 days) Script everything - node creation, data movement, etc. Aim for only data movement during Migration Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 14 / 48
  • 30. Preparation Practice, Practice, Practice! Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 15 / 48
  • 31. Migration 29th July 2016 at 3am Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 16 / 48
  • 32. Migration 29th July 2016 at 3am Queues - RabbitMQ, Kafka, etc Drain on X Switch to new on GCP Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 16 / 48
  • 33. Migration 29th July 2016 at 3am Queues - RabbitMQ, Kafka, etc Drain on X Switch to new on GCP DB Replicated slaves across DC Promote to master and create slaves Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 16 / 48
  • 34. Migration 29th July 2016 at 3am Queues - RabbitMQ, Kafka, etc Drain on X Switch to new on GCP DB Replicated slaves across DC Promote to master and create slaves ElasticSearch & Cassandra Snapshot/Restore Very Quick - Fast GCP network Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 16 / 48
  • 35. Migration 29th July 2016 at 3am Queues - RabbitMQ, Kafka, etc Drain on X Switch to new on GCP DB Replicated slaves across DC Promote to master and create slaves ElasticSearch & Cassandra Snapshot/Restore Very Quick - Fast GCP network Redis RDB restore, create slaves Beware of cluster state in case of redis cluster Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 16 / 48
  • 36. Post Migration 5-6hr of Maintenance Latency dropped to 1/4th on GCP DNS propagation issue (even after 2 days) L7 tunnels over VPN Ensure monitoring is taken over after migration Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 17 / 48
  • 37. Key Take Away Practice makes the migration perfect! Keep stateless nodes ready Keep configuration updated Expect issues Redis cluster state switch DNS caching by ISPs for days Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 18 / 48
  • 38. Key Take Away Practice makes the migration perfect! Keep stateless nodes ready Keep configuration updated Expect issues Redis cluster state switch DNS caching by ISPs for days Keep Calm! Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 18 / 48
  • 39. From Pets To Cattle ⇓ Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 19 / 48
  • 40. From Pets To Cattle Static Infrastructure is a myth! Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 20 / 48
  • 41. From Pets To Cattle Static Infrastructure is a myth! Manual updates can be faulty Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 20 / 48
  • 42. From Pets To Cattle Static Infrastructure is a myth! Manual updates can be faulty Nodes can fail quickly, one after another Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 20 / 48
  • 43. From Pets To Cattle Static Infrastructure is a myth! Manual updates can be faulty Nodes can fail quickly, one after another Configuration can quickly become stale Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 20 / 48
  • 44. From Pets To Cattle Static Infrastructure is a myth! Manual updates can be faulty Nodes can fail quickly, one after another Configuration can quickly become stale Misconfiguration of Nodes Salt propagation issues Recent config update Painful to detect and fix Production impact! Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 20 / 48
  • 45. From Pets To Cattle Infrastructure at scale needs → Centralized configurations Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48
  • 46. From Pets To Cattle Infrastructure at scale needs → Centralized configurations Dynamic Discovery Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48
  • 47. From Pets To Cattle Infrastructure at scale needs → Centralized configurations Dynamic Discovery Automatic recovery from failures Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48
  • 48. From Pets To Cattle Infrastructure at scale needs → Centralized configurations Dynamic Discovery Automatic recovery from failures Autoscaling Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48
  • 49. From Pets To Cattle Infrastructure at scale needs → Centralized configurations Dynamic Discovery Automatic recovery from failures Autoscaling Scripts for stateful nodes (create/update/migrate) Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48
  • 50. From Pets To Cattle Infrastructure at scale needs → Centralized configurations Dynamic Discovery Automatic recovery from failures Autoscaling Scripts for stateful nodes (create/update/migrate) Aggressive Monitoring and Alerting Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48
  • 51. From Pets To Cattle Infrastructure at scale needs → Centralized configurations Dynamic Discovery Automatic recovery from failures Autoscaling Scripts for stateful nodes (create/update/migrate) Aggressive Monitoring and Alerting Streamline Deployments Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48
  • 52. Configuration and Service Discovery For Configuration we needed → Centralized configuration storage Consistent store Audit of configuration changes Versioning for quick reverts Easy to deploy and manage Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 22 / 48
  • 53. Configuration and Service Discovery For Service Discovery we needed → Decoupled from application code Health checks Easy to Scale Out Easy to deploy and manage Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 23 / 48
  • 54. Configuration and Service Discovery We built ’Config-Service’ on top on ’Consul’ Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 24 / 48
  • 55. Configuration and Service Discovery We built ’Config-Service’ on top on ’Consul’ Configuration on nodes using Consul Template & Envconsul Installation on instances using internal Debian package and repo ’Config-Service’ package takes care of consul cluster configuration and health check registration Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 24 / 48
  • 56. Configuration Management Git repository to manage configuration Filename is the key, content is the value Single source of truth Audit log of changes Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 25 / 48
  • 57. Configuration Management Git repository to manage configuration Filename is the key, content is the value Single source of truth Audit log of changes Easy reverts and versioning (just use git revert) Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 25 / 48
  • 58. Service Discovery Named discovery Loose coupling Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 26 / 48
  • 59. Service Discovery Named discovery Loose coupling Auto failover Load balancing Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 26 / 48
  • 60. Service Discovery Named discovery Loose coupling Auto failover Load balancing Auto scaling on CPU usage / Number of Requests Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 26 / 48
  • 61. Service Discovery Named discovery Loose coupling Auto failover Load balancing Auto scaling on CPU usage / Number of Requests Node Maintenance Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 26 / 48
  • 62. Config-Service Overview Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 27 / 48
  • 63. Auto Scaling Pay as you go, lower cost Better fault tolerance Availability zone failures Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 28 / 48
  • 64. Auto Scaling Pay as you go, lower cost Better fault tolerance Availability zone failures Handle sudden increase in traffic (specially at midnight!) Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 28 / 48
  • 65. Key Take Away Assume things will break Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48
  • 66. Key Take Away Assume things will break Set Convention Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48
  • 67. Key Take Away Assume things will break Set Convention Script everything Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48
  • 68. Key Take Away Assume things will break Set Convention Script everything Use deb/rpm packages Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48
  • 69. Key Take Away Assume things will break Set Convention Script everything Use deb/rpm packages Instance groups for stateless services Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48
  • 70. Key Take Away Assume things will break Set Convention Script everything Use deb/rpm packages Instance groups for stateless services More Cattle, less Pets Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48
  • 71. Kubernetes Partial Kubernetes deployment since Oct, 2016 Full Production deployment since Nov, 2016 Using Google Container Engine Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 30 / 48
  • 72. Kubernetes Partial Kubernetes deployment since Oct, 2016 Full Production deployment since Nov, 2016 Using Google Container Engine 30+ deployments 500+ containers (At Peak) Autoscale on CPU targets Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 30 / 48
  • 73. Kubernetes Partial Kubernetes deployment since Oct, 2016 Full Production deployment since Nov, 2016 Using Google Container Engine 30+ deployments 500+ containers (At Peak) Autoscale on CPU targets Not all services on boarded yet Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 30 / 48
  • 74. Kubernetes We don’t use K8S Ingress/Service Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 31 / 48
  • 75. Kubernetes We don’t use K8S Ingress/Service Config-Service (consul) as DaemonSet Containers get registered on Config-Service (NodePort) from health check Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 31 / 48
  • 76. Kubernetes We don’t use K8S Ingress/Service Config-Service (consul) as DaemonSet Containers get registered on Config-Service (NodePort) from health check No change in existing architecture needed Service discovery from Internal/External HA Proxy still works Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 31 / 48
  • 77. Kubernetes ’Config-Service’ allows us to have hybrid model Instance groups can coexist with Kubernetes Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 32 / 48
  • 78. Kubernetes ’Config-Service’ allows us to have hybrid model Instance groups can coexist with Kubernetes Recovery mechanism / Transitioning Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 32 / 48
  • 79. Kubernetes ’Config-Service’ allows us to have hybrid model Instance groups can coexist with Kubernetes Recovery mechanism / Transitioning Instance group size set to zero (Fully on K8S) Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 32 / 48
  • 80. Deployment Pipeline Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 33 / 48
  • 81. Deployment Pipeline Jenkins Pipeline Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48
  • 82. Deployment Pipeline Jenkins Pipeline Pipeline triggers jenkins jobs Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48
  • 83. Deployment Pipeline Jenkins Pipeline Pipeline triggers jenkins jobs 3 Clicks to Deploy Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48
  • 84. Deployment Pipeline Jenkins Pipeline Pipeline triggers jenkins jobs 3 Clicks to Deploy Approval Steps Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48
  • 85. Deployment Pipeline Jenkins Pipeline Pipeline triggers jenkins jobs 3 Clicks to Deploy Approval Steps Jobs to pause, resume or revert deployment Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48
  • 86. Deployment Pipeline Jenkins Pipeline Pipeline triggers jenkins jobs 3 Clicks to Deploy Approval Steps Jobs to pause, resume or revert deployment Tracked in Slack channels Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48
  • 87. Deployment Pipeline Jenkins Pipeline Pipeline triggers jenkins jobs 3 Clicks to Deploy Approval Steps Jobs to pause, resume or revert deployment Tracked in Slack channels Soon to be transformed to CI/CD Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48
  • 88. Monitoring & Alerting Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 35 / 48
  • 89. Monitoring & Alerting Monitoring is critical Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 36 / 48
  • 90. Monitoring & Alerting Monitoring is critical Know your Infrastructure Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 36 / 48
  • 91. Monitoring & Alerting Monitoring is critical Know your Infrastructure Capture everything, always Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 36 / 48
  • 92. Monitoring & Alerting Monitoring is critical Know your Infrastructure Capture everything, always Use Proper tools Prometheus (with exporters) ELK Sentry StatsD NewRelic OpsGenie Pingdom Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 36 / 48
  • 93. Monitoring & Alerting Monitoring is critical Know your Infrastructure Capture everything, always Use Proper tools Prometheus (with exporters) ELK Sentry StatsD NewRelic OpsGenie Pingdom Identify Retention Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 36 / 48
  • 94. Monitoring & Alerting Bare minimum required metrics→ Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 37 / 48
  • 95. Monitoring & Alerting Bare minimum required metrics→ Load Average CPU percent Memory Available Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 37 / 48
  • 96. Monitoring & Alerting Bare minimum required metrics→ Load Average CPU percent Memory Available Network Bandwidth Network Connections Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 37 / 48
  • 97. Monitoring & Alerting Bare minimum required metrics→ Load Average CPU percent Memory Available Network Bandwidth Network Connections Disk IOPS Disk Usage Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 37 / 48
  • 98. Build Dashboards Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 38 / 48
  • 99. Build Dashboards Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 38 / 48
  • 100. Build Dashboards Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 38 / 48
  • 101. Build Dashboards Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 38 / 48
  • 102. Monitoring & Alerting ’Config-Service’ logs auto failover Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48
  • 103. Monitoring & Alerting ’Config-Service’ logs auto failover Slack for notifications Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48
  • 104. Monitoring & Alerting ’Config-Service’ logs auto failover Slack for notifications On Call Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48
  • 105. Monitoring & Alerting ’Config-Service’ logs auto failover Slack for notifications On Call Avoid alert blindness Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48
  • 106. Monitoring & Alerting ’Config-Service’ logs auto failover Slack for notifications On Call Avoid alert blindness Keep links handy Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48
  • 107. Monitoring & Alerting ’Config-Service’ logs auto failover Slack for notifications On Call Avoid alert blindness Keep links handy Schedule jobs Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48
  • 108. Monitoring & Alerting ’Config-Service’ logs auto failover Slack for notifications On Call Avoid alert blindness Keep links handy Schedule jobs Automate Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48
  • 109. Future Plans Hire more engineers! Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48
  • 110. Future Plans Hire more engineers! Move more services to Kubernetes Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48
  • 111. Future Plans Hire more engineers! Move more services to Kubernetes Move away from PG (don’t need ACID) Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48
  • 112. Future Plans Hire more engineers! Move more services to Kubernetes Move away from PG (don’t need ACID) Transition to Microservices Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48
  • 113. Future Plans Hire more engineers! Move more services to Kubernetes Move away from PG (don’t need ACID) Transition to Microservices Improve monitoring further Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48
  • 114. Future Plans Hire more engineers! Move more services to Kubernetes Move away from PG (don’t need ACID) Transition to Microservices Improve monitoring further More fault tolerance Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48
  • 115. Microservices Golang (go-kit inspired) Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48
  • 116. Microservices Golang (go-kit inspired) Cassandra for storage Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48
  • 117. Microservices Golang (go-kit inspired) Cassandra for storage ElasticSearch for lookup Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48
  • 118. Microservices Golang (go-kit inspired) Cassandra for storage ElasticSearch for lookup gRPC for communication Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48
  • 119. Microservices Golang (go-kit inspired) Cassandra for storage ElasticSearch for lookup gRPC for communication Hystrix for real time monitoring Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48
  • 120. Microservices Golang (go-kit inspired) Cassandra for storage ElasticSearch for lookup gRPC for communication Hystrix for real time monitoring Zipkin for request tracing Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48
  • 121. Microservices Golang (go-kit inspired) Cassandra for storage ElasticSearch for lookup gRPC for communication Hystrix for real time monitoring Zipkin for request tracing Prometheus for metrics Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48
  • 122. Flash Sale Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 42 / 48
  • 123. Flash Sale Ultimate test of scalability Hard to judge peak Throughput can multiply in short time Planned for 2x throughput Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 42 / 48
  • 124. Flash Sale - Latency Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 43 / 48
  • 125. Flash Sale Cache read calls at multiple layers Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 44 / 48
  • 126. Flash Sale Cache read calls at multiple layers Upsized ES nodes, Eventually replacing entire cluster Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 44 / 48
  • 127. Flash Sale Cache read calls at multiple layers Upsized ES nodes, Eventually replacing entire cluster Local SSD PG slaves with RAID 0 (100k IOPS) Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 44 / 48
  • 128. Flash Sale Cache read calls at multiple layers Upsized ES nodes, Eventually replacing entire cluster Local SSD PG slaves with RAID 0 (100k IOPS) Identify network bottlenecks Recheck ulimit and connection limits Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 44 / 48
  • 129. Flash Sale Cache read calls at multiple layers Upsized ES nodes, Eventually replacing entire cluster Local SSD PG slaves with RAID 0 (100k IOPS) Identify network bottlenecks Recheck ulimit and connection limits Build and keep SOP handy Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 44 / 48
  • 130. Flash Sale - Standard Operating Procedure Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 45 / 48
  • 131. Infrastructure Team at Carousell 400+ servers Thousands of requests per second Production Issues get looked after in < 5 Mins Uptime of 99.95 Failures don’t result in outages All thanks to Planning, Monitoring and Automation Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 46 / 48
  • 132. Take Away Isolate stateful and stateless components Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
  • 133. Take Away Isolate stateful and stateless components Isolating compute is equally important Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
  • 134. Take Away Isolate stateful and stateless components Isolating compute is equally important Choose data stores carefully, you won’t be changing them frequently Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
  • 135. Take Away Isolate stateful and stateless components Isolating compute is equally important Choose data stores carefully, you won’t be changing them frequently Use Abstractions only after understating them Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
  • 136. Take Away Isolate stateful and stateless components Isolating compute is equally important Choose data stores carefully, you won’t be changing them frequently Use Abstractions only after understating them Perform Root Cause Analysis not just workarounds/isolations Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
  • 137. Take Away Isolate stateful and stateless components Isolating compute is equally important Choose data stores carefully, you won’t be changing them frequently Use Abstractions only after understating them Perform Root Cause Analysis not just workarounds/isolations Identify bottlenecks Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
  • 138. Take Away Isolate stateful and stateless components Isolating compute is equally important Choose data stores carefully, you won’t be changing them frequently Use Abstractions only after understating them Perform Root Cause Analysis not just workarounds/isolations Identify bottlenecks Monitor everything Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
  • 139. Take Away Isolate stateful and stateless components Isolating compute is equally important Choose data stores carefully, you won’t be changing them frequently Use Abstractions only after understating them Perform Root Cause Analysis not just workarounds/isolations Identify bottlenecks Monitor everything Blame CODE not CODER Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
  • 140. Thank You Q&A P.S. we are hiring http://careers.carousell.com/ Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 48 / 48