The document describes LinkedIn's use of Couchbase for caching and the automation of Couchbase clusters using SaltStack. Key points:
- LinkedIn uses Couchbase to store cached data for read scaling across hundreds of clusters totaling thousands of servers.
- Automation is achieved using SaltStack's states, pillars and grains to configure Couchbase installation, cluster expansion/reduction, and uninstall remotely.
- A Couchbase execution module and Salt runners implement cluster operations like setup, expansion, reduction through the REST API and CLI while providing output to the user.
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Setting up Couchbase cluster...Couchbase cluster setup completed successfully
1. Couchbase Orchestration and Scaling a Caching
Infrastructure at LinkedIn
Issa Fattah
Senior SRE for New York Engineering
Member of Couchbase Virtual Team
April 20th, 2016
2. Who We Are
Couchbase supported by Virtual Team
– 10 SREs
– 2 Software Engineers
– Sponsoring by a Director
– 5-90% of their time to supporting Couchbase
– In addition to their day to day responsibilities
Everyone can and is encouraged to contribute
3. Overview
LinkedIn Story
Why Couchbase?
Development and Operations
Clusters and Numbers
Automation Use Cases
Salt Stack
– States, Pillars, Grains, Runner, Execution Module
Automation Output
– Caveats/Improvements
Conclusion
4. The LinkedIn Story
Founded in 2002
Grown into the worlds largest professional social
media network
Offices in 24 countries, 30 cities around the world
Available in 24 languages
Revenue of $862M in Q4 2015
5. The LinkedIn Story
Growth in Site Features
– Member Profiles,
Connections, and Sharing
– Post and apply for Jobs
– LinkedIn Groups and
Company Pages
– Premium tools for hiring,
marketing, and sales
Growth in Internet traffic
– Billions of page hits per
day
– Global, round the clock
traffic
Growth in Audience
– 400+M members
– 3M+ company pages
– 2.1M+ groups
6. The LinkedIn Story
Difficult for storage systems to keep up
“Read-scaling”
– Store data in cache memory
– Replicate entire databases
– Temporary data such as for de-duping
– Memcached, EHCache, Custom
Infrastructure around storage systems
– Cache invalidations
– Reliable data replication
7. Why Couchbase?
Evaluated systems to replace Memcached: Mongo,
Redis, and others
Couchbase had advantages
– Drop-in replacement for Memcached
– Built in replication and cluster expansion
– Memory latency for operations
– Persistence i.e. asynchronous writes to disk
– Utilize some of the development infrastructure we’ve built
8. Why Couchbase?
Partitioning
– Partitioning done automatically
– Expansion and rebalancing of cluster
Warm Caches
– Replication to protect against server failures
– On-disk data for server reboot
– Backup/restore and live data pumps for data transfer
across data centers
9. Development and Operations
Increasing number of servers, focused on remote
caches
Important for our caches to remain warm
Built up operational tools and standards
– Deployment configuration scripts
– Common pattern for development
– Support libraries for our developers
– Monitoring of the servers
10. Development and Operations: Tools
SALT modules to build and configure a new cluster
Set a master, setup/expand/reduce a cluster
Custom RPM with some backported bug fixes
Integrate with monitoring dashboards
– Grab over 300 metrics
– Per-host: Key to value ratio
– Aggregated metrics: QPS, data vs memory size
Monitor high watermark, latency, data ejection, etc.
11. Clusters and Numbers
Approximately 300 couchbase clusters in all colos
Current versions supported: 2.2.0 and 3.0.1
Average cluster size of 10 hosts
Largest cluster is 72 hosts.
Single and Multi-tenant clusters
Highest throughput cluster is 1.8M QPS.
12. Clusters and Numbers:
Credential Cache
Stores metadata
for users creds of
2 providers.
~8000 QPS
Avg. GET call
time ~1 ms.
18 hosts
13. Clusters and Numbers:
Credential Cache
~98% hit-ratio
All in Memory.
No ejections to disk
No performance
penalty (when
compared to RAM
access)
14. Couchbase Summary
In-Memory cache that fits into our existing
infrastructure
Provides eventual-consistency persistence
Read-scaling with acceptable latencies
Management and monitoring of the clusters
Rich set of tooling extended by members of CBVT
using Salt
15. Automation Use Cases
Build and deploy Couchbase in a reliable and consistent
way
Support multiple versions of Couchbase
Metadata to describe the cluster/buckets being
provisioned
Provide a layer of abstraction that is easy to use
Scale cluster sizes as needed.
Decommission clusters entirely.
17. Terminology
Salt Master
– central server that manages hosts which run agents, called Salt Minions.
Salt Runner
– application of convenience executed by the salt-run command on the Salt Master.
Execution Module
– similar to a Salt Runner except that is executed on the Minion host.
The method of configuration management provided by Salt consists of:
– Pillars: centrally managed data, rendered on the master
– Grains: data specific to the minion host that is being targeted
Range
– distributed metadata store that contains information about clusters of hosts.
18. Salt State
Expresses
the state of
a host in a
small easy-
to-read and
understand
file:
# If couchbase user and group do not
#exist, create it.
couchbase:
group.present:
- gid: 500
user.present:
- shell: /bin/bash
- home: /opt/couchbase
- createhome: True
- uid: 500
- gid: 500
19. Salt State
Specify path for
couchbase
bucket data that
is eventually
persisted to disk.
Uniform
configuration
across all hosts in
a cluster.
#Create path where couchbase will
#store bucket data:
create_data_dir:
file.directory:
- name: ‘/path/to/data’
- user: couchbase
- group: couchbase
- dir_mode: 700
20. Salt State
Install couchbase.
Version and
package are hard-
coded.
#Install couchbase-server version.
couchbase-server:
pkg.installed:
- name: ‘couchbase-server’
- require:
- user: couchbase
- group: couchbase
- file: password_file
- fromrepo: ifattah-
repo,RPMS.os
- refresh: True
- version: ‘3.0.1-1444.10’
21. Salt State
State successful if:
– All previous blocks
were successful.
– couchbase-server
process is running
Vanilla state
Robust templates
# Make sure that service is running
#after installation
service:
- running
- require:
- pkg: couchbase-server
22. Salt State Templating
Parameterize the version we want to install
Jinja Template + Pillar metadata
{% if salt['pillar.get']('couchbase_pillar:version') == '2.2.0' %}
{% set pkg_name = ’prod-couchbase-server' %}
{% set version = '2.2.0-85' %}
{% elif salt['pillar.get']('couchbase_pillar:version') == '3.0.1' %}
{% set pkg_name = ’prod-couchbase-server' %}
{% set version = '3.0.1-14' %}
{% endif %}
23. Salt State Templating
Install couchbase.
Version and package
are taken from pillar
data.
State (SLS file) can be
applied via CLI or via
Python Salt client
#Install couchbase-server version.
couchbase-server:
pkg.installed:
- name: {{ pkg_name }}
- require:
- user: couchbase
- group: couchbase
- fromrepo: some-repo
- refresh: True
- version: {{ version }}
24. Salt State
Can be applied via CLI or Salt Runner:
sudo salt hostname state.sls couchbase.setup
hostname1:
----------
ID: required_packages
Function: pkg.installed
Result: True
Comment: All specified packages are already installed.
Started: 20:37:00.729487
Duration: 6502.956 ms
Changes:
25. Salt Pillars
YAML-defined.
Every cluster’s composition.
– Buckets
– Replicas
– RAM Size
– Type
SASL or no SASL?
Host lists are obtained by Range.
Grain stores the name of this file so
that we can retrieve the admin
password for this cluster only.
#!yaml|gpg
couchbase_pillar:
admin_password: ‘<encrypted string>’
host_range: '%ifattah.couchbase.99'
data_path: /mnt/foo
version: 3.0.1
buckets:
bucketA:
type: couchbase
auth: none
port: 13337
replicas: 1
ram: 256
bucketB:
type: couchbase
auth: none
port: 13338
replicas: 1
ram: 256
26. Salt Grains
Host-specific data:
– OS version
– Kernel version
– Total Physical RAM
Set the pillar file which stores the encrypted admin password.
When pillar data is made available to a targeted minion:
– Include pillar file matching the name of the grain:
• couchbase.cluster.{{ grains['couchbase_cluster'] }}
Ensures a cluster is only accessing it’s specific pillar metadata
Set the cluster_ramsize (RAM to allocate to couchbase):
Ensures all hosts’ RAM utilized in the same way.
27. Salt Execution Module
Available functions executed by Salt Minion on targeted
hosts
Post-installation steps required by couchbase
Constructs cli commands and API requests
28. Salt Execution Module
Constructs Couchbase commands:
– Set cluster’s admin password, data_path (taken from pillar data)
– Add/remove host(s) from a cluster
– Issue a rebalance of data
– Get status/stop rebalance
– Create new buckets
– Flush existing buckets
Issue requests via HTTP REST API:
– Get Couchbase version from running node
– Enable auto-failover (when a node fails to respond).
– Rename a node
– Get membership status/health of all hosts in a cluster.
29. Salt Runner
Provides 4 functions to the user:
– setup_cluster:
• Build a cluster from scratch with provided metadata from pillar and host list
from range.
• Applies ‘setup’ state for installation.
– expand_cluster (reduce_cluster)
• Compares range host list with couchbase membership.
• Add hosts to couchbase cluster to match range (source of truth)
– Uninstall
• Ensures couchbase is removed by applying another state ‘uninstall.sls’
All above runner functions verify the results of all functions
executed by the minion (execution module)
30. Salt Runner
Ties everything together:
– sudo salt-run couchbase.setup_cluster %ifattah.couchbase.99
Apply ‘couchbase.setup’ state to given hosts
– Ensure all pre-installation and installation pre-requisites are met.
Sets grain value for a host to obtain the decrypted admin
password (name of pillar file)
Salt’s cmd_iter() invokes a function in the execution module
on the targeted minion host.
Runner output informs the user throughout the process
31. Salt Runner Output (setup_cluster)
Couchbase will be installed on:
salt-minion1.linkedin.local
salt-minion2.linkedin.local
salt-minion3.linkedin.local
salt-minion4.linkedin.local
Master node is: salt-minion1.linkedin.local
couchbase-server version: 3.0.1
Couchbase data path: /mnt/foo
Decrypted Couchbase Administrator password: adminadmin
Decrypted Couchbase readonly password: readonly
MB of RAM allocated for couchbase-server (cluster-init-ramsize): 1419
The following buckets will be created:
- bucketA
- bucketB
32. Salt Runner Output (setup_cluster)
[INFO] Non-interactive: Answered yes to ‘ Is the above information correct?
Proceed?’
[INFO] Beginning installation ...
[INFO] SUCCESS: Successfully ran couchbase.setup on salt-minion1…4.linkedin.local
[INFO] SUCCESS: couchbase.setup state applied to all hosts.
[INFO] Syncing modules from base env…
[INFO] Modules on salt-minion1.linkedin.local were synced.
[INFO] Modules on salt-minion2.linkedin.local were synced.
[INFO] Modules on salt-minion3.linkedin.local were synced.
[INFO] Modules on salt-minion4.linkedin.local were synced.
[INFO] Initializing couchbase on all cluster nodes…
[INFO] SUCCESS: Successfully set data_path on salt-minion4.linkedin.local
[INFO] SUCCESS: Successfully set data_path on salt-minion2.linkedin.local
[INFO] SUCCESS: Successfully set data_path on salt-minion3.linkedin.local
[INFO] SUCCESS: Successfully set data_path on salt-minion1.linkedin.local
[INFO] SUCCESS: Successfully set cluster_ramsize to 1419M
33. Salt Runner Output (setup_cluster)
[INFO] SUCCESS: Successfully renamed salt-minion1.linkedin.local with FQDN.
[INFO] SUCCESS: Successfully created readonly admin account
[INFO] Adding salt-minion2.linkedin.local to cluster...
[INFO] SUCCESS: Successfully added salt-minion2.linkedin.local to the cluster
[INFO] Adding salt-minion3.linkedin.local to cluster...
[INFO] SUCCESS: Successfully added salt-minion3.linkedin.local to the cluster
[INFO] Adding salt-minion4.linkedin.local to cluster...
[INFO] SUCCESS: Successfully added salt-minion4.linkedin.local to the cluster
[INFO] Starting Rebalance...
[INFO] SUCCESS: Successfully rebalanced the cluster
[INFO] SUCCESS: Successfully created bucketA bucket
[INFO] SUCCESS: Successfully created bucketB bucket
[INFO] Starting Rebalance...
[INFO] SUCCESS: Successfully rebalanced the cluster
[INFO] SUCCESS: Successfully enabled autofailover for the cluster
[INFO] INSTALLATION COMPLETE!
35. Salt Runner Output (reduce_cluster)
[INFO] ** Starting preliminary procedures for cluster reduction... **
[INFO] Setting grain...
[INFO] Summary of operations to be performed:
[INFO] The following hosts will be removed from the couchbase cluster:
[INFO] - salt-minion4.linkedin.local
[INFO] - salt-minion3.linkedin.local
[INFO] All commands will be executed on the chosen master node: salt-
minion1.linkedin.local
[INFO] Non-interactive: Answered yes to 'Is the above information correct?
Proceed?'
[INFO] WARNING: Cluster reduction can be a dangerous operation.
[INFO] Non-interactive: Answered yes to 'Are you sure this cluster can operate
with 2 fewer nodes? Proceed?’
[INFO] Marking hosts for removal and beginning rebalance...
[INFO] - salt-minion4.linkedin.local
[INFO] - salt-minion3.linkedin.local
[INFO] Rebalancing may take a while…
36. Salt Runner Output (reduce_cluster)
[INFO] Waiting 30 seconds after rebalance, before uninstall.
[INFO] Uninstalling couchbase from removed hosts...
[INFO] Couchbase will be uninstalled from:
[INFO] - salt-minion4.linkedin.local
[INFO] - salt-minion3.linkedin.local
[INFO] Non-interactive: Answered yes to 'Is the above information correct?
Proceed?'
[INFO] Syncing modules from base env...
[INFO] Modules on salt-minion3.linkedin.local were synced.
[INFO] Modules on salt-minion4.linkedin.local were synced.
[INFO] Beginning removal ...
[INFO] SUCCESS: removed couchbase_cluster grain from salt-minion4.linkedin.local
[INFO] SUCCESS: removed couchbase_cluster grain from salt-minion3.linkedin.local
[INFO] SUCCESS: Successfully ran couchbase.uninstall on salt-minion3…
[INFO] SUCCESS: Successfully ran couchbase.uninstall on salt-minion4…
[INFO] UNINSTALLED!
[INFO] Cluster reduction complete!
True
38. Salt Runner Output (expand_cluster)
[INFO] ** Starting preliminary procedures for cluster reduction... **
[INFO] Setting grain...
[INFO] Summary of operations to be performed:
[INFO] The following hosts will be removed from the couchbase cluster:
[INFO] - salt-minion4.linkedin.local
[INFO] - salt-minion3.linkedin.local
[INFO] All commands will be executed on the chosen master node: salt-
minion1.linkedin.local
[INFO] Non-interactive: Answered yes to 'Is the above information correct?
Proceed?'
[INFO] WARNING: Cluster reduction can be a dangerous operation.
[INFO] Non-interactive: Answered yes to 'Are you sure this cluster can operate
with 2 fewer nodes? Proceed?’
[INFO] Marking hosts for removal and beginning rebalance...
[INFO] - salt-minion4.linkedin.local
[INFO] - salt-minion3.linkedin.local
[INFO] Rebalancing may take a while…
39. Salt Runner Output (expand_cluster)
[INFO] ** Starting preliminary procedures for cluster expansion... **
[INFO] Setting grain...
[INFO] The following hosts belong to the range cluster %ifattah.couchbase.9999 :
[INFO] - salt-minion1.linkedin.local, clusterMember=active, status=healthy
[INFO] - salt-minion2.linkedin.local, clusterMember=active, status=healthy
[INFO] The following hosts are being added for cluster expansion:
- salt-minion3.linkedin.local
- salt-minion4.linkedin.local
[INFO] Non-interactive: Answered yes to 'Is the above information correct?'
[INFO] Beginning installation ...
[INFO] SUCCESS: Successfully ran couchbase.setup on salt-minion4.linkedin.local
[INFO] SUCCESS: Successfully ran couchbase.setup on salt-minion3.linkedin.local
[INFO] SUCCESS: couchbase.setup state applied to new hosts.
40. Salt Runner Output (expand_cluster)
[INFO] Syncing modules from base env...
[INFO] Modules on salt-minion4.linkedin.local were synced.
[INFO] Modules on salt-minion3.linkedin.local were synced.
[INFO] Initializing couchbase on new cluster nodes...
[INFO] SUCCESS: Successfully set data_path on salt-minion4.linkedin.local
[INFO] SUCCESS: Successfully set data_path on salt-minion3.linkedin.local
[INFO] Adding salt-minion3.linkedin.local to cluster...
[INFO] SUCCESS: Successfully added salt-minion3.linkedin.local to the cluster
[INFO] Adding salt-minion4.linkedin.local to cluster...
[INFO] SUCCESS: Successfully added salt-minion4.linkedin.local to the cluster
[INFO] Starting Rebalance...
[INFO] SUCCESS: Successfully rebalanced the cluster
[INFO] EXPANSION COMPLETE!
True
42. Caveats/Improvements
If grain isn’t set, pillar render error occurs on the
minion.
– To avoid leaking password to minion logs, set
safe_renderrer_error to false.
Frequent updates to execution module as functions are
tested for Couchbase 4.x
Logic to infer how many hosts can be safely removed
Provide a frontend interface
– Instead of running from salt master host.
43. Conclusions
Couchbase
– Provides a robust caching layer
– Powers critical parts of linkedin.com
Saltstack
– Quickly and dynamically provision clusters
– Reliably scale clusters as needed.
Automation is your friend
Open-source version can be made available if there is
enough interest.