What are the techniques and technolgies used by popular social networking sites such as Facebook, Twitter, Tumblr, Pinterest or Instagram? How do they architect their systems to scale to multiples of 100 million of visits per day?
A Journey Into the Emotions of Software Developers
Lessons from Highly Scalable Architectures at Social Networking Sites
1. Software Engineering in a Cloud World
Lessons from
highly-scalable architectures
at social networking sites
Patrick Senti
patrick.senti@gmail.com
1
2. Social Networking – Trends 2012
more users...
… higher share of time ...
… for longer
Source: State of Media: The Social Media Report 2012, nielsen, http://is.gd/LYHmnm
2
3. User Adoption Faster for New Entrants
User Growth
(years since launch)
1000
Facebook
Twitter
Tumblr
Instagram
Pinterest
Million (logarithmic)
100
10
1
0.1
0.01
0.5
1
2
3
4
5
6
7
8
years
Source: author's compilations of data from company data, press statements, technical blogs & presentations
3
4. Staggering Volumes
Page views
Reads
Writes
New data
Servers
Engineers
500 million/day
~40k requests/second
~1 million/second
~3 TB/day
1000
20
Likes (counter)
Photos
Queries
New Data
Servers
Engineers
2.7 billion/day
300 millions/day
70'000/day
500 TB/day
“tens of thousands”
~1700
Tweets (peak)
Tweets (avg)
API calls
New data
Engineers
~25'000/second
~250 million/day (1000/second)
6 billion/day (70'000/second)
~8 TB/day (80MB/second)
500 (of 1000 total employees)
Page views
Growth rate
Machinery
2.3 billion/month
50% (visitors, March 2012)
150 web servers
90 caching servers
70 database instances
35 logging, internal
410 TB (user data)
~65 (NB, until end of 2011: 12)
Data size
Employees
4
Sources: http://is.gd/mpdOPN, http://is.gd/1vJ1il, http://is.gd/58X8ns, http://is.gd/LGexI6, http://is.gd/tZfNPA, http://is.gd/bcpCJc, http://is.gd/kXVEEF
5. Methodology
●
Author's synthesis
●
●
●
Information collected 2010 – 2012
Mostly secondary research conducted on the internet
Sources of information
●
●
Engineering blogs by social network companies
●
Research reports
●
Technology documentation
●
●
Public presentations at industry conferences
Author's data analysis
Threats to validity
●
Subjective selection of information sources
●
Non-systematic analysis and synthesis of data gathered
5
6. Typical Scalability Approaches
●
Load Balancing
●
Static content on dedicated servers
●
Caching
●
Database Partitioning
●
Replication (high availability)
●
(How) Do these work at social-network scale?
6
7. Facebook
Functionality
- Type of blog
- User profile with personal data
- Users 'friend' each-other
- Post public or private
messages
Data Center
- owned by facebook
Software Architecture
Source: Aaditya Agarwal, Facebook Architecture, Qcon'2008, London
7
8. Twitter
Functionality
- 140-character messages
- Users follow each-other
- Posts can contain pictures,
media links etc.
Software architecture
- Ruby on Rails, Erlang
- since 2009: JVM, Scala
- MySQL
- Memcached
- Unicorn (Mongrel) web server
Data Center
- dedicated data center
(outsourced)
Source: Krikorian R., Twitter's Real Time Architecture, Qcon NYC 2012
8
9. tumblr
Functionality
- Microblogging
- Users follow each-other
- Dashboard similar
to a Facebook page
Software architecture
- PHP, Ruby, Scala
- Redis, Hbase, MySQL
- Memcache
- Thrift
Data Center
- started at Rackspace
- co-located, dedicated
Source: Tumblr Architecture – 15 Billion Page Views A months and Harder to Scale than Twitter, Highscalability Blog
Source: tumblr.com
9
10. Pinterest
Functionality
- Photo sharing pinboards
- Categorize images, share with others
- mostly used by women (2012: 83%)
Software architecture
- Python
- Django
Data Center
- Amazon EC2, EBS, S3
Source: pinterest.com
Source: Jackson B., Pinterest growth driven by Amazon cloud scalability, 04.2012, techworld.com
10
11. Instagram
Functionality
- Smartphone photo sharing
- Post to other social networks
- Send messages
Software architecture
- Python, Django
- PostgreSQL
- Redis
- Nginx
- Node.js
- Android
Data Center
- started with single small scale PC
(up to 30+ million users)
- 100+ instances at Amazon
(EC2, EBS, S3 for photos)
Source: Wikipedia
Employees
- 2010: 2 engineers, 2012: 5 engineers
- That's the total employee count
Source: Instagram, What Powers Instagram: Hundreds of Instances, Dozens of Technologies, Instagram Engineering Blog
11
12. Scalability Options
#CPUs
RAM
disk
transparent scalability
scale 'out of the box'
●complex hardware (high cost)
●specialised Knowledge
●more complex software (multi-core)
●
●
either way
- scale by parallization
- partition for fault tolerance
- replicate for reliability
scale up
simple hardware (low cost)
scale by numbers
●difficult to implement
●difficult to maintain (myth?)
●nore complex software (expensive licenses)
●
this means:
- decouple components
- asynchronous processing
- monitor to operate
●
scale out
#machines
12
13. Caching
●
Goal
Reduce response times for web site & data access
●
Product
memcached (open source, initially developed 2003)
●
Benefits
All accesses (read & write) are O(1)
13
14. memchached
Features
● Remote-accessible in-memory key/value cache
● Least Recently Used (LRU) eviction
● Shared-nothing, distributed architecture
Implementation
● memcached nodes map to key-ranges (client-side hashing – no SPOF)
● Multi-threaded, event-based async network I/O (200'000 requests/s at Facebook)
● Single-node fault tolerance by consistent hashing scheme
memcached
memcached
Web Server
Load Balancer
Keys={4,5,6}
memcached
Keys={7,8,9}
memcached
Keys={10,11,12}
client
Web Server
server = hashf(key) % #servers
Source: memcached.org
Keys={1,2,3}
14
15. Consistent Hashing in a nutshell
'Traditional' hashing: buckets contain pre-defined range
=> at worst requires re-building the full cache, every node may be affected
Consistent hashing: buckets are located on a ring, contain up to pre-defined limit
=> at worst, only the keys of the failing node need to be re-mapped
server = min(s | s.location >= (hashf(key) % #locations))
Keys={8,9,10}
Keys={5,6,7}
Keys={1,2,3}
Keys={3,4,5}
Keys={8,9,10}
Keys={1,2,3}
m
Keys={5,6,7}
Source: David Karget et al, Web caching with consistent hashing, Vol 31, 1999, Computer Networks
Keys={1,2,3,4,5}
16
16. Memcached Results
●
Results at Twitter
●
●
20TB of data covering >30 services
●
2 trillion queries/day (>23 million queries/second)
●
●
100s of servers
Modified memcached, released as “Twemcache”
Key objectives
●
High Availability
●
Predictable Performance
●
Dynamic adoption to size (grow/shrink)
●
Monitoring of cache effectiveness
Source: Chris Aniszczyk, Caching with Twemcache, 07.2012, Twitter Engineering Blog
17
17. Shard your data
●
Shards
●
horizontal partitions (e.g. by user, time, ...)
●
distributed to multiple physical nodes => parallelized data access
●
data typically denormalized
●
similar data is replicated to all shards – e.g. static data
Web Server
node = hashf(userid) % #nodes
db-client
node1
Userids=
{A, …, F}
node2
Userids=
{G, …, L}
node3
node4
Userids=
{….}
Userids=
{….}
18
18. Sharding Results
●
Impressive results at Facebook
●
●
4ms reads, 5ms writes
●
60M queries/second (peak)
●
●
1800 MySQL servers
Growth 20x (overall data, over two years)
What work's
●
●
Linking across shards – store cross-reference s in both shards (two-way access)
●
Fault tolerance: single-instance failure only affects subset of users
●
●
Shard by user – group similar data into the same shard
Consistent hashing -
What doesn't
●
Join's across shards – not possible efficient
●
Sharding by time not helpful – one shard keeps running “hot”
●
Sharding by function not helpful – non-uniform distribution, hot spots, unique access patterns
●
Fixed hashing – nodes become unbalanced, difficult to grow or shrink
Source: Facebook Techtalks, MySQL & Hbase, December 5, 2011
19
19. Managing shards
●
Results at Tumblr
●
●
Grouped into 5 global pools / 58 shard pools
●
28 TB
●
100 billion rows
●
●
200 db servers
No DBAs - 2 engineers keep this running at 50% of their time
Jetpants – DB management toolkit
●
●
Split shards into new shards
●
Master promotions
●
●
Clone slaves efficiently
Command line to work with topology
Open sourced
●
https://github.com/tumblr/jetpants
Source: Elias E., Managing Large Sharded Topologies with Jetpants, 12.2012, Percona Live MySQL Conference
20
20. Asynchronous & Distributed Work
●
Problem
Do more work in less time
●
Solution
Distributed, asynchronous processing
MapReduce
●
Requirements
●
●
Distribute work
●
Collect results
●
●
Split work job into multiple pieces
Fault tolerant
Technologies
●
Message Queueing
●
Gearman
●
Hadoop / Pig
21
21. Asynchronous Work Example
●
Instagram Push Notifications
●
●
All uploads go into a task-queue
●
●
Image uploads
~200 worker processes asynchronously process the images
Gearman
●
Open Source
●
Framework to distribute work
●
Load Balancing
●
No SPOF
Source: gearman.org
Source: Instagram, What Powers Instagram: Hundreds of Instances, Dozens of Technologies, 2012, Instagram Engineering Blog
22
22. Apache Hadoop
●
What it is
●
Distributed MapReduce engine
●
Fault tolerant
●
Asynchronous job scheduling
●
●
Scalable: e.g. 4000 node cluster,
sorts of 1TB in 62 seconds
Datastorage
●
●
Distributed storage
●
Written in Java
●
Data replicated among 3 nodes
●
Block storage of 64MB/block
●
●
HDFS – scalable to multiple PB
No SPOF
Apache Pig
●
High-level query language
Sources: Apache Hadoop, Wikipedia, The Free Encyclopedia, accesses January 8, 2013
Weil K., NoSQL at Twitter, 04.2010, NoSQL EU 2012
23
23. Results
●
NoSQL at Twitter
●
●
HD speed: ~80MB/s => 24.3 hours
●
●
Store 7TB of data/day
Need to parallelize writes and reads
Analysis using Pig
●
Count all tweets
●
12 billion
●
5 minutes
Source: Weil K., NoSQL at Twitter, 04.2010, NoSQL EU 2012
24
25. Service Oriented Architecture
“Onion-Style”
outer services
- public (e.g. REST)
- user interface
- typically scripted
(Python, Ruby, JavaScript)
inner services
- private & highly efficient
- data access, calculation etc.
- workers to accomplish work
in parallel
- mix of languages (Java,
Scala, Python, C, ...)
fire hose
- highly available, scalable service bus
- distribute services as needed
- typically asynchronous
27
26. Tumblr Firehose
Results
- 4 x CPUs @ 72GB RAM, 2 disks
- provide 1 week of streams
- ~400k messages/second
- 1 Week of Tumblr posts
New
Post
Apache kafka
- O(1) persistent message queue
- x times 100K messages/s
- pub/sub interface
finagle
finagle
internal API
(thrift)
finagle
- asynchronous RPC system
- JVM-hosted languages (Java, Scala, ...)
- Connection pools, failure detectors,
failover, load-balancing, back-pressure ...
HTTP Client
HTTP Client
HTTP Client
public API
(JSON)
Apache Zookeeper (Cluster)
- distributed coordination
- highly available
Source: Blake M., Tumblr Firehose - The Gory Details, 2012, Tumblr Engineering Blog
28
27. SOA revisited – network efficiency
efficient?
consumer
1. Serialize
2. Wait for response
3. Deserialize
Interface
CORBA, HTTP/JSON, WSDL/XML/SOAP, ...
provider
1. Deserialize
2. Provide response
3. Serialize
29
28. Apache thrift – optimized wire protocol
●
What it is
●
●
Cross-language service implementation
●
Code-generation engine (C++, Java, Python, JavaScript, …)
●
●
Human-readable interface definition language (non-XML)
Binary wire protocol
Benefits
●
Low-overhead serialization/de-serialization
●
Native language bindings (no XML parsing or XSD)
●
Efficient protocol implementation
30
29. thrift example
interface
struct UserProfile {
1: i32 uid,
2: string name,
3: string blurb
}
service UserStorage {
void store(1: UserProfile user),
UserProfile retrieve(1: i32 uid)
}
client
# Make an object
up = UserProfile(uid=1,
name="Test User",
blurb="Thrift is great")
# Talk to a server via TCP sockets, binary protocol
transport = TSocket.TSocket("localhost", 9090)
transport.open()
Protocol =TBinaryProtocol.TBinaryProtocol(transport)
# Use the service we already defined
service = UserStorage.Client(protocol)
service.store(up)
Up2 = service.retrieve(1)
Service implementation
class UserStorageHandler : virtual public UserStorageIf {
public:
UserStorageHandler() {
// Your initialization goes here
}
void store(const UserProfile& user) {
// Your implementation goes here
printf("storen");
}
void retrieve(UserProfile& _return, const int32_t uid) {
// Your implementation goes here
printf("retrieven");
}
};
//main ...
}
Source: thrift.apache.org
31
30. Serialization / Deserialization Performance
Benchmark
- CPU Core i7 2.7GHz
- Serialization of a service message (media descriptor of a video)
Serialization … (thrift: -66% )
… Deserialization (thrift: -92%)
Message size (thrift: -19%)
Source: Author testing
32
31. redis: In-Memory DB
Problem
Solution
Require speed of cache, query semantics, persistence, fault-tolerance of DB
cluster
redis.io – a distributed in-memory DB
Redis
● fast: O(1) access times - 100'000 writes/second, 80'000 read/second
● fault-tolerant
● datatypes: strings, hashes, lists, sets, sorted sets
● complex queries: intersection, subset, sort, …
● more than just a DB: pub/sub channels
redis
Keys={1,2,3}
master
redis
Keys={3,4,5}
redis
Keys={5,6,7}
async replication
consumer
redis
Keys={8,9,10}
slave
slave
slave
33
32. redis results
●
tumblr
●
>7500 notifications/second (well above MySQL max. concurrent limit)
●
<5ms response time requirement
●
Redis: 30'000 requests/second
Source: Blake M., Staircar: Redis-powered notifications, 07.2011, Tumblr Engineering Blog
35
33. Automate everything & Monitor
●
If just two engineers
●
●
●
maintain dozens of databases
●
●
run 100+ servers
Scale a system to 30+ million users
… automation is like air to breathe …
… monitoring is the lifeline
Source: Adams J., Scaling Twitter, 2010, Chirp Conference
Dashboard @ Twitter
36
34. Cell Architecture
Cell Architecture
●
●
Self-contained cells of data + logic
●
Each cell itself made up of a cluster of nodes
●
Cells provide internal failover
●
Reliability
●
Scalability
Client
Discovery Service
consistent hashing by user-id
Application Server Cluster
Metadata store (HBase)
Cell
Source: Malik P., Scaling the Messages Application Back End, 04.11, facebook Engineering's Notes
37
36. Take Away for Application Development
●
Scalability => Distribution
●
●
Efficiency at every level
●
●
Loosely Coupled Components (accessible via APIs, services)
Shared nothing
Reliability => Replication
●
●
Monitoring
●
●
Automation
Fast provisioning of replicates
Flexibility => Simplification
●
Build for simple use
●
Abstract to simplify (e.g. Pig/Hadoop, Redis/in-Memory DB)
●
API-everything
39
37. Paradigm Shift?
●
New normal
●
●
<5 engineers
●
Distributed work load
●
Horizontal scalability
●
●
100s of machines
PBs of data
Drivers
●
Low barriers of entry – free or low-cost hosting
●
Declining cost – CPU, storage, networking
●
Web-scale ready open-source software
40