Lessons from Highly Scalable Architectures at Social Networking Sites

Software Engineering in a Cloud World
Lessons from
highly-scalable architectures
at social networking sites
Patrick Senti
patrick.senti@gmail.com

1

Social Networking – Trends 2012
more users...

… higher share of time ...

… for longer

Source: State of Media: The Social Media Report 2012, nielsen, http://is.gd/LYHmnm

2

User Adoption Faster for New Entrants
User Growth
(years since launch)
1000
Facebook
Twitter
Tumblr
Instagram
Pinterest

Million (logarithmic)

100
10
1
0.1
0.01
0.5

1

2

3

4

5

6

7

8

years

Source: author's compilations of data from company data, press statements, technical blogs & presentations

3

Staggering Volumes

Page views
Reads
Writes
New data
Servers
Engineers

500 million/day
~40k requests/second
~1 million/second
~3 TB/day
1000
20

Likes (counter)
Photos
Queries
New Data
Servers
Engineers

2.7 billion/day
300 millions/day
70'000/day
500 TB/day
“tens of thousands”
~1700

Tweets (peak)
Tweets (avg)
API calls
New data
Engineers

~25'000/second
~250 million/day (1000/second)
6 billion/day (70'000/second)
~8 TB/day (80MB/second)
500 (of 1000 total employees)

Page views
Growth rate
Machinery

2.3 billion/month
50% (visitors, March 2012)
150 web servers
90 caching servers
70 database instances
35 logging, internal
410 TB (user data)
~65 (NB, until end of 2011: 12)

Data size
Employees

4
Sources: http://is.gd/mpdOPN, http://is.gd/1vJ1il, http://is.gd/58X8ns, http://is.gd/LGexI6, http://is.gd/tZfNPA, http://is.gd/bcpCJc, http://is.gd/kXVEEF

Methodology
●

Author's synthesis
●
●

●

Information collected 2010 – 2012
Mostly secondary research conducted on the internet

Sources of information
●
●

Engineering blogs by social network companies

●

Research reports

●

Technology documentation

●

●

Public presentations at industry conferences

Author's data analysis

Threats to validity
●

Subjective selection of information sources

●

Non-systematic analysis and synthesis of data gathered
5

Typical Scalability Approaches

●

Load Balancing

●

Static content on dedicated servers

●

Caching

●

Database Partitioning

●

Replication (high availability)

●

(How) Do these work at social-network scale?

6

Facebook
Functionality
- Type of blog
- User profile with personal data
- Users 'friend' each-other
- Post public or private
messages
Data Center
- owned by facebook

Software Architecture

Source: Aaditya Agarwal, Facebook Architecture, Qcon'2008, London

7

Twitter
Functionality
- 140-character messages
- Users follow each-other
- Posts can contain pictures,
media links etc.

Software architecture
- Ruby on Rails, Erlang
- since 2009: JVM, Scala
- MySQL
- Memcached
- Unicorn (Mongrel) web server
Data Center
- dedicated data center
(outsourced)

Source: Krikorian R., Twitter's Real Time Architecture, Qcon NYC 2012

8

tumblr
Functionality
- Microblogging
- Users follow each-other
- Dashboard similar
to a Facebook page
- PHP, Ruby, Scala
- Redis, Hbase, MySQL
- Memcache
- Thrift
Data Center
- started at Rackspace
- co-located, dedicated

Source: Tumblr Architecture – 15 Billion Page Views A months and Harder to Scale than Twitter, Highscalability Blog

Source: tumblr.com

9

Pinterest
Functionality
- Photo sharing pinboards
- Categorize images, share with others
- mostly used by women (2012: 83%)

- Python
- Django

Data Center
- Amazon EC2, EBS, S3

Source: pinterest.com

Source: Jackson B., Pinterest growth driven by Amazon cloud scalability, 04.2012, techworld.com

10

Instagram
Functionality
- Smartphone photo sharing
- Post to other social networks
- Send messages
- Python, Django
- PostgreSQL
- Redis
- Nginx
- Node.js
- Android
Data Center
- started with single small scale PC
(up to 30+ million users)
- 100+ instances at Amazon
(EC2, EBS, S3 for photos)

Source: Wikipedia

Employees
- 2010: 2 engineers, 2012: 5 engineers
- That's the total employee count
Source: Instagram, What Powers Instagram: Hundreds of Instances, Dozens of Technologies, Instagram Engineering Blog

11

Scalability Options
#CPUs
RAM
disk

transparent scalability
scale 'out of the box'
●complex hardware (high cost)
●specialised Knowledge
●more complex software (multi-core)
●
●

either way
- scale by parallization
- partition for fault tolerance
- replicate for reliability

scale up

simple hardware (low cost)
scale by numbers
●difficult to implement
●difficult to maintain (myth?)
●nore complex software (expensive licenses)
●

this means:
- decouple components
- asynchronous processing
- monitor to operate

●

scale out

#machines

12

Caching
●

Goal

Reduce response times for web site & data access

●

Product

memcached (open source, initially developed 2003)

●

Benefits

All accesses (read & write) are O(1)

13

memchached
Features
● Remote-accessible in-memory key/value cache
● Least Recently Used (LRU) eviction
● Shared-nothing, distributed architecture
Implementation
● memcached nodes map to key-ranges (client-side hashing – no SPOF)
● Multi-threaded, event-based async network I/O (200'000 requests/s at Facebook)
● Single-node fault tolerance by consistent hashing scheme

memcached
memcached

Web Server
Load Balancer

Keys={4,5,6}

memcached

Keys={7,8,9}

memcached

Keys={10,11,12}

client
Web Server

server = hashf(key) % #servers

Source: memcached.org

Keys={1,2,3}

14

Consistent Hashing in a nutshell
'Traditional' hashing: buckets contain pre-defined range
=> at worst requires re-building the full cache, every node may be affected
Consistent hashing: buckets are located on a ring, contain up to pre-defined limit
=> at worst, only the keys of the failing node need to be re-mapped

server = min(s | s.location >= (hashf(key) % #locations))

Keys={8,9,10}

Keys={5,6,7}

Keys={1,2,3}

Keys={3,4,5}

Keys={8,9,10}

Keys={1,2,3}

m
Keys={5,6,7}

Source: David Karget et al, Web caching with consistent hashing, Vol 31, 1999, Computer Networks

Keys={1,2,3,4,5}

16

Memcached Results
●

Results at Twitter
●

●

20TB of data covering >30 services

●

2 trillion queries/day (>23 million queries/second)

●

●

100s of servers

Modified memcached, released as “Twemcache”

Key objectives
●

High Availability

●

Predictable Performance

●

Dynamic adoption to size (grow/shrink)

●

Monitoring of cache effectiveness

Source: Chris Aniszczyk, Caching with Twemcache, 07.2012, Twitter Engineering Blog

17

Shard your data
●

Shards
●

horizontal partitions (e.g. by user, time, ...)

●

distributed to multiple physical nodes => parallelized data access

●

data typically denormalized

●

similar data is replicated to all shards – e.g. static data

Web Server
node = hashf(userid) % #nodes
db-client

node1
Userids=
{A, …, F}

node2
Userids=
{G, …, L}

node3

node4

Userids=
{….}

Userids=
{….}
18

Sharding Results
●

Impressive results at Facebook
●

●

4ms reads, 5ms writes

●

60M queries/second (peak)

●

●

1800 MySQL servers

Growth 20x (overall data, over two years)

What work's
●

●

Linking across shards – store cross-reference s in both shards (two-way access)

●

Fault tolerance: single-instance failure only affects subset of users

●

●

Shard by user – group similar data into the same shard

Consistent hashing -

What doesn't
●

Join's across shards – not possible efficient

●

Sharding by time not helpful – one shard keeps running “hot”

●

Sharding by function not helpful – non-uniform distribution, hot spots, unique access patterns

●

Fixed hashing – nodes become unbalanced, difficult to grow or shrink

Source: Facebook Techtalks, MySQL & Hbase, December 5, 2011

19

Managing shards
●

Results at Tumblr
●
●

Grouped into 5 global pools / 58 shard pools

●

28 TB

●

100 billion rows

●

●

200 db servers

No DBAs - 2 engineers keep this running at 50% of their time

Jetpants – DB management toolkit
●
●

Split shards into new shards

●

Master promotions

●

●

Clone slaves efficiently

Command line to work with topology

Open sourced
●

https://github.com/tumblr/jetpants

Source: Elias E., Managing Large Sharded Topologies with Jetpants, 12.2012, Percona Live MySQL Conference

20

Asynchronous & Distributed Work
●

Problem

Do more work in less time

●

Solution

Distributed, asynchronous processing
MapReduce

●

Requirements
●

●

Distribute work

●

Collect results

●

●

Split work job into multiple pieces

Fault tolerant

Technologies
●

Message Queueing

●

Gearman

●

Hadoop / Pig

21

Asynchronous Work Example
●

Instagram Push Notifications
●

●

All uploads go into a task-queue

●

●

Image uploads
~200 worker processes asynchronously process the images

Gearman
●

Open Source

●

Framework to distribute work

●

Load Balancing

●

No SPOF
Source: gearman.org

Source: Instagram, What Powers Instagram: Hundreds of Instances, Dozens of Technologies, 2012, Instagram Engineering Blog

22

Apache Hadoop
●

What it is
●

Distributed MapReduce engine

●

Fault tolerant

●

Asynchronous job scheduling

●

●

Scalable: e.g. 4000 node cluster,
sorts of 1TB in 62 seconds

Datastorage
●

●

Distributed storage

●

Written in Java

●

Data replicated among 3 nodes

●

Block storage of 64MB/block

●

●

HDFS – scalable to multiple PB

No SPOF

Apache Pig
●

High-level query language

Sources: Apache Hadoop, Wikipedia, The Free Encyclopedia, accesses January 8, 2013
Weil K., NoSQL at Twitter, 04.2010, NoSQL EU 2012

23

Results
●

NoSQL at Twitter
●

●

HD speed: ~80MB/s => 24.3 hours

●

●

Store 7TB of data/day
Need to parallelize writes and reads

Analysis using Pig
●

Count all tweets

●

12 billion

●

5 minutes

Source: Weil K., NoSQL at Twitter, 04.2010, NoSQL EU 2012

24

Simplified Queries

Source: Weil K., NoSQL at Twitter, 04.2010, NoSQL EU 2012

25

Service Oriented Architecture
“Onion-Style”

outer services
- public (e.g. REST)
- user interface
- typically scripted
(Python, Ruby, JavaScript)

inner services
- private & highly efficient
- data access, calculation etc.
- workers to accomplish work
in parallel
- mix of languages (Java,
Scala, Python, C, ...)

fire hose
- highly available, scalable service bus
- distribute services as needed
- typically asynchronous
27

Tumblr Firehose
Results
- 4 x CPUs @ 72GB RAM, 2 disks
- provide 1 week of streams
- ~400k messages/second
- 1 Week of Tumblr posts

New
Post

Apache kafka
- O(1) persistent message queue
- x times 100K messages/s
- pub/sub interface

finagle

finagle
internal API
(thrift)

finagle
- asynchronous RPC system
- JVM-hosted languages (Java, Scala, ...)
- Connection pools, failure detectors,
failover, load-balancing, back-pressure ...

HTTP Client
HTTP Client
HTTP Client
public API
(JSON)

Apache Zookeeper (Cluster)
- distributed coordination
- highly available

Source: Blake M., Tumblr Firehose - The Gory Details, 2012, Tumblr Engineering Blog

28

SOA revisited – network efficiency
efficient?

consumer
1. Serialize
2. Wait for response
3. Deserialize

Interface

CORBA, HTTP/JSON, WSDL/XML/SOAP, ...

provider
1. Deserialize
2. Provide response
3. Serialize

29

Apache thrift – optimized wire protocol
●

What it is
●

●

Cross-language service implementation

●

Code-generation engine (C++, Java, Python, JavaScript, …)

●

●

Human-readable interface definition language (non-XML)

Binary wire protocol

Benefits
●

Low-overhead serialization/de-serialization

●

Native language bindings (no XML parsing or XSD)

●

Efficient protocol implementation

30

thrift example
interface
struct UserProfile {
1: i32 uid,
2: string name,
3: string blurb
}
service UserStorage {
void store(1: UserProfile user),
UserProfile retrieve(1: i32 uid)
}

client
# Make an object
up = UserProfile(uid=1,
name="Test User",
blurb="Thrift is great")
# Talk to a server via TCP sockets, binary protocol
transport = TSocket.TSocket("localhost", 9090)
transport.open()
Protocol =TBinaryProtocol.TBinaryProtocol(transport)
# Use the service we already defined
service = UserStorage.Client(protocol)
service.store(up)
Up2 = service.retrieve(1)

Service implementation
class UserStorageHandler : virtual public UserStorageIf {
public:
UserStorageHandler() {
// Your initialization goes here
}
void store(const UserProfile& user) {
// Your implementation goes here
printf("storen");
}
void retrieve(UserProfile& _return, const int32_t uid) {
// Your implementation goes here
printf("retrieven");
}
};
//main ...
}
Source: thrift.apache.org

31

Serialization / Deserialization Performance
Benchmark
- CPU Core i7 2.7GHz
- Serialization of a service message (media descriptor of a video)

Serialization … (thrift: -66% )

… Deserialization (thrift: -92%)

Message size (thrift: -19%)

Source: Author testing

32

redis: In-Memory DB
Problem
Solution

Require speed of cache, query semantics, persistence, fault-tolerance of DB
cluster
redis.io – a distributed in-memory DB

Redis
● fast: O(1) access times - 100'000 writes/second, 80'000 read/second
● fault-tolerant
● datatypes: strings, hashes, lists, sets, sorted sets
● complex queries: intersection, subset, sort, …
● more than just a DB: pub/sub channels

redis

Keys={1,2,3}
master

redis

Keys={3,4,5}

redis

Keys={5,6,7}

async replication

consumer

redis

Keys={8,9,10}

slave
slave
slave

33

redis results
●

tumblr
●

>7500 notifications/second (well above MySQL max. concurrent limit)

●

<5ms response time requirement

●

Redis: 30'000 requests/second

Source: Blake M., Staircar: Redis-powered notifications, 07.2011, Tumblr Engineering Blog

35

Automate everything & Monitor
●

If just two engineers
●

●

●

maintain dozens of databases

●

●

run 100+ servers

Scale a system to 30+ million users

… automation is like air to breathe …
… monitoring is the lifeline

Source: Adams J., Scaling Twitter, 2010, Chirp Conference

Dashboard @ Twitter

36

Cell Architecture
Cell Architecture

●

●

Self-contained cells of data + logic

●

Each cell itself made up of a cluster of nodes

●

Cells provide internal failover

●

Reliability

●

Scalability
Client
Discovery Service
consistent hashing by user-id

Application Server Cluster
Metadata store (HBase)
Cell
Source: Malik P., Scaling the Messages Application Back End, 04.11, facebook Engineering's Notes

37

Summary
Scalability
●

Cache

●

Data Sharding

●

In-Memory DB

●

Efficient wire protocols

Flexibility
●

SOA
●
●

Layered (outer, inner services)

●

●

Decoupled
Asynchronous (firehouse)

Automation

Reliability
●

Replication

●

Cell Architecture
38

Take Away for Application Development
●

Scalability => Distribution
●

●

Efficiency at every level

●

●

Loosely Coupled Components (accessible via APIs, services)
Shared nothing

Reliability => Replication
●

●

Monitoring

●

●

Automation
Fast provisioning of replicates

Flexibility => Simplification
●

Build for simple use

●

Abstract to simplify (e.g. Pig/Hadoop, Redis/in-Memory DB)

●

API-everything

39

Paradigm Shift?
●

New normal
●

●

<5 engineers

●

Distributed work load

●

Horizontal scalability

●

●

100s of machines

PBs of data

Drivers
●

Low barriers of entry – free or low-cost hosting

●

Declining cost – CPU, storage, networking

●

Web-scale ready open-source software

40

What we haven't covered

●

CAP Theorem

●

A/B Testing

●

NoSQL Databases

42

Lessons from Highly Scalable Architectures at Social Networking Sites

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (18)

Destaque

Destaque (20)

Semelhante a Lessons from Highly Scalable Architectures at Social Networking Sites

Semelhante a Lessons from Highly Scalable Architectures at Social Networking Sites (20)

Último

Último (20)

Lessons from Highly Scalable Architectures at Social Networking Sites