4. Twitter’s infrastructure
● Twitter founded in 2006
● Global-scale application
● Unique scale and performance characteristics
● Real-time
● Built to purpose and well optimized
● Large data centers
4
5. Strategic questions
1. What is the long-term mix of cloud versus
datacenter?
2. Which cloud provider(s) should we use?
3. How can we be confident in this type of
decision?
4. Why should we evaluate this now (2016)?
5
6. Tactical questions
1. What is the feasibility and cost of large-scale
adoption?
2. Which workloads are best-suited for the cloud
and are they separable?
3. How would our architecture change on the
cloud?
4. How do we get to an actionable plan?
6
7. Evaluation process
● Started evaluation in 2016
● Were able to make a patient, rigorous
decision
● Defined baseline workload requirements
● Engaged major providers
● Analyzed clouds for each major workload
● Built overall cloud plan
● Iterated and optimized choices 7
8. Evaluation Timeline
Considering Moving
● PoC’s Completed
& Results
Delivered
● Legal Agreement with
T&C’s ratified
● Kickoff dataproc,
bigquery, dataflow
experimentation
● Security and
Platform
Review
● v1 Hadoop on GCP
Architecture
Ratified
● Begin build for
migration plan
● Consensus built with
Product, Revenue, Eng
● Migration Kickoff
● Proposal to migrate
Hadoop to GCP
formally accepted
June
‘16
● Initial Cloud RfP release
● 27 Synthetic PoC’s on
GCP begin
● Testing Projects /
Network established
Sept
‘16
Mar
‘17
July
‘17
Nov
‘17
Jan
‘18
Apr
‘18
June
‘18
8
9. Built overall cloud plan
● Created a series of candidate architectures
for each platform with their resource
requirements
● Developed a migration project plan &
timeline
● Created financial projections
● With some other business considerations
9
10. Financial modeling
● 10-year time horizon to avoid timing artifacts
● Compared on premise and multiple cloud
scenarios
● Costs of migration and long-term
● Long-term price/performance curves
(e.g. Moore’s Law, historical pricing)
● Two independent models to avoid model
errors
10
11. ● An immediate all-in migration at Twitter scale
is: expensive, distracting, and risky
● More value from new architectures and
transformation, so start smaller and learn as
we go
● Hadoop offered several important, specific
benefits with lower risk
● We gained confidence in our investments in
both cloud projects and data centers
What we found
11
13. Type Use Compute %
Real-time Critical performance production jobs
with dedicated capacity
10%
Processing Regularly scheduled production jobs
with dedicated capacity
60%
Ad-hoc One off / ad-hoc queries and analysis 30%
Cold Dense storage clusters, not for compute minimal
Twitter Hadoop cluster types
13
14. Twitter Hadoop challenges
1. Scaling: Significant YoY Compute & Storage growth
2. Hardware: Designing, building, maintaining & operating
3. Capacity Planning: Hard to predict for adhoc especially
4. Agility: Must respond fast especially for adhoc compute
5. Deployment: Must deploy at scale and in-flight
6. Network: Both cross-DC and cross-cluster
7. Disaster Recovery: Durable copies needed in 2+ DCs
14
15. Twitter Hadoop requirements
● Network sustained bandwidth per core
● Disk (data) sustained bandwidth per core
● Large sequential reads & writes
● Throughput not latency
● Capacity
● CPU / RAM not usually the bottleneck
● Consistency of datasets (set of HDFS files)
15
16. Twitter Hadoop on premise hardware
numbers
Clusters: 10 to 10K nodes
Network: 10G moving to 25G
Data Disks: 24T-72T over 12 HDDs
CPU: 8 cores with 64G memory
I/O: Network: ~20MB/s
sustained, peaks of 10x
HDFS read: 20 rq/s sustained,
peaks of 3x
HDFS write: large variation, 16
17. 2. Twitter Hadoop on
cloud VMs
Durable storage: cloud
object store
Scratch storage:
a. with HDFS over
cloud object store
b. with HDFS on cloud
block store
c. with HDFS on local
disks
1. Hadoop-as-a-
Service
(HaaS) from the
cloud
provider
Cloud architectural options
17
18. 2. Functional Test
Gridmix: IO + Compute
● Capture of real
production cluster
workload (1k-5k jobs)
● Replays reads, writes,
shuffles, compute
Testing plan
1. Baseline Tests
● TestDFSIO:
low level IO read/write
● Teragen:
measure maximum
write rate
● Terasort:
read, shuffle, write
18
19. HDFS configurations tested
Availability
● Critical data: 2 regions
● Other data: 2 zones
Each type of Object, Block
and Local Storage
Dataset consistency
Test cloud provider choices:
1. object store
2. object store with external
consistency service
19
21. GCP HaaS: DataProc config
● Hadoop 2.7.2
● Performance tests with 800 vCPUs:
○ 100 x n1-standard-8 (8 VCPU, 30G memory)
○ 200 x n1-standard-4 (4 VCPU, 30G memory)
● Scale test with 8000 vCPUs:
○ 1000 x n1-standard-8 (8 vCPU, 30G memory)
● Modeled average CPU and average to peak CPU.
● No preemptible instances in initial work
● Similar to on premise hardware SKUs
21
Decided to use DataProc
for evaluation.
22. Durable
Storage
Scratch
Storage
HDFS Speedup vs on premise
(normalized by IO-per-core)
Cloud
Storage
Local SSD 3 x 375G SSD ~2x (but expensive)
Cloud
Storage
PD-HDD 1.5TB PD-HDD ~1x
None PD-HDD 1.5TB PD-HDD ~1x
DataProc 100 x n1-standard-8 Results
Tuned Compute Engine instance types to get the optimum balance of
network : cores : storage (this changes over time)
22
23. Durable
Storage
Scratch
Storage
HDFS Speedup vs on premise
(normalized by IO-per-core)
Cloud
Storage
Local SSD 2 x 375G SSD ~2x (but expensive)
Cloud
Storage
PD-HDD 1.5TB PD-HDD 1.4x
DataProc 200 x n1-standard-4 Results
23
24. Benchmark Findings
1. Application Benchmarks
are critical
Total job time is composed of
multiple steps. We found
variation both better and worse
at each step.
Recommendation: You should
rely on an application
benchmark like GridMix rather
than micro-benchmarks.
2. Can treat network
storage like local disk
Both Cloud Storage and PD
offered nearly as much
bandwidth as typical direct
attached HDDs on premise
24
25. Functional Test Findings
1. Live Migration of VMs was not noticeable
during Hadoop testing. It was during other
Twitter platform testing of Compute Engine
(cache at very high rps of small objects)
2. Cloud Storage checksum vs HDFS checksum.
Fixed via HDFS-13056 in collaboration with
Google
3. fsync() system call on Local SSD was slow
(fixed)
25
27. + Leads to the fastest migration
+ Limits duplication of costs during migration period
- Introduces significant tech debt post-migration
- Requires a major rearchitecture post-migration to
capture benefits of cloud
- Concerns around overall cost, risk, and distraction of this
approach at Twitter scale
Life-and-Shift
everything
Disqualified Lift-and-Shift *Everything*
27
28. ● Separable with fewer dependencies
● Standard open source software:
○ Continue to develop in house and run on premise
○ Reduces lock-in risk
● Rearchitecting is achievable
○ Not a lift-and-shift
● Data in Cloud Storage:
○ Enables broader diversity of data processing
frameworks and services
● Long-term bet on Google’s Big Data ecosystem
Hadoop to Cloud was Interesting
28
29. Separate Hadoop Compute and Storage
● Scaling the dimensions independently
● Makes it easy to run multiple clusters and processing
frameworks over the same data
● Virtual network and project primitives provide
segmentation of access and cost structures.
● State is preserved in Cloud Storage therefore
deployments, upgrades, and testing are simpler
● Can treat storage as a commodity
Enables
29
31. Twitter production Hadoop remains on premise
● Not as separable from other production workloads
● Focusing on non-production workloads limits our risk
● Regular compute-intensive usage patterns
● Benefits more from purpose built hardware
● Fewer processing frameworks are needed
31
32. Twitter Strategic Benefits
● Next-generation architecture with numerous
enhancements:
○ security, encryption, isolation, live migration
● Leverage Google’s capacity and R&D
● Larger ecosystem of open source & cloud software
● Long-term strategic collaboration with Google
● Beachhead that enable teams across Twitter to make
tactical cloud adoption decisions
What does this do
overall for Twitter?
32
33. Infrastructure benefits
● Large-scale ad-hoc
analysis and backfills
● Cloud Storage avoids
HDFS limits
● Offsite Backup
● Increases availability of
cold data
Twitter Functional Benefits
Platform benefits
● Built-in compliance
support (e.g. SOX)
● Direct chargeback using
Project
● Simplified retention
● GCP services such as
BigQuery, Spanner,
Cloud ML, TPUs, etc
33
34. Finding: At Twitter Scale, Cloud has limits
● Cloud providers have limits for all sorts of things
and we often need them increased.
● Cloud HaaS do not generally support 10K node
hadoop clusters
● Dynamic scaling down < O(days) is not yet
feasible / cost-effective with current Hadoop at
Twitter scale
● Capacity planning with cloud providers is
encouraged for O(10K) vCPU deltas and required
for O(100K) vCPU deltas
34
35. What we are working on now
❏ Finalizing bucket & user creation and IAM designs
❏ Building replication, cluster deployment, and data
management software
❏ Hadoop Cloud Storage connector improvements
continue (open source)
❏ Retention and “directory” / dataset atomicity in GCS
35
✓ Foundational network
(8x100Gbps)
✓ Copy cluster
✓ Copying PBs of data to the
cloud
✓ Early Presto analytics use
case: up to 100K-core
Dataproc cluster querying
15PB dataset in Cloud
Storage
37. 3. Ensure migration plan
captures benefits
Lift-and-shift may not deliver
value in all cases.
Substantial iteration is required
to balance tactical migration
work with long-term strategy.
2. Compare application
benchmark costs
Compare the cost of running an
application using benchmark
results. Don’t just look at
pricing pages.
e.g. the network is hugely
important to performance.
1. Run the most informative
tests
Application-level
benchmarking (e.g. GridMix)
Scale testing
Recommendations
37
38. 2. Cloud adoption
is complex
Finding separable workloads
can be a challenge.
Architectural choices are non-
obvious.
Methodical evaluation is well-
worth the effort.
1. Separate compute and
storage is a real thing
The better the network, the less
locality matters.
Life gets much easier when
Compute can be stateless.
You can treat PD like direct
attached HDDs.
Conclusions
3. Very early in this process
and lots more to come
We’re excited to be gaining
experience with the platform
and learning from everyone.
38