SlideShare a Scribd company logo
1 of 39
Download to read offline
How @twitterhadoop
chose Google Cloud
Joep Rottinghuis & Lohit VijayaRenu
Twitter Hadoop Team (@twitterhadoop)
1
1. Twitter infrastructure
2. Hadoop evaluation
3. Evaluation outcomes
4. Recommendations and conclusions
5. Q&A
Credit to presentation at GoogleNext 2019 by Derek Lyon & Dave
Beckett (https://youtu.be/4FLFcWgZdo4) 2
Twitter Infrastructure
3
Twitter’s infrastructure
● Twitter founded in 2006
● Global-scale application
● Unique scale and performance characteristics
● Real-time
● Built to purpose and well optimized
● Large data centers
4
Strategic questions
1. What is the long-term mix of cloud versus
datacenter?
2. Which cloud provider(s) should we use?
3. How can we be confident in this type of
decision?
4. Why should we evaluate this now (2016)?
5
Tactical questions
1. What is the feasibility and cost of large-scale
adoption?
2. Which workloads are best-suited for the cloud
and are they separable?
3. How would our architecture change on the
cloud?
4. How do we get to an actionable plan?
6
Evaluation process
● Started evaluation in 2016
● Were able to make a patient, rigorous
decision
● Defined baseline workload requirements
● Engaged major providers
● Analyzed clouds for each major workload
● Built overall cloud plan
● Iterated and optimized choices
7
Evaluation Timeline
Considering Moving
● PoC’s Completed
& Results Delivered
● Legal Agreement with
T&C’s ratified
● Kickoff dataproc,
bigquery, dataflow
experimentation
● Security and
Platform
Review
● v1 Hadoop on GCP
Architecture
Ratified
● Begin build for
migration plan
● Consensus built with
Product, Revenue, Eng
● Migration Kickoff
● Proposal to migrate
Hadoop to GCP
formally accepted
June
‘16
● Initial Cloud RfP release
● 27 Synthetic PoC’s on
GCP begin
● Testing Projects /
Network established
Sept
‘16
Mar
‘17
July
‘17
Nov
‘17
Jan
‘18
Apr
‘18
June
‘18
8
Built overall cloud plan
● Created a series of candidate architectures
for each platform with their resource
requirements
● Developed a migration project plan &
timeline
● Created financial projections
● With some other business considerations
9
Financial modeling
● 10-year time horizon to avoid timing artifacts
● Compared on premise and multiple cloud
scenarios
● Costs of migration and long-term
● Long-term price/performance curves
(e.g. Moore’s Law, historical pricing)
● Two independent models to avoid model
errors
10
● An immediate all-in migration at Twitter scale
is: expensive, distracting, and risky
● More value from new architectures and
transformation, so start smaller and learn as
we go
● Hadoop offered several important, specific
benefits with lower risk
● We gained confidence in our investments in
both cloud projects and data centers
What we found
11
>1.4T
Messages Per Day
>500K
Compute Cores
>300PB
Logical Storage
Hadoop@Twitter scale
>12,500
Peak Cluster Size
12
Type Use Compute %
Real-time Critical performance production jobs
with dedicated capacity
10%
Processing Regularly scheduled production jobs
with dedicated capacity
60%
Ad-hoc One off / ad-hoc queries and analysis 30%
Cold Dense storage clusters, not for compute minimal
Twitter Hadoop cluster types
13
Twitter Hadoop challenges
1. Scaling: Significant YoY Compute & Storage growth
2. Hardware: Designing, building, maintaining & operating
3. Capacity Planning: Hard to predict for adhoc especially
4. Agility: Must respond fast especially for adhoc compute
5. Deployment: Must deploy at scale and in-flight
6. Network: Both cross-DC and cross-cluster
7. Disaster Recovery: Durable copies needed in 2+ DCs
14
Twitter Hadoop requirements
● Network sustained bandwidth per core
● Disk (data) sustained bandwidth per core
● Large sequential reads & writes
● Throughput not latency
● Capacity
● CPU / RAM not usually the bottleneck
● Consistency of datasets (set of HDFS files)
15
Twitter Hadoop on premise hardware
numbers
Clusters: 10 to 10K nodes
Network: 10G moving to 25G
Data Disks: 24T-72T over 12 HDDs
CPU: 8 cores with 64G memory
I/O: Network: ~20MB/s sustained, peaks of 10x
HDFS read: 20 rq/s sustained, peaks of 3x
HDFS write: large variation, peaks of 10x
16
2. Twitter Hadoop on
cloud VMs
Durable storage: cloud
object store
Scratch storage:
a. with HDFS over
cloud object store
b. with HDFS on cloud
block store
c. with HDFS on local
disks
1. Hadoop-as-a-Service
(HaaS) from the cloud
provider
Cloud architectural options
17
2. Functional Test
Gridmix: IO + Compute
● Capture of real
production cluster
workload (1k-5k jobs)
● Replays reads, writes,
shuffles, compute
Testing plan
1. Baseline Tests
● TestDFSIO:
low level IO read/write
● Teragen:
measure maximum
write rate
● Terasort:
read, shuffle, write
18
HDFS configurations tested
Availability
● Critical data: 2 regions
● Other data: 2 zones
Each type of Object, Block
and Local Storage
Dataset consistency
Test cloud provider choices:
1. object store
2. object store with external
consistency service
19
Hadoop Evaluation
20
GCP HaaS: DataProc config
● Hadoop 2.7.2
● Performance tests with 800 vCPUs:
○ 100 x n1-standard-8 (8 VCPU, 30G memory)
○ 200 x n1-standard-4 (4 VCPU, 30G memory)
● Scale test with 8000 vCPUs:
○ 1000 x n1-standard-8 (8 vCPU, 30G memory)
● Modeled average CPU and average to peak CPU.
● No preemptible instances in initial work
● Similar to on premise hardware SKUs
21
Decided to use DataProc
for evaluation.
Durable
Storage
Scratch
Storage
HDFS Speedup vs on premise
(normalized by IO-per-core)
Cloud
Storage
Local SSD 3 x 375G SSD ~2x (but expensive)
Cloud
Storage
PD-HDD 1.5TB PD-HDD ~1x
None PD-HDD 1.5TB PD-HDD ~1x
DataProc 100 x n1-standard-8 Results
Tuned Compute Engine instance types to get the optimum balance of
network : cores : storage (this changes over time)
22
Durable
Storage
Scratch
Storage
HDFS Speedup vs on premise
(normalized by IO-per-core)
Cloud
Storage
Local SSD 2 x 375G SSD ~2x (but expensive)
Cloud
Storage
PD-HDD 1.5TB PD-HDD 1.4x
DataProc 200 x n1-standard-4 Results
23
Benchmark Findings
1. Application Benchmarks
are critical
Total job time is composed of
multiple steps. We found
variation both better and worse
at each step.
Recommendation: You should
rely on an application
benchmark like GridMix rather
than micro-benchmarks.
2. Can treat network
storage like local disk
Both Cloud Storage and PD
offered nearly as much
bandwidth as typical direct
attached HDDs on premise
24
Functional Test Findings
1. Live Migration of VMs was not noticeable
during Hadoop testing. It was during other
Twitter platform testing of Compute Engine
(cache at very high rps of small objects)
2. Cloud Storage checksum vs HDFS checksum.
Fixed via HDFS-13056 in collaboration with
Google
3. fsync() system call on Local SSD was slow
(fixed)
25
Evaluation Outcomes
26
+ Leads to the fastest migration
+ Limits duplication of costs during migration period
- Introduces significant tech debt post-migration
- Requires a major rearchitecture post-migration to
capture benefits of cloud
- Concerns around overall cost, risk, and distraction of this
approach at Twitter scale
Life-and-Shift
everything
Disqualified Lift-and-Shift *Everything*
27
● Separable with fewer dependencies
● Standard open source software:
○ Continue to develop in house and run on premise
○ Reduces lock-in risk
● Rearchitecting is achievable
○ Not a lift-and-shift
● Data in Cloud Storage:
○ Enables broader diversity of data processing
frameworks and services
● Long-term bet on Google’s Big Data ecosystem
Hadoop to Cloud was Interesting
28
Separate Hadoop Compute and Storage
● Scaling the dimensions independently
● Makes it easy to run multiple clusters and processing
frameworks over the same data
● Virtual network and project primitives provide
segmentation of access and cost structures.
● State is preserved in Cloud Storage therefore
deployments, upgrades, and testing are simpler
● Can treat storage as a commodity
Enables
29
1. Cold Cluster
● Storage: Cloud Storage
● Compute: Limited
ephemeral Dataproc an
option
● Scaling: mostly storage
driven
2. Ad-Hoc Clusters
● Storage: Cloud Storage
● Compute: Compute
Engine and Twitter build
of Hadoop (long running
clusters)
● Scaling: mixture, with
spiky compute
Twitter Hadoop Rearchitected for Cloud
30
Twitter production Hadoop remains on premise
● Not as separable from other production workloads
● Focusing on non-production workloads limits our risk
● Regular compute-intensive usage patterns
● Benefits more from purpose built hardware
● Fewer processing frameworks are needed
31
Twitter Strategic Benefits
● Next-generation architecture with numerous
enhancements:
○ security, encryption, isolation, live migration
● Leverage Google’s capacity and R&D
● Larger ecosystem of open source & cloud software
● Long-term strategic collaboration with Google
● Beachhead that enable teams across Twitter to make
tactical cloud adoption decisions
What does this do
overall for Twitter?
32
Infrastructure benefits
● Large-scale ad-hoc
analysis and backfills
● Cloud Storage avoids
HDFS limits
● Offsite Backup
● Increases availability of
cold data
Twitter Functional Benefits
Platform benefits
● Built-in compliance
support (e.g. SOX)
● Direct chargeback using
Project
● Simplified retention
● GCP services such as
BigQuery, Spanner,
Cloud ML, TPUs, etc
33
Finding: At Twitter Scale, Cloud has limits
● Cloud providers have limits for all sorts of things
and we often need them increased.
● Cloud HaaS do not generally support 10K node
hadoop clusters
● Dynamic scaling down < O(days) is not yet
feasible / cost-effective with current Hadoop at
Twitter scale
● Capacity planning with cloud providers is
encouraged for O(10K) vCPU deltas and required
for O(100K) vCPU deltas
34
What we are working on now
❏ Finalizing bucket & user creation and IAM designs
❏ Building replication, cluster deployment, and data
management software
❏ Hadoop Cloud Storage connector improvements
continue (open source)
❏ Retention and “directory” / dataset atomicity in GCS
35
✓ Foundational network
(8x100Gbps)
✓ Copy cluster
✓ Copying PBs of data to the
cloud
✓ Early Presto analytics use
case: up to 100K-core
Dataproc cluster querying
15PB dataset in Cloud
Storage
Recommendations
and Conclusion
36
3. Ensure migration plan
captures benefits
Lift-and-shift may not deliver
value in all cases.
Substantial iteration is required
to balance tactical migration
work with long-term strategy.
2. Compare application
benchmark costs
Compare the cost of running an
application using benchmark
results. Don’t just look at
pricing pages.
e.g. the network is hugely
important to performance.
1. Run the most informative
tests
Application-level
benchmarking (e.g. GridMix)
Scale testing
Recommendations
37
2. Cloud adoption
is complex
Finding separable workloads
can be a challenge.
Architectural choices are
non-obvious.
Methodical evaluation is
well-worth the effort.
1. Separate compute and
storage is a real thing
The better the network, the less
locality matters.
Life gets much easier when
Compute can be stateless.
You can treat PD like direct
attached HDDs.
Conclusions
3. Very early in this process
and lots more to come
We’re excited to be gaining
experience with the platform
and learning from everyone.
38
Thank You
Questions?
39

More Related Content

What's hot

AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics
 

What's hot (20)

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20
 
Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium
 
Symantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in actionSymantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in action
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks Presentation
 
DIscover Spark and Spark streaming
DIscover Spark and Spark streamingDIscover Spark and Spark streaming
DIscover Spark and Spark streaming
 
HDF Data in the Cloud
HDF Data in the CloudHDF Data in the Cloud
HDF Data in the Cloud
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
 
Google Cloud Dataflow
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud Dataflow
 
5 levels of high availability from multi instance to hybrid cloud
5 levels of high availability  from multi instance to hybrid cloud5 levels of high availability  from multi instance to hybrid cloud
5 levels of high availability from multi instance to hybrid cloud
 
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBDistributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
 
5 Levels of High Availability: From Multi-instance to Hybrid Cloud
5 Levels of High Availability: From Multi-instance to Hybrid Cloud5 Levels of High Availability: From Multi-instance to Hybrid Cloud
5 Levels of High Availability: From Multi-instance to Hybrid Cloud
 
Serverless ETL and Optimization on ML pipeline
Serverless ETL and Optimization on ML pipelineServerless ETL and Optimization on ML pipeline
Serverless ETL and Optimization on ML pipeline
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
 
Scaling HDFS at Xiaomi
Scaling HDFS at XiaomiScaling HDFS at Xiaomi
Scaling HDFS at Xiaomi
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
Discover some "Big Data" architectural concepts with Redis
Discover some  "Big Data" architectural concepts with  Redis Discover some  "Big Data" architectural concepts with  Redis
Discover some "Big Data" architectural concepts with Redis
 

Similar to How @twitterhadoop chose google cloud

project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
Aswini Ashu
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
aswini pilli
 
vBACD - Distributed Petabyte-Scale Cloud Storage with GlusterFS - 2/28
vBACD - Distributed Petabyte-Scale Cloud Storage with GlusterFS - 2/28vBACD - Distributed Petabyte-Scale Cloud Storage with GlusterFS - 2/28
vBACD - Distributed Petabyte-Scale Cloud Storage with GlusterFS - 2/28
CloudStack - Open Source Cloud Computing Project
 

Similar to How @twitterhadoop chose google cloud (20)

Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
 
Getting started with GCP ( Google Cloud Platform)
Getting started with GCP ( Google  Cloud Platform)Getting started with GCP ( Google  Cloud Platform)
Getting started with GCP ( Google Cloud Platform)
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
1. beyond mission critical virtualizing big data and hadoop
1. beyond mission critical   virtualizing big data and hadoop1. beyond mission critical   virtualizing big data and hadoop
1. beyond mission critical virtualizing big data and hadoop
 
Bigdata and Hadoop with Docker
Bigdata and Hadoop with DockerBigdata and Hadoop with Docker
Bigdata and Hadoop with Docker
 
vBACD - Distributed Petabyte-Scale Cloud Storage with GlusterFS - 2/28
vBACD - Distributed Petabyte-Scale Cloud Storage with GlusterFS - 2/28vBACD - Distributed Petabyte-Scale Cloud Storage with GlusterFS - 2/28
vBACD - Distributed Petabyte-Scale Cloud Storage with GlusterFS - 2/28
 
The Future of GlusterFS and Gluster.org
The Future of GlusterFS and Gluster.orgThe Future of GlusterFS and Gluster.org
The Future of GlusterFS and Gluster.org
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Data Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud EraData Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud Era
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS
 
Getting more into GCP.pdf
Getting more into GCP.pdfGetting more into GCP.pdf
Getting more into GCP.pdf
 
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
 

More from lohitvijayarenu (8)

OpenSource and the Cloud ApacheCon.pptx
OpenSource and the Cloud  ApacheCon.pptxOpenSource and the Cloud  ApacheCon.pptx
OpenSource and the Cloud ApacheCon.pptx
 
The Adoption of Apache Beam at Twitter
The Adoption of Apache Beam at TwitterThe Adoption of Apache Beam at Twitter
The Adoption of Apache Beam at Twitter
 
Scaling event aggregation at twitter
Scaling event aggregation at twitterScaling event aggregation at twitter
Scaling event aggregation at twitter
 
Large Scale EventLog Management @Twitter
Large Scale EventLog Management @TwitterLarge Scale EventLog Management @Twitter
Large Scale EventLog Management @Twitter
 
Routing trillion events per day @twitter
Routing trillion events per day @twitterRouting trillion events per day @twitter
Routing trillion events per day @twitter
 
Open Source india 2014
Open Source india 2014Open Source india 2014
Open Source india 2014
 
Hadoop 2 @Twitter, Elephant Scale. Presented at
Hadoop 2 @Twitter, Elephant Scale. Presented at Hadoop 2 @Twitter, Elephant Scale. Presented at
Hadoop 2 @Twitter, Elephant Scale. Presented at
 
HBase backups and performance on MapR
HBase backups and performance on MapRHBase backups and performance on MapR
HBase backups and performance on MapR
 

Recently uploaded

Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 

Recently uploaded (20)

VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 

How @twitterhadoop chose google cloud

  • 1. How @twitterhadoop chose Google Cloud Joep Rottinghuis & Lohit VijayaRenu Twitter Hadoop Team (@twitterhadoop) 1
  • 2. 1. Twitter infrastructure 2. Hadoop evaluation 3. Evaluation outcomes 4. Recommendations and conclusions 5. Q&A Credit to presentation at GoogleNext 2019 by Derek Lyon & Dave Beckett (https://youtu.be/4FLFcWgZdo4) 2
  • 4. Twitter’s infrastructure ● Twitter founded in 2006 ● Global-scale application ● Unique scale and performance characteristics ● Real-time ● Built to purpose and well optimized ● Large data centers 4
  • 5. Strategic questions 1. What is the long-term mix of cloud versus datacenter? 2. Which cloud provider(s) should we use? 3. How can we be confident in this type of decision? 4. Why should we evaluate this now (2016)? 5
  • 6. Tactical questions 1. What is the feasibility and cost of large-scale adoption? 2. Which workloads are best-suited for the cloud and are they separable? 3. How would our architecture change on the cloud? 4. How do we get to an actionable plan? 6
  • 7. Evaluation process ● Started evaluation in 2016 ● Were able to make a patient, rigorous decision ● Defined baseline workload requirements ● Engaged major providers ● Analyzed clouds for each major workload ● Built overall cloud plan ● Iterated and optimized choices 7
  • 8. Evaluation Timeline Considering Moving ● PoC’s Completed & Results Delivered ● Legal Agreement with T&C’s ratified ● Kickoff dataproc, bigquery, dataflow experimentation ● Security and Platform Review ● v1 Hadoop on GCP Architecture Ratified ● Begin build for migration plan ● Consensus built with Product, Revenue, Eng ● Migration Kickoff ● Proposal to migrate Hadoop to GCP formally accepted June ‘16 ● Initial Cloud RfP release ● 27 Synthetic PoC’s on GCP begin ● Testing Projects / Network established Sept ‘16 Mar ‘17 July ‘17 Nov ‘17 Jan ‘18 Apr ‘18 June ‘18 8
  • 9. Built overall cloud plan ● Created a series of candidate architectures for each platform with their resource requirements ● Developed a migration project plan & timeline ● Created financial projections ● With some other business considerations 9
  • 10. Financial modeling ● 10-year time horizon to avoid timing artifacts ● Compared on premise and multiple cloud scenarios ● Costs of migration and long-term ● Long-term price/performance curves (e.g. Moore’s Law, historical pricing) ● Two independent models to avoid model errors 10
  • 11. ● An immediate all-in migration at Twitter scale is: expensive, distracting, and risky ● More value from new architectures and transformation, so start smaller and learn as we go ● Hadoop offered several important, specific benefits with lower risk ● We gained confidence in our investments in both cloud projects and data centers What we found 11
  • 12. >1.4T Messages Per Day >500K Compute Cores >300PB Logical Storage Hadoop@Twitter scale >12,500 Peak Cluster Size 12
  • 13. Type Use Compute % Real-time Critical performance production jobs with dedicated capacity 10% Processing Regularly scheduled production jobs with dedicated capacity 60% Ad-hoc One off / ad-hoc queries and analysis 30% Cold Dense storage clusters, not for compute minimal Twitter Hadoop cluster types 13
  • 14. Twitter Hadoop challenges 1. Scaling: Significant YoY Compute & Storage growth 2. Hardware: Designing, building, maintaining & operating 3. Capacity Planning: Hard to predict for adhoc especially 4. Agility: Must respond fast especially for adhoc compute 5. Deployment: Must deploy at scale and in-flight 6. Network: Both cross-DC and cross-cluster 7. Disaster Recovery: Durable copies needed in 2+ DCs 14
  • 15. Twitter Hadoop requirements ● Network sustained bandwidth per core ● Disk (data) sustained bandwidth per core ● Large sequential reads & writes ● Throughput not latency ● Capacity ● CPU / RAM not usually the bottleneck ● Consistency of datasets (set of HDFS files) 15
  • 16. Twitter Hadoop on premise hardware numbers Clusters: 10 to 10K nodes Network: 10G moving to 25G Data Disks: 24T-72T over 12 HDDs CPU: 8 cores with 64G memory I/O: Network: ~20MB/s sustained, peaks of 10x HDFS read: 20 rq/s sustained, peaks of 3x HDFS write: large variation, peaks of 10x 16
  • 17. 2. Twitter Hadoop on cloud VMs Durable storage: cloud object store Scratch storage: a. with HDFS over cloud object store b. with HDFS on cloud block store c. with HDFS on local disks 1. Hadoop-as-a-Service (HaaS) from the cloud provider Cloud architectural options 17
  • 18. 2. Functional Test Gridmix: IO + Compute ● Capture of real production cluster workload (1k-5k jobs) ● Replays reads, writes, shuffles, compute Testing plan 1. Baseline Tests ● TestDFSIO: low level IO read/write ● Teragen: measure maximum write rate ● Terasort: read, shuffle, write 18
  • 19. HDFS configurations tested Availability ● Critical data: 2 regions ● Other data: 2 zones Each type of Object, Block and Local Storage Dataset consistency Test cloud provider choices: 1. object store 2. object store with external consistency service 19
  • 21. GCP HaaS: DataProc config ● Hadoop 2.7.2 ● Performance tests with 800 vCPUs: ○ 100 x n1-standard-8 (8 VCPU, 30G memory) ○ 200 x n1-standard-4 (4 VCPU, 30G memory) ● Scale test with 8000 vCPUs: ○ 1000 x n1-standard-8 (8 vCPU, 30G memory) ● Modeled average CPU and average to peak CPU. ● No preemptible instances in initial work ● Similar to on premise hardware SKUs 21 Decided to use DataProc for evaluation.
  • 22. Durable Storage Scratch Storage HDFS Speedup vs on premise (normalized by IO-per-core) Cloud Storage Local SSD 3 x 375G SSD ~2x (but expensive) Cloud Storage PD-HDD 1.5TB PD-HDD ~1x None PD-HDD 1.5TB PD-HDD ~1x DataProc 100 x n1-standard-8 Results Tuned Compute Engine instance types to get the optimum balance of network : cores : storage (this changes over time) 22
  • 23. Durable Storage Scratch Storage HDFS Speedup vs on premise (normalized by IO-per-core) Cloud Storage Local SSD 2 x 375G SSD ~2x (but expensive) Cloud Storage PD-HDD 1.5TB PD-HDD 1.4x DataProc 200 x n1-standard-4 Results 23
  • 24. Benchmark Findings 1. Application Benchmarks are critical Total job time is composed of multiple steps. We found variation both better and worse at each step. Recommendation: You should rely on an application benchmark like GridMix rather than micro-benchmarks. 2. Can treat network storage like local disk Both Cloud Storage and PD offered nearly as much bandwidth as typical direct attached HDDs on premise 24
  • 25. Functional Test Findings 1. Live Migration of VMs was not noticeable during Hadoop testing. It was during other Twitter platform testing of Compute Engine (cache at very high rps of small objects) 2. Cloud Storage checksum vs HDFS checksum. Fixed via HDFS-13056 in collaboration with Google 3. fsync() system call on Local SSD was slow (fixed) 25
  • 27. + Leads to the fastest migration + Limits duplication of costs during migration period - Introduces significant tech debt post-migration - Requires a major rearchitecture post-migration to capture benefits of cloud - Concerns around overall cost, risk, and distraction of this approach at Twitter scale Life-and-Shift everything Disqualified Lift-and-Shift *Everything* 27
  • 28. ● Separable with fewer dependencies ● Standard open source software: ○ Continue to develop in house and run on premise ○ Reduces lock-in risk ● Rearchitecting is achievable ○ Not a lift-and-shift ● Data in Cloud Storage: ○ Enables broader diversity of data processing frameworks and services ● Long-term bet on Google’s Big Data ecosystem Hadoop to Cloud was Interesting 28
  • 29. Separate Hadoop Compute and Storage ● Scaling the dimensions independently ● Makes it easy to run multiple clusters and processing frameworks over the same data ● Virtual network and project primitives provide segmentation of access and cost structures. ● State is preserved in Cloud Storage therefore deployments, upgrades, and testing are simpler ● Can treat storage as a commodity Enables 29
  • 30. 1. Cold Cluster ● Storage: Cloud Storage ● Compute: Limited ephemeral Dataproc an option ● Scaling: mostly storage driven 2. Ad-Hoc Clusters ● Storage: Cloud Storage ● Compute: Compute Engine and Twitter build of Hadoop (long running clusters) ● Scaling: mixture, with spiky compute Twitter Hadoop Rearchitected for Cloud 30
  • 31. Twitter production Hadoop remains on premise ● Not as separable from other production workloads ● Focusing on non-production workloads limits our risk ● Regular compute-intensive usage patterns ● Benefits more from purpose built hardware ● Fewer processing frameworks are needed 31
  • 32. Twitter Strategic Benefits ● Next-generation architecture with numerous enhancements: ○ security, encryption, isolation, live migration ● Leverage Google’s capacity and R&D ● Larger ecosystem of open source & cloud software ● Long-term strategic collaboration with Google ● Beachhead that enable teams across Twitter to make tactical cloud adoption decisions What does this do overall for Twitter? 32
  • 33. Infrastructure benefits ● Large-scale ad-hoc analysis and backfills ● Cloud Storage avoids HDFS limits ● Offsite Backup ● Increases availability of cold data Twitter Functional Benefits Platform benefits ● Built-in compliance support (e.g. SOX) ● Direct chargeback using Project ● Simplified retention ● GCP services such as BigQuery, Spanner, Cloud ML, TPUs, etc 33
  • 34. Finding: At Twitter Scale, Cloud has limits ● Cloud providers have limits for all sorts of things and we often need them increased. ● Cloud HaaS do not generally support 10K node hadoop clusters ● Dynamic scaling down < O(days) is not yet feasible / cost-effective with current Hadoop at Twitter scale ● Capacity planning with cloud providers is encouraged for O(10K) vCPU deltas and required for O(100K) vCPU deltas 34
  • 35. What we are working on now ❏ Finalizing bucket & user creation and IAM designs ❏ Building replication, cluster deployment, and data management software ❏ Hadoop Cloud Storage connector improvements continue (open source) ❏ Retention and “directory” / dataset atomicity in GCS 35 ✓ Foundational network (8x100Gbps) ✓ Copy cluster ✓ Copying PBs of data to the cloud ✓ Early Presto analytics use case: up to 100K-core Dataproc cluster querying 15PB dataset in Cloud Storage
  • 37. 3. Ensure migration plan captures benefits Lift-and-shift may not deliver value in all cases. Substantial iteration is required to balance tactical migration work with long-term strategy. 2. Compare application benchmark costs Compare the cost of running an application using benchmark results. Don’t just look at pricing pages. e.g. the network is hugely important to performance. 1. Run the most informative tests Application-level benchmarking (e.g. GridMix) Scale testing Recommendations 37
  • 38. 2. Cloud adoption is complex Finding separable workloads can be a challenge. Architectural choices are non-obvious. Methodical evaluation is well-worth the effort. 1. Separate compute and storage is a real thing The better the network, the less locality matters. Life gets much easier when Compute can be stateless. You can treat PD like direct attached HDDs. Conclusions 3. Very early in this process and lots more to come We’re excited to be gaining experience with the platform and learning from everyone. 38