Extending Twitter's Data Platform to Google Cloud

DataWorks Summit
DataWorks SummitDataWorks Summit
Extending Twitter’s
Data Platform to
Google Cloud
1
Lohit VijayaRenu , Vrushali Channapattan
Data Platform @Twitter
Oxpecker
Roneobird
Data Access
Layer
ETL
Pipelines
Why Cloud?
- Provides a convenient way to test Hadoop changes at scale
- Temporarily rapidly grow / shrink
- A broader geographical footprint for locality and business continuity
- Access to other Google offerings such as BigQuery, CloudML, Cloud
DataFlow etc
Partly Cloudy
A project to extend Data Processing at Twitter
from an on-premises only model
to a hybrid on-premises and Cloud model
Before Partly Cloudy
Partly Cloudy
Design considerations
User Experience
Consistency in
user experience
for on-premises
& in cloud data
processing
Scalability
Ability scale out
to handle all
datasets & all
users from day
1
Onboarding
Seamless
onboarding
experience
New Avenues
Data access in
new processing
tools in cloud
Design principles
Authentication
Strong authentication
for all user and service
access to data
Authorization
Explicit authorization
for all user and service
access to data
Least privileged access
Audit
Ability to easily
determine who
performed what
actions on the data
Workstreams
● Various focus areas across the tech stack
○ Networking
○ GCP config
○ Replication
○ Data Processing Tools
○ Internal services
● Collaboration across teams within Twitter
● Collaboration with Google
Partly Cloudy Data Replication
Sync Datasets to GCS
Data Infrastructure for Analytics
`
Hadoop Cluster
Data
Access
Layer
Replication Service
Retention Service
Hadoop Cluster
Replication Service
Retention Service
Extending Replication to GCS
DataCenter 2DataCenter 1
Hadoop
ClusterM
Hadoop
ClusterN
Hadoop
ClusterC
Hadoop
ClusterZ
Hadoop
ClusterX-2
Hadoop
ClusterL
Hadoop
ClusterX-1
● Same dataset
available on GCS for
users
● Unlock Presto on
GCP, Hadoop on
GCP, BigQuery and
other tools
Destination Cluster
/ClusterY/logs/partly-cloudy/
2019/04/10/03
Data Replicator Copy
Source Cluster
/ClusterX/logs/partly-cloudy/
2019/04/10/03
Replicator : ClusterY
Distcp
2019/04/10/03
DAL
Dataset : partly-cloudy
/ClusterX/logs/partly-cloudy
/ClusterY/logs/partly-cloudy
Destination Cluster
/ClusterY/logs/partly-cloudy/
2019/04/10/03
Data Replicator Copy + Merge
Source Cluster
/ClusterX-2/logs/partly-cloudy/
2019/04/10/03
Replicator : ClusterY
Distcp
2019/04/10/03
DAL
Dataset : partly-cloudy
/ClusterX-1/logs/partly-cloudy
/ClusterX-2/logs/partly-cloudy
/ClusterY/logs/partly-cloudy
Type : Multiple Src
Source Cluster
/ClusterX-1/logs/partly-cloudy/
2019/04/10/03
Distcp
2019/04/10/03
Merge
Twitter
DataCenter
Architecture behind GCS replication
Copy Cluster
GCS
/gcs/logs/partly-cloud
/2019/04/10/03
Replicator : GCS
DAL
Source Cluster
/ClusterX/logs/partly-cloudy/
2019/04/10/03
Distcp
Dataset : partly-cloudy
/ClusterX/logs/partly-cloudy
/gcs/logs/partly-cloudy
Merge same dataset on GCS (Multi Region Bucket)
Twitter DataCenter X-2
Copy Cluster X-2
/gcs/logs/partly-
cloudy/2019/04/
10/03
Source ClusterX-2
/ClusterX-2/logs/partly-
cloudy//2019/04/10/03
Twitter DataCenter X-1
Copy Cluster X-1Source ClusterX-1
/ClusterX-1/logs/partly-
cloudy/2019/04/10/03
Distcp
Multi Region
Bucket
Distcp
Cloud Storage
Dataset via EagleEye
● View different
destination for
same dataset
● GCS is another
destination
● Also shows delay
for each hourly
partition
Partly Cloudy Resource Hierarchy
Organization and Project
structure
Partly Cloudy Resource Hierarchy
TWITTER Org
DATA INFRA
Folder
twitter-
product
twitter-revenue
twitter-infraeng GCP
Projects
Project
Dataset
bucket
User Bucket
Google Cloud Storage
Connector for Hadoop
Google Cloud Storage
Connector for Hadoop
Nest
Name
Nodes
Worker Nodes
Resource
Manager
Task
ViewFS filesystem layer
ViewFS filesystem layer
Shadow account based
access
User account based access
User account based access
Scratch
bucket
Scrubbed
bucket
Project contents
GCP Project ZGCP Project YGCP Project X
Replicators per project
Twitter DataCenter
Copy Cluster
/gcs/dataX/2019/
04/10/03
/gcs/dataY/2019/
04/10/03
/gcs/dataZ/2019/
04/10/03
DistcpDistcp
DistcpDistcp DistcpDistcp
Replicator X Replicator Y Replicator Z
Cloud Storage Cloud Storage Cloud Storage
Partly Cloudy Resource Hierarchy
Storage in the Cloud
GCS
On-premises
path
/dc1/cluster1/user/
helen/some/path/par
t-001.lzo
Logical Cloud
path
/gcs/user/helen/
some/path/part-
001.lzo
GCS bucket path
gs://user.helen.dp.
twitter.domain/some
/path/part-001.lzo
RegEx based path resolution
<property>
<name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:--
;replaceresolveddstpath:_:-#.^/gcs/logs/(?!((tst|test)(_|-)))(?&lt;dataset&gt;[^/]+)</name>
<value>gs://logs.${dataset}</value>
</property>
<property> <name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:--
;replaceresolveddstpath:_:-#.^/gcs/user/(?!((tst|test)(_|-)))(?&lt;userName&gt;[^/]+)</name>
<value>gs://user.${userName}</value>
</property>
/gcs/logs/partly-
cloudy/2019/04/10
/gcs/user/lohit/hadoop-stats
gs://logs.partly-
cloudy/2019/04/10
gs://user.lohit/hadoop-stats
Twitter ViewFS Path GCS bucket
Twitter ViewFS mounttable.xml
Bucket on GCS : gs://logs.partly-cloudy
Connector Path : /logs/partly-cloudy
Twitter Resolved Path : /gcs/logs/partly-cloudy
View FileSystem and Google Hadoop Connector
Twitter’s View FileSystem
Cluster-X Cluster-Y ClusterZ
Namespace 1 Namespace 2 Namespace 1 Namespace 1 Namespace 2
DataCenter-1 DataCenter-2 Cloud Storage
Connector
Replicator
Cloud Storage
Partly Cloudy Resource Hierarchy
User Management
User
UNIX
Kerberos
credentials
GSuite
OAuth2
credentials
GSuite
OAuth2
credentials
Shadow account
(GCP Service account)
Users & Accounts
Shadow account
Json key
Key Management
- A new key is generated every N days
- Each key is valid for 2N + N days
- Keys are distributed to compute nodes by Twitter’s key
distribution service
- The shadow account key is readable only by that user
- Key management & distribution is transparent to the user
Partly Cloudy Resource Hierarchy
DATA INFRA
twitter-[org]
twitter-
employee-users
How
do the
Data Processing Users
at Twitter get
to use Partly Cloudy
DemiGod
Services
What are DemiGod services
Demigod is a group of service(s) that are responsible for
configuring GCP for Twitter’s Data Platform.
They run in GCP.
Salient features of DemiGods
- Run asynchronously of each other.
- Run with exactly-scoped, privileged google service accounts
- Idempotent runs
- Puppet-like functionality. Will override any manual changes
- Modular in design
- Each kept as simple as possible
Twitter infra eng project Twitter product project
Partly Cloudy Admin Project
Twitter user project
bucket-creation
-ie org (svc-acc-ie)
bucket-creation
Product (svc-acc-
product)
shadow-user-
creation
policy-granting-ie
Key/
Secrets
store
LDAP/Googl
e Groups
GCS Config
bucket
key-
rotation/creation
Deployment of DemiGods
What
do the
Data Processing
Users
at Twitter get
❏ Datasets replicated on GCS
❏ A shadow account to access GCS
❏ GCS buckets for their scratch &
scrubbed data
❏ Access to a Twitter managed
Hadoop cluster in GCP
❏ Access to a Twitter managed
Presto cluster in GCP
❏ Exploring other Google offerings
(such as BigQuery, DataProc & DataFlow)
● Copied tens of petabytes of data
and keeping them in sync
● Tens of different projects with
hundreds of buckets
● Complex set of VPC rules
● Hundreds of users using GCP
● Unlocked multiple use cases on
GCP
Where are we today
Thank you!
Hiring https://careers.twitter.com
Tweet @TwitterHadoop
1 de 36

Recomendados

Rainbird: Realtime Analytics at Twitter (Strata 2011) por
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
77K visualizações60 slides
System design for video streaming service por
System design for video streaming serviceSystem design for video streaming service
System design for video streaming serviceNirmik Kale
448 visualizações13 slides
HDInsight for Architects por
HDInsight for ArchitectsHDInsight for Architects
HDInsight for ArchitectsAshish Thapliyal
2.5K visualizações52 slides
Apache Kudu: Technical Deep Dive

 por
Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
4.5K visualizações48 slides
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021 por
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
536 visualizações18 slides
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha... por
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
979 visualizações44 slides

Mais conteúdo relacionado

Mais procurados

Evolution from EDA to Data Mesh: Data in Motion por
Evolution from EDA to Data Mesh: Data in MotionEvolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in Motionconfluent
473 visualizações56 slides
Building large scale transactional data lake using apache hudi por
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
686 visualizações32 slides
Building Robust ETL Pipelines with Apache Spark por
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkDatabricks
34.6K visualizações43 slides
Real-Time Event Processing por
Real-Time Event ProcessingReal-Time Event Processing
Real-Time Event ProcessingAmazon Web Services
7K visualizações49 slides
Server monitoring using grafana and prometheus por
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheusCeline George
2K visualizações13 slides
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D... por
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
1.5K visualizações50 slides

Mais procurados(20)

Evolution from EDA to Data Mesh: Data in Motion por confluent
Evolution from EDA to Data Mesh: Data in MotionEvolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in Motion
confluent473 visualizações
Building large scale transactional data lake using apache hudi por Bill Liu
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
Bill Liu686 visualizações
Building Robust ETL Pipelines with Apache Spark por Databricks
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
Databricks34.6K visualizações
Real-Time Event Processing por Amazon Web Services
Real-Time Event ProcessingReal-Time Event Processing
Real-Time Event Processing
Amazon Web Services7K visualizações
Server monitoring using grafana and prometheus por Celine George
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheus
Celine George2K visualizações
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D... por Databricks
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks1.5K visualizações
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ... por Databricks
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks3.8K visualizações
Lesson 3 - Understanding Native Applications, Tools, Mobility, and Remote Man... por Gene Carboni
Lesson 3 - Understanding Native Applications, Tools, Mobility, and Remote Man...Lesson 3 - Understanding Native Applications, Tools, Mobility, and Remote Man...
Lesson 3 - Understanding Native Applications, Tools, Mobility, and Remote Man...
Gene Carboni2K visualizações
Apache Hudi: The Path Forward por Alluxio, Inc.
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
Alluxio, Inc.495 visualizações
Hadoop Backup and Disaster Recovery por Cloudera, Inc.
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
Cloudera, Inc.62.8K visualizações
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac... por HostedbyConfluent
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent4.3K visualizações
How to build a streaming Lakehouse with Flink, Kafka, and Hudi por Flink Forward
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward488 visualizações
Managing 2000 Node Cluster with Ambari por DataWorks Summit
Managing 2000 Node Cluster with AmbariManaging 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with Ambari
DataWorks Summit17.1K visualizações
Consensus and Raft Algorithm in Distributed System por Thao Huynh Quang
Consensus and  Raft Algorithm in Distributed SystemConsensus and  Raft Algorithm in Distributed System
Consensus and Raft Algorithm in Distributed System
Thao Huynh Quang45 visualizações
Apache Kafka – (Pattern and) Anti-Pattern por confluent
Apache Kafka – (Pattern and) Anti-PatternApache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-Pattern
confluent2.3K visualizações
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ... por DataWorks Summit
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
DataWorks Summit1.1K visualizações
data platform on kubernetes por 창언 정
data platform on kubernetesdata platform on kubernetes
data platform on kubernetes
창언 정715 visualizações
Google Dremel. Concept and Implementations. por Vicente Orjales
Google Dremel. Concept and Implementations.Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.
Vicente Orjales7.7K visualizações
Apache Iceberg - A Table Format for Hige Analytic Datasets por Alluxio, Inc.
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.6.6K visualizações

Similar a Extending Twitter's Data Platform to Google Cloud

Extending twitter's data platform to google cloud por
Extending twitter's data platform to google cloud Extending twitter's data platform to google cloud
Extending twitter's data platform to google cloud Vrushali Channapattan
252 visualizações36 slides
Extending Twitter's Data Platform to Google Cloud por
Extending Twitter's Data Platform to Google Cloud Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud lohitvijayarenu
114 visualizações36 slides
SEC302 Twitter's GCP Architecture for its petabyte scale data storage in gcs... por
SEC302  Twitter's GCP Architecture for its petabyte scale data storage in gcs...SEC302  Twitter's GCP Architecture for its petabyte scale data storage in gcs...
SEC302 Twitter's GCP Architecture for its petabyte scale data storage in gcs...Vrushali Channapattan
404 visualizações43 slides
Managing 100s of PetaBytes of data in Cloud por
Managing 100s of PetaBytes of data in CloudManaging 100s of PetaBytes of data in Cloud
Managing 100s of PetaBytes of data in Cloudlohitvijayarenu
170 visualizações30 slides
Cloud Composer workshop at Airflow Summit 2023.pdf por
Cloud Composer workshop at Airflow Summit 2023.pdfCloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdfLeah Cole
163 visualizações80 slides
How a distributed graph analytics platform uses Apache Kafka for data ingesti... por
How a distributed graph analytics platform uses Apache Kafka for data ingesti...How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...HostedbyConfluent
961 visualizações26 slides

Similar a Extending Twitter's Data Platform to Google Cloud(20)

Extending twitter's data platform to google cloud por Vrushali Channapattan
Extending twitter's data platform to google cloud Extending twitter's data platform to google cloud
Extending twitter's data platform to google cloud
Vrushali Channapattan252 visualizações
Extending Twitter's Data Platform to Google Cloud por lohitvijayarenu
Extending Twitter's Data Platform to Google Cloud Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
lohitvijayarenu114 visualizações
SEC302 Twitter's GCP Architecture for its petabyte scale data storage in gcs... por Vrushali Channapattan
SEC302  Twitter's GCP Architecture for its petabyte scale data storage in gcs...SEC302  Twitter's GCP Architecture for its petabyte scale data storage in gcs...
SEC302 Twitter's GCP Architecture for its petabyte scale data storage in gcs...
Vrushali Channapattan404 visualizações
Managing 100s of PetaBytes of data in Cloud por lohitvijayarenu
Managing 100s of PetaBytes of data in CloudManaging 100s of PetaBytes of data in Cloud
Managing 100s of PetaBytes of data in Cloud
lohitvijayarenu170 visualizações
Cloud Composer workshop at Airflow Summit 2023.pdf por Leah Cole
Cloud Composer workshop at Airflow Summit 2023.pdfCloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdf
Leah Cole163 visualizações
How a distributed graph analytics platform uses Apache Kafka for data ingesti... por HostedbyConfluent
How a distributed graph analytics platform uses Apache Kafka for data ingesti...How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
HostedbyConfluent961 visualizações
Best practices for application migration to public clouds interop presentation por esebeus
Best practices for application migration to public clouds interop presentationBest practices for application migration to public clouds interop presentation
Best practices for application migration to public clouds interop presentation
esebeus1.2K visualizações
仕事ではじめる機械学習 por Aki Ariga
仕事ではじめる機械学習仕事ではじめる機械学習
仕事ではじめる機械学習
Aki Ariga8.2K visualizações
Hybrid data lake on google cloud with alluxio and dataproc por Alluxio, Inc.
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataproc
Alluxio, Inc.98 visualizações
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration por Denodo
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Denodo 138 visualizações
GDG Cloud Southlake #8 Steve Cravens: Infrastructure as-Code (IaC) in 2022: ... por James Anderson
GDG Cloud Southlake #8  Steve Cravens: Infrastructure as-Code (IaC) in 2022: ...GDG Cloud Southlake #8  Steve Cravens: Infrastructure as-Code (IaC) in 2022: ...
GDG Cloud Southlake #8 Steve Cravens: Infrastructure as-Code (IaC) in 2022: ...
James Anderson457 visualizações
ML Infrastracture @ Dropbox por Tsahi Glik
ML Infrastracture @ Dropbox ML Infrastracture @ Dropbox
ML Infrastracture @ Dropbox
Tsahi Glik924 visualizações
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu por Yahoo Developer Network
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
Yahoo Developer Network473 visualizações
How @twitterhadoop chose google cloud por lohitvijayarenu
How @twitterhadoop chose google cloudHow @twitterhadoop chose google cloud
How @twitterhadoop chose google cloud
lohitvijayarenu192 visualizações
Journey to Containerized Application / Google Container Engine por Google Cloud Platform - Japan
Journey to Containerized Application / Google Container EngineJourney to Containerized Application / Google Container Engine
Journey to Containerized Application / Google Container Engine
Google Cloud Platform - Japan1.9K visualizações
GCP-pde.pdf por NirajKumar938204
GCP-pde.pdfGCP-pde.pdf
GCP-pde.pdf
NirajKumar938204181 visualizações
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ... por Alluxio, Inc.
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Alluxio, Inc.467 visualizações
Architecting a Scalable Hadoop Platform: Top 10 considerations for success por DataWorks Summit
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
DataWorks Summit1.6K visualizações
Getting started with GCP ( Google Cloud Platform) por bigdata trunk
Getting started with GCP ( Google  Cloud Platform)Getting started with GCP ( Google  Cloud Platform)
Getting started with GCP ( Google Cloud Platform)
bigdata trunk363 visualizações
Data platform modernization with Databricks.pptx por CalvinSim10
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptx
CalvinSim1062 visualizações

Mais de DataWorks Summit

Data Science Crash Course por
Data Science Crash CourseData Science Crash Course
Data Science Crash CourseDataWorks Summit
19.3K visualizações47 slides
Floating on a RAFT: HBase Durability with Apache Ratis por
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
2.9K visualizações20 slides
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi por
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
2.1K visualizações19 slides
HBase Tales From the Trenches - Short stories about most common HBase operati... por
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
1.8K visualizações18 slides
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac... por
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
1.6K visualizações74 slides
Managing the Dewey Decimal System por
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
1K visualizações8 slides

Mais de DataWorks Summit(20)

Data Science Crash Course por DataWorks Summit
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit19.3K visualizações
Floating on a RAFT: HBase Durability with Apache Ratis por DataWorks Summit
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit2.9K visualizações
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi por DataWorks Summit
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit2.1K visualizações
HBase Tales From the Trenches - Short stories about most common HBase operati... por DataWorks Summit
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit1.8K visualizações
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac... por DataWorks Summit
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit1.6K visualizações
Managing the Dewey Decimal System por DataWorks Summit
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit1K visualizações
Practical NoSQL: Accumulo's dirlist Example por DataWorks Summit
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit834 visualizações
HBase Global Indexing to support large-scale data ingestion at Uber por DataWorks Summit
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit915 visualizações
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix por DataWorks Summit
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit714 visualizações
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi por DataWorks Summit
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit1.3K visualizações
Supporting Apache HBase : Troubleshooting and Supportability Improvements por DataWorks Summit
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit1.8K visualizações
Security Framework for Multitenant Architecture por DataWorks Summit
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit1.1K visualizações
Presto: Optimizing Performance of SQL-on-Anything Engine por DataWorks Summit
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit1.8K visualizações
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl... por DataWorks Summit
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit3.2K visualizações
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi por DataWorks Summit
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit4K visualizações
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger por DataWorks Summit
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit957 visualizações
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory... por DataWorks Summit
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit771 visualizações
Computer Vision: Coming to a Store Near You por DataWorks Summit
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit214 visualizações
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark por DataWorks Summit
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit615 visualizações
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ... por DataWorks Summit
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
DataWorks Summit236 visualizações

Último

PRODUCT LISTING.pptx por
PRODUCT LISTING.pptxPRODUCT LISTING.pptx
PRODUCT LISTING.pptxangelicacueva6
14 visualizações1 slide
The Research Portal of Catalonia: Growing more (information) & more (services) por
The Research Portal of Catalonia: Growing more (information) & more (services)The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)CSUC - Consorci de Serveis Universitaris de Catalunya
80 visualizações25 slides
Vertical User Stories por
Vertical User StoriesVertical User Stories
Vertical User StoriesMoisés Armani Ramírez
14 visualizações16 slides
Serverless computing with Google Cloud (2023-24) por
Serverless computing with Google Cloud (2023-24)Serverless computing with Google Cloud (2023-24)
Serverless computing with Google Cloud (2023-24)wesley chun
11 visualizações33 slides
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... por
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...Bernd Ruecker
40 visualizações69 slides
Melek BEN MAHMOUD.pdf por
Melek BEN MAHMOUD.pdfMelek BEN MAHMOUD.pdf
Melek BEN MAHMOUD.pdfMelekBenMahmoud
14 visualizações1 slide

Último(20)

PRODUCT LISTING.pptx por angelicacueva6
PRODUCT LISTING.pptxPRODUCT LISTING.pptx
PRODUCT LISTING.pptx
angelicacueva614 visualizações
Serverless computing with Google Cloud (2023-24) por wesley chun
Serverless computing with Google Cloud (2023-24)Serverless computing with Google Cloud (2023-24)
Serverless computing with Google Cloud (2023-24)
wesley chun11 visualizações
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... por Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker40 visualizações
Melek BEN MAHMOUD.pdf por MelekBenMahmoud
Melek BEN MAHMOUD.pdfMelek BEN MAHMOUD.pdf
Melek BEN MAHMOUD.pdf
MelekBenMahmoud14 visualizações
SUPPLIER SOURCING.pptx por angelicacueva6
SUPPLIER SOURCING.pptxSUPPLIER SOURCING.pptx
SUPPLIER SOURCING.pptx
angelicacueva616 visualizações
Microsoft Power Platform.pptx por Uni Systems S.M.S.A.
Microsoft Power Platform.pptxMicrosoft Power Platform.pptx
Microsoft Power Platform.pptx
Uni Systems S.M.S.A.53 visualizações
Piloting & Scaling Successfully With Microsoft Viva por Richard Harbridge
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft Viva
Richard Harbridge12 visualizações
Voice Logger - Telephony Integration Solution at Aegis por Nirmal Sharma
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at Aegis
Nirmal Sharma39 visualizações
virtual reality.pptx por G036GaikwadSnehal
virtual reality.pptxvirtual reality.pptx
virtual reality.pptx
G036GaikwadSnehal14 visualizações
Evolving the Network Automation Journey from Python to Platforms por Network Automation Forum
Evolving the Network Automation Journey from Python to PlatformsEvolving the Network Automation Journey from Python to Platforms
Evolving the Network Automation Journey from Python to Platforms
Network Automation Forum13 visualizações
Case Study Copenhagen Energy and Business Central.pdf por Aitana
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdf
Aitana16 visualizações
Special_edition_innovator_2023.pdf por WillDavies22
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdf
WillDavies2218 visualizações
Five Things You SHOULD Know About Postman por Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman36 visualizações
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... por Jasper Oosterveld
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
Jasper Oosterveld19 visualizações
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... por James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson92 visualizações
Scaling Knowledge Graph Architectures with AI por Enterprise Knowledge
Scaling Knowledge Graph Architectures with AIScaling Knowledge Graph Architectures with AI
Scaling Knowledge Graph Architectures with AI
Enterprise Knowledge38 visualizações
Mini-Track: AI and ML in Network Operations Applications por Network Automation Forum
Mini-Track: AI and ML in Network Operations ApplicationsMini-Track: AI and ML in Network Operations Applications
Mini-Track: AI and ML in Network Operations Applications
Network Automation Forum10 visualizações

Extending Twitter's Data Platform to Google Cloud

  • 1. Extending Twitter’s Data Platform to Google Cloud 1 Lohit VijayaRenu , Vrushali Channapattan
  • 3. Why Cloud? - Provides a convenient way to test Hadoop changes at scale - Temporarily rapidly grow / shrink - A broader geographical footprint for locality and business continuity - Access to other Google offerings such as BigQuery, CloudML, Cloud DataFlow etc
  • 4. Partly Cloudy A project to extend Data Processing at Twitter from an on-premises only model to a hybrid on-premises and Cloud model
  • 7. Design considerations User Experience Consistency in user experience for on-premises & in cloud data processing Scalability Ability scale out to handle all datasets & all users from day 1 Onboarding Seamless onboarding experience New Avenues Data access in new processing tools in cloud
  • 8. Design principles Authentication Strong authentication for all user and service access to data Authorization Explicit authorization for all user and service access to data Least privileged access Audit Ability to easily determine who performed what actions on the data
  • 9. Workstreams ● Various focus areas across the tech stack ○ Networking ○ GCP config ○ Replication ○ Data Processing Tools ○ Internal services ● Collaboration across teams within Twitter ● Collaboration with Google
  • 10. Partly Cloudy Data Replication Sync Datasets to GCS
  • 11. Data Infrastructure for Analytics ` Hadoop Cluster Data Access Layer Replication Service Retention Service Hadoop Cluster Replication Service Retention Service
  • 12. Extending Replication to GCS DataCenter 2DataCenter 1 Hadoop ClusterM Hadoop ClusterN Hadoop ClusterC Hadoop ClusterZ Hadoop ClusterX-2 Hadoop ClusterL Hadoop ClusterX-1 ● Same dataset available on GCS for users ● Unlock Presto on GCP, Hadoop on GCP, BigQuery and other tools
  • 13. Destination Cluster /ClusterY/logs/partly-cloudy/ 2019/04/10/03 Data Replicator Copy Source Cluster /ClusterX/logs/partly-cloudy/ 2019/04/10/03 Replicator : ClusterY Distcp 2019/04/10/03 DAL Dataset : partly-cloudy /ClusterX/logs/partly-cloudy /ClusterY/logs/partly-cloudy
  • 14. Destination Cluster /ClusterY/logs/partly-cloudy/ 2019/04/10/03 Data Replicator Copy + Merge Source Cluster /ClusterX-2/logs/partly-cloudy/ 2019/04/10/03 Replicator : ClusterY Distcp 2019/04/10/03 DAL Dataset : partly-cloudy /ClusterX-1/logs/partly-cloudy /ClusterX-2/logs/partly-cloudy /ClusterY/logs/partly-cloudy Type : Multiple Src Source Cluster /ClusterX-1/logs/partly-cloudy/ 2019/04/10/03 Distcp 2019/04/10/03 Merge
  • 15. Twitter DataCenter Architecture behind GCS replication Copy Cluster GCS /gcs/logs/partly-cloud /2019/04/10/03 Replicator : GCS DAL Source Cluster /ClusterX/logs/partly-cloudy/ 2019/04/10/03 Distcp Dataset : partly-cloudy /ClusterX/logs/partly-cloudy /gcs/logs/partly-cloudy
  • 16. Merge same dataset on GCS (Multi Region Bucket) Twitter DataCenter X-2 Copy Cluster X-2 /gcs/logs/partly- cloudy/2019/04/ 10/03 Source ClusterX-2 /ClusterX-2/logs/partly- cloudy//2019/04/10/03 Twitter DataCenter X-1 Copy Cluster X-1Source ClusterX-1 /ClusterX-1/logs/partly- cloudy/2019/04/10/03 Distcp Multi Region Bucket Distcp Cloud Storage
  • 17. Dataset via EagleEye ● View different destination for same dataset ● GCS is another destination ● Also shows delay for each hourly partition
  • 18. Partly Cloudy Resource Hierarchy Organization and Project structure
  • 19. Partly Cloudy Resource Hierarchy TWITTER Org DATA INFRA Folder twitter- product twitter-revenue twitter-infraeng GCP Projects
  • 20. Project Dataset bucket User Bucket Google Cloud Storage Connector for Hadoop Google Cloud Storage Connector for Hadoop Nest Name Nodes Worker Nodes Resource Manager Task ViewFS filesystem layer ViewFS filesystem layer Shadow account based access User account based access User account based access Scratch bucket Scrubbed bucket Project contents
  • 21. GCP Project ZGCP Project YGCP Project X Replicators per project Twitter DataCenter Copy Cluster /gcs/dataX/2019/ 04/10/03 /gcs/dataY/2019/ 04/10/03 /gcs/dataZ/2019/ 04/10/03 DistcpDistcp DistcpDistcp DistcpDistcp Replicator X Replicator Y Replicator Z Cloud Storage Cloud Storage Cloud Storage
  • 22. Partly Cloudy Resource Hierarchy Storage in the Cloud
  • 24. RegEx based path resolution <property> <name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:-- ;replaceresolveddstpath:_:-#.^/gcs/logs/(?!((tst|test)(_|-)))(?&lt;dataset&gt;[^/]+)</name> <value>gs://logs.${dataset}</value> </property> <property> <name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:-- ;replaceresolveddstpath:_:-#.^/gcs/user/(?!((tst|test)(_|-)))(?&lt;userName&gt;[^/]+)</name> <value>gs://user.${userName}</value> </property> /gcs/logs/partly- cloudy/2019/04/10 /gcs/user/lohit/hadoop-stats gs://logs.partly- cloudy/2019/04/10 gs://user.lohit/hadoop-stats Twitter ViewFS Path GCS bucket Twitter ViewFS mounttable.xml
  • 25. Bucket on GCS : gs://logs.partly-cloudy Connector Path : /logs/partly-cloudy Twitter Resolved Path : /gcs/logs/partly-cloudy View FileSystem and Google Hadoop Connector Twitter’s View FileSystem Cluster-X Cluster-Y ClusterZ Namespace 1 Namespace 2 Namespace 1 Namespace 1 Namespace 2 DataCenter-1 DataCenter-2 Cloud Storage Connector Replicator Cloud Storage
  • 26. Partly Cloudy Resource Hierarchy User Management
  • 28. Key Management - A new key is generated every N days - Each key is valid for 2N + N days - Keys are distributed to compute nodes by Twitter’s key distribution service - The shadow account key is readable only by that user - Key management & distribution is transparent to the user
  • 29. Partly Cloudy Resource Hierarchy DATA INFRA twitter-[org] twitter- employee-users
  • 30. How do the Data Processing Users at Twitter get to use Partly Cloudy DemiGod Services
  • 31. What are DemiGod services Demigod is a group of service(s) that are responsible for configuring GCP for Twitter’s Data Platform. They run in GCP.
  • 32. Salient features of DemiGods - Run asynchronously of each other. - Run with exactly-scoped, privileged google service accounts - Idempotent runs - Puppet-like functionality. Will override any manual changes - Modular in design - Each kept as simple as possible
  • 33. Twitter infra eng project Twitter product project Partly Cloudy Admin Project Twitter user project bucket-creation -ie org (svc-acc-ie) bucket-creation Product (svc-acc- product) shadow-user- creation policy-granting-ie Key/ Secrets store LDAP/Googl e Groups GCS Config bucket key- rotation/creation Deployment of DemiGods
  • 34. What do the Data Processing Users at Twitter get ❏ Datasets replicated on GCS ❏ A shadow account to access GCS ❏ GCS buckets for their scratch & scrubbed data ❏ Access to a Twitter managed Hadoop cluster in GCP ❏ Access to a Twitter managed Presto cluster in GCP ❏ Exploring other Google offerings (such as BigQuery, DataProc & DataFlow)
  • 35. ● Copied tens of petabytes of data and keeping them in sync ● Tens of different projects with hundreds of buckets ● Complex set of VPC rules ● Hundreds of users using GCP ● Unlocked multiple use cases on GCP Where are we today

Notas do Editor

  1. To transfer data from on-premise to GCS Runs only yarn for GCS transfer, no local data Security Minimal in-DC hosts connect to GCS Networking Dedicated high bandwidth Requires separate dedicated configuration for routing to public end-points Each worker node has two IP addresses. Our DC RFC space that can't be used on the public Internet GCS traffic uses public IP Internal traffic (reading from cluster, observability, puppet, etc) uses internal IP
  2. Data is identified by a dataset name HDFS is the primary storage for Analytics Users configure replication rules for different clusters Dataset also has retention rules defined per cluster Dataset are always represented on fixed interval partitions (hourly/daily) Dataset is defined in system called Data Access Layer (DAL)* Data is made available at different destination using Replicator
  3. Long running daemon (on mesos) Daemon checks configuration and schedules copy on hourly partition Copy jobs are executed as Hadoop distcp jobs Jobs are on destination cluster After hourly copy, publish partition to DAL
  4. Some datasets are collected across multiple DataCenters Replicator would kick off multiple DistCP jobs to copy at tmp location Replicator then merges dataset into single directory and does atomic rename to final destination Renames on HDFS are cheap and atomic, which makes this operation easy
  5. Use same Replicator code to sync data to GCS Utilize ViewFileSystem abstraction to hide GCS /gcs/dataset/2019/04/10/03 maps to gs://dataset.bucket/2019/04/10/03 Use Google Hadoop Connector to interact with GCS using Hadoop APIs Distcp jobs runs on dedicated Copy cluster Create ViewFileSystem mount point on Copy cluster to fake GCS destination Distcp tasks stream data from source HDFS to GCS (no local copy)
  6. Data for same dataset is aggregated at multiple DataCenters (DC x and DC y) Replicators in each DC schedules individual DistCp jobs Data from multiple DC ends up under same path on GCS
  7. UI support via EagleEye to view all replication configurations Properties associated with configuration. Src, dest, owner, email, etc… CLI support to manage replication configurations Load new or modify existing configuration List all configurations Mark active/inactive configurations API support for clients and replicators Rich set of api access for all above operations
  8. GCP Projects are based on organization Deploy separate Replicator with its own credentials per project Shared copy cluster per DataCenter Enables independent updates and reduces risk of errors
  9. Logs vs user path resolution Projects and buckets have standard naming convention Logs at : gs://logs.<category name>.twttr.net/ User data at gs://user.<user name>.twtter.net/ Access to these buckets are via standard path Logs at /gcs/logs/<category name>/ User data at /gcs/user/<user name>/ Typically we need mapping of path prefix to bucket name in Hadoop ViewFileSystem mountable.xml Modified ViewFileSystem to dynamically create mountable mapping on demand since bucket name and path name are standard No configuration or update needed