SlideShare uma empresa Scribd logo
1 de 54
Baixar para ler offline
Anthony Hsu
Staff Software Engineer
Scaling Deep Learning on Hadoop
at LinkedIn
DataWorks Summit, Washington, D.C., May 23, 2019
About Me: Anthony Hsu
• https://www.linkedin.com/in/erwaman/
• Staff Software Engineer at LinkedIn working on the Hadoop Dev team
• Been working in the Hadoop space for 5.5 years on workflow scheduling (Azkaban),
dataset access (Dali), machine learning infra (TonY, this talk)
LinkedIn's Vision
Create economic opportunity
for every member of the global workforce
630M
Members
30M
Companies
20M
Jobs
50K
Skills
90K
Schools
Machine Learning at LinkedIn
People You May Know
Job Recommendations
News Feed
LinkedIn Learning Recommendations
4
Why Deep Learning?
5
Building AI Applications Using Deep Learning
https://blog.easysol.net/building-ai-applications/
• Prediction accuracy of traditional ML
models tends to plateau quickly as data
increases
• Deep networks continue to improve as
data increases
Which framework to use?
6
Andrej Karpathy, Director of AI at Tesla
https://twitter.com/karpathy/status/972295865187512320
Machine Learning process
• ML process has many parts
7
Data Ingestion
Data Preparation
Model Training
Model Deployment
Model Serving
Machine Learning process
• ML process has many parts
• At LinkedIn, we have a Productive
ML (Pro-ML) initiative to accelerate
this loop. We have teams working
on every part of the ML pipeline.
8
Data Ingestion
Data Preparation
Model Training
Model Deployment
Model Serving
Machine Learning process
• ML process has many parts
• At LinkedIn, we have a Productive
ML (Pro-ML) initiative to accelerate
this loop. We have teams working
on every part of the ML pipeline.
• This talk will focus on model
training.
9
Data Ingestion
Data Preparation
Model Training
Model Deployment
Model Serving
Early days: how AI engineers did training
• Copy code and
dependencies to each
host
• Manually specify host
and port of each process
• Customize arguments for
each process
10
# On ps0.example.com:
$ python trainer.py 
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 
--job_name=ps --task_index=0
# On ps1.example.com:
$ python trainer.py 
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 
--job_name=ps --task_index=1
# On worker0.example.com:
$ python trainer.py 
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 
--job_name=worker --task_index=0
# On worker1.example.com:
$ python trainer.py 
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 
--job_name=worker --task_index=1
Source: https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/distributed.md
Challenges of scaling up training
• Managing code and dependencies
• Orchestrating distributed training
• Resource contention (especially for GPUs)
• Managing an ML workflow (data preparation, training, deployment)
• Fault tolerance
11
E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to
allocate 693.00M (726663168 bytes) from device:
CUDA_ERROR_OUT_OF_MEMORY: out of memory
Existing YARN features to leverage
• YARN is Hadoop's scheduler
12
Existing YARN features to leverage
• YARN is Hadoop's scheduler
• YARN supports
○ GPU resources and other resource types
13
Existing YARN features to leverage
• YARN is Hadoop's scheduler
• YARN supports
○ GPU resources and other resource types
○ Team-based and hierarchical queues
14
Existing YARN features to leverage
• YARN is Hadoop's scheduler
• YARN supports
○ GPU resources and other resource types
○ Team-based and hierarchical queues
○ Elasticity between queues
15
Existing YARN features to leverage
• YARN is Hadoop's scheduler
• YARN supports
○ GPU resources and other resource types
○ Team-based and hierarchical queues
○ Elasticity between queues
○ User-based limits
16
New and upcoming YARN features useful for ML
• Docker container support productionized in Hadoop 3.x
• YARN Native Service in Hadoop 3.x
• Submarine ML CLI released in Hadoop 3.2.0, now its own Hadoop subproject
17
How can we do distributed training on YARN?
• Want to take a program developed on a single machine and run it in distributed mode
with little or no modifications
• Want to take advantage of YARN's features
• Some existing open-source solutions we looked at:
○ Kubeflow (Google)
○ TensorFlow on Spark (Yahoo!)
○ Spark Deep Learning (Databricks)
○ TOY: TensorFlow on YARN (Intel)
○ XLearning (Qihoo)
○ Horovod (Uber)
○ YARN Native Service (in Hadoop 3.x)
18
Kubeflow + Kubernetes
• Kubeflow is an ML toolkit built on Kubernetes
○ Has a rich ecosystem and active community
• Kubernetes is one of the most popular cluster managers
• Challenges in adopting Kubernetes at LinkedIn
○ Large investment in YARN
■ Many clusters of 1000s of nodes (our largest is ~6000)
■ Expertise and tooling for YARN
○ Scalability: "No more than 5000 nodes"
(https://kubernetes.io/docs/setup/cluster-large/)
○ Need to integrate with Hadoop security (Kerberos and Hadoop delegation tokens)
○ Lack of hierarchical namespaces 19
Spark-based solutions
• TensorFlow on Spark (Yahoo!)
• Spark Deep Learning (Databricks)
• Pros
○ Integrates well with native Spark processing
• Cons
○ GPU resource requests not supported until Spark 3.0 (SPARK-20327)
○ No heterogeneous resource support (e.g.: more memory + GPUs for workers, less
memory + only CPUs for parameter servers)
20
YARN-native solutions
• TOY: TensorFlow on YARN (Intel)
• XLearning (Qihoo)
• Pros
○ Works with YARN out-of-the-box
• Cons
○ No GPU resource support
21
Horovod
• Horovod (Uber)
• Wraps existing optimizer to allow synchronous distributed training
• Works with many frameworks (TensorFlow, PyTorch, Keras, MXNet)
• Uses MPI or NCCL for communication
○ Multi-node MPI on YARN requires Docker containers running sshd daemons
22
YARN Native Service
• YARN Native Service (available in Hadoop 3.x)
• Configure distributed training jobs via XML, YAML, or JSON config file
• Distributed TensorFlow requires deploying YARN DNS Registry and ZooKeeper
• Relatively new, LinkedIn is still on Hadoop 2.x
23
Summary of open-source solutions
Open-source solution Pros Cons
Kubeflow / Kubernetes (Google) ● Large marketplace of libraries and plugins
● Active community
● Does not run on Hadoop
● May not scale to very large clusters
TensorFlow on Spark (Yahoo!)
Spark Deep Learning (Databricks)
● Integrates with Spark ● No GPU resource support until Spark 3.0
(SPARK-20327)
● No heterogeneous resource support
TOY: TensorFlow on YARN (Intel)
XLearning (Qihoo)
● YARN native, works out-of-the-box ● No GPU resource support
Horovod (Uber) ● Supports synchronous distributed training ● MPI on YARN requires Docker
YARN Native Service ● YARN native ● Distributed TensorFlow requires YARN DNS
Registry and ZooKeeper
24
Building our own solution: TonY
• TonY is a YARN application for running distributed ML jobs
• We started with TensorFlow support (hence TensorFlow on YARN (TonY))
• Now we also support PyTorch and Horovod (so perhaps Things on YARN is more apt)
25
A Comparison of MapReduce, Spark, and TonY
26
Map
task
Map
task
Map
task
Reduce
task
Reduce
task
Spark
executor
Spark
executor
Spark
executor
Spark
executor
Foo
task
Foo
task
Foo
task
Bar
task
Bar
task
Qux
task
MapReduce
• 2 task types
• Map tasks connected
to Reduce tasks
Spark
• 1 task type
• All connected to all
TonY
• N task types
• Heterogeneous connections
Baz
task
TonY supports many different models
27
Scoring
task
Scoring
task
Scoring
task
Scoring
task
Scoring
task
Parallel tasks,
no communication
Worker
task
Worker
task
Worker
task
Parameter
server task
Parameter
server task
Worker + Parameter Server Model
Worker
task
Worker
task
Worker
task
Worker
task
Ring All-Reduce Model
TonY also supports more exotic setups
28
Worker
task
Worker
task
Worker
task
Parameter
server task
Parameter
server task
Worker-PS with chief worker and
evaluator
Chief
worker
task
Evaluator
task
Worker
task
Worker
task
Worker
task
Worker
task
Ring All-Reduce with in-memory
distributed hash table (DHT)
DHT
task
DHT
task
DHT
task
TonY supports multiple frameworks
29
TonY under the hood
30
TonY under the hood
31
TonY Client
YARN
ResourceManager
TonY component
YARN component
TonY under the hood
32
TonY Client
YARN
ResourceManager
TonY
ApplicationMaster
TonY component
YARN component
YARN container
TonY under the hood
33
TonY Client
YARN
ResourceManager
TonY
ApplicationMaster
TonY
Task Executor
TonY
Task Executor
TonY
Task Executor
TonY component
YARN component
YARN container
TonY under the hood
34
TonY Client
YARN
ResourceManager
TonY
ApplicationMaster
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Parameter
Server Task
TonY component
TensorFlow component
YARN component
YARN container
TonY under the hood
35
TonY Client
YARN
ResourceManager
TonY
ApplicationMaster
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Parameter
Server Task
TonY component
TensorFlow component
YARN component
YARN container
TonY under the hood
36
TonY Client
YARN
ResourceManager
TonY
ApplicationMaster
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Parameter
Server Task
TonY component
TensorFlow component
YARN component
YARN container
Related YARN changes
37
Related YARN changes
38
• Backport of GPU support to Hadoop 2.x (YARN-8200)
Related YARN changes
39
• Backport of GPU support to Hadoop 2.x (YARN-8200)
• Support for updating tracking URL (YARN-7974)
○ Contributed to Hadoop 2.x and 3.x
Using TonY
• TonY client lets you easily launch a job with only a few required arguments
40
java -cp `hadoop classpath`:tony-cli-0.3.7-all.jar 
com.linkedin.tony.cli.ClusterSubmitter 
--python_venv=venv.zip 
--python_binary_path=Python/bin/python 
--src_dir=src 
--executes=my_model.py 
--conf_file=tony-test.xml
Using TonY
• For a list of all configurations, see
https://github.com/linkedin/Ton
Y/wiki/TonY-Configurations
41
<configuration>
<property>
<name>tony.worker.instances</name>
<value>3</value>
</property>
<property>
<name>tony.worker.gpus</name>
<value>1</value>
</property>
<property>
<name>tony.ps.instances</name>
<value>1</value>
</property>
</configuration>
• Example configuration file:
Using TonY
$ java ... com.linkedin.tony.cli.ClusterSubmitter ...
...
INFO impl.YarnClientImpl: Submitted application application_XXX
INFO tony.TonyClient: URL to track running application
(will proxy to TensorBoard once it has started): http://...
INFO tony.TonyClient: ResourceManager web address for application: http://...
...
INFO tony.TonyClient: Logs for ps 0 at: http://...
INFO tony.TonyClient: Logs for worker 0 at: http://...
INFO tony.TonyClient: Logs for worker 1 at: http://...
INFO tony.TonyClient: Logs for worker 2 at: http://...
TonY Portal for accessing job events and configs
43
Using TonY to launch notebooks and tools on demand
• TonY can be used to launch
○ Jupyter notebooks
○ TensorBoard
○ MLflow
○ etc.
• Run any Python virtual environment, PEX, or shiv
• Run any Docker image
44
TonY is open-source
• Open-source repo: https://github.com/linkedin/tony
○ Contributions welcome!
• OpML '19 paper: https://arxiv.org/abs/1904.01631 (presented 3 days ago)
• LinkedIn engineering blog post: https://bit.ly/2O6L5WD
45
TonY integrations with other projects
Azkaban workflow scheduler integration
• Azkaban is a workflow
scheduler for Hadoop
• Run TonY jobs inside a
workflow that includes
Spark and other data
processing jobs
47
TonY job tuning recommendations by Dr. Elephant
48
• Dr. Elephant is a
job tuning and
performance
analysis tool for
Hadoop jobs.
Run TonY on Google Cloud DataProc
• DataProc lets you run Hadoop and Spark on Google's Cloud
• TonY setup script for DataProc:
https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/tony
• TonY on DataProc blog post: https://bit.ly/2HEYemT
49
TonY runtime for Hadoop Submarine
• Submarine is a deep learning CLI for Hadoop
• TonY is a supported runtime implementation for Submarine (SUBMARINE-40, in
Submarine 0.2.0)
50
TonY on Microsoft Azure HDInsight (coming soon)
• HDInsight lets you run open-source frameworks on Azure, including Hadoop, Spark,
and Kafka
• TonY integration is coming soon
51
+
Demo
52
• Live demo using TonY Client from CLI
• Video of using TonY job in Azkaban: https://youtu.be/DM89y8BGFaY
Future Work
• GPU metrics + tuning suggestions for Dr. Elephant
• Expand TonY Portal to support launching notebooks, visualization,
and managing experiments
• TonY CLI + Python library
• TonY support on Azure HDInsight
• TonY support for other ML frameworks, schedulers, and cloud services
53
+ ?
Thank you!
54
Questions?

Mais conteúdo relacionado

Mais procurados

Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLJim Mlodgenski
 
Apache doris (incubating) introduction
Apache doris (incubating) introductionApache doris (incubating) introduction
Apache doris (incubating) introductionleanderlee2
 
Using Performance Insights to Optimize Database Performance (DAT402) - AWS re...
Using Performance Insights to Optimize Database Performance (DAT402) - AWS re...Using Performance Insights to Optimize Database Performance (DAT402) - AWS re...
Using Performance Insights to Optimize Database Performance (DAT402) - AWS re...Amazon Web Services
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaObjectRocket
 
Accelerating Ceph with RDMA and NVMe-oF
Accelerating Ceph with RDMA and NVMe-oFAccelerating Ceph with RDMA and NVMe-oF
Accelerating Ceph with RDMA and NVMe-oFinside-BigData.com
 
State transfer With Galera
State transfer With GaleraState transfer With Galera
State transfer With GaleraMydbops
 
Log analysis with the elk stack
Log analysis with the elk stackLog analysis with the elk stack
Log analysis with the elk stackVikrant Chauhan
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions Yugabyte
 
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321) ...
Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321) ...Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321) ...
Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321) ...Amazon Web Services
 
Vectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookVectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookDatabricks
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeDremio Corporation
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Find your own iOS kernel bug
Find your own iOS kernel bugFind your own iOS kernel bug
Find your own iOS kernel bugGustavo Martinez
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningCastLabKAIST
 

Mais procurados (20)

Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
 
Apache doris (incubating) introduction
Apache doris (incubating) introductionApache doris (incubating) introduction
Apache doris (incubating) introduction
 
NetApp & Storage fundamentals
NetApp & Storage fundamentalsNetApp & Storage fundamentals
NetApp & Storage fundamentals
 
OpenVINO introduction
OpenVINO introductionOpenVINO introduction
OpenVINO introduction
 
Using Performance Insights to Optimize Database Performance (DAT402) - AWS re...
Using Performance Insights to Optimize Database Performance (DAT402) - AWS re...Using Performance Insights to Optimize Database Performance (DAT402) - AWS re...
Using Performance Insights to Optimize Database Performance (DAT402) - AWS re...
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and Kibana
 
Accelerating Ceph with RDMA and NVMe-oF
Accelerating Ceph with RDMA and NVMe-oFAccelerating Ceph with RDMA and NVMe-oF
Accelerating Ceph with RDMA and NVMe-oF
 
State transfer With Galera
State transfer With GaleraState transfer With Galera
State transfer With Galera
 
Apache Nifi Crash Course
Apache Nifi Crash CourseApache Nifi Crash Course
Apache Nifi Crash Course
 
Log analysis with the elk stack
Log analysis with the elk stackLog analysis with the elk stack
Log analysis with the elk stack
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
 
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
 
Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321) ...
Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321) ...Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321) ...
Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321) ...
 
Vectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookVectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at Facebook
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Find your own iOS kernel bug
Find your own iOS kernel bugFind your own iOS kernel bug
Find your own iOS kernel bug
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 

Semelhante a Scaling Deep Learning on Hadoop at LinkedIn

Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and BeyondHadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and BeyondErik Krogen
 
Introduction to DL platform
Introduction to DL platformIntroduction to DL platform
Introduction to DL platformxiaogaozi
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARNWangda Tan
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2Aswini Ashu
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2aswini pilli
 
Bringing Deep Learning into production
Bringing Deep Learning into production Bringing Deep Learning into production
Bringing Deep Learning into production Paolo Platter
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Zohar Elkayam
 
TonY: Native support of TensorFlow on Hadoop
TonY: Native support of TensorFlow on HadoopTonY: Native support of TensorFlow on Hadoop
TonY: Native support of TensorFlow on HadoopAnthony Hsu
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about SparkGiivee The
 
Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014spinningmatt
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
Migre sus bases de datos Oracle a la nube
Migre sus bases de datos Oracle a la nube Migre sus bases de datos Oracle a la nube
Migre sus bases de datos Oracle a la nube EDB
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...Databricks
 
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...Data Con LA
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...Databricks
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNHortonworks
 
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...Anyscale
 

Semelhante a Scaling Deep Learning on Hadoop at LinkedIn (20)

Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and BeyondHadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
 
Introduction to DL platform
Introduction to DL platformIntroduction to DL platform
Introduction to DL platform
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARN
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
Bringing Deep Learning into production
Bringing Deep Learning into production Bringing Deep Learning into production
Bringing Deep Learning into production
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 
TonY: Native support of TensorFlow on Hadoop
TonY: Native support of TensorFlow on HadoopTonY: Native support of TensorFlow on Hadoop
TonY: Native support of TensorFlow on Hadoop
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Migre sus bases de datos Oracle a la nube
Migre sus bases de datos Oracle a la nube Migre sus bases de datos Oracle a la nube
Migre sus bases de datos Oracle a la nube
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
 
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
 

Último

MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 

Último (20)

MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 

Scaling Deep Learning on Hadoop at LinkedIn

  • 1. Anthony Hsu Staff Software Engineer Scaling Deep Learning on Hadoop at LinkedIn DataWorks Summit, Washington, D.C., May 23, 2019
  • 2. About Me: Anthony Hsu • https://www.linkedin.com/in/erwaman/ • Staff Software Engineer at LinkedIn working on the Hadoop Dev team • Been working in the Hadoop space for 5.5 years on workflow scheduling (Azkaban), dataset access (Dali), machine learning infra (TonY, this talk)
  • 3. LinkedIn's Vision Create economic opportunity for every member of the global workforce 630M Members 30M Companies 20M Jobs 50K Skills 90K Schools
  • 4. Machine Learning at LinkedIn People You May Know Job Recommendations News Feed LinkedIn Learning Recommendations 4
  • 5. Why Deep Learning? 5 Building AI Applications Using Deep Learning https://blog.easysol.net/building-ai-applications/ • Prediction accuracy of traditional ML models tends to plateau quickly as data increases • Deep networks continue to improve as data increases
  • 6. Which framework to use? 6 Andrej Karpathy, Director of AI at Tesla https://twitter.com/karpathy/status/972295865187512320
  • 7. Machine Learning process • ML process has many parts 7 Data Ingestion Data Preparation Model Training Model Deployment Model Serving
  • 8. Machine Learning process • ML process has many parts • At LinkedIn, we have a Productive ML (Pro-ML) initiative to accelerate this loop. We have teams working on every part of the ML pipeline. 8 Data Ingestion Data Preparation Model Training Model Deployment Model Serving
  • 9. Machine Learning process • ML process has many parts • At LinkedIn, we have a Productive ML (Pro-ML) initiative to accelerate this loop. We have teams working on every part of the ML pipeline. • This talk will focus on model training. 9 Data Ingestion Data Preparation Model Training Model Deployment Model Serving
  • 10. Early days: how AI engineers did training • Copy code and dependencies to each host • Manually specify host and port of each process • Customize arguments for each process 10 # On ps0.example.com: $ python trainer.py --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 --job_name=ps --task_index=0 # On ps1.example.com: $ python trainer.py --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 --job_name=ps --task_index=1 # On worker0.example.com: $ python trainer.py --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 --job_name=worker --task_index=0 # On worker1.example.com: $ python trainer.py --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 --job_name=worker --task_index=1 Source: https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/distributed.md
  • 11. Challenges of scaling up training • Managing code and dependencies • Orchestrating distributed training • Resource contention (especially for GPUs) • Managing an ML workflow (data preparation, training, deployment) • Fault tolerance 11 E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 693.00M (726663168 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
  • 12. Existing YARN features to leverage • YARN is Hadoop's scheduler 12
  • 13. Existing YARN features to leverage • YARN is Hadoop's scheduler • YARN supports ○ GPU resources and other resource types 13
  • 14. Existing YARN features to leverage • YARN is Hadoop's scheduler • YARN supports ○ GPU resources and other resource types ○ Team-based and hierarchical queues 14
  • 15. Existing YARN features to leverage • YARN is Hadoop's scheduler • YARN supports ○ GPU resources and other resource types ○ Team-based and hierarchical queues ○ Elasticity between queues 15
  • 16. Existing YARN features to leverage • YARN is Hadoop's scheduler • YARN supports ○ GPU resources and other resource types ○ Team-based and hierarchical queues ○ Elasticity between queues ○ User-based limits 16
  • 17. New and upcoming YARN features useful for ML • Docker container support productionized in Hadoop 3.x • YARN Native Service in Hadoop 3.x • Submarine ML CLI released in Hadoop 3.2.0, now its own Hadoop subproject 17
  • 18. How can we do distributed training on YARN? • Want to take a program developed on a single machine and run it in distributed mode with little or no modifications • Want to take advantage of YARN's features • Some existing open-source solutions we looked at: ○ Kubeflow (Google) ○ TensorFlow on Spark (Yahoo!) ○ Spark Deep Learning (Databricks) ○ TOY: TensorFlow on YARN (Intel) ○ XLearning (Qihoo) ○ Horovod (Uber) ○ YARN Native Service (in Hadoop 3.x) 18
  • 19. Kubeflow + Kubernetes • Kubeflow is an ML toolkit built on Kubernetes ○ Has a rich ecosystem and active community • Kubernetes is one of the most popular cluster managers • Challenges in adopting Kubernetes at LinkedIn ○ Large investment in YARN ■ Many clusters of 1000s of nodes (our largest is ~6000) ■ Expertise and tooling for YARN ○ Scalability: "No more than 5000 nodes" (https://kubernetes.io/docs/setup/cluster-large/) ○ Need to integrate with Hadoop security (Kerberos and Hadoop delegation tokens) ○ Lack of hierarchical namespaces 19
  • 20. Spark-based solutions • TensorFlow on Spark (Yahoo!) • Spark Deep Learning (Databricks) • Pros ○ Integrates well with native Spark processing • Cons ○ GPU resource requests not supported until Spark 3.0 (SPARK-20327) ○ No heterogeneous resource support (e.g.: more memory + GPUs for workers, less memory + only CPUs for parameter servers) 20
  • 21. YARN-native solutions • TOY: TensorFlow on YARN (Intel) • XLearning (Qihoo) • Pros ○ Works with YARN out-of-the-box • Cons ○ No GPU resource support 21
  • 22. Horovod • Horovod (Uber) • Wraps existing optimizer to allow synchronous distributed training • Works with many frameworks (TensorFlow, PyTorch, Keras, MXNet) • Uses MPI or NCCL for communication ○ Multi-node MPI on YARN requires Docker containers running sshd daemons 22
  • 23. YARN Native Service • YARN Native Service (available in Hadoop 3.x) • Configure distributed training jobs via XML, YAML, or JSON config file • Distributed TensorFlow requires deploying YARN DNS Registry and ZooKeeper • Relatively new, LinkedIn is still on Hadoop 2.x 23
  • 24. Summary of open-source solutions Open-source solution Pros Cons Kubeflow / Kubernetes (Google) ● Large marketplace of libraries and plugins ● Active community ● Does not run on Hadoop ● May not scale to very large clusters TensorFlow on Spark (Yahoo!) Spark Deep Learning (Databricks) ● Integrates with Spark ● No GPU resource support until Spark 3.0 (SPARK-20327) ● No heterogeneous resource support TOY: TensorFlow on YARN (Intel) XLearning (Qihoo) ● YARN native, works out-of-the-box ● No GPU resource support Horovod (Uber) ● Supports synchronous distributed training ● MPI on YARN requires Docker YARN Native Service ● YARN native ● Distributed TensorFlow requires YARN DNS Registry and ZooKeeper 24
  • 25. Building our own solution: TonY • TonY is a YARN application for running distributed ML jobs • We started with TensorFlow support (hence TensorFlow on YARN (TonY)) • Now we also support PyTorch and Horovod (so perhaps Things on YARN is more apt) 25
  • 26. A Comparison of MapReduce, Spark, and TonY 26 Map task Map task Map task Reduce task Reduce task Spark executor Spark executor Spark executor Spark executor Foo task Foo task Foo task Bar task Bar task Qux task MapReduce • 2 task types • Map tasks connected to Reduce tasks Spark • 1 task type • All connected to all TonY • N task types • Heterogeneous connections Baz task
  • 27. TonY supports many different models 27 Scoring task Scoring task Scoring task Scoring task Scoring task Parallel tasks, no communication Worker task Worker task Worker task Parameter server task Parameter server task Worker + Parameter Server Model Worker task Worker task Worker task Worker task Ring All-Reduce Model
  • 28. TonY also supports more exotic setups 28 Worker task Worker task Worker task Parameter server task Parameter server task Worker-PS with chief worker and evaluator Chief worker task Evaluator task Worker task Worker task Worker task Worker task Ring All-Reduce with in-memory distributed hash table (DHT) DHT task DHT task DHT task
  • 29. TonY supports multiple frameworks 29
  • 30. TonY under the hood 30
  • 31. TonY under the hood 31 TonY Client YARN ResourceManager TonY component YARN component
  • 32. TonY under the hood 32 TonY Client YARN ResourceManager TonY ApplicationMaster TonY component YARN component YARN container
  • 33. TonY under the hood 33 TonY Client YARN ResourceManager TonY ApplicationMaster TonY Task Executor TonY Task Executor TonY Task Executor TonY component YARN component YARN container
  • 34. TonY under the hood 34 TonY Client YARN ResourceManager TonY ApplicationMaster TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Parameter Server Task TonY component TensorFlow component YARN component YARN container
  • 35. TonY under the hood 35 TonY Client YARN ResourceManager TonY ApplicationMaster TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Parameter Server Task TonY component TensorFlow component YARN component YARN container
  • 36. TonY under the hood 36 TonY Client YARN ResourceManager TonY ApplicationMaster TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Parameter Server Task TonY component TensorFlow component YARN component YARN container
  • 38. Related YARN changes 38 • Backport of GPU support to Hadoop 2.x (YARN-8200)
  • 39. Related YARN changes 39 • Backport of GPU support to Hadoop 2.x (YARN-8200) • Support for updating tracking URL (YARN-7974) ○ Contributed to Hadoop 2.x and 3.x
  • 40. Using TonY • TonY client lets you easily launch a job with only a few required arguments 40 java -cp `hadoop classpath`:tony-cli-0.3.7-all.jar com.linkedin.tony.cli.ClusterSubmitter --python_venv=venv.zip --python_binary_path=Python/bin/python --src_dir=src --executes=my_model.py --conf_file=tony-test.xml
  • 41. Using TonY • For a list of all configurations, see https://github.com/linkedin/Ton Y/wiki/TonY-Configurations 41 <configuration> <property> <name>tony.worker.instances</name> <value>3</value> </property> <property> <name>tony.worker.gpus</name> <value>1</value> </property> <property> <name>tony.ps.instances</name> <value>1</value> </property> </configuration> • Example configuration file:
  • 42. Using TonY $ java ... com.linkedin.tony.cli.ClusterSubmitter ... ... INFO impl.YarnClientImpl: Submitted application application_XXX INFO tony.TonyClient: URL to track running application (will proxy to TensorBoard once it has started): http://... INFO tony.TonyClient: ResourceManager web address for application: http://... ... INFO tony.TonyClient: Logs for ps 0 at: http://... INFO tony.TonyClient: Logs for worker 0 at: http://... INFO tony.TonyClient: Logs for worker 1 at: http://... INFO tony.TonyClient: Logs for worker 2 at: http://...
  • 43. TonY Portal for accessing job events and configs 43
  • 44. Using TonY to launch notebooks and tools on demand • TonY can be used to launch ○ Jupyter notebooks ○ TensorBoard ○ MLflow ○ etc. • Run any Python virtual environment, PEX, or shiv • Run any Docker image 44
  • 45. TonY is open-source • Open-source repo: https://github.com/linkedin/tony ○ Contributions welcome! • OpML '19 paper: https://arxiv.org/abs/1904.01631 (presented 3 days ago) • LinkedIn engineering blog post: https://bit.ly/2O6L5WD 45
  • 46. TonY integrations with other projects
  • 47. Azkaban workflow scheduler integration • Azkaban is a workflow scheduler for Hadoop • Run TonY jobs inside a workflow that includes Spark and other data processing jobs 47
  • 48. TonY job tuning recommendations by Dr. Elephant 48 • Dr. Elephant is a job tuning and performance analysis tool for Hadoop jobs.
  • 49. Run TonY on Google Cloud DataProc • DataProc lets you run Hadoop and Spark on Google's Cloud • TonY setup script for DataProc: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/tony • TonY on DataProc blog post: https://bit.ly/2HEYemT 49
  • 50. TonY runtime for Hadoop Submarine • Submarine is a deep learning CLI for Hadoop • TonY is a supported runtime implementation for Submarine (SUBMARINE-40, in Submarine 0.2.0) 50
  • 51. TonY on Microsoft Azure HDInsight (coming soon) • HDInsight lets you run open-source frameworks on Azure, including Hadoop, Spark, and Kafka • TonY integration is coming soon 51 +
  • 52. Demo 52 • Live demo using TonY Client from CLI • Video of using TonY job in Azkaban: https://youtu.be/DM89y8BGFaY
  • 53. Future Work • GPU metrics + tuning suggestions for Dr. Elephant • Expand TonY Portal to support launching notebooks, visualization, and managing experiments • TonY CLI + Python library • TonY support on Azure HDInsight • TonY support for other ML frameworks, schedulers, and cloud services 53 + ?