08448380779 Call Girls In Friends Colony Women Seeking Men
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
1. Scaling Tensorflow to 100s of GPUs
with Spark and Hops Hadoop
Global AI Conference, Santa Clara, January 18th 2018
Hops
Jim Dowling
Associate Prof @ KTH
Senior Researcher @ RISE SICS
CEO @ Logical Clocks AB
2. AI Hierarchy of Needs
2
DDL
(Distributed
Deep Learning)
Deep Learning,
RL, Automated ML
A/B Testing, Experimentation, ML
B.I. Analytics, Metrics, Aggregates,
Features, Training/Test Data
Reliable Data Pipelines, ETL, Unstructured and
Structured Data Storage, Real-Time Data Ingestion
[Adapted from https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007?gi=7e13a696e469 ]
3. AI Hierarchy of Needs
3[Adapted from https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007?gi=7e13a696e469 ]
DDL
(Distributed
Deep Learning)
Deep Learning,
RL, Automated ML
A/B Testing, Experimentation, ML
B.I. Analytics, Metrics, Aggregates,
Features, Training/Test Data
Reliable Data Pipelines, ETL, Unstructured and
Structured Data Storage, Real-Time Data Ingestion
Analytics
Prediction
4. AI Hierarchy of Needs
4
DDL
(Distributed
Deep Learning)
Deep Learning,
RL, Automated ML
A/B Testing, Experimentation, ML
B.I. Analytics, Metrics, Aggregates,
Features, Training/Test Data
Reliable Data Pipelines, ETL, Unstructured and
Structured Data Storage, Real-Time Data Ingestion
Hops
[Adapted from https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007?gi=7e13a696e469 ]
5. More Data means Better Predictions
Prediction
Performance
Traditional AI
Deep Neural Nets
Amount Labelled Data
Hand-crafted
can outperform
1980s1990s2000s 2010s 2020s?
6. What about More Compute?
“Methods that scale with computation
are the future of AI”*
- Rich Sutton (A Founding Father of Reinforcement Learning)
2018-01-19 6/46
* https://www.youtube.com/watch?v=EeMCEQa85tw
7. More Compute should mean Faster Training
Training
Performance
Single-Host
Distributed
Available Compute
20152016 2017 2018?
8. Reduce DNN Training Time from 2 weeks to 1 hour
2018-01-19 8/46
In 2017, Facebook
reduced training
time on ImageNet
for a CNN from 2
weeks to 1 hour
by scaling out to
256 GPUs using
Ring-AllReduce on
Caffe2.
https://arxiv.org/abs/1706.02677
9. DNN Training Time and Researcher Productivity
9
•Distributed Deep Learning
-Interactive analysis!
-Instant gratification!
•Single Host Deep Learning
• Suffer from Google-Envy
“My Model’s Training.”
Training
15. NVLink vs PCI-E Single Root Complex
2018-01-19 16/46On Single-Host (dist. Training), the Bus can be the Bottleneck
[Images from: https://www.microway.com/product/octoputer-4u-10-gpu-server-single-root-complex/ ]
NVLink – 80 GB/s PCI-E – 16 GB/s
16. Scale: Remove Bus and Net B/W Bottlenecks
2018-01-19 17/46
Only one slow worker or bus or n/w link is needed to bottleneck DNN training time.
Ring-AllReduce
17. The Cloud is full of Bottlenecks….
Training
Performance
Public Cloud (10 GbE)
Infiniband On-Premise
Available Compute
18. Deep Learning Hierarchy of Scale
2018-01-19 19/46
DDL
AllReduce
on GPU Servers
DDL with GPU Servers
and Parameter Servers
Parallel Experiments on GPU Servers
Single GPU
Many GPUs on a Single GPU Server
Days/Hours
Days
Weeks
Minutes
Training Time for ImageNet
Hours
19. Lots of good GPUs > A few great GPUs
Hops
100 x Nvidia 1080Ti (DeepLearning11)
8 x Nvidia P/V100 (DGX-1)
VS
Both top (100 GPUs) and bottom (8 GPUs) cost the same: $150K (2017).
20. Consumer GPU Server $15K (10 x 1080Ti)
2018-01-19 https://www.oreilly.com/ideas/distributed-tensorflow 21/46
21. Cluster of Commodity GPU Servers
#EUai8
22
InfiniBan
Max 1-2 GPU Servers per Rack (2-4 KW per server)
27. Don’t do this: Different Clusters for Big Data and ML
29
28. Hops: Single ML and Big Data Cluster
30/70
IT
DataLake
GPUs Compute
Kafka
Data EngineeringData Science
Project1 ProjectN
Elasticsearch
29. HopsFS: Next Generation HDFS*
16x
Throughput
FasterBigger
*https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi
37x
Number of files
Scale Challenge Winner (2017)
31
30. Size Matters: Improving the Performance of Small Files in HDFS. Salman Niazi, Seif Haridi, Jim Dowling. Poster, EuroSys 2017.
`
HopsFS now stores Small Files in the DB
31. GPUs supported as a Resource in Hops 2.8.2*
33
Hops is the only Hadoop distribution to support GPUs-as-a-Resource.
*Robin Andersson, GPU Integration for Deep Learning on YARN, MSc Thesis, 2017
32. GPU Resource Requests in Hops
34
HopsYARN
4 GPUs on any host
10 GPUs on 1 host
100 GPUs on 10 hosts with ‘Infiniband’
20 GPUs on 2 hosts with ‘Infiniband_P100’
HopsFS
Mix of commodity GPUs and more
powerful GPUs good for (1) parallel
experiments and (2) distributed training
33. Hopsworks Data Platform
35
Develop Train Test Serve
MySQL Cluster
Hive
InfluxDB
ElasticSearch
KafkaProjects,Datasets,Users
HopsFS / YARN
Spark, Flink, Tensorflow
Jupyter, Zeppelin
Jobs, Kibana, Grafana
REST
API
Hopsworks
34. Python is a First-Class Citizen in Hopsworks
www.hops.io 36
42. Hops API
•Java/Scala library
- Secure Streaming Analytics with Kafka/Spark/Flink
• SSL/TLS certs, Avro Schema, Endpoints for Kafka/Hopsworks/etc
•Python Library
- Managing tensorboard, Load/save models in HopsFS
- Distributed Tensorflow in Python
- Parameter sweeps for parallel experiments
43. TensorFlow-as-a-Service in RISE SICS ICE
• Hops
• Spark/Flink/Kafka/TensorFlow/Hadoop-as-a-service
www.hops.site
• RISE SICS ICE
• 250 kW Datacenter, ~400 servers
• Research and test environment
https://www.sics.se/projects/sics-ice-data-center-in-lulea 45
44. Summary
•Distribution can make Deep Learning practitioners
more productive.
https://www.oreilly.com/ideas/distributed-tensorflow
•Hopsworks is a new Data Platform built on HopsFS
with first-class support for Python and Deep
Learning / ML
- Tensorflow / Spark
45. The Team
Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman
Niazi, Mahmoud Ismail, Theofilos Kakantousis, Ermias
Gebremeskel, Antonios Kouzoupis, Alex Ormenisan, Fabio Buso,
Robin Andersson, August Bonds, Filotas Siskos, Mahmoud Hamed.
Active:
Alumni:
Vasileios Giannokostas, Johan Svedlund Nordström,Rizvi Hasan, Paul Mälzer, Bram
Leenders, Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto
Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro,
Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos
Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid
Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu, Fanti Machmount Al
Samisti, Braulio Grana, Adam Alpire, Zahin Azher Rashid, ArunaKumari Yedurupaka,
Tobias Johansson , Roberto Bampi.
www.hops.io
@hopshadoop