As the data and models scale, it becomes necessary to have multiple processing units for both training and inference. SignSGD is a gradient compression algorithm that only transmits the sign of the stochastic gradients during distributed training. This algorithm uses 32 times less communication per iteration than distributed SGD. We show that signSGD obtains free lunch both in theory and practice: no loss in accuracy while yielding speedups. Pushing the current boundaries of deep learning also requires using multiple dimensions and modalities. These can be encoded into tensors, which are natural extensions of matrices. These functionalities are available in the Tensorly package with multiple backend interfaces for large-scale deep learning.
3. 3
MOORE’S LAW: A SUPERCHARGED LAW
More than a billion
operations per image.
NVIDIA GPUs enable
parallel operations.
Enables Large-Scale AI.
COMPUTE INFRASTRUCTURE FOR AI: GPU
10. 10
TAKE-AWAYS FOR SIGN-SGD
• Convergence even under biased gradients and noise.
• Faster convergence than SGD in theory and in practice.
• For distributed training, similar variance reduction as SGD.
• In practice, similar accuracy but with far less communication.
https://github.com/PermiJW/signSGD-with-Majority-Vote
Pytorch code at
17. 17
T E N S O R L Y : H I G H - L E V E L A P I F O R T E N S O R
A L G E B R A
• Python programming
• User-friendly API
• Multiple backends:
flexible + scalable
• Example notebooks in
repository
18. 18
TENSORS:
TOPIC DETECTION IN TEXT
Co-occurrence
of word triplets Topic 1 Topic 2
STORM
WORLD SERIES
AUSTRALIA
STOCK MARKET
WASHINGTON
HEALTH
CRISIS
MACHINE
LEARNING
LIBRARY OF
NEWS ARTICLES
Amazon
Comprehend
LIST OF TOPICS
20. 20
TENSOR-BASED LDA TRAINING IS FASTER
• Mallet is an open-source framework for topic modeling
• Benchmarks on AWS SageMaker Platform
• Bulit into AWS Comprehend NLP service.
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
5 10 15 20 25 30 50 75 100
Timeinminutes
Number of Topics
Training time for NYTimes
Spectral Time(minutes) Mallet Time (minutes)
0.00
50.00
100.00
150.00
200.00
250.00
5 10 15 20 25 50 100
Timeinminutes
Number of Topics
Training time for PubMed
Spectral Time (minutes) Mallet Time (minutes)
8 million documents
22x faster on average 12x faster on average
300000 documents
21. A New Vision for Autonomy
Center for Autonomous Systems and Technologies
24. 24NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
RESEARCH LEADERS AT NVIDIA
Robotics
Dieter Fox
Learning &
Perception
Jan KautzBill Dally Dave Luebke Alex Keller Aaron Lefohn
Graphics
Steve Keckler Dave Nellans Mike O’Connor
ArchitectureProgramming
Michael Garland
VLSI
Brucek Khailany
Circuits
Tom Gray
Networks
Larry Dennison
Chief
Scientist
Computer
vision Core ML
Sanja Fidler Me !
Applied
research
Bryan Catanzaro
Notas do Editor
For 30 years, the dynamics of Moore’s law held true. But CPU performance scaling has slowed. GPU computing is defining a new, supercharged law. It starts with a highly specialized parallel processor called the GPU and continues through system design, system software, algorithms, and optimized applications. The world is jumping on board — today, there are some 800,000 GPU developers.
Sign SGD: special case of ADAM.
(Averaging window in ADAM = 1)
Sign SGD: special case of ADAM.
(Averaging window in ADAM = 1)
Sign SGD: special case of ADAM.
(Averaging window in ADAM = 1)