Language models help in automating a wide range of natural language processing (NLP) tasks such as speech recognition, machine translation, text summarization and more. Transformer architecture was introduced a few years back and it has significantly changed the NLP landscape since then. Transformer based models are getting bigger and better to improve the state of the art on language understanding and generation tasks.
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Accelerated Training of Transformer Models
1. Accelerated Training of
Transformer Models
Kaarthik Sivashanmugam – Principal Engineering Manager
Sherlock Huang – Principal Engineer
Azure AI - Frameworks
2. Agenda
ONNX Runtime for Training
Introduction
Integration with training frameworks
Acceleration & Native Capabilities
Memory usage and execution optimizations
Mixed precision training, Distributed training parallelism
modes, Gradient checkpointing, AdaSum, DeepSpeed
ZeRO
Training Recipes & Perf Results
Pretraining and finetuning: BERT, GPT-2, Turing
Demo: ONNX Runtime Training in Azure Databricks
5. ONNX IR (intermediate representation)
ONNX Operator schema
Operation type
Attributes
Inputs/outputs
Shape inference function
https://onnx.ai/
https://github.com/onnx/onnx/blob/master/docs/Operators.md
Y
weight
(128 x 256)
(128 x 256)
(batch x 256)
X
(batch x 128)
bias
(256)
(256)
Inputs
A (batch x 128)
B (128 x 256)
C (256)
Outputs
Y (batch x 256)
Attributes
alpha: 0.7
beta: 0.5
Gemm
ONNX Spec
6. Graph composed of computational
nodes
Built-in and custom operators
ONNX Model
7. ONNX Runtime (ORT)
Cross-platform accelerator for training and inferencing
Core part of ML stack at Microsoft for innovations from the company
and industry
ORT Training
Adopted by 1P and 3P workloads for acceleration
Current focus on large transformer models (based on demand and acceleration needs)
Extensible and supports PyTorch, Keras/Tensorflow, …
9. Training & ORT Acceleration
Define Model
Get Data Batch
Compute Loss
Compute Gradients
& Update Weights
Evaluate
Train
Loop
Acceleration scope
Create ORTTrainer
using the model
ORTTrainer.train_step()
Checkpoint
10. import torch
from onnxruntime.training import ORTTrainer, optim
# Model definition
class NeuralNet(torch.nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
...
def forward(self, x):
...
model = NeuralNet(input_size=784, hidden_size=500, num_classes=10)
criterion = torch.nn.Functional.cross_entropy
model_description =
{'inputs': [('data', ['in', 'batch_size']),
('target', ['label_x_batch_size'])],
'outputs’: [('loss', [], True),
('output', ['out', 'batch_size’])]
}
optimizer_config = optim.AdamConfig(lr=learning_rate)
trainer = ORTTrainer(model, model_description, optimizer_config,
optimizer configuration, criterion)
# Training Loop
for t in range(1000):
# forward + backward + weight update
loss, y_pred = trainer.train_step(x, y)
ORT in PyTorch
PyTorch PyTorch + ONNX Runtime backend
import torch
# Model definition
class NeuralNet(torch.nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
...
def forward(self, x):
...
model = NeuralNet(input_size=784, hidden_size=500, num_classes=10)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
# Training Loop
for t in range(1000):
# forward
y_pred = model(x)
loss = criterion(y_pred, y)
# reset gradient buffer
optimizer.zero_grad()
# backward
loss.backward()
# weight update
optimizer.step()
11. ONNXRuntime
ORT TrainingSession Python API
PyTorch Script
PyTorch
ORTTrainer
To ONNX
GPU
buffer
TF/Keras Script
TF
ORTTrainer
To ONNX
GPU
buffer
ORT Frontend Adapters
13. Contributors to ORT Acceleration
Optimal
Gradient Graph
CUDA Kernel
Optimizations
Graph
Optimizations
Memory
Efficiency
Other Training
Capabilities
Static graph optimization
techniques like constant
folding, redundant node
elimination
Memory and compute
optimized using global
knowledge of data
dependencies
Static graph used for
preallocation of memory
for weights and gradients
Memory reuse
Op fusion
Reimplemented cuDNN
kernels
Removed redundant
computation
Mixed precision training
Distributed training
parallelism modes
Gradient checkpointing
AdaSum
DeepSpeed ZeRO
14. Native Capabilities in ORT
Distributed
Training
Modes
Gradient
Checkpoint
Mixed
Precision
Training
Gradient
Accumulation
AdaSum
16-bit and 32-bit FP types to
make training faster and use
less memory
Parallelism modes: Data,
Horizontal and Pipeline
Computed gradients are
accumulated into gradient buffer
using partial execution of graph
repeated for N steps
Averaged gradients are used in
optimizer for weight updates
Stashed activations often
dominate memory consumption
in training
Recompute discarded
activations when needed.
Trade off between memory
usage vs. computation cost.
Combines gradients in a novel
way to improve convergence
Model converges faster
DeepSpeed
ZeRO
Redundancy
Optimizer
Optimizer State Partitioning
Gradient Partitioning
Parameter Partitioning
17. Training Recipes
▪ BERT Pretraining
▪ Nvidia’s implementation of BERT pretraining accelerated using ORT
▪ https://github.com/microsoft/onnxruntime-training-examples/tree/master/nvidia-bert
▪ GPT-2 Finetuning
▪ Finetuning of Hugging Face GPT-2 model
▪ https://github.com/microsoft/onnxruntime-training-examples/tree/master/huggingface-gpt2
▪ Turing Finetuning
▪ Finetuning of Microsoft Turing model for abstractive text summarization, sentiment analysis and suggested reply scenarios
▪ https://github.com/microsoft/Turing-NLR (private preview)
19. BERT Pretraining in 4xDGX-2
PyTorch 1.5 with
NGC 20.03-py3
PyTorch 1.5 with
ONNX Runtime
% Gain with
ONNX Runtime
Phase 1 Throughput (ex/sec) 11522.1 12826.2 11.32%
Phase 2 Throughput (ex/sec) 2150.0 2464.1 14.61%
Phase 1 time (hours) 11.12 9.99 10.16%
Phase 2 time (hours) 6.62 5.77 12.84%
Total time (hours) 17.74 15.76 11.16%
PyTorch w/ ORT can train with 2x the local batch size as PyTorch w/o ORT
(global batch size was kept the same for comparison)
20. Perf Improvement with ORT
Model (Scenario)/# Params Perf improvement w/ ORT
Turing* (pretraining)/340M 1.4x
Turing* (pretraining)/350M 1.2x
RoBERTa XL (pretraining)/500M 3x
RoBERTa XL (finetuning)/500M 1.2x
RoBERTa XXL (pretraining)/1B 7x
GPT-2 M(pretraining)/345M 1.2x
* https://msturing.org/
21. Demo: ONNX Runtime Training in Azure Databricks
https://github.com/skaarthik/onnxruntime-training-databricks
22. Summary
▪ Optimize and accelerate model
training using ONNX Runtime (ORT)
▪ ORT is used in training very large
models used in various Microsoft
products/services
▪ https://github.com/microsoft/onnxruntime
▪ https://github.com/microsoft/onnxruntime-
training-examples