15. モデル学習
for i in range(0, num_minibatches_to_train):
## Extract training data
features, labels =
generate_random_data_sample(minibatch_size, input_dim,
num_output_classes)
## Train
trainer.train_minibatch({feature : features, label :
labels})
20. from cntk import distributed
...
learner = cntk.learner.momentum_sgd(...) # create local learner
distributed_after = epoch_size # number of samples to warm start with
distributed_learner = distributed.data_parallel_distributed_learner(
learner = learner,
num_quantization_bits = 32, # non-quantized gradient accumulation
distributed_after = 0) # no warm start
損失関数の定義
21. minibatch_source = MinibatchSource(...)
...
trainer = Trainer(z, ce, pe, distributed_learner)
...
session = training_session(trainer=trainer,
mb_source=minibatch_source, ...)
session.train()
...
distributed.Communicator.finalize() # must be called to finalize
MPI in case of successful distributed training
最適化方法の定義
https://docs.microsoft.com/en-us/cognitive-toolkit/multiple-gpus-and-
machines#2-configuring-parallel-training-in-cntk-in-python
62. Azure ML integration
デプロイまでを含んだ モデル ライフサイクル管理
Hardware Accelerated
Model Gallery
Brainwave
Compiler & Runtime
“Brainslice” Soft
Neural Processing Unit
63. Performance Flexibility Scale
Rapidly adapt to evolving ML
Inference-optimized numerical precision
Exploit sparsity, deep compression
Excellent inference at low batch sizes
Ultra-low latency | 10x < CPU/GPU
World’s largest cloud investment in FPGAs
Multiple Exa-Ops of aggregate AI capacity
Runs on Microsoft’s scale infrastructure
Low cost
$0.21/million images on Azure FPGA
64. F F F
L0
L1
F F F
L0
Pretrained DNN Model
in CNTK, etc.
Scalable DNN Hardware
Microservice
BrainWave
Soft DPU
Instr Decoder
& Control
Neural FU
64
Network switches
FPGAs
65. Model
Management
Service
Azure ML orchestratorPython and TensorFlow
Featurize images and train classifier
Classifier
(TF/LGBM)
Preprocessing
(TensorFlow, C++
API)
Control Plane
Service
Brain Wave Runtime
FPGA
CPU
67. Web search
ranking
Traditional software (CPU) server plane
QPICPU
QSFP
40Gb/s ToR
FPGA
CPU
40Gb/s
QSFP QSFP
Hardware acceleration plane
相互接続されたFPGAが従来のソ
フトウェアレイヤーとは分離さ
れて動作
CPUから独立して管理・使用が
可能
Web search
ranking
Deep neural
networks
SDN offload
SQL
CPUs
FPGAs
Routers
98. Neural Functional Unit
VRF
Instruction
Decoder
TA
TA
TA
TA
TA
Matrix-Vector Unit Convert to msft-fp
Convert to float16
Multifunction
Unit
xbar x
A
+ VRF
VRF
Multifunction
Unit
xbar x
+ VRF
VRF
Tensor Manager
Matrix Memory
Manager
Vector Memory
Manager
DRAM
x
A
+
Activation
Multiply
Add/Sub
Legend
Memory
Tensor data
Instructions
Commands
TA Tensor Arbiter
Input Message
Processor
Control
Processor
Output Message
Processor
A
Kernel
Matrix Vector
Multiply
VRFMatrix RF
+
Kernel
Matrix Vector
Multiply
VRFMatrix RF
Kernel
Matrix Vector
Multiply
VRFMatrix RF
NetworkIFC
...
109. Model
Management
Service
Azure ML orchestratorPython and TensorFlow
Featurize images and train classifier
Classifier
(TF/LGBM)
Preprocessing
(TensorFlow, C++
API)
Control Plane
Service
Brain Wave Runtime
FPGA
CPU
110. EUS
SEA
WEU
WUS
Stamp: 20 racks
Azure box
24 CPU cores
4 FPGAs
BrainWave
Azure ML
Wire service
AML FPGA VM
Extension
Azure Host
MonAgent
DNN pipeline