Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| Software for AI Optimization Summit 2021 Technical Session

Reducing Deep Learning
Integration Costs and Maximising
Compute Efficiency for Multiple AI
Hardware
Jianhui Li
Principal Engineer, Intel

2
Deep Learning Trends
INT8
FP32
Training
Inference
Deep Learning Steps
Data Precision
Topologies
Computer Vision Natural Language Processing
Recommendation Systems
Re-Inforcement Learning
Frameworks
ResNet-50, Squeezenets, Mobilenet GNMT, Bert
NCF, Wide & Deep
MiniGO
Diverse and rapidly
evolving
BFloat16

The driving forces of AI Optimization
Diversifying AI
application
3
(conv: General Matrix Multiply)
conv
Recommendation
Engine
conv
Natural Language Processing
conv
Computer
Vision
Hardware
Acceleration
for AI
CPU
+ DL
Acceleration
GPU
+DL
Acceleration
Accelera
tors

4
Deep learn workload time breakdown
• Accelerating matrix multiplication alone doesn’t solve the problem
• Conv and Matmul operations are less dominant beyond computer vision application
• Low-Precision introduces memory bound quantize operations
• Amdahl's law
• Need to have aggressive fusion
*Profiling data collected from internal performance study

Accelerating Matrix Multiplication
5
Dot product
Matrix A
Matrix C
Matrix B
M
K
K
N
Dot product with
matrix operation
Matrix A
Matrix C
Matrix B
M
K
K
N
potential
fusion function

6
Performance
Library
Integration
Framework
Graph
1
3
4
2
1
3
4
2
Framework
Runtime
1
3
4
2
Pattern
Matcher
Graph
Rewriter
Function API
Extend Function API to support Fusion
Matmul
+Relu
Mat
mul
Activ
ation
Norm RNN
Conv
+Relu
Kernel wrapper
Performance Library
implements DNN ops and
fused op and exposed
using function APIs
Dispatch fused OPs to
registered library functions
at Framework Runtime
Enhance FW pattern
matcher and replace
matched subgraph as one
fused op backed by library
functions
1
2 3
Gelu

Framework Graph
Representation for Gelu
Passing
Graph
Limitation of Pattern Match
7
Another Framework Graph
Representation for Gelu
Passing
Graph
Gelu
conv
relu
conv
relu
conv
relu
Input
NHWC
Output0
NHWC
Output1
NHWC
Output2
NHWC
Small pattern miss optimization for large graph
conv
relu
conv
relu
conv
relu
Input
NHWC
Output0
Blocked Layout
Output1
Blocked Layout
Output2
NHWC
Pattern too rigid to match the input graphs

8
• Graph API allows HW backend to maximize performance
• Same integration for multiple AI HW: CPU, GPU, and accelerators
Today
Deep Learning frameworks
Primitives API
HW
Accel
Future
Deep Learning frameworks
CPU
+ DL
Acceleration
GPU
+DL
Acceleration
HW
Accel
Primitives API + Graph API
oneDNN
CPU
+ DL
Acceleration
GPU
+DL
Acceleration
oneDNN
oneDNN is evolving…

9
Framework
Runtime
Context
Graph
Rewrite
get_partitions()
Framework Graph
Passing
Graph
1
3
4
2
oneDNN
Graph API add_op()
1
3
4
2
DL
Framework
oneDNN
Graph
Backend
1
3
4
2
compile() execute()
Forming
graph
1
3
4
2
Backend decides
partition
4
2
Backend compiles
partition
4
2
Backend executes
compiled partition
4
2
oneDNN Graph API

10
oneDNN Graph API Usage
oneDNN
Graph API
Graph
Rewrite
Framework
Graph
Passing
Graph
1
3
4
2
1
3
4
2
DL Framework
Framework
Runtime Context
1
3
4
2
CPU GPU
Intel®, ARM Intel®, NVIDIA GPU
* Other names and brands may be claimed as the property of others.
Other implementations
Accelerators
Graph
Rewrite
Framework
Graph
Passing
Graph
1
3
4
2
1
3
4
2
DL Framework
Framework
Runtime Context
1
2
4
3
Leverage oneDNN based framework
integration and oneDNN implementation
Leverage oneDNN based framework
integration and bring your own
implementation based on backend API
Unified API for DL
acceleration libraries
targeting AI HWs
1
3
4
2
4
2 4
2 4
2
oneDNN w/ Graph
backend API

Industry
Momentum
oneDNN implementation
ported to A64FX Fugaku CPU
Optimized for the Armv8-A and
SVE instruction set
9.3x speedup for Tensorflow
Resnet-50 training and 7.8x for
inference on A64FX
https://github.com/oneapi-src/oneDNN
11
https://blog.fltech.dev/entry/2020/11/19/fugaku-onednn-deep-dive-en
Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy

Call to action
• Join us on this journey -
• Hardware developers – read, provide feedback, and adopt oneDNN Graph for
XPU computing!
https://spec.oneapi.com/onednn-graph/latest/
https://github.com/oneapi-src/oneDNN/tree/dev-graph
• Check out www.oneAPI.com for oneAPI specification
• Software developers – try out oneAPI in the Intel DevCloud
https://software.intel.com/content/www/us/en/develop/tools/devcloud.html
12
Preview

Notices and Disclaimers
• Intel technologies may require enabled hardware, software or service
activation.
• No product or component can be absolutely secure.
• Your costs and results may vary.
• © Intel Corporation. Intel, the Intel logo, and other Intel marks are
trademarks of Intel Corporation or its subsidiaries. Other names and
brands may be claimed as the property of others.
13

oneCCL
Specification
14
Thank You!
http://oneapi.com

Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| Software for AI Optimization Summit 2021 Technical Session

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| Software for AI Optimization Summit 2021 Technical Session

Semelhante a Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| Software for AI Optimization Summit 2021 Technical Session (20)

Mais de Intel® Software

Mais de Intel® Software (20)

Último

Último (20)

Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| Software for AI Optimization Summit 2021 Technical Session