SlideShare uma empresa Scribd logo
1 de 31
Baixar para ler offline
High-Throughput Convolutional Neural Network
on an FPGA by Customized JPEG Compression
Hiroki Nakahara
Tokyo Institute of Technology, JP
Zhiqiang Que Wayne Luk
Imperial College London, UK
Outline
• Background
• JPEG compression for a high-speed inference
• CNN model for an FPGA implementation
• Channel shift and point-wise decomposition
• Quantization strategy
• Channel shuffle
• Fully-pipelined CNN architecture
• Experimental results
• Conclusion
2
Convolutional Neural Networks (CNNs)
• High accuracy and many applications
• Image recognitions, NLPs, data mining [1]
• FPGAs on cloud services
• Amazon AWS, Microsoft Azure, etc.
3
[1] Y. Liang, K. Ouyang, L. Jing, S. Ruan, Y. Liu, J. Zhang, D. S. Rosenblum and Y. Zheng,
“UrbanFM: Inferring Fine-Grained Urban Flows,” ACM SIGKDD Conf. on knowledge discovery
and data mining (KDD), 2019, pp.3132–3142.
Problems
• Power consumption
• Performance bottleneck (Data-transfer)
• e.g., AWS F1 provides overall read/write at 6.5GB/s
from host CPU to FPGA [1]
4
Host
PC
Interconnect
PCIe
CNN
Kernel
.jpg
RAW(RGB) Img.
Accelerator Card
[1] Y. Chen, J. He, X. Zhang, C. Hao, and D. Chen, “Cloud-DNN: An Open Framework for
Mapping DNN Models to Cloud FPGAs,” FPGA, 2019, pp.73–82.
Our Contributions
• Customized JPEG for a high-speed data transfer
→ Compression ratio (Speed-up) vs. accuracy
• Fully pipelined inference architecture
w/ light-weight CNN
5
Host
PC
Interconnect
FPGA
PCIe
CNN
Kernel
Interconnect
FPGA
PCIe
CNN
Kernel
Decoder
.jpg
.jpg
Host
PC
RAW(RGB) Img.
(a) Conventional (b) Proposed
Low-quality Img.
Dog?
Cat?
6Image Source: https://www.kaggle.com/c/dogs-vs-cats/data
7
Labradoodle?
Fried Chicken?
Source: https://bit.ly/2zveHGT
8
Labradoodle?
Fried Chicken?
Our Contributions
• Customized JPEG for a high-speed data transfer
→ Compression ratio (Speed-up) vs. accuracy
• Fully pipelined inference architecture
w/ light-weight CNN
9
Host
PC
Interconnect
FPGA
PCIe
CNN
Kernel
Interconnect
FPGA
PCIe
CNN
Kernel
Decoder
.jpg
.jpg
Host
PC
RAW(RGB) Img.
(a) Conventional (b) Proposed
Low-quality Img.
Customized JPEG
for a High-speed Inference
10
JPEG Coding
11
Pre-
processing
DCT Quant.
Huffman
Encoding
Quant.
Table
Huffman
Coding
Table
Post-
processing
IDCT
Reverse
Quant.
Huffman
Decoding
Encoding
Decoding
CompressedImageData
RGB
Image
Picture
Matrix
DCT
Matrix
JPEG
Header
Proposed JPEG Coding
with a CNN Accelerator
12
Quant.
Huffman
Encoding
Fully
Pipelining
CNN
IDCT
Reverse
Quant.
Huffman
Decoding
Host PC
ImageStreamData
JPEG
Image
.jpg
Extreme
Quant. Value q
Quant.
Table
RAM
PCIe
Huffman
Decoding
& Reverse
Quant.
RAMRAM
Detection
Result
FPGA
Ping-pong
Buffer
Huffman
Coding Table
Huffman
Coding Table
Huffman Decoding
and Reverse Quantization Unit
13
0
1
2
3
4
2
2
2
3
4
Shift Register
Shift
Value
Quantized
Value
Quant.
Value q
Run-length
Decoder
00**
01**
10**
110*
1110
Image Data Stream
...
...
Priority
Encoder
Buffer RAM
Zig-zag writing
ADR
WDATA
Zig-zag
pattern ROM
• Decompose the 2D-IDCT with 16 1D-DCTs
14
2D-IDCT
AP-922
Application Note 922, “A Fast Precise Implementation of 8x8 Discrete Cosine
Transform Using the Streaming SIMD Extensions and MMX Instructions,”
https://www.cs.cmu.edu/ barbic/cs-740/ap922.pdf
2D-IDCT Unit
15
..
Controller
Operation
Units
Reg. 1D-IDCT Unit
RAM RAM RAM
• Two 1D-IDCT units
• Use half precision (16 bits)
CNN model for an FPGA
Implementation
16
Overview
1. Decomposing k×k convolution by
channel shift [1] and point-wise (1×1) convolution
2. Binary (1-bit) weight quantization [2]
3. Channel split and shuffle [3]
17
[3] X. Zhang, X. Zhou, M. Lin and J. Sun, “ShuffleNet: An Extremely Efficient Convolutional
Neural Network for Mobile Devices,” CVPR, 2018.
[1] B. Wu, A. Wan, X. Yue, P. H. Jin, S. Zhao, N. Golmant, A. Gholamine- jad, J. Gonzalez,
and K. Keutzer, “Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions,”
CVPR, 2018, pp. 9127-9135.
[2] M. Courbariaux, Y. Bengio, and J.-P. David, “BinaryConnect: Training Deep Neural
Networks with Binary Weights During Propagations,” NIPS, 2015, pp.3105–3113.
Building Blocks
18
#channel x2
(a) Plain block
(b) Down-sampling block
Channel
Split
Shift PWConv Shift PWConv
Concat
&
Shuffle
Channel
Split
Shift
PWConv
(s=2)
Shift PWConv
Concat
&
Shuffle
Shift
PWConv
(s=2)
#channel/2
Our CNN
Model
19
Layer Output
size
Kernel size Stride #Output
channel
Image 224 3
PWConv 224 1 2 24
Norm 224 1 1 24
Shift 224 1 1 24
Pool 112 3 1 24
PWConv 112 2 2 24
Norm 112 1 1 24
ReLU 112 1 1 24
Shift 112 3 1 24
Pool 56 2 2 24
Stage 2 28 116
(4 repeats)
Stage 3 14 232
(8 repeats)
Stage 4 7 464
(16 repeats)
GAP 1 7 1 464
PWConv 1 1 1 1000
• Training-aware
quantization
• w: binary, a: 8-bit
• 2.54 M params,
0.616 GMACs
Fully-pipelined
CNN Architecture
20
Dataflow for a Residual Stage
of a Plain Block
• Double buffers for branch-flow
• Xilinx #pragma HLS dataflow
21
Layer
Unit
F.map Buffer
...
...
...
...
Layer
Unit
...
...
Shuffle
...
2D Convolutional Unit
22
...
...
AdderTree
BN Act
W.mem
...
...
...
...
c
n
p
c
n×p
Convolution Unit
Pooling Units
23
x00 x01 x02 x03 x04
x10 x11 x12 x13 x14
x20 x21 x22 x23 x24
x30 x31 x32 x33 x34
x40 x41 x42 x43 x44
x11 x10 x04 x03 x02 x01 x00
Write
Ctrl.
Logic
F. Map Mem. (n=5, k=2)
Shift Register
Max
Selector
+F. Map Mem.
Register
Reset
Write
Ctrl.
Logic
1
𝑛!
Controller
Max. Pooling
Unit
Global Ave.
Pooling
Unit
Experimental Results
24
Compression Ratio vs. Accuracy
25
162.2
124.6
82.1
53.5
34.9
11.5
59.61
66.64
70.8 71.1 71.2 71.2
50
55
60
65
70
75
0.0
50.0
100.0
150.0
200.0
q=1 q=2 q=3 q=4 q=5 Standard
speed-up acc
ImageNetTop-1Accuracy[%]
DataTransferSpeed-UpRatio
(Baseline:RGBImageTransfer)
JPEG Quantization Bit
• ImageNet2012 (224x224 pixel image) classification task
• PyTorch 1.4.0 + modified libjpeg library
Only decreases 0.3 point of accuracy and
achieves 82.1 times speed-up
Implementation Results
Module #LUTs #FFs #DSPs 18Kb BRAMs #URAMs
JPEG Decoder 11,675 6,646 34 2 0
Huffman Decoder 6,794 2,378 0 0 0
2D-IDCT 4,881 4,278 34 2 0
Pipelined-CNN 263,120 266,784 2,336 2,744 0
Total 274,795 273,440 2,370 2,746 16
(Ratio) (23.2%) (11.5%) (34.6%) (63.5%) (1.6%)
26
• Xilinx Inc. Virtex UltraScale+ FPGA
VCU1525 acceleration development kit
• Xilinx Inc. SDAccel 2018.2
• Operates 300MHz@75Watt
• System performance: 3321.25 FPS
• JPEG trans-decode: 81,120 FPS (c.f. conv. RGB transfer: 1242.8 FPS)
• JPEG decoder part of the LUT was only 4.2% of total system resource
Comparison with
Other FPGA Implementations
27
Method AlexNet1 FINN-R2 Synetgy3 MobNetV24 CouldDNN5 Ours
FPGA Stratix V Zynq
ZU3EG
Zynq
ZU3EG
Zynq ZU9EG Virtex US+
XCVU9P
Virtex US+
XCVU9P
FPS 864.7 200.0 96.5 809.8 123.1 3321.2
Top-1 Acc. 42.90% 50.30% 68.30% 68.1% --- 70.8%
Top-5 Acc. 66.80% --- 88.12% --- --- 90.1%
Precision
(W/Act)
16/16 1/2 4/4 8/8 16/16 1/8
Freq.(MHz) 150 220 250 333 214 300
Power (W) 26.2 10.2 5.5 --- 49.25 75.0
1 S. Liang, S. Yin, L. Liu, W. Luk, and S. Wei, “Fp-bnn: Binarized neural network on FPGA,” Neurocomputing, 275:10721086, 2018.
2 M. Blott, T. Preusser, N. Fraser, G. Gambardella, K. O’Brien, and Y. Umuroglu, “FINN-R: An end-to-end deep-learning framework for fast
exploration of quantized neural networks,” 2018.
3 Y. Yang, Q. Huang, B. Wu, T. Zhang, L. Ma, G. Gambardella, M. Blott, L. Lavagno, K. A. Vissers, J. Wawrzynek and K. Keutzer, “Synetgy:
Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs,” FPGA, pp. 23-32, 2019.
4 D. Wu, Y. Zhang, X. Jia, L. Tian, T. L, L. Sui, D. Xie, and Y. Shan, “A High-performance CNN Processor Based on FPGA for MobileNets,” 29th
International Conference on Field Programmable Logic and Ap- plications (FPL), 2019, pp.136-143.
5 Y. Chen, J. He, X. Zhang, C. Hao, and D. Chen, “Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs,” FPGA, 2019,
pp.73–82.
Comparison with CPU and GPU
Platform CPU GPU FPGA
Device Xeon E5-2690 Tesla V100 Virtex US+ XCVU9P
Clock Freq. 2.6 GHz 1.53 GHz 0.3 GHz
Memory 32GB DDR4 16GB HBM2 9.49 MB BRAM
Throughput (FPS) 24.0 350.0 3321.25
Power (W) 95 295 75
Efficiency (FPS/W) 0.25 1.18 44.28
28
• Ubuntu 18.04 LTS with PyTorch 1.4.0
• 128 Batch with INT8 quantization (for CPU and GPU)
Note: CPU and GPU did not use our JPEG compression scheme
Conclusion
29
Conclusion
• Customized JPEG compression for a high-speed inference
• 82.1x speed-up, 0.3-point accuracy drop
• CNN model for a fully-pipelined implementation
• Channel shift and point-wise decomposition
• Binary weight quantization
• Channel split-shuffle operation
• Fully-pipelined CNN architecture
• Achieved 3,321 FPS@75W
• Speed-up: 138.4x CPU, 9.5x GPU
• Energy efficiency: 177.1x CPU, 37.5x GPU
• Future works
• Custom compression & Other DL applications
30
Thank you
Hiroki Nakahara (Tokyo Tech, JP)
nakahara@ict.e.titech.ac.jp
31

Mais conteúdo relacionado

Mais procurados

モデルアーキテクチャ観点からの高速化2019
モデルアーキテクチャ観点からの高速化2019モデルアーキテクチャ観点からの高速化2019
モデルアーキテクチャ観点からの高速化2019Yusuke Uchida
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsShunta Saito
 
Towards Machine Comprehension of Spoken Content
Towards Machine Comprehension of Spoken ContentTowards Machine Comprehension of Spoken Content
Towards Machine Comprehension of Spoken ContentNVIDIA Taiwan
 
AI&BigData Lab. Артем Чернодуб "Распознавание изображений методом Lazy Deep ...
AI&BigData Lab. Артем Чернодуб  "Распознавание изображений методом Lazy Deep ...AI&BigData Lab. Артем Чернодуб  "Распознавание изображений методом Lazy Deep ...
AI&BigData Lab. Артем Чернодуб "Распознавание изображений методом Lazy Deep ...GeeksLab Odessa
 
Deep Learning
Deep LearningDeep Learning
Deep LearningJun Wang
 
Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Universitat Politècnica de Catalunya
 
Promises of Deep Learning
Promises of Deep LearningPromises of Deep Learning
Promises of Deep LearningDavid Khosid
 
【DL輪読会】Incorporating group update for speech enhancement based on convolutio...
【DL輪読会】Incorporating group update for speech enhancement  based on convolutio...【DL輪読会】Incorporating group update for speech enhancement  based on convolutio...
【DL輪読会】Incorporating group update for speech enhancement based on convolutio...Deep Learning JP
 
[html5jロボット部 第7回勉強会] Microsoft Cognitive Toolkit (CNTK) Overview
[html5jロボット部 第7回勉強会] Microsoft Cognitive Toolkit (CNTK) Overview[html5jロボット部 第7回勉強会] Microsoft Cognitive Toolkit (CNTK) Overview
[html5jロボット部 第7回勉強会] Microsoft Cognitive Toolkit (CNTK) OverviewNaoki (Neo) SATO
 
Details of Lazy Deep Learning for Images Recognition in ZZ Photo app
Details of Lazy Deep Learning for Images Recognition in ZZ Photo appDetails of Lazy Deep Learning for Images Recognition in ZZ Photo app
Details of Lazy Deep Learning for Images Recognition in ZZ Photo appPAY2 YOU
 
NeuralProcessingofGeneralPurposeApproximatePrograms
NeuralProcessingofGeneralPurposeApproximateProgramsNeuralProcessingofGeneralPurposeApproximatePrograms
NeuralProcessingofGeneralPurposeApproximateProgramsMohid Nabil
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksChristian Perone
 
Cneuromod20210329 bellec parcellation
Cneuromod20210329 bellec parcellationCneuromod20210329 bellec parcellation
Cneuromod20210329 bellec parcellationPierre Bellec
 
[PR12] PR-063: Peephole predicting network performance before training
[PR12] PR-063: Peephole predicting network performance before training[PR12] PR-063: Peephole predicting network performance before training
[PR12] PR-063: Peephole predicting network performance before trainingTaegyun Jeon
 
Deep Learning through Examples
Deep Learning through ExamplesDeep Learning through Examples
Deep Learning through ExamplesSri Ambati
 
【DL輪読会】Physion: Evaluating Physical Prediction from Vision in Humans and Mach...
【DL輪読会】Physion: Evaluating Physical Prediction from Vision in Humans and Mach...【DL輪読会】Physion: Evaluating Physical Prediction from Vision in Humans and Mach...
【DL輪読会】Physion: Evaluating Physical Prediction from Vision in Humans and Mach...Deep Learning JP
 
[DL輪読会]Learning Visible Connectivity Dynamics for Cloth Smoothing (CoRL2021)
[DL輪読会]Learning Visible Connectivity Dynamics for Cloth Smoothing (CoRL2021)[DL輪読会]Learning Visible Connectivity Dynamics for Cloth Smoothing (CoRL2021)
[DL輪読会]Learning Visible Connectivity Dynamics for Cloth Smoothing (CoRL2021)Deep Learning JP
 

Mais procurados (20)

モデルアーキテクチャ観点からの高速化2019
モデルアーキテクチャ観点からの高速化2019モデルアーキテクチャ観点からの高速化2019
モデルアーキテクチャ観点からの高速化2019
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methods
 
Towards Machine Comprehension of Spoken Content
Towards Machine Comprehension of Spoken ContentTowards Machine Comprehension of Spoken Content
Towards Machine Comprehension of Spoken Content
 
AI&BigData Lab. Артем Чернодуб "Распознавание изображений методом Lazy Deep ...
AI&BigData Lab. Артем Чернодуб  "Распознавание изображений методом Lazy Deep ...AI&BigData Lab. Артем Чернодуб  "Распознавание изображений методом Lazy Deep ...
AI&BigData Lab. Артем Чернодуб "Распознавание изображений методом Lazy Deep ...
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Tutorial on Deep Learning
Tutorial on Deep LearningTutorial on Deep Learning
Tutorial on Deep Learning
 
Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...
 
Promises of Deep Learning
Promises of Deep LearningPromises of Deep Learning
Promises of Deep Learning
 
Deep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLabDeep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLab
 
【DL輪読会】Incorporating group update for speech enhancement based on convolutio...
【DL輪読会】Incorporating group update for speech enhancement  based on convolutio...【DL輪読会】Incorporating group update for speech enhancement  based on convolutio...
【DL輪読会】Incorporating group update for speech enhancement based on convolutio...
 
[html5jロボット部 第7回勉強会] Microsoft Cognitive Toolkit (CNTK) Overview
[html5jロボット部 第7回勉強会] Microsoft Cognitive Toolkit (CNTK) Overview[html5jロボット部 第7回勉強会] Microsoft Cognitive Toolkit (CNTK) Overview
[html5jロボット部 第7回勉強会] Microsoft Cognitive Toolkit (CNTK) Overview
 
Details of Lazy Deep Learning for Images Recognition in ZZ Photo app
Details of Lazy Deep Learning for Images Recognition in ZZ Photo appDetails of Lazy Deep Learning for Images Recognition in ZZ Photo app
Details of Lazy Deep Learning for Images Recognition in ZZ Photo app
 
NeuralProcessingofGeneralPurposeApproximatePrograms
NeuralProcessingofGeneralPurposeApproximateProgramsNeuralProcessingofGeneralPurposeApproximatePrograms
NeuralProcessingofGeneralPurposeApproximatePrograms
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
 
Cneuromod20210329 bellec parcellation
Cneuromod20210329 bellec parcellationCneuromod20210329 bellec parcellation
Cneuromod20210329 bellec parcellation
 
[PR12] PR-063: Peephole predicting network performance before training
[PR12] PR-063: Peephole predicting network performance before training[PR12] PR-063: Peephole predicting network performance before training
[PR12] PR-063: Peephole predicting network performance before training
 
Deep Learning through Examples
Deep Learning through ExamplesDeep Learning through Examples
Deep Learning through Examples
 
【DL輪読会】Physion: Evaluating Physical Prediction from Vision in Humans and Mach...
【DL輪読会】Physion: Evaluating Physical Prediction from Vision in Humans and Mach...【DL輪読会】Physion: Evaluating Physical Prediction from Vision in Humans and Mach...
【DL輪読会】Physion: Evaluating Physical Prediction from Vision in Humans and Mach...
 
[DL輪読会]Learning Visible Connectivity Dynamics for Cloth Smoothing (CoRL2021)
[DL輪読会]Learning Visible Connectivity Dynamics for Cloth Smoothing (CoRL2021)[DL輪読会]Learning Visible Connectivity Dynamics for Cloth Smoothing (CoRL2021)
[DL輪読会]Learning Visible Connectivity Dynamics for Cloth Smoothing (CoRL2021)
 
Bellec cornell 2021
Bellec cornell 2021Bellec cornell 2021
Bellec cornell 2021
 

Semelhante a FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customized JPEG Compression

Dp2 ppt by_bikramjit_chowdhury_final
Dp2 ppt by_bikramjit_chowdhury_finalDp2 ppt by_bikramjit_chowdhury_final
Dp2 ppt by_bikramjit_chowdhury_finalBikramjit Chowdhury
 
Convolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic handsConvolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic handsMohsen Jafarzadeh
 
"An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ..."An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ...butest
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)DonghyunKang12
 
Science and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated EraScience and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated EraLarry Smarr
 
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...deawoo Kim
 
Lifetime maximization of wireless sensor networks with a mobile
Lifetime maximization of wireless sensor networks with a mobileLifetime maximization of wireless sensor networks with a mobile
Lifetime maximization of wireless sensor networks with a mobileNexgen Technology
 
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction NetworkEDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction Networkgerogepatton
 
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction NetworkEDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction Networkgerogepatton
 
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...TELKOMNIKA JOURNAL
 
Online opportunistic routing using Reinforcement learning
Online opportunistic routing using Reinforcement learningOnline opportunistic routing using Reinforcement learning
Online opportunistic routing using Reinforcement learningHarshal Solao
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the ContinuumIan Foster
 
OptIPuter Overview
OptIPuter OverviewOptIPuter Overview
OptIPuter OverviewLarry Smarr
 
FastV2C-HandNet - ICICC 2020
FastV2C-HandNet - ICICC 2020FastV2C-HandNet - ICICC 2020
FastV2C-HandNet - ICICC 2020RohanLekhwani
 
Moldable pipelines for CNNs on heterogeneous edge devices
Moldable pipelines for CNNs on heterogeneous edge devicesMoldable pipelines for CNNs on heterogeneous edge devices
Moldable pipelines for CNNs on heterogeneous edge devicesLEGATO project
 
Application Aware Topology Generation for Surface Wave Networks-on-Chip
Application Aware Topology Generation for Surface Wave Networks-on-ChipApplication Aware Topology Generation for Surface Wave Networks-on-Chip
Application Aware Topology Generation for Surface Wave Networks-on-Chipzhao fu
 
Accumulo and the Convergence of Machine Learning, Big Data, and Supercomputing
Accumulo and the Convergence of Machine Learning, Big Data, and SupercomputingAccumulo and the Convergence of Machine Learning, Big Data, and Supercomputing
Accumulo and the Convergence of Machine Learning, Big Data, and SupercomputingAccumulo Summit
 

Semelhante a FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customized JPEG Compression (20)

Dp2 ppt by_bikramjit_chowdhury_final
Dp2 ppt by_bikramjit_chowdhury_finalDp2 ppt by_bikramjit_chowdhury_final
Dp2 ppt by_bikramjit_chowdhury_final
 
Convolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic handsConvolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic hands
 
"An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ..."An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ...
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
Science and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated EraScience and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated Era
 
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...
Revisiting Sensor MAC for Periodic Monitoring: Why Should Transmitters Be Ear...
 
EIS_REVIEW_1.pptx
EIS_REVIEW_1.pptxEIS_REVIEW_1.pptx
EIS_REVIEW_1.pptx
 
Lifetime maximization of wireless sensor networks with a mobile
Lifetime maximization of wireless sensor networks with a mobileLifetime maximization of wireless sensor networks with a mobile
Lifetime maximization of wireless sensor networks with a mobile
 
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction NetworkEDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
 
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction NetworkEDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
 
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
 
Online opportunistic routing using Reinforcement learning
Online opportunistic routing using Reinforcement learningOnline opportunistic routing using Reinforcement learning
Online opportunistic routing using Reinforcement learning
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
OptIPuter Overview
OptIPuter OverviewOptIPuter Overview
OptIPuter Overview
 
FastV2C-HandNet - ICICC 2020
FastV2C-HandNet - ICICC 2020FastV2C-HandNet - ICICC 2020
FastV2C-HandNet - ICICC 2020
 
Moldable pipelines for CNNs on heterogeneous edge devices
Moldable pipelines for CNNs on heterogeneous edge devicesMoldable pipelines for CNNs on heterogeneous edge devices
Moldable pipelines for CNNs on heterogeneous edge devices
 
Application Aware Topology Generation for Surface Wave Networks-on-Chip
Application Aware Topology Generation for Surface Wave Networks-on-ChipApplication Aware Topology Generation for Surface Wave Networks-on-Chip
Application Aware Topology Generation for Surface Wave Networks-on-Chip
 
An35225228
An35225228An35225228
An35225228
 
B.tech_project_ppt.pptx
B.tech_project_ppt.pptxB.tech_project_ppt.pptx
B.tech_project_ppt.pptx
 
Accumulo and the Convergence of Machine Learning, Big Data, and Supercomputing
Accumulo and the Convergence of Machine Learning, Big Data, and SupercomputingAccumulo and the Convergence of Machine Learning, Big Data, and Supercomputing
Accumulo and the Convergence of Machine Learning, Big Data, and Supercomputing
 

Mais de Hiroki Nakahara

ROS User Group Meeting #28 マルチ深層学習とROS
ROS User Group Meeting #28 マルチ深層学習とROSROS User Group Meeting #28 マルチ深層学習とROS
ROS User Group Meeting #28 マルチ深層学習とROSHiroki Nakahara
 
DSF2018講演スライド
DSF2018講演スライドDSF2018講演スライド
DSF2018講演スライドHiroki Nakahara
 
(公開版)Reconf研2017GUINNESS
(公開版)Reconf研2017GUINNESS(公開版)Reconf研2017GUINNESS
(公開版)Reconf研2017GUINNESSHiroki Nakahara
 
(公開版)FPGAエクストリームコンピューティング2017
(公開版)FPGAエクストリームコンピューティング2017 (公開版)FPGAエクストリームコンピューティング2017
(公開版)FPGAエクストリームコンピューティング2017 Hiroki Nakahara
 
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介Hiroki Nakahara
 
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)Hiroki Nakahara
 
Tensor flow usergroup 2016 (公開版)
Tensor flow usergroup 2016 (公開版)Tensor flow usergroup 2016 (公開版)
Tensor flow usergroup 2016 (公開版)Hiroki Nakahara
 
FPGAX2016 ドキュンなFPGA
FPGAX2016 ドキュンなFPGAFPGAX2016 ドキュンなFPGA
FPGAX2016 ドキュンなFPGAHiroki Nakahara
 
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみたHiroki Nakahara
 
Altera sdk for open cl アンケート集計結果(公開版)
Altera sdk for open cl アンケート集計結果(公開版)Altera sdk for open cl アンケート集計結果(公開版)
Altera sdk for open cl アンケート集計結果(公開版)Hiroki Nakahara
 
Nested RNSを用いたディープニューラルネットワークのFPGA実装
Nested RNSを用いたディープニューラルネットワークのFPGA実装Nested RNSを用いたディープニューラルネットワークのFPGA実装
Nested RNSを用いたディープニューラルネットワークのFPGA実装Hiroki Nakahara
 
私のファミコンのfpsは530000です。もちろんフルパワーで(以下略
私のファミコンのfpsは530000です。もちろんフルパワーで(以下略私のファミコンのfpsは530000です。もちろんフルパワーで(以下略
私のファミコンのfpsは530000です。もちろんフルパワーで(以下略Hiroki Nakahara
 
Verilog-HDL Tutorial (15) software
Verilog-HDL Tutorial (15) softwareVerilog-HDL Tutorial (15) software
Verilog-HDL Tutorial (15) softwareHiroki Nakahara
 
Verilog-HDL Tutorial (15) hardware
Verilog-HDL Tutorial (15) hardwareVerilog-HDL Tutorial (15) hardware
Verilog-HDL Tutorial (15) hardwareHiroki Nakahara
 
Verilog-HDL Tutorial (14)
Verilog-HDL Tutorial (14)Verilog-HDL Tutorial (14)
Verilog-HDL Tutorial (14)Hiroki Nakahara
 
Verilog-HDL Tutorial (13)
Verilog-HDL Tutorial (13)Verilog-HDL Tutorial (13)
Verilog-HDL Tutorial (13)Hiroki Nakahara
 
Verilog-HDL Tutorial (12)
Verilog-HDL Tutorial (12)Verilog-HDL Tutorial (12)
Verilog-HDL Tutorial (12)Hiroki Nakahara
 
Verilog-HDL Tutorial (11)
Verilog-HDL Tutorial (11)Verilog-HDL Tutorial (11)
Verilog-HDL Tutorial (11)Hiroki Nakahara
 

Mais de Hiroki Nakahara (20)

ROS User Group Meeting #28 マルチ深層学習とROS
ROS User Group Meeting #28 マルチ深層学習とROSROS User Group Meeting #28 マルチ深層学習とROS
ROS User Group Meeting #28 マルチ深層学習とROS
 
FPGAX2019
FPGAX2019FPGAX2019
FPGAX2019
 
SBRA2018講演資料
SBRA2018講演資料SBRA2018講演資料
SBRA2018講演資料
 
DSF2018講演スライド
DSF2018講演スライドDSF2018講演スライド
DSF2018講演スライド
 
(公開版)Reconf研2017GUINNESS
(公開版)Reconf研2017GUINNESS(公開版)Reconf研2017GUINNESS
(公開版)Reconf研2017GUINNESS
 
(公開版)FPGAエクストリームコンピューティング2017
(公開版)FPGAエクストリームコンピューティング2017 (公開版)FPGAエクストリームコンピューティング2017
(公開版)FPGAエクストリームコンピューティング2017
 
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
 
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
 
Tensor flow usergroup 2016 (公開版)
Tensor flow usergroup 2016 (公開版)Tensor flow usergroup 2016 (公開版)
Tensor flow usergroup 2016 (公開版)
 
FPGAX2016 ドキュンなFPGA
FPGAX2016 ドキュンなFPGAFPGAX2016 ドキュンなFPGA
FPGAX2016 ドキュンなFPGA
 
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
 
Altera sdk for open cl アンケート集計結果(公開版)
Altera sdk for open cl アンケート集計結果(公開版)Altera sdk for open cl アンケート集計結果(公開版)
Altera sdk for open cl アンケート集計結果(公開版)
 
Nested RNSを用いたディープニューラルネットワークのFPGA実装
Nested RNSを用いたディープニューラルネットワークのFPGA実装Nested RNSを用いたディープニューラルネットワークのFPGA実装
Nested RNSを用いたディープニューラルネットワークのFPGA実装
 
私のファミコンのfpsは530000です。もちろんフルパワーで(以下略
私のファミコンのfpsは530000です。もちろんフルパワーで(以下略私のファミコンのfpsは530000です。もちろんフルパワーで(以下略
私のファミコンのfpsは530000です。もちろんフルパワーで(以下略
 
Verilog-HDL Tutorial (15) software
Verilog-HDL Tutorial (15) softwareVerilog-HDL Tutorial (15) software
Verilog-HDL Tutorial (15) software
 
Verilog-HDL Tutorial (15) hardware
Verilog-HDL Tutorial (15) hardwareVerilog-HDL Tutorial (15) hardware
Verilog-HDL Tutorial (15) hardware
 
Verilog-HDL Tutorial (14)
Verilog-HDL Tutorial (14)Verilog-HDL Tutorial (14)
Verilog-HDL Tutorial (14)
 
Verilog-HDL Tutorial (13)
Verilog-HDL Tutorial (13)Verilog-HDL Tutorial (13)
Verilog-HDL Tutorial (13)
 
Verilog-HDL Tutorial (12)
Verilog-HDL Tutorial (12)Verilog-HDL Tutorial (12)
Verilog-HDL Tutorial (12)
 
Verilog-HDL Tutorial (11)
Verilog-HDL Tutorial (11)Verilog-HDL Tutorial (11)
Verilog-HDL Tutorial (11)
 

Último

result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSrknatarajan
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 

Último (20)

result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 

FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customized JPEG Compression

  • 1. High-Throughput Convolutional Neural Network on an FPGA by Customized JPEG Compression Hiroki Nakahara Tokyo Institute of Technology, JP Zhiqiang Que Wayne Luk Imperial College London, UK
  • 2. Outline • Background • JPEG compression for a high-speed inference • CNN model for an FPGA implementation • Channel shift and point-wise decomposition • Quantization strategy • Channel shuffle • Fully-pipelined CNN architecture • Experimental results • Conclusion 2
  • 3. Convolutional Neural Networks (CNNs) • High accuracy and many applications • Image recognitions, NLPs, data mining [1] • FPGAs on cloud services • Amazon AWS, Microsoft Azure, etc. 3 [1] Y. Liang, K. Ouyang, L. Jing, S. Ruan, Y. Liu, J. Zhang, D. S. Rosenblum and Y. Zheng, “UrbanFM: Inferring Fine-Grained Urban Flows,” ACM SIGKDD Conf. on knowledge discovery and data mining (KDD), 2019, pp.3132–3142.
  • 4. Problems • Power consumption • Performance bottleneck (Data-transfer) • e.g., AWS F1 provides overall read/write at 6.5GB/s from host CPU to FPGA [1] 4 Host PC Interconnect PCIe CNN Kernel .jpg RAW(RGB) Img. Accelerator Card [1] Y. Chen, J. He, X. Zhang, C. Hao, and D. Chen, “Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs,” FPGA, 2019, pp.73–82.
  • 5. Our Contributions • Customized JPEG for a high-speed data transfer → Compression ratio (Speed-up) vs. accuracy • Fully pipelined inference architecture w/ light-weight CNN 5 Host PC Interconnect FPGA PCIe CNN Kernel Interconnect FPGA PCIe CNN Kernel Decoder .jpg .jpg Host PC RAW(RGB) Img. (a) Conventional (b) Proposed Low-quality Img.
  • 9. Our Contributions • Customized JPEG for a high-speed data transfer → Compression ratio (Speed-up) vs. accuracy • Fully pipelined inference architecture w/ light-weight CNN 9 Host PC Interconnect FPGA PCIe CNN Kernel Interconnect FPGA PCIe CNN Kernel Decoder .jpg .jpg Host PC RAW(RGB) Img. (a) Conventional (b) Proposed Low-quality Img.
  • 10. Customized JPEG for a High-speed Inference 10
  • 12. Proposed JPEG Coding with a CNN Accelerator 12 Quant. Huffman Encoding Fully Pipelining CNN IDCT Reverse Quant. Huffman Decoding Host PC ImageStreamData JPEG Image .jpg Extreme Quant. Value q Quant. Table RAM PCIe Huffman Decoding & Reverse Quant. RAMRAM Detection Result FPGA Ping-pong Buffer Huffman Coding Table Huffman Coding Table
  • 13. Huffman Decoding and Reverse Quantization Unit 13 0 1 2 3 4 2 2 2 3 4 Shift Register Shift Value Quantized Value Quant. Value q Run-length Decoder 00** 01** 10** 110* 1110 Image Data Stream ... ... Priority Encoder Buffer RAM Zig-zag writing ADR WDATA Zig-zag pattern ROM
  • 14. • Decompose the 2D-IDCT with 16 1D-DCTs 14 2D-IDCT AP-922 Application Note 922, “A Fast Precise Implementation of 8x8 Discrete Cosine Transform Using the Streaming SIMD Extensions and MMX Instructions,” https://www.cs.cmu.edu/ barbic/cs-740/ap922.pdf
  • 15. 2D-IDCT Unit 15 .. Controller Operation Units Reg. 1D-IDCT Unit RAM RAM RAM • Two 1D-IDCT units • Use half precision (16 bits)
  • 16. CNN model for an FPGA Implementation 16
  • 17. Overview 1. Decomposing k×k convolution by channel shift [1] and point-wise (1×1) convolution 2. Binary (1-bit) weight quantization [2] 3. Channel split and shuffle [3] 17 [3] X. Zhang, X. Zhou, M. Lin and J. Sun, “ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices,” CVPR, 2018. [1] B. Wu, A. Wan, X. Yue, P. H. Jin, S. Zhao, N. Golmant, A. Gholamine- jad, J. Gonzalez, and K. Keutzer, “Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions,” CVPR, 2018, pp. 9127-9135. [2] M. Courbariaux, Y. Bengio, and J.-P. David, “BinaryConnect: Training Deep Neural Networks with Binary Weights During Propagations,” NIPS, 2015, pp.3105–3113.
  • 18. Building Blocks 18 #channel x2 (a) Plain block (b) Down-sampling block Channel Split Shift PWConv Shift PWConv Concat & Shuffle Channel Split Shift PWConv (s=2) Shift PWConv Concat & Shuffle Shift PWConv (s=2) #channel/2
  • 19. Our CNN Model 19 Layer Output size Kernel size Stride #Output channel Image 224 3 PWConv 224 1 2 24 Norm 224 1 1 24 Shift 224 1 1 24 Pool 112 3 1 24 PWConv 112 2 2 24 Norm 112 1 1 24 ReLU 112 1 1 24 Shift 112 3 1 24 Pool 56 2 2 24 Stage 2 28 116 (4 repeats) Stage 3 14 232 (8 repeats) Stage 4 7 464 (16 repeats) GAP 1 7 1 464 PWConv 1 1 1 1000 • Training-aware quantization • w: binary, a: 8-bit • 2.54 M params, 0.616 GMACs
  • 21. Dataflow for a Residual Stage of a Plain Block • Double buffers for branch-flow • Xilinx #pragma HLS dataflow 21 Layer Unit F.map Buffer ... ... ... ... Layer Unit ... ... Shuffle ...
  • 22. 2D Convolutional Unit 22 ... ... AdderTree BN Act W.mem ... ... ... ... c n p c n×p Convolution Unit
  • 23. Pooling Units 23 x00 x01 x02 x03 x04 x10 x11 x12 x13 x14 x20 x21 x22 x23 x24 x30 x31 x32 x33 x34 x40 x41 x42 x43 x44 x11 x10 x04 x03 x02 x01 x00 Write Ctrl. Logic F. Map Mem. (n=5, k=2) Shift Register Max Selector +F. Map Mem. Register Reset Write Ctrl. Logic 1 𝑛! Controller Max. Pooling Unit Global Ave. Pooling Unit
  • 25. Compression Ratio vs. Accuracy 25 162.2 124.6 82.1 53.5 34.9 11.5 59.61 66.64 70.8 71.1 71.2 71.2 50 55 60 65 70 75 0.0 50.0 100.0 150.0 200.0 q=1 q=2 q=3 q=4 q=5 Standard speed-up acc ImageNetTop-1Accuracy[%] DataTransferSpeed-UpRatio (Baseline:RGBImageTransfer) JPEG Quantization Bit • ImageNet2012 (224x224 pixel image) classification task • PyTorch 1.4.0 + modified libjpeg library Only decreases 0.3 point of accuracy and achieves 82.1 times speed-up
  • 26. Implementation Results Module #LUTs #FFs #DSPs 18Kb BRAMs #URAMs JPEG Decoder 11,675 6,646 34 2 0 Huffman Decoder 6,794 2,378 0 0 0 2D-IDCT 4,881 4,278 34 2 0 Pipelined-CNN 263,120 266,784 2,336 2,744 0 Total 274,795 273,440 2,370 2,746 16 (Ratio) (23.2%) (11.5%) (34.6%) (63.5%) (1.6%) 26 • Xilinx Inc. Virtex UltraScale+ FPGA VCU1525 acceleration development kit • Xilinx Inc. SDAccel 2018.2 • Operates 300MHz@75Watt • System performance: 3321.25 FPS • JPEG trans-decode: 81,120 FPS (c.f. conv. RGB transfer: 1242.8 FPS) • JPEG decoder part of the LUT was only 4.2% of total system resource
  • 27. Comparison with Other FPGA Implementations 27 Method AlexNet1 FINN-R2 Synetgy3 MobNetV24 CouldDNN5 Ours FPGA Stratix V Zynq ZU3EG Zynq ZU3EG Zynq ZU9EG Virtex US+ XCVU9P Virtex US+ XCVU9P FPS 864.7 200.0 96.5 809.8 123.1 3321.2 Top-1 Acc. 42.90% 50.30% 68.30% 68.1% --- 70.8% Top-5 Acc. 66.80% --- 88.12% --- --- 90.1% Precision (W/Act) 16/16 1/2 4/4 8/8 16/16 1/8 Freq.(MHz) 150 220 250 333 214 300 Power (W) 26.2 10.2 5.5 --- 49.25 75.0 1 S. Liang, S. Yin, L. Liu, W. Luk, and S. Wei, “Fp-bnn: Binarized neural network on FPGA,” Neurocomputing, 275:10721086, 2018. 2 M. Blott, T. Preusser, N. Fraser, G. Gambardella, K. O’Brien, and Y. Umuroglu, “FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks,” 2018. 3 Y. Yang, Q. Huang, B. Wu, T. Zhang, L. Ma, G. Gambardella, M. Blott, L. Lavagno, K. A. Vissers, J. Wawrzynek and K. Keutzer, “Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs,” FPGA, pp. 23-32, 2019. 4 D. Wu, Y. Zhang, X. Jia, L. Tian, T. L, L. Sui, D. Xie, and Y. Shan, “A High-performance CNN Processor Based on FPGA for MobileNets,” 29th International Conference on Field Programmable Logic and Ap- plications (FPL), 2019, pp.136-143. 5 Y. Chen, J. He, X. Zhang, C. Hao, and D. Chen, “Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs,” FPGA, 2019, pp.73–82.
  • 28. Comparison with CPU and GPU Platform CPU GPU FPGA Device Xeon E5-2690 Tesla V100 Virtex US+ XCVU9P Clock Freq. 2.6 GHz 1.53 GHz 0.3 GHz Memory 32GB DDR4 16GB HBM2 9.49 MB BRAM Throughput (FPS) 24.0 350.0 3321.25 Power (W) 95 295 75 Efficiency (FPS/W) 0.25 1.18 44.28 28 • Ubuntu 18.04 LTS with PyTorch 1.4.0 • 128 Batch with INT8 quantization (for CPU and GPU) Note: CPU and GPU did not use our JPEG compression scheme
  • 30. Conclusion • Customized JPEG compression for a high-speed inference • 82.1x speed-up, 0.3-point accuracy drop • CNN model for a fully-pipelined implementation • Channel shift and point-wise decomposition • Binary weight quantization • Channel split-shuffle operation • Fully-pipelined CNN architecture • Achieved 3,321 FPS@75W • Speed-up: 138.4x CPU, 9.5x GPU • Energy efficiency: 177.1x CPU, 37.5x GPU • Future works • Custom compression & Other DL applications 30
  • 31. Thank you Hiroki Nakahara (Tokyo Tech, JP) nakahara@ict.e.titech.ac.jp 31