[DL輪読会]EfficientDet: Scalable and Efficient Object Detection

DEEP LEARNING JP
[DL Seminar]
EfficientDet: Scalable and Efficient Object Detection
Hiromi Nakagawa ACES, Inc.
https://deeplearning.jp

• Mingxing Tan, Ruoming Pang, Quoc V. Le（Google Research, Brain Team）
– EfficientNet の著者チーム
– Submitted to arXiv on 2019/11/20
• 物体検出でEfficientNetする
– Weighted Bi-directional Feature Pyramid Network (BiFPN)：
マルチスケールの特徴を効率的に抽出
– Compound Scaling：
resolution, depth, widthを一つの変数でスケール
• COCOで精度/サイズ/速度などでSoTAを更新
– #Params: 4x smaller
– FLOPs: 9.3x fewer
2
Overview

• 近年のObject Detectionのモデルは巨大化しがち
– AmoebaNet-based NAS-FPN：167M parameters, 3045B FLOPs（30x more than RetinaNet）
– ロボティクスや自動運転といったReal-worldへのdeployの妨げに
– モデルをEfficientにすることの重要性が高まっている
• 軽量化の傾向もあるが、精度が犠牲になっている
– One-stage, Anchor-free, Compression
• 特定のリソースに最適化するだけでもダメ。いろんなリソース制約に対応できるモデルがほしい
– 3B FLOPs ~ 300B FLOPs ?
4
Introduction

• 高精度と高効率を両立することはできるか？Detectorの設計について体系的に調査
• Challenge 1: Efficient Multi-Scale Feature Fusion
– マルチスケールの特徴を簡潔かつ効果的に抽出する Bidirectional Feature Pyramid Network (BiFPN) を提案
• Challenge 2: Model Scaling
– 入力画像の解像度に加えてネットワークの幅や深さなどをまとめてスケーリングするCompound Scalingを提案
• そもそも強いEfficientNetもBackboneに使う
5
Introduction

• Multi-scale fusion => aggregate features at different resolutions：𝑃 𝑖𝑛
= (𝑃𝑙1
𝑖𝑛
, … , 𝑃𝑙 𝑛
𝑖𝑛
)
7
BiFPN: Bi-directional Feature Pyramid Network
[Lin+CVPR’17] Feature Pyramid Networks
ex. Faster-RCNN,YOLO
上層の解像度が低くなる
ex. SSD
下層の特徴抽出が不十分
下層も大域特徴（コンテキスト）を
利用でき、解像度も高い
Ref. https://www.slideshare.net/ren4yu/single-shot

• (a) Conventional top-down FPN
– Limited by the one-way information flow
8

• (b) PANet
– Adds extra bottom-up path aggregation
network
9
• (c) NAS-FPN
– Neural architecture search
– Requires thousands of GPU hours for search
– Irregular network, difficult to interpret or modify

• (e) Simplified PANet
– PANet: Accurate but needs more parameters
and computations
– Remove the nodes whit only 1 input edge
10
• (f) BiFPN
– Extra edges from input to output at the same level
– Repeat feature network layer (=bidirectional path)

• Weighted feature fusion：How to fuse multi-scale features?
– Equally sum? → x
– Introduce additional weights, let the network to learn the importance of each input feature
– Unbound fusion：
• 𝑤𝑖：scalar（per-feature）, vector（per-channel）, tensor（per-pixel）
• scalar is enough but needs bounding for stable training
– Soft-max fusion：
• Slowdown on GPU
– Fast normalized fusion：
• Efficient
11

• Backbone: ImageNet pretrained EfficientNet
• Repeat BiFPN Layer
• Class & Box prediction networks share weights across all level of features
12
EfficientDet Architecture

• Use compound coefficient 𝝓 to jointly scale up all dimensions
– Object detection model has much more scaling dimensions than image classification models
13
Compound Scaling
Input size
𝑅𝑖𝑛𝑝𝑢𝑡
#channels
𝑊𝑏𝑖𝑓𝑝𝑛
#layers
𝐷 𝑏𝑖𝑓𝑝𝑛
#layers
𝐷𝑐𝑙𝑎𝑠𝑠
Backbone Network
𝐵0, … , 𝐵6 = 64 ∙ (1.35 𝜙
) = 3 + 𝜙/3
= 2 + 𝜙
= 512 + 𝜙 ∙ 128

• Trained with batch size 128 on 32 TPUv3 chips
• COCO2017で精度/パラメータ数/速度などでSoTAを達成
15
Experiments

• Trained with batch size 128 on 32 TPUv3 chips
• COCO2017で精度/パラメータ数/速度などでSoTAを達成
16
Experiments

• Real-world latency：Run 10 times with batch size 1
• GPU（ Titan-V ）： Up to 3.2x faster
• CPU（ Single-thread Xeon ）：Up to 8.1x faster
17
Experiments

• Ablation Study
18
Experiments
 EfficientNet BackboneにするだけでもRetinaNetから改善
 FPNをBiFPNにすると更に改善
 BiFPNは他のfeature networksに比べて
高精度かつ少パラメータ/低FLOPs

• Ablation Study
19
Experiments
 Feature fusionをSoftmaxからFast Fusionにすると
ほとんど精度低下せずに30%ほど高速化できる
 Compound Scalingによって個別にスケールを最適化
するより優れたmAP/FLOPsのモデルが得られる
Softmax Fusion Fast Fusion

• 高速・高精度・省計算な物体検出モデルであるEfficientDetを提案
– EfficientNetをBackboneに
– マルチスケールの特徴を効率的に抽出するBiFPNモジュールを提案、複数積み重ねて高次の特徴も抽出
– 共通の変数で解像度/幅/深さを複合的にスケーリングするCompound Scalingによる効率的なパラメータ探索
• COCOデータでSoTAの精度/速度を達成
– 4x smaller and 9.3x fewer FLOPs
– Latency：3.2x faster @GPU、8.1x faster@CPU
21
まとめ

• シンプルな工夫/拡張で精度/速度を改善。そりゃ良くなるよな、という感じ
– NAS-FPNみたいな魔改造感がない
• YOLOv3（arXiv18.04）の某グラフと比べると進展の速さを感じる
• その他
– Efficientだし精度もSoTAを更新した。
より精度を上げるためにEfficientさを捨てるとしたらどの方向？
– 最小解像度が512からの比較。それより小さくなると？
– 他の評価指標（mAPxx）やデータセットでのパフォーマンスは？
– Compound Scalingにおけるヒューリスティック、どれくらいセンシティブ？
– Keypointベースのアプローチと組み合わせるとどんな感じになる？
22
感想
ここらへん？

[DL輪読会]EfficientDet: Scalable and Efficient Object Detection

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to [DL輪読会]EfficientDet: Scalable and Efficient Object Detection

Similar to [DL輪読会]EfficientDet: Scalable and Efficient Object Detection (20)

More from Deep Learning JP

More from Deep Learning JP (20)

[DL輪読会]EfficientDet: Scalable and Efficient Object Detection