Depth estimation do we need to throw old things away

Depth estimation: Do we need to throw old things away?
Hae-Gon Jeon (전해곤)
1
Assistant Professor

My Research Timeline
2011 MS course 2013~2018: Ph.D. course 2015~Present
Coded exposure imaging:
Motion deblurring
[ICCV’13, ICCV’15, IJCV’17, TIP’17]
Light-field imaging
[ECCV’14, CVPR’15, ICCVW’15, PAMI’17,
SPL’17, TPAM’19, CVPR’18]
[CVPR’16, CVPR’18, TIP’19]
Depth + Denoising in low-light
2016~Present
[ICIP’15, ICCV’15, CVPR’16, ECCV16, SPL’17,
CVPR’17, TPAM’19]
Depth from small motion
2018~Present: Post-doc
Highly accurate
3D map
Optimized path
generation
From real map
information
Visual and AI system for rescue robotics
2
Visual and AI system for rescue robotics
[ICRA’19, Submitted to IROS19(1/2), IROS’19(1/2); IEEE TPAMI (Major Revision); IEEE TIP (Major Revision) ]

7
DispNet (FlowNet)
Encoder Decoder

8
Image: https://myweb.rollins.edu/jsiry/Visual_Cortex.html
Human Visual System

9
Human Visual System and DispNet
N, Mayer et al., A Large Dataset to Train Convolutional Networks for
Disparity, Optical Flow, and Scene Flow Estimation, CVPR16

10
Human Visual System
A multilayered membrane that contains millions of light-
sensitive cells that detect the image and translate it into a
series of electrical signals.
The optic nerves from both eyes join at the optic chiasma
where information from their retinas is correlated.
Humans constantly scan objects in their field of view,
usually resulting in a perceived image that is uniformly
sharp

11
Upconvolutional layers
• high-level information passed from
coarser feature maps
• fine local information provided in
lower layer feature maps
Correlation layer
• Multiplicative patch
comparisons between two
feature maps
• No trainable weights
Convolution layer
• identical processing streams for the two images
• With this architecture the network is constrained to
first produce meaningful representations of the two
images separately
DispNet End-to-end disparity estimation network (No need optimization)

12
Encoding images
Correlation between two images estimation
High-level information
Human Visual System and DispNet

13
DispNet, CVPR 16
PSMNet, CVPR 18
Is DispNet the best??
Rank Method Out-Noc Runtime
1 PSMNet, CVPR18 1.49% 0.41s
2 iResNet-i2, CVPR18 1.71% 0.12s
15 MC-CNN-arct, JMLR16 2.43% 67s
27 Content_CNN, CVPR16 3.07% 0.7s
28 Deep Embed, ICCV15 3.10% 3s
46 DispNetC, CVPR16 4.11% 0.06s
2018/04/23KITTI stereo evaluation 2012

18
Asymmetric stereo Light-field camera Monocular camera
[CVPR’16, Silver Prize of Samsung Humantech
Paper Award, Submitted to IEEE TIP]
[ECCV’14, CVPR’15, ICCVW’15, CVPR’18, IEEE
TPAMI’17, IEEE SPL’17, IEEE TPAMI’19,
Robustness champion of CVPR’17 workshop]
[ICCV’15, CVPR’16, ECCV’16, CVPR’18,
ICLR’19, IEEE TPAMI’17, IEEE SPL’17, IEEE
TPAMI’19, IEEE TPAMI under minor revision]
Today’s Talk

Stereo Matching with Color and Monochrome cameras
Publications
• Stereo Matching with Color and Monochrome Cameras in Low-light Conditions
Hae-Gon Jeon, Joon-Young Lee, Sunghoon Im, Hyowon Ha and In So Kweon
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016
• HumanTech Paper Award 2016, Silver Prize
• CMSNet: Deep Color and Monochrome Stereo
Hae-Gon Jeon, Sunghoon Im, Joon-Young Lee and Martial Hebert
Submitted to IEEE Transactions on Image Processing
19

Low-light imaging1: Burst photography
[Ziwei Liu et al., SIGGRAPH Asia 14,
Sam Hasinoff et al., SIGGRAPH Asia 16]
Courtesy of S. Im, H.-G. Jeon and I.S. Kweon,
[Submitted to CVPR 18]
Results from Google Camera HDR+
Sam Hasinoff et al., SIGGRAPH Asia 16
Burst shot with a short exposure
Suffering from under-exposure
20

Noisy Visible IR image Fused Output
Multi-spectral Video Fusion [IEEE TIP 07]
- Twin cameras: IR/Visible
- Temporal smoothing
- Cross-bilateral filter
Low-light imaging2: Multi-spectral fusion
21

300 400 500 600 700 800 900 1000
0
10
20
30
40
50
60
70
Imaging Performance
Wavelength (nm)
QuantumEfficiency(%)
Blue
Green
Red
300 400 500 600 700 800 900 1000
0
10
20
30
40
50
60
70
Imaging Performance
Wavelength (nm)
QuantumEfficiency(%)
Blue
Green
Red
Gray
Color camera (RGB) Monochrome Camera (W)
Color information
Reducing sharpness
Vulnerable to noise
No color information
Sharp image
Robust to noise1) Different spectral sensitivity 2) Severe noise
Issues on Depth from RGB-W image pair
RGB2Gray
Monochrome
The proposed RGB-W stereo
22

Gain map Monochrome by
gain adjustment
Decolorized image
Disparity map
Monochrome
Overview of the proposed method`
(1) Input
pair
(2) Gain map
(3) Decolorization
(4) Disparity map by
iterative gain
adjustment
(5) Refined map by
a tree-based filtering
(6) High-quality
color image
Solution
24

1) Contrast Preservation
2) Noise suppression
𝐸𝐸𝑛𝑛 𝛾𝛾 =
𝛻𝛻𝑢𝑢 𝐼𝐼𝛾𝛾
1 + 𝛻𝛻𝑣𝑣 𝐼𝐼𝛾𝛾
1
𝛻𝛻𝑢𝑢 𝐼𝐼𝛾𝛾
2 + 𝛻𝛻𝑣𝑣 𝐼𝐼𝛾𝛾
2
𝐸𝐸𝑐𝑐 𝛾𝛾 = 𝐺𝐺 𝐼𝐼, 𝐼𝐼 − 𝐺𝐺(𝐼𝐼, �𝐼𝐼𝛾𝛾) 1
𝐺𝐺: the guided output image
𝛻𝛻 𝑢𝑢,𝑣𝑣 : image gradient of horizontal 𝑢𝑢 and vertical 𝑣𝑣 directions
RGB2Gray Only contrast Proposed
𝐼𝐼𝛾𝛾
= 𝜔𝜔𝑟𝑟 𝐼𝐼𝑟𝑟 + 𝜔𝜔𝑔𝑔 𝐼𝐼𝑔𝑔 + 𝜔𝜔𝑏𝑏 𝐼𝐼𝑏𝑏
𝜔𝜔𝑟𝑟 + 𝜔𝜔𝑔𝑔 + 𝜔𝜔𝑏𝑏 = 1 𝜔𝜔𝑟𝑟 ≥ 0, 𝜔𝜔𝑔𝑔 ≥ 0, 𝜔𝜔𝑏𝑏 ≥ 0
𝜔𝜔 𝑟𝑟,𝑔𝑔,𝑏𝑏 ∈ {0.1, 0.2, ⋯ , 1.0}
High
Low
𝐼𝐼𝛾𝛾: the decolorized image
𝜔𝜔 𝑟𝑟,𝑔𝑔,𝑏𝑏 : weighting parameters of each color channel
𝐼𝐼 𝑟𝑟,𝑔𝑔,𝑏𝑏 : three color channels
Decolorization
Cost
Gain compensation
Impossible linear and global gain compensation due to different spectral sensitivities
Decolorization Gain compensationTractable solution :
Decolorization and Gain compensation
25

𝒱𝒱 𝑥𝑥, 𝑙𝑙 = 𝛼𝛼𝒱𝒱𝑆𝑆𝑆𝑆𝑆𝑆 𝑥𝑥, 𝑙𝑙 + 1 − 𝛼𝛼 𝒱𝒱𝑆𝑆𝑆𝑆 𝑆𝑆(𝑥𝑥, 𝑙𝑙)
𝒱𝒱𝑆𝑆𝑆𝑆𝑆𝑆 𝑥𝑥, 𝑙𝑙 = �
𝑥𝑥∈Ω𝑥𝑥
min( 𝐼𝐼𝐿𝐿 − 𝐼𝐼𝑅𝑅
𝛾𝛾
𝑥𝑥 + 𝑑𝑑 , 𝜏𝜏1) 𝒱𝒱𝑆𝑆𝑆𝑆 𝑆𝑆 𝑥𝑥, 𝑙𝑙 = �
𝑥𝑥∈Ω𝑥𝑥
min( 𝐽𝐽(𝐼𝐼𝐿𝐿) − 𝐽𝐽(𝐼𝐼𝑅𝑅
𝛾𝛾
𝑥𝑥 + 𝑑𝑑 ) , 𝜏𝜏2)
𝑠𝑠. 𝑡𝑡. 𝐽𝐽 𝐼𝐼 =
| ∑𝑥𝑥∈Ω𝑥𝑥
𝛻𝛻𝛻𝛻(𝑥𝑥) |
∑𝑥𝑥∈Ω𝑥𝑥
𝛻𝛻𝛻𝛻 𝑥𝑥 + 0.5
0.5
1
Sum of absolute differences (SAD):
Robust to image noise
Sum of informative edges (SIE):
Robust to non-linear intensity variation
Color image Informative edge map 𝑱𝑱(𝑰𝑰)Conventional gradient map 𝛻𝛻𝐼𝐼
Sum of intensity
Intensity ×3
Ω𝑥𝑥: supporting window centered pixel at 𝑥𝑥, 𝑑𝑑: disparity, 𝜏𝜏 1,2 : truncated value
Brightness consistency Edge similarity
Sum of signed gradients
cancels out image noise
Sum of absolute gradients
computes how strong the edges
RGB-W Stereo Matching
26

6
ANCC : Heo et al., Robust stereo matching using adaptive normalized cross-correlation, IEEE PAMI 2011
DASC : Kim et al., DASC: Dense adaptive self-correlation descriptor for multimodal and multi-spectral correspondence, CVPR 2015
JDMCC : Heo et al., Joint depth map and color consistency estimation for stereo images with different illuminations and cameras, IEEE PAMI 2013
CCNG : Holloway et al., Generalized assorted camera arrays: Robust cross-channel registration and applications, IEEE TIP 2015
36.64% 34.29% 11.55%21.45%22.44%
Dark illumination
ANCC
37.89%
JDMCC
36.24% 19.24%
ProposedCCNG
32.56%41.71%
DASC
Bright illuminationStructured light
Ground truth
Quantitative evaluation
28

ANCC JDMCC ProposedCCNGDASC
40.13%
44.83%
17.07%
24.23%
11.79%
14.48%
19.20%
24.45%
19.52%
41.71%Dark illumination
Bright illuminationStructured light
Ground truth
Quantitative evaluation
29

7
ProposedColorMono ANCC DASC JDMCC CCNG
Monochrom
e
Color ANCC (40.10%) DASC (39.85%)
Ground truth CCNG (31.80%)JDMCC (32.86%) Proposed (8.89%)
Dataset : Reindeer ( ) is a bad pixel
rate
Monochrome Color ANCC (26.58%) DASC (26.91%)
Ground truth CCNG (18.44%)JDMCC (18.54%) Proposed (15.14%)
Dataset : Moebious
Evaluations
30

8
Colorization method
Color image Y channel of color image
SLIC super-pixelU & V channel mapping Colorization result
Colorization and enhancement
V channel of color image
High-quality color image recovery
31

Colorized and enhanced image
Input color image
High-quality color image recovery
32

34
(1) Input
pair
(2) Gain map
(3) Decolorization
(4) Disparity map by
iterative gain
adjustment
(5) Refined map by
a tree-based filtering
(6) High-quality
color image
Problem
2. Matching window size
3. Balance value
1. Gain threshold
5. Smoothness
parameter
4. # of iterations
6. # of super-pixel
7. Color similarity

35
Deep Material Stereo [CVPR’18]

36
CNN version of RGB-W Stereo
Image recovery
Depth estimation
Encoder
Consistency

37
W C
Denoising
Left - Mono
Right - Color
Disparity
Denoised - Chrominance
Denoised - Mono
Initial colorization
Final color image
-
Occlusion
Occlusion
Disparity Colorization
CNN version of RGB-W Stereo

38
Final color image
Occlusion
Occlusion
Colorization
Occlusion
Without
occlusion
With
occlusion

39
Denoising
Left - Mono
Right - Color
Denoised - Chrominance
Denoised - Mono
Conv3X3X64,st=1
BatchNorm
Subtract
Relu
Gaussian distribution:
benefit for Gaussian noise model
Residual: noise
Denoising

45
RGB NIR Estimated disparity

46
Today’s Talk

Depth from Single Light Field Images
Publications
• Accurate Depth Map Estimation from a Lenslet Light Field Camera
Hae-Gon Jeon, Jaesik Park, Gyeongmin Choe, Jinsun Park, Yunsu Bok, Yu-Wing Tai and In So Kweon
• Depth from a Light Field Image with Learning- based Matching Costs
Hae-Gon Jeon, Jaesik Park, Gyeongmin Choe, Jinsun Park, Yunsu Bok, Yu-Wing Tai and In So Kweon
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Feb 2019
• Depth Estimation Challenge: Robustness Champion, CVPR workshop on Light Field for Computer Vision
• EPINET: A Fully-Convolutional Neural Network using Epipolar Geometry for Depth from Light Field Images
Changha Shin, Hae-Gon Jeon, Youngjin Yoon, In So Kweon and Seon Joo Kim
47

Epiploar plane image(EPI)
Synthetic EPI
Estimate slopes of lines
[Wanner and Goldluecke PAMI 13, Tao et al. ICCV 13, Tao et al. CVPR 15,
Wang et al. CVPR 16, Williem et al. CVPR 16, Heber et al. CVPR 17]
Light-field image
48

Commercial light-field camera
Main lens Microlens sensorObject
Blocking penetration of light
Capturing angular
information of rays in one
sensor
Sensor size 3280*3280
Sub-aperture image
328*328
Problem1. Reduce spatial resolution Problem2. Increase photon noise
49

Real world EPI
Epipolar plane image
EPIs from plenoptic camera are filled with many noises and aliasing, and
have vertical luminance changes due to the circular micro-lens used
50

Flipping adjacent views
Sub-aperture images
Very narrow baseline;
Physically 0.45mm
Within 1px
Averbuch and Keller, “A unified approach to FFT
based images registration”, IEEE TIP 2003
Accuracy with
1/100 pixel
precision!!
ℱ 𝐼𝐼 𝕩𝕩 + ∆𝕩𝕩 = ℱ{𝐼𝐼(𝕩𝕩)𝑒𝑒2𝜋𝜋𝑖𝑖∆𝕩𝕩
}
𝐼𝐼 𝕩𝕩 + ∆𝕩𝕩 = ℱ−1{ℱ{𝐼𝐼 𝕩𝕩 𝑒𝑒2𝜋𝜋𝑖𝑖∆𝕩𝕩}}
ℱ : Fourier transform
∆𝕩𝕩 : Sub-pixel displacement 𝕩𝕩
Bilinear Bicubic Phase Original
Multiview Stereo-Based Approach [CVPR’15]
51

Cost Volume
𝑓𝑓( )
Sub-aperture images
Matching Cost
,
Reference view Target view
Cost volume
Depthlabel
Sum of Absolute
Difference (SAD)
Sum of Gradient
Difference (GRAD)
52

Quantitative evaluation on interpolation methods
5.37
16.2
11.3
7.38
2.89
11.18
9.03
5.33
15.35
9.02
7.4
2.35
6.65
8.73
4.69
9.88
8.91
6.06
2.27
6.22 6.38
0
2
4
6
8
10
12
14
16
18
Bilinear Bicubic Ours
Bad pixel ratio
MedievalBuddha Buddha2 Mona Papillon Still life Horses
53
GT Bilinear Bicubic Phase
0.2 %
1 %
0.2 %
1 %
16.2% 15.35% 9.88%
9.03% 8.73% 6.38%

Refinements
Sub aperture
image
Cost volume Cost
aggregation
Graph-cuts Iterative
refinement
Conventional stereo matching
[Rhemann et al., CVPR 2011]
Proposed [CVPR 2015]Center view
54

Quantitative Evaluation
Center View
Depth from
Structured Light
Emitted Pattern
Synthesized View
0
0.2
0.4
0.6
0.8
1
Ours
Absolute Error
(in pixels)
55

Lytro
GCDL LAGC CADC OursCenter view
Comparison to Lytro Bulit-in
Lytro Built-in
Ours
Raytrix
Central view GCDL LAGC Ours
Qualitative Evaluation
56

Center View 3D mesh Actual scale Measured distance in 3D
Lytro Illum
Our Simple Lens Light Field Camera Dataset
Our simple lens camera w/o distortion correction With distortion correction
Qualitative Evaluation
57

There are still Problems
Problem2:
Severe noise
Problem1:
Severe vignetting 1. Hard to find accurate correspondence in
radiometric distortions and severe noise
⇒ Using various hand-craft matching cost
2. Which one is correct matching cost?
⇒ Predicting the correct matching cost using
two random forests
3. Does it work well in real world light-field images?
⇒ Realistic dataset generation based on an
imaging pipeline of the Lytro camera
58

Overview of the Proposed Method [TPAM’19]
1. Realistic Light Field Image Generation;
Emulating an imaging pipeline of Lytro camera
3. Random Forest 1 - Classification;
Selecting dominant matching costs
4. Random Forest 2 - Regression;
Predicting a disparity value with sub-pixel precision
2. Making Cost Volumes using Phase Shift;
Overcoming inherent degradation of light-field
images caused by a microlens array
SAD GRAD Census ZNCC
q = [ ]
59

60
?
Raw image Sub-aperture images
x
y
tθ
x·sinθ+y·cosθ+t=0
θ
t
Template for indirect line fitting
Best
match
Proposed
( )ccc ZYX ,,
( )ZYX ,,
F
point
micro-lens
center
image
Ll
( )cc yx ,
( )yx,
projected
micro-lens
center
projected
point
main lens
Projection of
adjacent corners
Closest point to
micro-lens center
(u, v)
(u’, v’)
Line feature
( ) ( ) 0=+−⋅+−⋅ cvvbuua cc
Micro-lens center
(uc, vc)














−





′
′
′+





=





v
u
v
u
k
v
u
v
u
ˆ
ˆ
Adjacent
corners
[Y. Bok, H.-G. Jeon, and I. S. Kweon., Geometric Calibration of Micro-Lens-Based
Light-Field Cameras using Line Features, ECCV 2014, IEEE TPAMI 2017]
ProposedDansereau et al., ICCV13
Light-field Camera Geometric Calibration
Dansereau et al., ICCV13

Data Generation Vignetting Map
Noise-free multi-view images Vignetting map from averaged
white plane images
Sub-aperture image with
vignetting map
61

Data Generation Lenslet Image Generation
Sub-aperture image with
vignetting map
Extract a pixel from each sub-
aperture image
Aggregate these pixels in a lenslet
62

Data Generation Add Noise
Noise level estimation of each color channel
0.2 0.3 0.4 0.5 0.6
0
0.005
0.01
0.015
0.02
0.025
Intensity
StandardDeviation
Green Channel1
0.2 0.3 0.4 0.5 0.6
0
0.005
0.01
0.015
0.02
0.025
Intensity
StandardDeviation
Green Channel2
0.2 0.3 0.4 0.5 0.6
0
0.005
0.01
0.015
0.02
0.025
Intensity
StandardDeviation
Blue Channel
0.2 0.3 0.4 0.5 0.6
0
0.005
0.01
0.015
0.02
0.025
Intensity
StandardDeviation
Red Channel
Convert color image to raw image
Y. Schechner et al., “Multiplexing for
optimal lighting”, IEEE TPAMI 2007
63

Data Generation Realistic Sub-aperture Image Generation
Noisy raw imageDemosaicing
Rearrange pixels at each lenslet to
each sub-aperture image
64

Effectiveness of the augmented training dataset
Depth
profilewithout
Depth
profilewithDepth
profileGaussian noise
65

Training Set http://hci-lightfield.iwr.uni-heidelberg.de/
Antinous, Range: [ -3.3, 2.8 ]
Boardgames, Range: [ -1.8, 1.6 ]
Dishes, Range: [ -3.1, 3.5 ]
Greek, Range: [ -3.5, 3.1 ]
Kitchen, Range: [ -1.6, 1.8 ]
Medieval2, Range: [ -1.7, 2.0 ]
Museum, Range: [ -1.5, 1.3 ]
Pens, Range: [ -1.7, 2.0 ]
Pillows, Range: [ -1.7, 1.8 ]
Platonic, Range: [ -1.7, 1.5 ]
Rosemary, Range: [ -1.8, 1.8 ]
Table, Range: [ -2.0, 1.6 ]
Tomb, Range: [ -1.5, 1.9 ]
Tower, Range: [ -3.6, 3.5 ]
Town, Range: [ -1.6, 1.6 ]
Vinyl, Range: [ -1.6, 1.2 ]
66

Cost Volumes Matching Costs
Sum of Absolute Difference (SAD)
Zero-mean Normalized Cross correlation (ZNCC)
Census Transform (Census)
Sum of Gradient Difference (GRAD)
+ Robust to image noise;
act as averaged filter
+ Compensate for differences in
both gain and offset
+ Synergy with other matching costs
+ imposing higher weights at edge boundaries
+ Tolerate radiometric
distortions
H. Hirschmuller and D. Scharstein, “Evaluation of stereo matching
costs on images with radiometric differences,” IEEE TPAMI 2009.
67

Cost Volumes Matching group1
𝑓𝑓( )
Sub-aperture images
Matching Cost
,
Cost volume
Depthlabel
68

Cost Volumes Matching group2
𝑓𝑓( )
Sub-aperture images
Matching Cost
,
Cost volume
Depthlabel
69

Cost Volumes Computed Cost Volumes
Matchinggroup
Matching cost
Sum of
Absolute
Difference
(SAD)
Zero-mean
Normalized
Cross
correlation
(ZNCC)
Census
Transform
(Census)
Sum of
Gradient
Difference
(GRAD)
70

Disparities from each cost volume
via Winner-Takes-All
71

Vectorizing estimated depth labels
with a ground truth depth label
31 53 43 55
55
55
55
55 6160 74
Ground truth
67 51 53 37 6658 12
25 42 49 55 6143 57
76 72 66 23 5558 56
SAD+GRAD GRAD+Census Census+SAD 𝛼𝛼 ∈ [0, 1.0]
⋯
⋯
⋯
⋯
⋯
⋯
⋯
⋯
⋯
⋯
⋯
⋯
Multiple disparity hypotheses
Campbell et al., “Using Multiple hypotheses to improve
depth-maps for multi-view stereo”, ECCV 2008
72

Vectorizing estimated depth labels
with a ground truth depth label
25 54 48 32
32
32
32
32 3442 11
Ground truth
19 20 43 37 3233 5
31 42 29 12 4134 57
44 39 56 49 4317 32
SAD+GRAD GRAD+Census Census+SAD 𝛼𝛼 ∈ [0, 1.0]
⋯
⋯
⋯
⋯
⋯
⋯
⋯
⋯
⋯
⋯
⋯
⋯
73

Training a random forest
32
32
31 42 29 12 4134 57
44 39 56 49 4317 32
⋯
⋯
⋯
⋯
25 54 48 3232 3442 11⋯ ⋯ ⋯
3219 20 43 37 3233 5⋯ ⋯ ⋯
⋯
⋯
𝐪𝐪
Random forest 1
Classification
Random Forest1 - Classification
74

Random Forest1 - Classification
Importance
�q4
�q3
�q1 �q2
�q7
�q9
�q5
�q8 �q10
�q11
�q6
𝐪𝐪
Retrieving a set of
important matching costs
using the permutation
importance measure
[L. Breiman, “Random forests,” Machine learning]
+ Removing unnecessary
matching cost
+ Designing a better
prediction model
Matching Group1 Matching Group2 Matching Group3 Matching Group4
75

Random Forest2 - Regression
Random forest 2
Regression�𝐪𝐪 �q4�q3�q1 �q2 �q7 �q9�q5 �q8 �q10 �q11�q6
vs.
Estimated disparity value
with sub-pixel precision
SAD+GRAD
[H.-G. Jeon et al., IEEE CVPR 2015]
with Weighted Median Filter
[Z. Ma et al., IEEE ICCV 2013]
Input of a random forest
for regression
76

Real-world examples – Lytro Illum
Wanner and Goldluecke,
IEEE TPAMI 14
Yu et al,
ICCV 13
Ours,
CVPR 15
Williem et al,
CVPR 16
Wang et al,
IEEE TPAMI 16
Tao et al,
IEEE TPAMI 17 Proposed
Wanner and Goldluecke,
IEEE TPAMI 14
Yu et al,
ICCV 13
Williem et al,
CVPR 16
Wang et al,
IEEE TPAMI 16
Tao et al,
IEEE TPAMI 17 Proposed
Ours,
CVPR 15
77

(1-α)SAD+ αCensus
SAD
Census
(1-α)Census+ αGRAD
GRAD
(1-α)GRAD+ αSAD
ZNCC
smooth
smooth
discontinuity
strong
texture
strong
texture
weak
texture
discontinuity
weak
texture
strong
texture
discontinuity
Which matching costs are selected?
78

Benchmark Bad pixel ratio (>0.07px) & Mean square error
Bad pixel ratio Mean square error
(2017.05.23)
Robustness Champion!!
79

Qualitative evaluation on different input setups
Kim et al., “Scene Reconstruction from High Spatio-
Angular Resolution Light Fields”, SIGGRAPH 2013
DSLR camera mounted on
motorized linear stage
Center view Kim et al. (# of input: 51) Proposed (# of input: 9)
80

Qualitative evaluation on different input setups
Samsung Galaxy Note 8
Input images SGM Proposed
H. Hirschmuller. Stereo processing by
semiglobal matching and mutual
information, IEEE PAMI 2008
81

One more Problem…
Runtime
Very slow …
82

Convolutional Neural Network
DispNet, CVPR 16
PSMNet, CVPR 18 EdgeStereo, ArXiv
83

EPINET [CVPR’18]
2
2
22
2
2
2
2
22
2
2
2
2
22
2
2
2
2
22
2
2
2
2
2
2
2
2
2
2
280280
Angular directions
of LF images
8 blocks
3 blocks
Convolutional Block
70 70 70
1
7
7
7
7
Disparity Map
70 70 70
70 70 70
70 70 70
280
Last Convolutional Block
Stack I0°
Stack I90°
Stack I135°
Stack I45°
Concatenation
C
O
N
V
R
E
L
U
C
O
N
V
C
O
N
V
R
E
L
U
C
O
N
V
R
E
L
U
B
N
84

Lack of Data
Antinous, Range: [ -3.3, 2.8 ]
Boardgames, Range: [ -1.8, 1.6 ]
Dishes, Range: [ -3.1, 3.5 ]
Greek, Range: [ -3.5, 3.1 ]
Kitchen, Range: [ -1.6, 1.8 ]
Medieval2, Range: [ -1.7, 2.0 ]
Museum, Range: [ -1.5, 1.3 ]
Pens, Range: [ -1.7, 2.0 ]
Pillows, Range: [ -1.7, 1.8 ]
Platonic, Range: [ -1.7, 1.5 ]
Rosemary, Range: [ -1.8, 1.8 ]
Table, Range: [ -2.0, 1.6 ]
Tomb, Range: [ -1.5, 1.9 ]
Tower, Range: [ -3.6, 3.5 ]
Town, Range: [ -1.6, 1.6 ]
Vinyl, Range: [ -1.6, 1.2 ]
85

View-shift augmentation Rotation augmentation
Scale augmentation
D(Disparity)
L(Image width)
D/2
D/3
L/2 L/3
Data Augmentation
86

Results
• 4D Light Field Benchmark
• 16 synthetic light-field images(9x9 views)
• Depth/disparity map for training scenes
• http://hci-lightfield.iwr.uni-heidelberg.de
87

4D Light field Benchmark: http://hci-lightfield.iwr.uni-heidelberg.de/
Evaluations
88

Cotton
Boxes
Dots
BadPixel(>0.03)BadPixel(>0.03)BadPixel(>0.03)DisparityDisparityDisparity
(a) GT (b) EPI2 (c) LF_OCC (l) Proposed(g) SPO(d) LF (e) EPI1 (f) CAE (h) SC_GC (i) RPRF5 (j) OFSY_330 (k) PS_RF
Evaluations
89

Center View (b) (c) (d)
Ours(e) (f) (g)
Center View (b) (c) (d)
Ours(e) (f) (g)
B: Globally consistent depth
labeling of 4D lightfields. S.
Wanner and B. Goldluecke
C: Accurate depth map
estimation from a lenslet light
field camera.H.-G. Jeon et al ,
D: Robust light field depth
estimation for noisy scene with
occlusion. W. Williem et al
E: Occlusionaware depth
estimation using light-field
cameras. T.-C. Wang,.
F: Shape estimation from
shading, defocus, and
correspondence using light-field
angular coherence. Tao et al
G: Line assisted light field
triangulation and stereo
matching.Z. Yu, et al
90

Unknown geometry:
Rotation, translation
Two Research Topics
91

92
Today’s Talk

Depth from Small Motion Video Clip
Publications (Co-author papers)
• High Quality Structure from Small Motion for Rolling Shutter Cameras
Sunghoon Im, Hyowon Ha, Gyeongmin Choe, Hae-Gon Jeon, Kyungdon Joo and In So Kweon
IEEE International Conference on Computer Vision (ICCV), Dec 2015
• High-quality Depth from Uncalibrated Small Motion Clip [Oral presentation]
Hyowon Ha, Sunghoon Im, Jaesik Park, Hae-Gon Jeon and In So Kweon
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , Jun 2016
• All-around Depth from Small Motion with A Spherical Panoramic Camera
Sunghoon Im, Hyowon Ha, François Rameau, Hae-Gon Jeon, Gyeongmin Choe and In So Kweon
European Conference on Computer Vision (ECCV), Oct 2016
• Robust Depth Estimation from Auto Bracketed Images
Sunghoon Im, Hae-Gon Jeon and In So Kweon
• Accurate 3D Reconstruction from Small Motion Clip for Rolling Shutter Cameras
Sunghoon Im, Hyowon Ha, Gyeongmin Choe, Hae-Gon Jeon, Kyungdon Joo and In So Kweon
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Apr 2019
• DPSNet: End-to-end Deep Plane Sweep Stereo
Sunghoon Im, Hae-Gon Jeon, Steve Lin and In So Kweon
International Conference on Learning Representations (ICLR), May 2019
93

Yu & Gallup (CVPR14)
Yu and Gallup, "3d reconstruction from
accidental motion“, CVPR2014
95

3. Dense 3D
reconstruction
2. Sparse 3D
reconstruction
1. Feature extraction
& tracking
Depth from Small Motion
97

∑∑= =
−=
I JN
i
N
j
jijijC
1 1
2
||)(||),( XKPxPX φ
 Bundle adjustment
TT
JI
T
j
T
ij
Z
Y
Z
X
ZYX
NN
ZYX
vu
]1,,[)],,([
featureof#:image,of#:
coordinateWorld:]1,,,[
coordinateimage2D:]1,,[
=
=
=
φ
X
x
matrixExtrinsic:matrix,Intrinsic: ijPK










=
100
0 yy
xx
cf
cf α
K
Sparse 3D reconstruction [ICCV’15]
2*Thenumberofre-projectionpoints
The number of refine value
 Jacobian matrix with rolling shutter
The number of features : 8
The number of images : 6
2*Thenumberofre-projectionpoints
The number of refine value
 Small angle approximation - Rotation matrix










−
−
−
==
1
1
1
x
ij
y
ij
x
ij
z
ij
y
ij
z
ij
ijijijij
rr
rr
rr
RR )(],|)([ rtrP where
)( iiiij w rrrr −+= +1
)( iiiij w tttt −+= +1
 Rotation and translation components
 Jacobian matrix without rolling shutter
98

Input Image Sequence
Proposed methodYu & Gallup (CVPR14)
Sparse 3D reconstruction
99

Input Image Only color smoothness
• Results of dense 3D reconstruction
Input Image Only color
smoothness
Dense 3D Reconstruction
100

 Geometry guidance term
∑ ∑∈ ⋅
⋅
−=
p Wq
q
pp
qp
p
g
pg
p
DDwE 2
)
ˆ
ˆ
()(
Xn
Xn
D
Key Idea
Neighboring pixels with similar color
should have similar normal
 Normal vectors guide the 3D
position of neighboring pixels
∑∈







 ⋅−−
=
pWq g
qp
g
g
p
N
w
γ
)(
exp
nn11
8
1
=
=
gg
p
p
pp
p
ppp
N
W
n
DXD
pD
yxX
constant,
neighbors-8
vectornormal:
coordinate3
ateDepth valu:
coordinateimageNormalized
:
:
:ˆ
:],,[ˆ
γ
)()()()( DDDD ggccd EEEE λλ ++=
 Energy function
)XDX(Dn ppqqp
ˆˆ −⋅
Input image Sparse 3D points
Normal of 3D points Normal map
101

Input Image
Sparse 3D points Proposed method
Conventional method
• Results of dense 3D reconstruction
102

Dataset from Yu & Gallup
103

Dataset from Yu & Gallup
104

There are still Problems
1. Exact definition of small motion 2. Blurry depth
106

Reference image Depth difference
Ground truth Our depth map (𝒃𝒃 = 𝟏𝟏. 𝟓𝟓)
Camera motion: Closest object
= 1:100
Small Motion Issue [CVPR’16]
107

Solution to Blurry Depth: Plane Sweeping
FarNear
P1
P2
P3
Sweeping depth
P1
P2
P3
Reference image
Mean
image
Intensity
profile
Mean
image
Intensity
profile
Mean
image
Intensity
profile
Reference view
Other views
Flat
Flat
Flat
Sharp
Sharp
Sharp
Given intrinsic and extrinsic parameters of camera,
108

Depth Quantization Error [TPAM’19]
Caused by Quantized depth
range in Plane sweeping
Depth range
109

Adaptive Matching Window [TPAM’19]
Adaptive depth range
The usage of blurry depth to
estimate min-max depth range
for per-pixel
controls the confidence weight [0,1]
controls the steepness of the exponential function
confidence map
Initial depth
110

Results
Kinect2 Ours Error map
111

Ours, IEEE TPAMIOurs., CVPR 16Yu and GallupOurs, ICCV 15
Results
112

One more Problem…
Runtime
Camera pose: 2s
+ Surface normal: 5min
+ Plane sweeping: 5min
+ Refinement: 1s
About 10 min
113

Procedure of stereo matching
Parametric – AD, SAD, BT, mean filter,
Laplacian of Gaussian,
Bilateral filtering, ZSAD,
NCC, ZNCC
Nonparametric – Rank filter, Softrank filter,
Census filter, Ordinal
Mutual Information – Hierarchical MI
Input
Guide
Output
Cost volume
Cost aggregation Iterative refinementGraph-cuts
𝐸𝐸 = 𝐸𝐸𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 + 𝐸𝐸𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
+𝐸𝐸𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 + 𝐸𝐸𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑟𝑟
𝑙𝑙𝑟𝑟 −
𝐶𝐶 𝑙𝑙+ + 𝐶𝐶(𝑙𝑙−)
2(𝐶𝐶(𝑙𝑙+ + 𝐶𝐶 𝑙𝑙− − 2𝐶𝐶(𝑙𝑙𝑟𝑟)))
Image Cost volume Cost aggregation Graph-cuts Iterative refinement
Cost computation
Common Pipeline of Traditional Approaches
Accurate Depth Map Estimation from a Lenslet Light Field Camera
Hae-Gon Jeon et al., IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2015
Stereo Matching with Color and Monochrome Cameras in Low-light Conditions
Hae-Gon Jeon et al., IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016
Fully End-to-end process
114

Design a network inferred by traditional Plane Sweeping Algorithm
𝒊𝒊𝒕𝒕𝒕𝒕 Pair Image
W × H × 2CH
Feature Concatenate
: Volume Generation
* No learnable para.
1
𝒍𝒍
𝑳𝑳
W × H × 2CH × L
Cost Volume
Generation
* 3D CNN
W × H × L Averaging
Cost
(𝒊𝒊 ∈ {𝟏𝟏, . . 𝑵𝑵})
W × H × L
Cost Aggregation
+ Upsampling
* 2D CNN
𝒊𝒊𝒕𝒕𝒕𝒕
Cost
Volume
(𝒊𝒊 + 𝟏𝟏)𝒕𝒕𝒕𝒕
Cost
Volume
Reference Image
⋯
4W×4H×1
Depth regression
(SoftMax)
W × H × CH
Feature Extraction
(1/4 Downsampled)
* 2D CNN
4W × 4H × 3
Input Image
Warping through
𝒍𝒍𝒕𝒕𝒕𝒕 plane (sweep)
Overview of DPSNet [ICLR19]
115

Testing
Do same process using reference and i-th images (i=1, … , N)
Add all of the cost volume (N), then average them
𝒊𝒊𝒕𝒕𝒕𝒕 Pair Image
W × H × 2CH
Feature Concatenate
: Volume Generation
1
𝒍𝒍
𝑳𝑳
W × H × 2CH × L
Cost Volume
Generation
* 3D CNN
W × H × L Averaging
Cost
(𝒊𝒊 ∈ {𝟏𝟏, . . 𝑵𝑵})
W × H × L
Cost Aggregation
+ Upsampling
* 2D CNN
𝒊𝒊𝒕𝒕𝒕𝒕
Cost
Volume
(𝒊𝒊 + 𝟏𝟏)𝒕𝒕𝒕𝒕
Cost
Volume
Reference Image
⋯
4W×4H×1
Depth regression
(SoftMax)
W × H × CH
Feature Extraction
(1/4 Downsampled)
* 2D CNN
4W × 4H × 3
Input Image
Warping through
𝒍𝒍𝒕𝒕𝒕𝒕
plane (sweep)
Iteratively add
cost volume
Training & Test Process
116

Filter each cost layers
using reference image features
Inspired by traditional cost volume filtering
We use shared weight for all layers
Aggre-
gated
Volume
Initial
+
Residual
Context Network
(2D Convolution)
Reference
Image Feature
Cost Volume Slice
Deep Cost Aggregation
Rhemann et al., CVPR2011
117

Reference GT depth
Estimated
Depth
Slice of volume along a label
(far/close)
Slice of volume along the green row in
reference image (x: Column, y: Cost layer)
Before
Aggregation
After
Aggregation
Ablation Study: Cost Aggregation
118

Confidence measures
Depth map evaluation
Lower is better Higher is better
* Winner Margin (WM):
Difference of maximum
and second maximum
* Curvature (CUR): Difference near
maximum response
Ablation Study: Cost Aggregation
119

⇐ Error metrics
w.r.t the number of images
⇐ Depth map result
w.r.t the number of images
Reference GT depth 2-view 3-view 4-view
Ablation Study: Number of Input Images
120

MVS, SUN3D, RGBD, Scenes11 datasets
Reference GT depth DeMoN
CVPR17
COLMAP
CVPR16
DeepMVS
CVPR18
Ours
Experimental Results
121

Quantitative Evaluation
Depth map evaluation Lower is better Higher is better
Experimental Results
122

Summary
Step1: Solution
Propose new ideas;
Iterative decolorization
Phase shift
Rolling shutter bundle
-> heavy computational burden
Cascade optimization;
Cost aggregation, Graph-cuts,
weighted median filtering, tree-
based filtering
-> Need for careful tuning user-
parameters`
123
Step2: maturity
Handling remaining issues;
Photometric distortion
-- Random forest prediction
Depth quantization error
-- adaptive matching window
-> Still suffering from
computational issues
Step3: Breakthrough
Design of CNNs via Re-Search;
CMSNet: fraction of iterative stereo
matching
EPINet: Merging traditional approaches
DPSNet: inspired by traditional plane
sweeping algorithm
-> No user parameters in test phase
-> Fast depth prediction
-> Accurate results
-> Large number of training parameters
-> New applications
Personal website https://sites.google.com/site/hgjeoncv/home

Depth estimation do we need to throw old things away

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Depth estimation do we need to throw old things away

Semelhante a Depth estimation do we need to throw old things away (20)

Mais de NAVER Engineering

Mais de NAVER Engineering (20)

Último

Último (20)

Depth estimation do we need to throw old things away