Deep convolutional neural fields for depth estimation from a single image
1. Deep Convolutional Neural Fields for
Depth Estimation from a Single Image
Fayao Liu, Chunhua Shen, Guosheng Lin
University of Adelaide, Australia; Australian Centre for Robotic Vision
2016/8/11 1
2. Australian Centre for Robotic Vision
• University of Adelaide, Australia
Chunhua Shen
Compressive sensing, tracking, detection, …
Weibo: @沈春華_ADL
Fayao Liu (PhD student)
Depth estimation, image segmentation, CRF learning
Guosheng Lin
Graphical models, hashing
2016/8/11 2
3. Depth Estimation in Monocular Images
No reliable depth cue
• No stereo correspondence
• No motion in videos
2016/8/11 3
4. Previous works
• Enforcing geometric assumptions
– Hedau et al. ECCV 2010
– Lee et al. NIPS 2010
– Gupta et al. ECCV 2010
• Non-parametric methods
– Candidate images retrieval + scene alignment + depth infer
– Karsch et al. PAMI 2014
2016/8/11 4
5. Contributions
Propose to formulate the depth estimation as a deep
continuous CRF learning problem, without relying on any
geometric priors nor any extra information
– joint training of a deep CNN and a graphical model
– the partition can be analytically calculated, the log-likelihood can
be optimized directly
– The gradients can be exactly calculated in the back propagation
training.
– Inference (MAP problem) is in closed form
– Jointly train unary and pairwise potentials of the CRF
2016/8/11 5
6. Overview
• 𝐱: image
• 𝐲 = 𝑦1, … , 𝑦𝑛 ∈ 𝑅 𝑛
: continuous depth values
corresponding to all 𝑛 superpixels in 𝐱
• conditional probability distribution of the data
• Z(𝐱) is the partition function
2016/8/11 6
7. Overview
• conditional probability distribution of the data
• Z(𝐱) is the partition function
• Inference: maximum a posteriori (MAP) problem
2016/8/11 7
8. Energy Function
• Typical combination of unary and pairwise potentials
• 𝑈 regress the depth from a single superpixel
• 𝑉 encourages smoothness between neighboring
superpixels
• 𝑈 and 𝑉 are jointly learned in a unified CNN framework
2016/8/11 8
10. Unary Potential
• Regress depth value of each superpixel using lease
square loss
Ground-truth prediction
224 × 224
2016/8/11 10
11. Pairwise Potential
• Pairwise potentials are constructed from 𝐾 types of
similarity observations
• Here 𝑅 𝑝𝑞 is the output of the network
• Only 1 fully connected layer (without activation)
2016/8/11 11
12. Pairwise Potential
• Only 1 fully connected layer (without activation)
• 𝑆 𝑝𝑞
(𝐾)
: 𝑘th similarity type
𝑆 𝑝𝑞
(𝐾)
= exp(−𝛾||𝑠 𝑝
(𝑘)
− 𝑠 𝑞
(𝑘)
||),
• 3 types are used in the paper
– color difference
– color histogram difference
– LBP texture disparity
2016/8/11 12
14. The Energy Function
• The energy
• For ease of expression, we introduce
– 𝐈 is the 𝑛 × 𝑛 identity matrix
– 𝐑 is the matrix composed of 𝑅 𝑝𝑞
– 𝐃 is a diagonal matrix with 𝐷 𝑝𝑝 = 𝑞 𝑅 𝑝𝑞
• We have
2016/8/11 14
15. Partition and Conditional Probability
Distribution
• Remind that
• and the energy
• Due to quadratic terms of 𝐲 and positive definiteness
of 𝐀, we have
• Gaussian integral (n-dimensional with linear term)
• Hence the conditional probability distribution is
2016/8/11 15
16. Negative log-likelihood
• Given
• The negative log-likelihood is
• During learning, we minimizes the negative log-
likelihood of the training data with regularization:
2016/8/11 16
17. Partial Derivatives
• We then calculate the partial derivatives of negative
log-likelihood
• where 𝐉 is an 𝑛 × 𝑛 matrix with elements
2016/8/11 17
19. Depth Prediction
• Prediction is to solve the MAP inference, in which
closed form solutions exist
• Discuss: if 𝑅 𝑝𝑞 = 0(discard the pairwise term), then
𝐲∗
= 𝐳, which is a conventional regression model.
2016/8/11 19
20. Experiment Datasets
• Make3D: outdoor scene reconstruction
– 534 images
• NYU v2: indoor scene reconstruction
– 1449 RGBD images (795 training; 654 testing)
2016/8/11 20
achieved No. ① for the task of semantic pixel labelling on PASCAL VOC 2012 (as of July 2015).
Guosheng Lin is the winner of Google PhD fellowship in 2014
Conditional distribution: given an image, the probability of the depth values assigned to all the superpixels is defined as an exponential family distribution.
Here E is the energy, Z is the partion function.
In general, Z is difficult to compute. Howerver, in this paper, the CRF model is continuous since the depth values are continuous. Under certain conditions, Z can be calculated analytically. We will discuss this later.
Given the conditional probability, the inference of the depth value becomes an MAP problem
The energy function is defined as typical combination of unary and pairwise
N is the set of all superpixels, and S are the edges over the graphical model.