CVPR2015(Best Paper Award)の論文紹介
"DynamicFusion: Reconstruction and Tracking of Non-‐rigid Scenes in Real-‐Time"
Richard A. Newcombe, Dieter Fox, Steven M. Seitz
内容に関して何かお気づきになりましたら,スライドに記載されているメールアドレスにご連絡頂けると幸いです
4. DynamicFusion
Dense SLAM システム
• デプス画像を統合して動的シーンをリアルタイムで
3次元復元
– KinectFusionを動的シーンに拡張
Video: https://www.youtube.com/watch?v=i1eZekcc_lM
4
(a) Initial Frame at t = 0s (b) Raw (noisy) depth maps for frames at t = 1s, 10s, 15s, 20s (c) Node Distance
(d) Canonical Model (e) Canonical model warped into its live frame (f) Model Normals
Figure 2: DynamicFusion takes an online stream of noisy depth maps (a,b) and outputs a real-time dense reconstruction of the moving
scene (d,e). To achieve this, we estimate a volumetric warp (motion) field that transforms the canonical model space into the live frame,
enabling the scene motion to be undone, and all depth maps to be densely fused into a single rigid TSDF reconstruction (d,f). Simulta-
neously, the structure of the warp field is constructed as a set of sparse 6D transformation nodes that are smoothly interpolated through
a k-nearest node average in the canonical frame (c). The resulting per-frame warp field estimate enables the progressively denoised and
5. KinectFusion
静的シーンを対象としたDense SLAM システム
• 複数のデプス画像から密なサーフェスモデルを構築
• 得られたモデルに対して最新のデプス画像を位置合わせし
てカメラ姿勢を推定
“KinectFusion: Real-Time Dense Surface Mapping and Tracking” (ISMAR 2011)
Richard A. Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J. Davison,
Pushmeet Kohi, Jamie Shotton, Steve Hodges, Andrew Fitzgibbon
KinectFusion: Real-Time Dense Surface Mapping and Tracking⇤
Richard A. Newcombe
Imperial College London
Shahram Izadi
Microsoft Research
Otmar Hilliges
Microsoft Research
David Molyneaux
Microsoft Research
Lancaster University
David Kim
Microsoft Research
Newcastle University
Andrew J. Davison
Imperial College London
Pushmeet Kohli
Microsoft Research
Jamie Shotton
Microsoft Research
Steve Hodges
Microsoft Research
Andrew Fitzgibbon
Microsoft Research
Figure 1: Example output from our system, generated in real-time with a handheld Kinect depth camera and no other sensing infrastructure.
Normal maps (colour) and Phong-shaded renderings (greyscale) from our dense reconstruction system are shown. On the left for comparison
is an example of the live, incomplete, and noisy data from the Kinect sensor (used as input to our system).
5
Video: https://www.youtube.com/watch?v=quGhaggn3cQ
9. KinectFusion: 処理の流れ
9
カメラ移動量とサーフェスの同時推定問題
rom Depth to a Dense Oriented Point Cloud
Raw Depth ICP Outliers
Depth Map
Conversion
Model-Frame
Camera Track
Volumetric
Integration
Model
Rendering
Predicted Vertex
and Normal Maps
(Measured Vertex
and Normal Maps)
(ICP) (TSDF Fusion) (TSDF Raycast)
6DoF Pose and Raw Depth
19. KinectFusion: サーフェス表現
19
Raw Depth ICP Outliers
Depth Map
Conversion
Model-Frame
Camera Track
Volumetric
Integration
Model
Rendering
Predicted Vertex
and Normal Maps
(Measured Vertex
and Normal Maps)
(ICP) (TSDF Fusion) (TSDF Raycast)
6DoF Pose and Raw Depth
27. KinectFusion: サーフェス表現
TSDF 𝐹", 𝑊" の更新 (𝐹":投票値,𝑊":重み)
• 𝑊=>
𝐩 = 1で良い結果が得られる
27
million new point measurements are made per second). Storing
a weight Wk(p) with each value allows an important aspect of the
global minimum of the convex L2 de-noising metric to be exploited
for real-time fusion; that the solution can be obtained incrementally
as more data terms are added using a simple weighted running av-
erage [7], defined point-wise {p|FRk
(p) 6= null}:
Fk(p) =
Wk 1(p)Fk 1(p)+WRk
(p)FRk
(p)
Wk 1(p)+WRk
(p)
(11)
Wk(p) = Wk 1(p)+WRk
(p) (12)
No update on the global TSDF is performed for values resulting
from unmeasurable regions specified in Equation 9. While Wk(p)
provides weighting of the TSDF proportional to the uncertainty of
surface measurement, we have also found that in practice simply
letting WRk
(p) = 1, resulting in a simple average, provides good re-
sults. Moreover, by truncating the updated weight over some value
Wh ,
Wk(p) min(Wk 1(p)+WRk
(p),Wh ) , (13)
million new point measurements are made per second). Storing
a weight Wk(p) with each value allows an important aspect of the
global minimum of the convex L2 de-noising metric to be exploited
for real-time fusion; that the solution can be obtained incrementally
as more data terms are added using a simple weighted running av-
erage [7], defined point-wise {p|FRk
(p) 6= null}:
Fk(p) =
Wk 1(p)Fk 1(p)+WRk
(p)FRk
(p)
Wk 1(p)+WRk
(p)
(11)
Wk(p) = Wk 1(p)+WRk
(p) (12)
No update on the global TSDF is performed for values resulting
from unmeasurable regions specified in Equation 9. While Wk(p)
provides weighting of the TSDF proportional to the uncertainty of
surface measurement, we have also found that in practice simply
letting WRk
(p) = 1, resulting in a simple average, provides good re-
sults. Moreover, by truncating the updated weight over some value
Fk(p) =
Wk 1(p)Fk 1(p)+WRk
(p)FRk
(p)
Wk 1(p)+WRk
(p)
(11)
Wk(p) = Wk 1(p)+WRk
(p) (12)
update on the global TSDF is performed for values resulting
m unmeasurable regions specified in Equation 9. While Wk(p)
ovides weighting of the TSDF proportional to the uncertainty of
face measurement, we have also found that in practice simply
ing WRk
(p) = 1, resulting in a simple average, provides good re-
ts. Moreover, by truncating the updated weight over some value
h ,
Wk(p) min(Wk 1(p)+WRk
(p),Wh ) , (13)
moving average surface reconstruction can be obtained enabling
onstruction in scenes with dynamic object motion.
Although a large number of voxels can be visited that will not
oject into the current image, the simplicity of the kernel means
eration time is memory, not computation, bound and with current
system workflow.
新しい観測現在まで
28. KinectFusion: センサーの姿勢推定
点群と面の距離(point-plane energy)を最小化
28
(Vk 1,Nk 1) which is used in our experimental section for a com-
parison between frame-to-frame and frame-model tracking.
Utilising the surface prediction, the global point-plane energy,
under the L2 norm for the desired camera pose estimate Tg,k is:
E(Tg,k) = Â
u2U
Wk(u)6=null
⇣
Tg,k
˙Vk(u) ˆV
g
k 1 (ˆu)
⌘>
ˆN
g
k 1 (ˆu)
2
, (16)
where each global frame surface prediction is obtained using the
previous fixed pose estimate Tg,k 1. The projective data as-
sociation algorithm produces the set of vertex correspondences
{Vk(u), ˆVk 1(ˆu)|W(u) 6= null} by computing the perspectively pro-
jected point, ˆu = p(KeTk 1,k
˙Vk(u)) using an estimate for the frame-
frame transform eTz
k 1,k = T 1
g,k 1
eTz
g,k and testing the predicted and
measured vertex and normal for compatibility. A threshold on the
distance of vertices and difference in normal values suffices to re-
ject grossly incorrect correspondences, also illustrated in Figure 7:
8
< Mk(u) = 1, and
ns for view-planning
y [Besl92], the ICP
ome the most widely
al shapes (a similar
d Medioni [Chen92]).
1] provide a recent
on the original ICP
McKay [Besl92], each
est point in the other
a point-to-point error
red distance between
mized. The process is
a threshold or it stops
ioni [Chen92] used a
ect of minimization is
point and the tangent
e the point-to-point
n, the point-to-plane
nlinear least squares
dt method [Press92].
ane ICP algorithm is
ion, researchers have
rates in the former
explanation of the
escribed by Pottmann
source points such that the total error between the corresponding
points, under a certain chosen error metric, is minimal.
When the point-to-plane error metric is used, the object of
minimization is the sum of the squared distance between each
source point and the tangent plane at its corresponding destination
point (see Figure 1). More specifically, if si = (six, siy, siz, 1)T
is a
source point, di = (dix, diy, diz, 1)T
is the corresponding destination
point, and ni = (nix, niy, niz, 0)T
is the unit normal vector at di, then
the goal of each ICP iteration is to find Mopt such that
( )( )∑ •−⋅=
i
iii
2
opt minarg ndsMM M (1)
where M and Mopt are 4×4 3D rigid-body transformation matrices.
Figure 1: Point-to-plane error between two surfaces.
tangent
plane
s1
source
point
destination
point
d1
n1
unit
normal
s2
d2
n2
s3
d3
n3
destination
surface
source
surface
l1
l2
l3
Figure Reference: Low, Kok-Lim. "Linear least-squares optimization for point-to-plane icp surface registration.”
Chapel Hill, University of North Carolina (2004).
2つのサーフェス間の誤差
35. KinectFusion: 実験結果
ボクセルの解像度と処理時間
上から順に
• デプスマップの統合
• サーフェス生成のためのレイキャスティング
• ピラミッドマップを利用したカメラ姿勢の最適化
• ピラミッドマップの各スケール間の対応付け
• デプスマップの前処理
35
t
e
)
d Figure 12: A reconstruction result using 1
64 the memory (643 vox-
els) of the previous figures, and using only every 6th sensor frame,
demonstrating graceful degradation with drastic reductions in mem-
ory and processing resources.
Time(ms)
Voxel Resolution
64 128 192 320 448384256 512
33 3 3 3 3 3 3
37. KinectFusionからDynamicFusionへ
KinectFusionの前提
• 観測シーンは大部分が変化しない
DynamicFusion
• リアルタイム処理を保ちKinectFusionを動的かつ非
剛体なシーンへと拡張
37
KinectFusion: Real-Time Dense Surface Mapping and Tracking⇤
Richard A. Newcombe
Imperial College London
Shahram Izadi
Microsoft Research
Otmar Hilliges
Microsoft Research
David Molyneaux
Microsoft Research
Lancaster University
David Kim
Microsoft Researc
Newcastle Univers
Andrew J. Davison
Imperial College London
Pushmeet Kohli
Microsoft Research
Jamie Shotton
Microsoft Research
Steve Hodges
Microsoft Research
Andrew Fitzgibbon
Microsoft Research
Figure 1: Example output from our system, generated in real-time with a handheld Kinect depth camera and no other sensing infrastru
Normal maps (colour) and Phong-shaded renderings (greyscale) from our dense reconstruction system are shown. On the left for compa
is an example of the live, incomplete, and noisy data from the Kinect sensor (used as input to our system).
ABSTRACT
We present a system for accurate real-time mapping of complex and
arbitrary indoor scenes in variable lighting conditions, using only a
moving low-cost depth camera and commodity graphics hardware.
We fuse all of the depth data streamed from a Kinect sensor into
a single global implicit surface model of the observed scene in
real-time. The current sensor pose is simultaneously obtained by
1 INTRODUCTION
Real-time infrastructure-free tracking of a handheld camera w
simultaneously mapping the physical scene in high-detail pro
new possibilities for augmented and mixed reality application
In computer vision, research on structure from motion (
and multi-view stereo (MVS) has produced many compellin
sults, in particular accurate camera tracking and sparse recon
DynamicFusion: Reconstruction and Tracking of Non-rigid Scenes in Real-Time
Richard A. Newcombe
newcombe@cs.washington.edu
Dieter Fox
fox@cs.washington.edu
University of Washington, Seattle
Steven M. Seitz
seitz@cs.washington.edu
Figure 1: Real-time reconstructions of a moving scene with DynamicFusion; both the person and the camera are moving. The initially
noisy and incomplete model is progressively denoised and completed over time (left to right).
38. DynamicFusion: 概要
• ノイズの大きい連続デプス画像を入力
• 動的シーンの密な3次元形状をリアルタイムで出力
38
(a) Initial Frame at t = 0s (b) Raw (noisy) depth maps for frames at t = 1s, 10s, 15s, 20s (c) N
(d) Canonical Model (e) Canonical model warped into its live frame (f) Mo
Figure 2: DynamicFusion takes an online stream of noisy depth maps (a,b) and outputs a real-time dense reconstructi
scene (d,e). To achieve this, we estimate a volumetric warp (motion) field that transforms the canonical model space in
enabling the scene motion to be undone, and all depth maps to be densely fused into a single rigid TSDF reconstructio
neously, the structure of the warp field is constructed as a set of sparse 6D transformation nodes that are smoothly inte
a k-nearest node average in the canonical frame (c). The resulting per-frame warp field estimate enables the progressiv
completed scene geometry to be transformed into the live frame in real-time (e). In (e) we also visualise motion trails
(a) Initial Frame at t = 0s (b) Raw (noisy) depth maps for frames at t = 1s, 10s, 15s, 20s (c) Node D
(d) Canonical Model (e) Canonical model warped into its live frame (f) Model N
39. DynamicFusion: 概要
• ワープフィールドを疎なノードのワープ( 6自由
度)の重み付き平均で表現
• 基準空間の各ボクセルを最新フレームへワープ
39
0s (c) Node Distance
(f) Model Normals
ime dense reconstruction of the moving
nonical model space into the live frame,
gid TSDF reconstruction (d,f). Simulta-
(a) Initial Frame at t = 0s (b) Raw (noisy) depth maps for frames at t = 1s, 10s, 15s, 20s
(d) Canonical Model (e) Canonical model warped into its live frame
Figure 2: DynamicFusion takes an online stream of noisy depth maps (a,b) and outputs a real-time de
scene (d,e). To achieve this, we estimate a volumetric warp (motion) field that transforms the canonical
enabling the scene motion to be undone, and all depth maps to be densely fused into a single rigid TSD
neously, the structure of the warp field is constructed as a set of sparse 6D transformation nodes that a
a k-nearest node average in the canonical frame (c). The resulting per-frame warp field estimate enable
completed scene geometry to be transformed into the live frame in real-time (e). In (e) we also visuali
of model vertices over the last 1 second of scene motion together with a coordinate frame showing the ri
motion. In (c) we render the nearest node to model surface distance where increased distance is mapped
of objects with both translation and rotation results in signif-
icantly better tracking and reconstruction. For each canoni-
cal point vc 2 S, Tlc = W(vc) transforms that point from
canonical space into the live, non-rigidly deformed frame of
reference.
with each unit dual-quaternio
the k-nearest transformation
R3
7! R defines a weight tha
of each node and SE3(.) con
an SE(3) transformation mat
(a) Initial Frame at t = 0s (b) Raw (noisy) depth maps for frames at t = 1s, 10s, 15s, 20s (c) Node Di
(d) Canonical Model (e) Canonical model warped into its live frame (f) Model No
Figure 2: DynamicFusion takes an online stream of noisy depth maps (a,b) and outputs a real-time dense reconstruction of t
scene (d,e). To achieve this, we estimate a volumetric warp (motion) field that transforms the canonical model space into the
補間疎なノード
ワープフィールド𝒲Jを推定
40. DynamicFusion: 概要
TSDFは最新フレームの空間で統合
• 最新フレームのレイが基準空間では歪曲
40
Non-rigid scene deformation Introducing an occlusion
(a) Live frame t = 0 (b) Live Frame t = 1 (c) Canonical 7! Live (d) Live frame t = 0 (e) Live Frame t = 1 (f) Canonical 7! Live
Figure 3: An illustration of how each point in the canonical frame maps, through a correct warp field, onto a ray in the live camera frame
when observing a deforming scene. In (a) the first view of a dynamic scene is observed. In the corresponding canonical frame, the warp is
initialized to the identity transform and the three rays shown in the live frame also map as straight lines in the canonical frame. As the scene
deforms in the live frame (b), the warp function transforms each point from the canonical and into the corresponding live frame location,
causing the corresponding rays to bend (c). Note that this warp can be achieved with two 6D deformation nodes (shown as circles), where
the left node applies a clockwise twist. In (d) we show a new scene that includes a cube that is about to occlude the bar. In the live frame
54. DynamicFusion: ワープフィールド𝒲Jの推定
データ項
𝐃𝐚𝐭𝐚 𝒲, 𝒱, 𝐷J ≡ • 𝜓 𝐝𝐚𝐭𝐚 𝐧pÃ
-
𝐯ÄÃ − 𝐯𝐥ÃÆ
Ã∈Ç
𝜓 𝐝𝐚𝐭𝐚: Robust Tukey penalty function
54
頂点と法線の推定値(基準空間
から最新フレームの空間に変換)
計測点(デプス)
最新フレームにおける
頂点の法線方向の誤差
• This gives the “solution” as a simple least-squares problem:
ˆa =
i
wixix⊤
i
−1
i
wiyixi. (8)
Note that this solution is depends on the wi values which in turn depend
on ˆa.
• The idea is to alternate calculating ˆa and recalculating wi = w((yi −
ˆa⊤
xi)/σi).
• Here are the weight functions associated with the two estimates. For the
Cauchy ρ function,
wC(u) =
u
1 + (u/c)2
(9)
0.2
0.4
0.6
0.8
1
–6 –4 –2 0 2 4 6
and, for the Beaton-Tukey ρ function,
Beaton-Tukey 𝜌 function
0.2
0.4
0.6
0.8
1
–6 –4 –2 0 2 4 6
d, for the Beaton-Tukey ρ function,
wT (u) =
1 − u
a
2 2
|u| ≤ a
0 |u| > a
. (10)
Beaton-Tukey 𝜌 functionの例
55. DynamicFusion: ワープフィールド𝒲Jの推定
正則化項
• 最新フレームで計測されない箇所のモーションを制約
𝐑𝐞𝐠 𝒲,ℰ ≡ • • 𝛼Z’ 𝜓𝐫𝐞𝐠 𝐓𝒊𝒄 𝐝𝐠V
’
− 𝐓𝒋𝒄 𝐝𝐠V
’
’∈ℰ Z
”
Z•Ê
𝜓𝐫𝐞𝐠: Huber penalty
𝛼Z’ = max 𝐝𝐠 𝓌
Z
, 𝐝𝐠 𝓌
’
55
ノード 𝑖 と 𝑗のエッジができる
だけ剛体を保つ制約
𝐿Î 𝑎 =
1
2
𝑎e 𝑎 ≤ 𝛿
𝛿 𝑎 −
1
2
𝛿 , otherwise
Huber loss function
Huber loss
Squared error loss
Huber loss functionの例
58. DynamicFusion: 実験結果
58
Canonical Model for “drinking from a cup”
(a) Canonical model warped into the live frame for “drinking from a cup”
Canonical Model for “Crossing fingers”
59. DynamicFusion: 実験結果
59
Canonical Model for “drinking from a cup”
(a) Canonical model warped into the live frame for “drinking from a cup”
Canonical Model for “Crossing fingers”
(b) Canonical model warped into the live frame for “crossing fingers”
Figure 5: Real-time non-rigid reconstructions for two deforming scenes. Upper rows of (a) and (b) show the canonical models as they
evolve over time, lower rows show the corresponding warped geometries tracking the scene. In (a) complete models of the arm and the
cup are obtained. Note the system’s ability to deal with large motion and add surfaces not visible in the initial scene, such as the bottom of
the cup and the back side of the arm. In (b) we show full body motions including clasping of the hands where we note that the model stays
consistent throughout the interaction.
tracking scenes with more fluid deformations than shown
in the results, but the long term stability can degrade and
tracking will fail when the observed data term is not able to
5. Conclusions