Slides of Nathan Piasco ICRA 2019 oral presentation about the paper "Learning Scene Geometry for Visual Localization in Challenging Conditions". Best paper in Robot Vision Finalist
15. Our method: learning through missing modality
Image triplet
Reconstructed
depth
CNN decoder
L1 Loss
Triplet Loss
Descriptors
concatenation
LθG
LθD;θI
Triplet Loss
LθI
Triplet Loss
LθD
CNN
CNN
Desc.
Desc.
We use triplet loss to produce
strong image descriptor.
6 / 18
16. Our method: learning through missing modality
Images/Depth
pair
Reconstructed
depth
CNN decoder
L1 Loss
Triplet Loss
Descriptors
concatenation
LθG
LθD;θI
Triplet Loss
LθI
Triplet Loss
LθD
CNN
CNN
Desc.
Desc.
Latent image representation is
given to a CNN decoder to
reproduce the scene geometry.
The CNN decoder is simply
trained by minimizing L1
distance between ground truth
and reconstructed depth.
6 / 18
17. Our method: learning through missing modality
Images/Depth
pair
Reconstructed
depth
CNN decoder
L1 Loss
Triplet Loss
Descriptors
concatenation
LθG
LθD;θI
Triplet Loss
LθI
Triplet Loss
LθD
CNN
CNN
Desc.
Desc.
We use another CNN to
produce strong depth map
descriptor.
6 / 18
18. Our method: learning through missing modality
Images/Depth
pair
Reconstructed
depth
CNN decoder
L1 Loss
Triplet Loss
Descriptors
concatenation
LθG
LθD;θI
Triplet Loss
LθI
Triplet Loss
LθD
CNN
CNN
Desc.
Desc.
Final descriptor is obtained by
concatenating image and
depth map descriptors.
6 / 18
19. Training policy
Image triplet
Reconstructed
depth
CNN decoder
L1 Loss
Triplet Loss
Descriptors
concatenation
LθG
LθD;θI
Triplet Loss
LθI
Triplet Loss
LθD
CNN
CNN
Desc.
Desc.
Data required to train the
descriptor part of our system:
7 / 18
21. System deployment
NN Encoder
NN decoder
Compact descriptor
Nearest
Neighbor fast
indexing
Reference images + poses Most similar image, with pose
Unknown pose query
Inferred depth map
f mage poseg
NN Encoder
Reference descriptors
The depth information is only needed during the training step!
8 / 18
23. Dataset for Localization in challenging conditions
Dataset: we train and test our proposal on RobotCar1
dataset
Deep architectures: Resnet18 or Alexnet encoder combined with NetVLAD or MAC descriptor
1
Will Maddern, Geoffrey Pascoe, Chris Linegar, and Paul Newman, 1 year, 1000 km: The Oxford RobotCar dataset, The International Journal of
Robotics Research (IJRR), 2016
2
Judy Hoffman, Saurabh Gupta, and Trevor Darrell, Learning with Side Information through Modality Hallucination, In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pages 826–834, 2016
9 / 18
24. Dataset for Localization in challenging conditions
Dataset: we train and test our proposal on RobotCar1
dataset
Deep architectures: Resnet18 or Alexnet encoder combined with NetVLAD or MAC descriptor
Competitors: Hallucination network2
& same descriptor only trained with RGB
1
Will Maddern, Geoffrey Pascoe, Chris Linegar, and Paul Newman, 1 year, 1000 km: The Oxford RobotCar dataset, The International Journal of
Robotics Research (IJRR), 2016
2
Judy Hoffman, Saurabh Gupta, and Trevor Darrell, Learning with Side Information through Modality Hallucination, In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pages 826–834, 2016
9 / 18
25. Dataset for Localization in challenging conditions
Dataset: we train and test our proposal on RobotCar1
dataset
Deep architectures: Resnet18 or Alexnet encoder combined with NetVLAD or MAC descriptor
Competitors: Hallucination network2
& same descriptor only trained with RGB
Metric: we compare the distance between the top ranked returned database image position and the
query ground truth position.
1
Will Maddern, Geoffrey Pascoe, Chris Linegar, and Paul Newman, 1 year, 1000 km: The Oxford RobotCar dataset, The International Journal of
Robotics Research (IJRR), 2016
2
Judy Hoffman, Saurabh Gupta, and Trevor Darrell, Learning with Side Information through Modality Hallucination, In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pages 826–834, 2016
9 / 18
26. Results: long-term localization
Queries atabase
Example of queries and nearest image in
the database.
20 30 40 50
D - Distance to top 1 candidate (m)
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
Recall@D(%)
0 10 20
0.5
1
RGBtD (our) + NetVLAD (Rt)
RGB + NetVLAD (Rt)
RGBtD (our) + NetVLAD (A)
RGBtD (hall) + NetVLAD (A)
RGB + NetVLAD (A)
RGBtD (our) + MAC (R)
RGB + MAC (R)
RGBtD (our) + MAC (A)
RGBtD (hall) + MAC (A)
RGB+ MAC (A)
Competitors:
--- Only RGB [RGB]
-x- Hallucination network
[RGBtD (Hall)]
-o- Our proposal [RGBtD
(our)]
10 / 18
27. Results: cross-season localization
Queries atabase
Example of queries and nearest image in
the database.
20 30 40 50
D - Distance to top 1 candidate (m)
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
Recall@D(%)
0 10 20
0.5
1
RGBtD (our) + NetVLAD (Rt)
RGB + NetVLAD (Rt)
RGBtD (our) + NetVLAD (A)
RGBtD (hall) + NetVLAD (A)
RGB + NetVLAD (A)
RGBtD (our) + MAC (R)
RGB + MAC (R)
RGBtD (our) + MAC (A)
RGBtD (hall) + MAC (A)
RGB+ MAC (A)
Competitors:
--- Only RGB [RGB]
-x- Hallucination network
[RGBtD (Hall)]
-o- Our proposal [RGBtD
(our)]
11 / 18
28. Failure case: Night to day
atabaseQueries
Example of queries and nearest image in
the database.
20 30 40 50
D - Distance to top 1 candidate (m)
0.05
0.1
0.15
0.2
Recall@D(%)
0 10 20
0.5
1
RGBtD (our) + NetVLAD (Rt)
RGB + NetVLAD (Rt)
RGBtD (our) + NetVLAD (A)
RGBtD (hall) + NetVLAD (A)
RGB + NetVLAD (A)
RGBtD (our) + MAC (R)
RGB + MAC (R)
RGBtD (our) + MAC (A)
RGBtD (hall) + MAC (A)
RGB+ MAC (A)
Competitors:
--- Only RGB [RGB]
-x- Hallucination network
[RGBtD (Hall)]
-o- Our proposal [RGBtD
(our)]
12 / 18
29. Improving night to day localization
Night images that serve
as input for the
generative network
Ground truth depth maps
Generated depth maps
Our network is not able to generate proper depth maps from night images.
13 / 18
30. Recall - Training policy
Images/Depth
pair
Reconstructed
depth
CNN decoder
L1 Loss
Triplet Loss
Descriptors
concatenation
LθG
LθD;θI
Triplet Loss
LθI
Triplet Loss
LθD
CNN
CNN
Desc.
Desc.
Thanks to the design of our
method, we can improve
generation performances of
the decoder without impacting
the descriptors networks.
14 / 18
31. Recall - Training policy
Images/Depth
pair
Reconstructed
depth
CNN decoder
L1 Loss
Triplet Loss
Descriptors
concatenation
LθG
LθD;θI
Triplet Loss
LθI
Triplet Loss
LθD
CNN
CNN
Desc.
Desc.
Thanks to the design of our
method, we can improve
generation performances of
the decoder without impacting
the descriptors networks.
We fine tune our system with
pairs of night image/depth
map.
14 / 18
32. Improving night to day localization
Night images that
serve as input for
the generative
network
Ground truth
depth maps
Generated depth
maps
Generated depth
maps, with fine
tuning
15 / 18
33. Results: Night to day (fine tuning)
20 30 40 50
D - Distance to top 1 candidate (m)
0.05
0.1
0.15
0.2
Recall@D(%)
Night vs daytime (w/o fine tuning)
20 30 40 50
D - Distance to top 1 candidate (m)
0.05
0.1
0.15
0.2
0.25
Recall@D(%)
Night vs daytime (w/ fine tuning)
With fine tuning with a small amount of data, we are able to nearly double localization performances.
16 / 18
35. Conclusion
Summary:
• new global image descriptor trained with side depth modality,
• especially well suited to long term and cross season outdoor localization,
• promising results for challenging night to day image retrieval.
17 / 18
36. Conclusion
Summary:
• new global image descriptor trained with side depth modality,
• especially well suited to long term and cross season outdoor localization,
• promising results for challenging night to day image retrieval.
For future works:
• use other modalities,
• investigate less trivial descriptor fusion method,
• test on other visual localisation tasks (e.g. direct pose regression).
17 / 18
37. References
Relja Arandjelovi´c, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic, NetVLAD: CNN
architecture for weakly supervised place recognition, IEEE Transactions on Pattern Analysis and
Machine Intelligence (TPAMI), pages 5297–5307, 2017.
Judy Hoffman, Saurabh Gupta, and Trevor Darrell, Learning with Side Information through
Modality Hallucination, In IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 826–834, 2016.
Will Maddern, Geoffrey Pascoe, Chris Linegar, and Paul Newman, 1 year, 1000 km: The Oxford
RobotCar dataset, The International Journal of Robotics Research (IJRR), 2016.