A talk from the Develop Track at AWE USA 2018 - the World's #1 XR Conference & Expo in Santa Clara, California May 30- June 1, 2018.
Chris Varekamp (Philips Group Innovation, Research): Depth estimation, Processing & Rendering for Dynamic 6DoF VR
In this talk I will discuss how a real-time depth-based processing chain can be built using our experience in stereo-to-depth conversion for autostereoscopic displays.
http://AugmentedWorldExpo.com
3. Public
Example results of real-time processing
3
• Multi-view on 4K auto-stereo display
• Rendering multiple views + weaving from left image/depth
[1] copyright 2006, Blender Foundation / Netherlands Media Art Institute / www.elephantsdream.org
[https://orange.blender.org/blog/creative-commons-license-2/]
• Depth estimation from Left -> Right
• Depth estimation from Right -> Left
4. Public
Real-time stereo to multi-view conversion
4
Left Eye Video
Right Eye Video
Estimate
Depth
View Synthesis
Dense Depth Video
Multiview Video
Implemented in real-time:
• Multi-core on amazon cloud (C++), 30 FPS @ 16 cores (2x depth estimation)
• FPGA/IC
• Algorithms largely suitable for implementation on GPU
• with > 10 user licenses worldwide
5. Public
Real-time depth estimation
5
Error classification
Recursive search
matching
Confidence/colour
adaptive filtering
Pixel-based
matching
Error classification
Confidence/colour
adaptive filtering
Depth coding
Depth
Left/right stereo
4x DDR3 memory
LVDS output (current 3D display)
SDI output
(for AR/VR)
FPGA: Altera Arria V device
DVI input
HDMI input
FPGA is joint work with Dimenco
6. Public
Ingredients for real-time and high quality
Disparity detection and correction
• Low-complexity disparity estimation using recursive search block matching [1]
• Disparity error detection and correction via supervised learning [2]
• Repeat both steps pixel wise.
Confidence and color adaptive filtering
• Efficient filtering [3]
• Confident pixels should not be filtered
• Unconfident pixels are filtered using colour similarity
6
[1] G. de Haan, P.W.A.C. Biezen, H. Huijgen, O.A. Ojo. True-motion estimation with 3-D recursive search block matching.
IEEE Transactions on Circuits and Systems for Video Technology, vol. 3, no. 5, October 1993.
[2] C. Varekamp, K. Hinnen, W. Simons. Detection and correction of disparity estimation errors via supervised learning.
International Conference on 3D Imaging, 3-5 Dec. 2013.
[3] L. Vosters, C. Varekamp, G. de Haan. Overview of efficient high-quality state-of-the-art depth enhancement methods by
thorough design space exploration. Journal of Real-Time Image Processing, pp. 1–21, 2015.
7. Public
Approaches to 6DoF
7
Accurate 3D geometry
Reflectance/scattering properties
Multi-modal sensing (vision, laser)
Light sources
Object recognition
Post-production
Dense camera/lens array (light field)Multiple views with depth
Camera spacing ~6 cm for indoor
Avoid costly post-production
Reduced hardware complexity
8. Public
Multi-camera design rules
8
𝐵
𝑧near
Disparity =
𝑓𝐵
𝑧near
[pixel]
𝑓
𝐵 = baseline m
𝑓 = focal length [pixel]
𝑧 = depth [m]
scene
𝑓 = 1000 pixel
(for 2K sensor, HFOV ≈ 90°)
Holds for regular lenses (perspective projection). For fisheye lenses the relation is different but the principle remains the same.
sensor
9. Public
6DoF processing flow
9
Multi-view
registration
Disparity estimation
for camera pairs
Multi-view
disparity refinement
Compositing
Image + Depth
compression
View synthesis
N-cameras
Image + Depth
decompression
Left/right stereo
Software/hardware real-time/offline
Real-time client
OpenVR/SteamVR
Graphics cards: GTX 1000 series
10. Public
Camera calibration/Multi-view registration
Offline
• Intrinsic parameters (focal, principle point, distortion) from known pattern
• Extrinsic parameters (rotation/translation) from known pattern
– Not robust to handling the camera setup
– Some frequently used algorithms cannot deal with more than two camera’s
Partially online
• Intrinsic parameters offline
• Extrinsic parameters online using images and estimated depth
– Multi-view registration method
– More practical and robust for rig handling;
– More relevant for larger (maybe outdoor) setups;
– We implemented two versions: (a) feature-based; (b) image-based on GPU
10
Original fisheye
Rectified
12. Public
Formats, packing, coding, compression
12
Equirectangular
projection
𝐷 = 𝑎𝐷 + 𝑏Perspective projection
𝐷 ∝
𝑟min
𝑟
Fish-eye, Cube map, etc.
Standard video codecs can be used (e.g HEVC)
Optional packing of image and depth
13. Public
Playback via depth to mesh at client-side
13
Fixed mesh topology Depth map adaptive mesh topology
𝑢1, 𝑣1, 𝐷1 𝑢2, 𝑣2, 𝐷2
𝑃𝑉𝑀𝑄
𝑢
𝑣
𝐷 𝑢, 𝑣
1
Model, view, projection matrices
Re-projection matrix
14. Public
Example real-time configuration
14
USB3
USB3
Mini PC
HDMI
(1920x2160)
FPGA
SDI 4:2:2
(3840x2160)
Render PC
with SDI
capture card
1
2
VR headset
Synchronized
Capture
Image capture
Lens un-distortion
Stereo rectification
Output: left + right (top-bottom)
𝐿
𝑅
𝐿 𝑅
𝐷𝐿 𝐷 𝑅
Depth estimation
USB3
USB3
Mini PC
HDMI
(1920x2160)
FPGA
SDI 4:2:2
(3840x2160)
3
4
USB3
USB3
5
6
16. Public
Dynamic 6DoF: stereo with depth
16
HTC Vive headset
Anchor Views
Position tracking
Left eye
Sweet spot with motion freedom
position tracker (for static scene part)
fish-eye
𝐿 𝑅
𝐷𝐿 𝐷 𝑅
format
Compared with stereo:
• More natural experience allowing small head motions
• Depth packing or in separate streams (e.g. HEVC)
• Efficient rendering by combining two meshes and
blending the results for both eyes.
19. Public
Dynamic 6DoF: linear array
19
HTC Vive headset
Anchor Views
Position tracking
Left eye
sweet spot
Compared with stereo:
• Larger motion freedom;
• For different applications: more camera’s, different configurations
• Scalable approach: decode only video streams in vicinity of the eye locations
21. Public
Conclusions/future work
• Demonstrated use of our real-time depth estimation algorithms for 6DoF VR
• Depth can play a role in most components of a full system including playback
• Depth-based approach has the potential to achieve high-quality at low-latency
• A live streaming demo is possible (work in progress)
21
Contact: chris.varekamp@philips.com
Explain that our 6DoF approach essentially means taking many photo’s/ video’s and selecting/interpolating between these using estimated depth
We developed depth technology for auto-stereoscopic (3D) display
Effect of disparity estimation errors will influence quality for the larger baselines
Potential problem regions are: occlusion regions/reflections
For static scene parts we can composite based on image and depth from separate camera views
The result is a composite image in equirectangular format
Symbol 𝐷 denotes encoded disparity
For perspective projection, 𝐷 is the disparity between the stereo pairs
Using standard OpenGL to convert from (u,v,D(u,v)) to normalized device coordinates using re-projection Q matrix and the standard OpenGL model view projection matrices.
Conversion of depth format to internal mesh representation at client side
Regular mesh or adaptive mesh (initial testing)
We are currently busy building this setup to demonstrate LIVE streaming
Refer to tracker as a method for creating a mix of static and dynamic