SlideShare uma empresa Scribd logo
1 de 55
Baixar para ler offline
Fisheye/Ominidirectional View
in Autonomous Driving
YuHuang
Yu.huang07@gmail.com
Sunnyvale,California
Outline
• Graph-Based Classification of Omnidirectional Images
• Flat2Sphere: Learning Spherical Convolution for Fast Features from 360° Imagery
• Spherical CNNs
• Scene Understanding Networks for AD based on Around View Monitoring System
• Eliminating the Blind Spot: Adapting 3D Object Detection and Mono Depth Estimation to
360◦ Panoramic Imagery
• SphereNet: Learning Spherical Representations for Detection and Classification in
Omnidirectional Images
• FisheyeMODNet: Moving Object detection on Surround-view Cameras for AD
• OmniDRL: Robust Pedestrian Detection using DRL on Omnidirectional Cameras
• WoodScape: A multi-task, multi-camera fisheye dataset for AD
• FisheyeDistanceNet: Self-Supervised Scale-Aware Distance Estimation using Mono Fisheye
Camera for AD
Graph-Based Classification of Omnidirectional Images
• Omnidirectional cameras are widely used in such areas as robotics and virtual reality as they
provide a wide field of view.
• Their images are often processed with classical methods, which might unfortunately lead to
non-optimal solutions as these methods are designed for planar images that have different
geometrical properties than omnidirectional ones.
• Here, image classification by taking into account the specific geometry of omnidirectional
cameras with graph-based representations.
• In particular, deep learning architectures for data on graphs.
• It is a principled way of graph construction such that convolutional filters respond similarly
for the same pattern on different positions of the image regardless of lens distortions.
• Reference: “Graph-based Isometry Invariant Representation Learning”, ICML, 2017
Graph-Based Classification of Omnidirectional Images
• Transformation Invariant Graph-based Network (TIGraNet):
• It takes as input images that are represented as signals on a grid graph and gives
classification labels as output.
• Briefly this approach proposes a network of alternatively stacked spectral convolutional and
dynamic pooling layers, which creates features that are equivariant to the isometric
transformation.
• Further, the output of the last layer is processed by a statistical layer, which makes the
equivariant representation of data invariant to isometric transformations.
• Finally, the resulting feature vector is fed to a number of fully-connected layers and a
softmax layer, which outputs the probability distribution that the signal belongs to each of
the given classes.
• This transformation-invariant classification algorithm is extended to omnidirectional images
by incorporating the knowledge about the camera lens geometry in the graph structure.
Graph-Based Classification of Omnidirectional Images
The graph construction method makes response of the filter
similar regardless of different position of the pattern on an
image from an omnidirectional camera.
Graph-Based Classification of Omnidirectional Images
TIGraNet architecture. The network is composed of an alternation of spectral convolution layers Fl and dynamic
pooling layers Pl, followed by a statistical layer H, multiple fully-connected layers (FC) and a softmax operator
(SM). The input of the network is an image that is represented as a signal y0 on the grid-graph with Laplacian
matrix L. The output of the system is a label that corresponds to the most likely class for the input sample.
Graph-Based Classification of Omnidirectional Images
Example of the gnomonic projection. An object
from tangent plane Ti is projected to the sphere
at tangency point X0,i, which is defined by
spherical coordinates φi , θi . The point Xk,I is
defined by coordinates (xk,i , yk,i ) on the plane.
Example of the equirectangular representation of
the image. On the left, the figure depicts the
original image on the tangent plane Ti, on the right,
projected to the points of the sphere. To build an
equirectangular image the values points on the
discrete regular grid are often approximated from
the values of projected points by interpolation.
Graph-Based Classification of Omnidirectional Images
a) Choose pattern p0 , .., p4 from an object on tangent plane Te at
equator (φe = 0, θe = 0) (red points) and then, b) move this object
on the sphere by moving the tangent plane Ti to point (φi,θi). c)
Thus, the filter localized at tangency point (φi , θi ) uses values pi,1 ,
pi,3 (blue points) which we can obtain by interpolation.
The goal is to develop a transformation
invariant system, which can recognize
the same object on different planes Ti
that are tangent to S at different points
(φi , θi ) without any extra training.
The challenge of building such a system
is to design a proper graph signal
representation that allow compensating
for the distortion effects that appear on
different elevations of S.
Graph-Based Classification of Omnidirectional Images
Comparison to the state-of-the-art methods on the ETH- 80 datasets.
Select the architecture of different methods to feature similar number
of convolutional filters and neurons in the fully-connected layers.
Flat2Sphere: Learning Spherical Convolution
for Fast Features from 360° Imagery
• While 360° cameras offer tremendous new possibilities in vision, graphics, and augmented
reality, the spherical images they produce make core feature extraction non-trivial.
• Convolutional neural networks (CNNs) trained on images from perspective cameras yield
“flat" filters, yet 360° images cannot be projected to a single plane without significant
distortion.
• A naive solution that repeatedly projects the viewing sphere to all tangent planes is accurate,
but much too computationally intensive for real problems.
• Flat2Sphere learns a spherical convolutional network that translates a planar CNN to process
360° imagery directly in its equirectangular projection.
• This approach learns to reproduce the flat filter outputs on 360° data, sensitive to the
varying distortion effects across the viewing sphere.
• The key benefits are 1) efficient feature extraction for 360° images and video, and 2) the
ability to leverage powerful pre-trained networks researchers have carefully honed (together
with massive labeled image training sets) for perspective images.
Flat2Sphere: Learning Spherical Convolution
for Fast Features from 360° Imagery
Strategies for applying CNNs to 360° images. Top: The 1st strategy unwraps the 360° input into a single planar image
using a global projection (equirectangular), then applies the CNN on the distorted planar image. Bottom: The 2nd
strategy samples multiple tangent planar projections to obtain multiple perspective images, to which the CNN is
applied independently to obtain local results for the original 360° image. Strategy I is fast but inaccurate; Strategy II is
accurate but slow. The approach learns to replicate flat filters on spherical imagery, offering both speed and accuracy.
Flat2Sphere: Learning Spherical Convolution
for Fast Features from 360° Imagery
Spherical convolution differs from ordinary CNN. (a) The kernel weight in spherical convolution is tied only
along each row, and each kernel convolves along the row to generate 1D output. Note that the kernel size
differs at different rows and layers, and it expands near the top and bottom of the image. (b) Inverse
perspective projections P−1 to equirectangular projections at different polar angles θ. The same square image
will distort to different sizes and shapes depending on θ.
Flat2Sphere: Learning Spherical Convolution
for Fast Features from 360° Imagery
Object detection examples on 360° PASCAL test images. Images show the top 40% of equirectangular
projection; black regions are undefined pixels. Text gives predicted label, multi-class probability, and IoU, resp.
Spherical CNNs
• Convolutional Neural Networks (CNNs) have become the method of choice for learning
problems involving 2D planar images.
• However, a number of problems of recent interest have created a demand for models that
can analyze spherical images.
• Examples include omnidirectional vision for drones, robots, and autonomous cars, molecular
regression problems, and global weather and climate modelling.
• A naive application of convolutional networks to a planar projection of the spherical signal is
destined to fail, because the space-varying distortions introduced by such a projection will
make translational weight sharing ineffective.
• In this work there are building blocks for constructing spherical CNNs.
• It defines the spherical cross-correlation that is both expressive and rotation-equivariant.
• The spherical correlation satisfies a generalized Fourier theorem, which allows to compute it
efficiently using a generalized (non-commutative) Fast Fourier Transform (FFT) algorithm.
Spherical CNNs
• S2 and SO(3) correlation by analogy to the classical planar Z2 correlation.
• The planar correlation can be understood as follows:
• The value of the output feature map at translation x ∈ Z2 is computed as an inner
product between the input feature map and a filter, shifted by x.
• Similarly, the spherical correlation can be understood as follows:
• The value of the output feature map evaluated at rotation R ∈ SO(3) is computed
• as an inner product between the input feature map and a filter, rotated by R.
• For functions on the sphere and rotation group, there is an analogous transform, which is
referred to as generalized Fourier transform (GFT) and a corresponding fast algorithm (GFFT).
Spherical CNNs
Spherical correlation in the spectrum. The signal f and the locally-supported filter ψ are Fourier transformed,
block-wise tensored, summed over input channels, and finally inverse transformed. Note that because the
filter is locally supported, it is faster to use a matrix multiplication (DFT) than an FFT algorithm for it. It
parameterizes the sphere using spherical coordinates α, β, and SO(3) with ZYZ-Euler angles α, β, γ.
Scene Understanding Networks for Autonomous
Driving based on Around View Monitoring System
• Modern driver assistance systems rely on a wide range of sensors (RADAR, LIDAR,
ultrasound and cameras) for scene understanding and prediction.
• These sensors are typically used for detecting traffic participants and scene elements
required for navigation.
• Relying on camera based systems, specifically Around View Monitoring (AVM) system has
great potential to achieve these goals in both parking and driving modes with decreased
costs.
• This is a new end-to-end solution for delimiting the safe drivable area for each frame by
means of identifying the closest obstacle in each direction from the driving vehicle;
• It calculates the distance to the nearest obstacles and is incorporated into a unified end- to-
end architecture capable of joint object detection, curb detection and safe drivable area
detection.
• Augmentation of the base architecture with 3D object detection.
Scene Understanding Networks for Autonomous
Driving based on Around View Monitoring System
This approach for detecting the curb and the free
drivable area is inspired by a Stixel representation
of the world. Originally, the network takes as input
each vertical column of an image. The input
columns that the network used had width 24,
overlapped over 23 pixels. Each column would
then be passed through a convolutional network
to output one-of-k labels, with k being the height
dimension. As a result, it would learn to classify
the position of the bottom pixel of the obstacle
corresponding to that column. The union of all
columns would build either the curb or the free
drivable area of the scene.
Scene Understanding Networks for Autonomous
Driving based on Around View Monitoring System
• In this architecture, due to the overlapping between the columns, more than 95% of the
computation is redundant.
• Motivated by this observation, replace the column-wise network implementation with an
end-to-end architecture.
• This network encoded the image into a deep feature map using multiple convolutional
layers and then used multiple upsampling layers to generate a feature map having the same
resolution as the input image.
• Crop hardcoded regions of the image corresponding to the pixel columns augmented with
the neighboring area of 23 pixels.
• As a result, the regions of interest for cropping the upsampled feature map are 23 pixels
wide and 720 (height) pixels tall.
• Slide this window horizontally over the image at each x-coordinate.
• The resulting crops are then resized to a fixed length (e.g.7x7) in the ROI pooling layer and
are classified to one-of-k classes (k is the height of the image), to ultimately predict the
bottom point.
Scene Understanding Networks for Autonomous
Driving based on Around View Monitoring System
Bottom prediction architecture using ROI pooling for each column
Use a single shot method for the final classification layer of the bottom prediction task. Moreover,
to make the network more efficient, replace the decoder part of the network corresponding to the
multiple upsample layers with a single dense horizontal upsampling layer. The resulting feature
map generated from the encoder after applying multiple convolutions with stride > 1 has a
resolution of [width/16, height/16], being reduced 16 times the original image size.
Scene Understanding Networks for Autonomous
Driving based on Around View Monitoring System
Finally, add another fully connected layer on top of the horizontal upsampling layer to make a linear combination of each
column’s input. A softmax is used to classify each of the resulted columns to one-of-k categories, where k is the height of
the image being predicted. Each column classification subtask automatically takes into account the pixels displayed in the
proximity of the center column being classified and represents the final bottom prediction.
Bottom-Net architecture
Scene Understanding Networks for Autonomous
Driving based on Around View Monitoring System
Unified architectures which combine the bottom prediction and the object detection networks usually take
advantage of shared computation of the encoder for better training optimization and runtime performance.
Scene Understanding Networks for Autonomous
Driving based on Around View Monitoring System
The final architecture consists of two branches, for object orientation estimation based on angle
discretization and for object dimensions regression, respectively.
3D-Net architecture
Scene Understanding Networks for Autonomous
Driving based on Around View Monitoring System
Side view detections. (left) left view. (right) right view.
Scene Understanding Networks for Autonomous
Driving based on Around View Monitoring System
Captured frame from the
high accuracy solution.
SphereNet: Learning Spherical Representations for Detection
and Classification in Omnidirectional Images
• Omnidirectional cameras offer great benefits over classical cameras wherever a wide field of
view is essential, such as in virtual reality applications or in autonomous robots.
• Unfortunately, standard convolutional neural networks are not well suited for this scenario as
the natural projection surface is a sphere which cannot be unwrapped to a plane without
introducing significant distortions, particularly in the polar regions.
• SphereNet is a deep learning framework which encodes invariance against such distortions
explicitly into convolutional neural networks.
• Towards this goal, SphereNet adapts the sampling locations of the convolutional filters,
effectively reversing distortions, and wraps the filters around the sphere.
• By building on regular convolutions, SphereNet enables the transfer of existing perspective
convolutional neural network models to the omnidirectional case.
• On the tasks of image classification and object detection, it exploits two newly created semi-
synthetic and real-world omnidirectional datasets.
SphereNet: Learning Spherical Representations for Detection
and Classification in Omnidirectional Images
Overview. (a+b) Capturing images with fisheye or 360◦ action camera results in images which are
best represented on the sphere. (c) Using regular convolutions (e.g., with 3 × 3 filter kernels) on
the rectified equirectangular representation (see Fig. 2b) suffers from distortions of the
sampling locations (red) close to the poles. (d) In contrast, our SphereNet kernel exploits
projections (red) of the sampling pattern on the tangent plane (blue), yielding filter outputs
which are invariant to latitudinal rotations.
SphereNet: Learning Spherical Representations for Detection
and Classification in Omnidirectional Images
Kernel Sampling Pattern at φ = 0 (blue) and φ = 1.2 (red) in spherical (a) and equirectangular (b)
representation. Note the distortion of the kernel at φ = 1.2 in (b).
SphereNet: Learning Spherical Representations for Detection
and Classification in Omnidirectional Images
Uniform Sphere Sampling. Comparison of an equirectangular sampling grid on the sphere with N =
200 points (a) to an approximation of evenly distributing N = 127 sampling points on a sphere with
the Saff - Kuijlaars method(b, c). Note that the sampling points at the poles are much more evenly
spaced in the uniform sphere sampling (b) compared to the equirectangular representation (a)
which oversamples the image in these regions.
SphereNet: Learning Spherical Representations for Detection
and Classification in Omnidirectional Images
• SphereNet can be integrated into a convolutional neural network for image classification by
adapting the sampling locations of the convolution and pooling kernels.
• Furthermore, it is straightforward to additionally utilize a uniform sphere sampling, which is
compared to nearest neighbor and bilinear interpolation on an equirectangular representation in
the experiments.
• The integration of SphereNet into an image classification network does not introduce novel
model parameters and no changes to the training of the network are required.
• In order to perform object detection on the sphere, the Spherical Single Shot MultiBox
Detector (Sphere-SSD) adapts the Single Shot MultiBox Detector (SSD) to objects located on
tangent planes of a sphere.
• SSD exploits a fully convolutional architecture, predicting category scores and box offsets for a
set of default anchor boxes of different scales and aspect ratios.
• Sphere-SSD uses a weighted sum between a localization loss and confidence loss.
• However, in contrast to the original SSD, anchor boxes are now placed on tangent planes of the
sphere and are defined in terms of spherical coordinates of their respective tangent plane, the
width/height of the box on the tangent plane as well as an in-plane rotation.
SphereNet: Learning Spherical Representations for Detection
and Classification in Omnidirectional Images
Spherical Anchor Boxes are gnomonic projections of 2D bounding boxes of various scales, aspect
ratios and orientations on tangent planes of the sphere. The figure visualizes anchors of the same
orientation at different scales and aspect ratios on a 16 × 8 feature map on a sphere (a) and an
equirectangular grid (b).
SphereNet: Learning Spherical Representations for Detection
and Classification in Omnidirectional Images
Detection Results on FlyingCars Dataset. The ground truth is shown in green, SphereNet (NN) results in red.
Eliminating the Blind Spot: Adapting 3D Object Detection and
Monocular Depth Estimation to 360◦ Panoramic Imagery
• Recent automotive vision work has focused on processing forward-facing cameras.
• However, future autonomous vehicles will not be viable without a more comprehensive
surround sensing, akin to a human driver, as can be provided by 360◦ panoramic cameras.
• Here is an approach to adapt contemporary deep network architectures developed on
conventional rectilinear imagery to work on equirectangular 360◦ panoramic imagery.
• To address the lack of annotated panoramic automotive datasets availability, it adapts a
contemporary automotive dataset, via style and projection transformations, to facilitate the
cross-domain retraining of contemporary algorithms for panoramic imagery.
• Following this approach, it retrains and adapts existing architectures to recover scene depth
and 3D pose of vehicles from monocular panoramic imagery without any panoramic training
labels or calibration parameters.
• This approach is evaluated qualitatively on crowd-sourced panoramic images and
quantitatively using an automotive environment simulator to provide the first benchmark for
such techniques within panoramic imagery.
Eliminating the Blind Spot: Adapting 3D Object Detection and
Monocular Depth Estimation to 360◦ Panoramic Imagery
Panoramic images are typically represented using an equirectangular projection (A); in contrast, a
conventional camera uses a rectilinear projection. In this projection, the image-space coordinates are
proportional to latitude and longitude of observed points rather than the usual projection onto a focal plane.
Adaptig 3D Object Detection and Depth Estimation to Panoramic Imagery 3 monocular depth (B) and to
recover the full 3D pose of vehicles (B) from panoramic imagery.
Eliminating the Blind Spot: Adapting 3D Object Detection and
Monocular Depth Estimation to 360◦ Panoramic Imagery
Convolutions are computed seamlessly across horizontal image boundaries using the padding approach.
Eliminating the Blind Spot: Adapting 3D Object Detection and
Monocular Depth Estimation to 360◦ Panoramic Imagery
Monocular depth recovery and
3D object detection with our
approach. Left: Real-world
images. Right: Synthetic images.
FisheyeMODNet: Moving Object detection on
Surround-view Cameras for Autonomous Driving
• Moving Object Detection is an important task for achieving robust autonomous driving.
• An autonomous front vehicle has to estimate collision risk with other interacting objects in
the environment and calculate an optional trajectory.
• Collision risk is typically higher for moving objects than static ones due to the need to
estimate the future states and poses of the objects for decision making.
• This is particularly important for near-range objects around the vehicle which are typically
detected by a fisheye surround-view system that captures a 360◦ view of the scene.
• This work is a CNN architecture for moving object detection using fisheye images that were
captured in autonomous driving environment.
• To target embedded deployment, it designs a lightweight encoder sharing weights across
sequential images.
FisheyeMODNet: Moving Object detection on
Surround-view Cameras for Autonomous Driving
Images from the surround-view camera network showing near field sensing and wide
field of view. Four fish- eye cameras (marked green) provide 360◦ surround view.
FisheyeMODNet: Moving Object detection on
Surround-view Cameras for Autonomous Driving
Network Architecture adapted from ShuffleSeg base network. Two sequential images encoding
the motion information across time are utilized train the network end-to-end for MOD.
OmniDRL: Robust Pedestrian Detection using Deep
Reinforcement Learning on Omnidirectional Cameras
• Pedestrian detection is one of the most explored topics in computer vision and robotics.
• Deep Reinforcement Learning has proved to be within the SoA in terms of both detection in
perspective cameras and robotics applications.
• However, for detection in omnidirectional cameras, the literature is still scarce, mostly
because of their high levels of distortion.
• This is an efficient technique for robust pedestrian detection in omnidirectional images.
• The method uses deep RL that takes advantage of the distortion in the image.
• By considering the 3D bounding boxes and their distorted projections into the image, this
method is able to provide the pedestrian’s position in the world, in contrast to the image
positions provided by most SoA methods for perspective cameras.
• The method avoids the need of pre-processing steps to remove the distortion, which is
computation- ally expensive.
OmniDRL: Robust Pedestrian Detection using Deep
Reinforcement Learning on Omnidirectional Cameras
Illustration of the method, using a
Multi-task network, for pedestrian
detection in omnidirectional cameras.
The input is an omnidirectional image
with an initial state of the bounding
box, represented in the world
coordinate system. Using this
information, a set of possible actions
are applied in order to detect the
pedestrian in the 3D environment.
After the trigger is activated, the line
segments of 3D bounding box
estimated are projected to the
omnidirectional image. Then, the IoU
between the ground truth and our
estimation is computed in the image
coordinates.
OmniDRL: Robust Pedestrian Detection using Deep
Reinforcement Learning on Omnidirectional Cameras
Depiction of the scheme of the proposed
network, where the first convolutional layers
are shared, and then split into branches (DQN
and Classification).
OmniDRL: Robust Pedestrian Detection using Deep
Reinforcement Learning on Omnidirectional Cameras
This figure shows the image formation using unified central catadioptric cameras. (a) the projection of a
point R ∈ R3 onto the normalized image plane {i−, i+} (intermediate projection on the unitary sphere {n− ,
n+ }). (b) the projection of 3D straight line segments for images using this model (x1 and x2 are the edges of
the line’s segment).
WoodScape: A multi-task, multi-camera
fisheye dataset for autonomous driving
• Fisheye cameras are commonly employed for obtaining a large field of view in surveillance,
augmented reality and in particular automotive applications.
• In spite of their prevalence, there are few public datasets for detailed evaluation of computer
vision algorithms on fisheye images.
• The 1st extensive fisheye automotive dataset, WoodScape, named after Robert Wood who
invented the fisheye camera in 1906.
• WoodScape comprises of 4 surround view cameras and nine tasks including segmentation,
depth estimation, 3D bounding box detection and soiling detection.
• Semantic annotation of 40 classes at the instance level is provided for over 10,000 images
and annotation for other tasks are provided for over 100,000 images.
WoodScape: A multi-task, multi-camera
fisheye dataset for autonomous driving
WoodScape, the first fisheye image dataset dedicated to autonomous driving. It contains four cameras covering
360° accompanied by a HD laser scanner, IMU and GNSS. Annotations are made available for nine tasks, notably
3D object detection, depth estimation (overlaid on front camera) and semantic segmentation.
WoodScape: A multi-task, multi-camera
fisheye dataset for autonomous driving
Comparison of fisheye models.
WoodScape: A multi-task, multi-camera
fisheye dataset for autonomous driving
Undistorting the fisheye image: (a)
Rectilinear correction; (b) Piecewise
linear correction; (c) Cylindrical
correction. Left: raw image; Right:
undistorted image.
WoodScape: A multi-task, multi-camera
fisheye dataset for autonomous driving
Segmentation using ENet (top) and Object detection using Faster RCNN (bottom).
FisheyeDistanceNet: Self-Supervised Scale-Aware Distance
Estimation using Monocular Fisheye Camera for Autonomous Driving
• Fisheye cameras are commonly used in applications like autonomous driving and
surveillance to provide a large field of view (> 180◦).
• However, they come at the cost of strong non-linear distortion which require more complex
algorithms.
• Here is Euclidean distance estimation on fisheye cameras for automotive scenes.
• Obtaining accurate and dense depth supervision is difficult in practice, but self-supervised
learning approaches show promising results and could potentially overcome the problem.
• This is a self-supervised scale-aware framework for learning Euclidean distance and ego-
motion from raw monocular fisheye videos without applying rectification.
• While it is possible to perform piece-wise linear approximation of fisheye projection surface
and apply standard rectilinear models, it has its own set of issues like re-sampling distortion
and discontinuities in transition regions.
FisheyeDistanceNet: Self-Supervised Scale-Aware Distance
Estimation using Monocular Fisheye Camera for Autonomous Driving
Overview: the 1st row represents ego masks , Mt-1,
Mt+1, indicates which pixel coordinates are valid
when constructing It−1 from It and It from It+1
respectively. The 2nd row indicates the masking of
static pixels computed after 2 epochs, where black
pixels are filtered from the photometric loss (i.e. σ
= 0). It prevents dynamic objects at similar speed
as the ego car and low texture regions from
contaminating the loss. The masks are computed
for forward and backward sequences from the
input sequence S and reconstructed images. The
3rd row represents the distance estimates
corresponding to their input frames. Finally, the
vehicle’s odometry data is used to resolve the
scale factor issue.
FisheyeDistanceNet: Self-Supervised Scale-Aware Distance
Estimation using Monocular Fisheye Camera for Autonomous Driving
• The overall self-supervised SfM from motion objective consists of a photometric loss term
Lp imposed between the reconstructed target image Iˆt and the target image It, and a
distance regularization term Ls ensuring edge-aware smoothing in the distance estimates.
• Finally, Ldc a cross-sequence distance consistency derived from the chain of frames in the
training sequence S.
• To prevent the training objective getting stuck in the local minima due to the gradient
locality of the bilinear sampler, adopt 4 scales to train the network.
• The distance estimation network is mainly based on the U-net architecture, an encoder-
decoder network with skip connections.
• After testing different variants of ResNet family, chose a ResNet18 as the encoder.
• The key aspect is replacing with deformable convolutions since regular CNNs are inherently
limited in modeling large, unknown geometric distortions due to their fixed structures, such
as fixed filter kernels, fixed receptive field sizes, and fixed pooling kernels.
FisheyeDistanceNet: Self-Supervised Scale-Aware Distance
Estimation using Monocular Fisheye Camera for Autonomous Driving
• The backbone of pose estimation network is based on paper "Digging into self-supervised
monocular depth estimation”, which predicts rotation using Euler angle parameterization.
• Replace normal convolutions with deformable convolutions for the encoder-decoder setting.
• Predict the rotation using an axis-angle representation, and scale the rotation and
translation outputs by 0.01.
• For monocular training, use a sequence length of three frames, while pose network is
formed from a ResNet18, modified to accept a pair of color images (or six channels) as input
and to predict a single 6-DoF relative pose between It−1→t and It→t−1.
• Perform horizontal flips and following training augmentations: random brightness, contrast,
saturation, and hue jitter with respective ranges of ±0.2, ±0.2, ±0.2, and ±0.1.
• Importantly, the color augmentations are only applied to the images which are fed to the
networks, not to those used to compute photometric loss term Lp.
• All 3 images fed to the pose and depth networks are augmented with the same parameters.
FisheyeDistanceNet: Self-Supervised Scale-Aware Distance
Estimation using Monocular Fisheye Camera for Autonomous Driving
(a) Depth network: U-Net. (b) Pose network: A separate pose network. (c) Per-pixel minimum reprojection: When
correspondences are good, the reprojection loss should be low. (d) Full-resolution multi-scale: Upsample depth
predictions at intermediate layers and compute all losses at the input resolution, reducing texture-copy artifacts.
FisheyeDistanceNet: Self-Supervised Scale-Aware Distance
Estimation using Monocular Fisheye Camera for Autonomous Driving
FisheyeDistanceNet produces sharp distance maps on distorted fisheye images.
Fisheye Omnidirectional View in Autonomous Driving

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

SfM Learner系単眼深度推定手法について
SfM Learner系単眼深度推定手法についてSfM Learner系単眼深度推定手法について
SfM Learner系単眼深度推定手法について
 
PR-302: NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
PR-302: NeRF: Representing Scenes as Neural Radiance Fields for View SynthesisPR-302: NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
PR-302: NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
 
Image processing.pdf
Image processing.pdfImage processing.pdf
Image processing.pdf
 
GRPHICS06 - Shading
GRPHICS06 - ShadingGRPHICS06 - Shading
GRPHICS06 - Shading
 
Image enhancement techniques
Image enhancement techniquesImage enhancement techniques
Image enhancement techniques
 
U-Net (1).pptx
U-Net (1).pptxU-Net (1).pptx
U-Net (1).pptx
 
カメラでの偏光取得における円偏光と位相遅延の考え方
カメラでの偏光取得における円偏光と位相遅延の考え方カメラでの偏光取得における円偏光と位相遅延の考え方
カメラでの偏光取得における円偏光と位相遅延の考え方
 
NetVLAD: CNN architecture for weakly supervised place recognition
NetVLAD:  CNN architecture for weakly supervised place recognitionNetVLAD:  CNN architecture for weakly supervised place recognition
NetVLAD: CNN architecture for weakly supervised place recognition
 
semantic segmentation サーベイ
semantic segmentation サーベイsemantic segmentation サーベイ
semantic segmentation サーベイ
 
CV_Chap 6 Motion Representation
CV_Chap 6 Motion RepresentationCV_Chap 6 Motion Representation
CV_Chap 6 Motion Representation
 
Depth estimation using deep learning
Depth estimation using deep learningDepth estimation using deep learning
Depth estimation using deep learning
 
Region based segmentation
Region based segmentationRegion based segmentation
Region based segmentation
 
【DL輪読会】Monocular real time volumetric performance capture
【DL輪読会】Monocular real time volumetric performance capture 【DL輪読会】Monocular real time volumetric performance capture
【DL輪読会】Monocular real time volumetric performance capture
 
fusion of Camera and lidar for autonomous driving II
fusion of Camera and lidar for autonomous driving IIfusion of Camera and lidar for autonomous driving II
fusion of Camera and lidar for autonomous driving II
 
Ray Tracing in Computer Graphics
Ray Tracing in Computer GraphicsRay Tracing in Computer Graphics
Ray Tracing in Computer Graphics
 
Image Degradation & Resoration
Image Degradation & ResorationImage Degradation & Resoration
Image Degradation & Resoration
 
Passive stereo vision with deep learning
Passive stereo vision with deep learningPassive stereo vision with deep learning
Passive stereo vision with deep learning
 
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
 
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic SegmentationSemantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
 
Image Processing using Matlab ( using a built in Highboost filtering,averagin...
Image Processing using Matlab ( using a built in Highboost filtering,averagin...Image Processing using Matlab ( using a built in Highboost filtering,averagin...
Image Processing using Matlab ( using a built in Highboost filtering,averagin...
 

Semelhante a Fisheye Omnidirectional View in Autonomous Driving

Report bep thomas_blanken
Report bep thomas_blankenReport bep thomas_blanken
Report bep thomas_blanken
xepost
 
Urban 3D Semantic Modelling Using Stereo Vision, ICRA 2013
Urban 3D Semantic Modelling Using Stereo Vision, ICRA 2013Urban 3D Semantic Modelling Using Stereo Vision, ICRA 2013
Urban 3D Semantic Modelling Using Stereo Vision, ICRA 2013
Sunando Sengupta
 
Super Resolution of Image
Super Resolution of ImageSuper Resolution of Image
Super Resolution of Image
Satheesh K
 

Semelhante a Fisheye Omnidirectional View in Autonomous Driving (20)

Tomographic reconstruction in nuclear medicine
Tomographic reconstruction in nuclear medicineTomographic reconstruction in nuclear medicine
Tomographic reconstruction in nuclear medicine
 
Poster_Final
Poster_FinalPoster_Final
Poster_Final
 
Introduction to Real Time Rendering
Introduction to Real Time RenderingIntroduction to Real Time Rendering
Introduction to Real Time Rendering
 
DICTA 2017 poster
DICTA 2017 posterDICTA 2017 poster
DICTA 2017 poster
 
Fisheye Omnidirectional View in Autonomous Driving II
Fisheye Omnidirectional View in Autonomous Driving IIFisheye Omnidirectional View in Autonomous Driving II
Fisheye Omnidirectional View in Autonomous Driving II
 
Report bep thomas_blanken
Report bep thomas_blankenReport bep thomas_blanken
Report bep thomas_blanken
 
Optic flow estimation with deep learning
Optic flow estimation with deep learningOptic flow estimation with deep learning
Optic flow estimation with deep learning
 
CT Image reconstruction
CT Image reconstructionCT Image reconstruction
CT Image reconstruction
 
Image reconstruction
Image reconstructionImage reconstruction
Image reconstruction
 
Vf sift
Vf siftVf sift
Vf sift
 
N045077984
N045077984N045077984
N045077984
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
SIGGRAPH 2014 Course on Computational Cameras and Displays (part 4)
SIGGRAPH 2014 Course on Computational Cameras and Displays (part 4)SIGGRAPH 2014 Course on Computational Cameras and Displays (part 4)
SIGGRAPH 2014 Course on Computational Cameras and Displays (part 4)
 
Urban 3D Semantic Modelling Using Stereo Vision, ICRA 2013
Urban 3D Semantic Modelling Using Stereo Vision, ICRA 2013Urban 3D Semantic Modelling Using Stereo Vision, ICRA 2013
Urban 3D Semantic Modelling Using Stereo Vision, ICRA 2013
 
E0343034
E0343034E0343034
E0343034
 
TransNeRF
TransNeRFTransNeRF
TransNeRF
 
Super Resolution of Image
Super Resolution of ImageSuper Resolution of Image
Super Resolution of Image
 
06 image features
06 image features06 image features
06 image features
 
998-isvc16
998-isvc16998-isvc16
998-isvc16
 
Multiple UGV SLAM Map Sharing
Multiple UGV SLAM Map SharingMultiple UGV SLAM Map Sharing
Multiple UGV SLAM Map Sharing
 

Mais de Yu Huang

Mais de Yu Huang (20)

Application of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous DrivingApplication of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous Driving
 
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
 
Data Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous DrivingData Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous Driving
 
Techniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous DrivingTechniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous Driving
 
BEV Joint Detection and Segmentation
BEV Joint Detection and SegmentationBEV Joint Detection and Segmentation
BEV Joint Detection and Segmentation
 
BEV Object Detection and Prediction
BEV Object Detection and PredictionBEV Object Detection and Prediction
BEV Object Detection and Prediction
 
Fisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VIFisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VI
 
Fisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving VFisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving V
 
Fisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IVFisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IV
 
Prediction,Planninng & Control at Baidu
Prediction,Planninng & Control at BaiduPrediction,Planninng & Control at Baidu
Prediction,Planninng & Control at Baidu
 
Cruise AI under the Hood
Cruise AI under the HoodCruise AI under the Hood
Cruise AI under the Hood
 
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
 
Scenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous DrivingScenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous Driving
 
How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?
 
Annotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous DrivingAnnotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous Driving
 
Simulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atgSimulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atg
 
Multi sensor calibration by deep learning
Multi sensor calibration by deep learningMulti sensor calibration by deep learning
Multi sensor calibration by deep learning
 
Prediction and planning for self driving at waymo
Prediction and planning for self driving at waymoPrediction and planning for self driving at waymo
Prediction and planning for self driving at waymo
 
Jointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planningJointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planning
 
Data pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous drivingData pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous driving
 

Último

Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
rknatarajan
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Dr.Costas Sachpazis
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 

Último (20)

University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 

Fisheye Omnidirectional View in Autonomous Driving

  • 1. Fisheye/Ominidirectional View in Autonomous Driving YuHuang Yu.huang07@gmail.com Sunnyvale,California
  • 2. Outline • Graph-Based Classification of Omnidirectional Images • Flat2Sphere: Learning Spherical Convolution for Fast Features from 360° Imagery • Spherical CNNs • Scene Understanding Networks for AD based on Around View Monitoring System • Eliminating the Blind Spot: Adapting 3D Object Detection and Mono Depth Estimation to 360◦ Panoramic Imagery • SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images • FisheyeMODNet: Moving Object detection on Surround-view Cameras for AD • OmniDRL: Robust Pedestrian Detection using DRL on Omnidirectional Cameras • WoodScape: A multi-task, multi-camera fisheye dataset for AD • FisheyeDistanceNet: Self-Supervised Scale-Aware Distance Estimation using Mono Fisheye Camera for AD
  • 3. Graph-Based Classification of Omnidirectional Images • Omnidirectional cameras are widely used in such areas as robotics and virtual reality as they provide a wide field of view. • Their images are often processed with classical methods, which might unfortunately lead to non-optimal solutions as these methods are designed for planar images that have different geometrical properties than omnidirectional ones. • Here, image classification by taking into account the specific geometry of omnidirectional cameras with graph-based representations. • In particular, deep learning architectures for data on graphs. • It is a principled way of graph construction such that convolutional filters respond similarly for the same pattern on different positions of the image regardless of lens distortions. • Reference: “Graph-based Isometry Invariant Representation Learning”, ICML, 2017
  • 4. Graph-Based Classification of Omnidirectional Images • Transformation Invariant Graph-based Network (TIGraNet): • It takes as input images that are represented as signals on a grid graph and gives classification labels as output. • Briefly this approach proposes a network of alternatively stacked spectral convolutional and dynamic pooling layers, which creates features that are equivariant to the isometric transformation. • Further, the output of the last layer is processed by a statistical layer, which makes the equivariant representation of data invariant to isometric transformations. • Finally, the resulting feature vector is fed to a number of fully-connected layers and a softmax layer, which outputs the probability distribution that the signal belongs to each of the given classes. • This transformation-invariant classification algorithm is extended to omnidirectional images by incorporating the knowledge about the camera lens geometry in the graph structure.
  • 5. Graph-Based Classification of Omnidirectional Images The graph construction method makes response of the filter similar regardless of different position of the pattern on an image from an omnidirectional camera.
  • 6. Graph-Based Classification of Omnidirectional Images TIGraNet architecture. The network is composed of an alternation of spectral convolution layers Fl and dynamic pooling layers Pl, followed by a statistical layer H, multiple fully-connected layers (FC) and a softmax operator (SM). The input of the network is an image that is represented as a signal y0 on the grid-graph with Laplacian matrix L. The output of the system is a label that corresponds to the most likely class for the input sample.
  • 7. Graph-Based Classification of Omnidirectional Images Example of the gnomonic projection. An object from tangent plane Ti is projected to the sphere at tangency point X0,i, which is defined by spherical coordinates φi , θi . The point Xk,I is defined by coordinates (xk,i , yk,i ) on the plane. Example of the equirectangular representation of the image. On the left, the figure depicts the original image on the tangent plane Ti, on the right, projected to the points of the sphere. To build an equirectangular image the values points on the discrete regular grid are often approximated from the values of projected points by interpolation.
  • 8. Graph-Based Classification of Omnidirectional Images a) Choose pattern p0 , .., p4 from an object on tangent plane Te at equator (φe = 0, θe = 0) (red points) and then, b) move this object on the sphere by moving the tangent plane Ti to point (φi,θi). c) Thus, the filter localized at tangency point (φi , θi ) uses values pi,1 , pi,3 (blue points) which we can obtain by interpolation. The goal is to develop a transformation invariant system, which can recognize the same object on different planes Ti that are tangent to S at different points (φi , θi ) without any extra training. The challenge of building such a system is to design a proper graph signal representation that allow compensating for the distortion effects that appear on different elevations of S.
  • 9. Graph-Based Classification of Omnidirectional Images Comparison to the state-of-the-art methods on the ETH- 80 datasets. Select the architecture of different methods to feature similar number of convolutional filters and neurons in the fully-connected layers.
  • 10. Flat2Sphere: Learning Spherical Convolution for Fast Features from 360° Imagery • While 360° cameras offer tremendous new possibilities in vision, graphics, and augmented reality, the spherical images they produce make core feature extraction non-trivial. • Convolutional neural networks (CNNs) trained on images from perspective cameras yield “flat" filters, yet 360° images cannot be projected to a single plane without significant distortion. • A naive solution that repeatedly projects the viewing sphere to all tangent planes is accurate, but much too computationally intensive for real problems. • Flat2Sphere learns a spherical convolutional network that translates a planar CNN to process 360° imagery directly in its equirectangular projection. • This approach learns to reproduce the flat filter outputs on 360° data, sensitive to the varying distortion effects across the viewing sphere. • The key benefits are 1) efficient feature extraction for 360° images and video, and 2) the ability to leverage powerful pre-trained networks researchers have carefully honed (together with massive labeled image training sets) for perspective images.
  • 11. Flat2Sphere: Learning Spherical Convolution for Fast Features from 360° Imagery Strategies for applying CNNs to 360° images. Top: The 1st strategy unwraps the 360° input into a single planar image using a global projection (equirectangular), then applies the CNN on the distorted planar image. Bottom: The 2nd strategy samples multiple tangent planar projections to obtain multiple perspective images, to which the CNN is applied independently to obtain local results for the original 360° image. Strategy I is fast but inaccurate; Strategy II is accurate but slow. The approach learns to replicate flat filters on spherical imagery, offering both speed and accuracy.
  • 12. Flat2Sphere: Learning Spherical Convolution for Fast Features from 360° Imagery Spherical convolution differs from ordinary CNN. (a) The kernel weight in spherical convolution is tied only along each row, and each kernel convolves along the row to generate 1D output. Note that the kernel size differs at different rows and layers, and it expands near the top and bottom of the image. (b) Inverse perspective projections P−1 to equirectangular projections at different polar angles θ. The same square image will distort to different sizes and shapes depending on θ.
  • 13. Flat2Sphere: Learning Spherical Convolution for Fast Features from 360° Imagery Object detection examples on 360° PASCAL test images. Images show the top 40% of equirectangular projection; black regions are undefined pixels. Text gives predicted label, multi-class probability, and IoU, resp.
  • 14. Spherical CNNs • Convolutional Neural Networks (CNNs) have become the method of choice for learning problems involving 2D planar images. • However, a number of problems of recent interest have created a demand for models that can analyze spherical images. • Examples include omnidirectional vision for drones, robots, and autonomous cars, molecular regression problems, and global weather and climate modelling. • A naive application of convolutional networks to a planar projection of the spherical signal is destined to fail, because the space-varying distortions introduced by such a projection will make translational weight sharing ineffective. • In this work there are building blocks for constructing spherical CNNs. • It defines the spherical cross-correlation that is both expressive and rotation-equivariant. • The spherical correlation satisfies a generalized Fourier theorem, which allows to compute it efficiently using a generalized (non-commutative) Fast Fourier Transform (FFT) algorithm.
  • 15. Spherical CNNs • S2 and SO(3) correlation by analogy to the classical planar Z2 correlation. • The planar correlation can be understood as follows: • The value of the output feature map at translation x ∈ Z2 is computed as an inner product between the input feature map and a filter, shifted by x. • Similarly, the spherical correlation can be understood as follows: • The value of the output feature map evaluated at rotation R ∈ SO(3) is computed • as an inner product between the input feature map and a filter, rotated by R. • For functions on the sphere and rotation group, there is an analogous transform, which is referred to as generalized Fourier transform (GFT) and a corresponding fast algorithm (GFFT).
  • 16. Spherical CNNs Spherical correlation in the spectrum. The signal f and the locally-supported filter ψ are Fourier transformed, block-wise tensored, summed over input channels, and finally inverse transformed. Note that because the filter is locally supported, it is faster to use a matrix multiplication (DFT) than an FFT algorithm for it. It parameterizes the sphere using spherical coordinates α, β, and SO(3) with ZYZ-Euler angles α, β, γ.
  • 17. Scene Understanding Networks for Autonomous Driving based on Around View Monitoring System • Modern driver assistance systems rely on a wide range of sensors (RADAR, LIDAR, ultrasound and cameras) for scene understanding and prediction. • These sensors are typically used for detecting traffic participants and scene elements required for navigation. • Relying on camera based systems, specifically Around View Monitoring (AVM) system has great potential to achieve these goals in both parking and driving modes with decreased costs. • This is a new end-to-end solution for delimiting the safe drivable area for each frame by means of identifying the closest obstacle in each direction from the driving vehicle; • It calculates the distance to the nearest obstacles and is incorporated into a unified end- to- end architecture capable of joint object detection, curb detection and safe drivable area detection. • Augmentation of the base architecture with 3D object detection.
  • 18. Scene Understanding Networks for Autonomous Driving based on Around View Monitoring System This approach for detecting the curb and the free drivable area is inspired by a Stixel representation of the world. Originally, the network takes as input each vertical column of an image. The input columns that the network used had width 24, overlapped over 23 pixels. Each column would then be passed through a convolutional network to output one-of-k labels, with k being the height dimension. As a result, it would learn to classify the position of the bottom pixel of the obstacle corresponding to that column. The union of all columns would build either the curb or the free drivable area of the scene.
  • 19. Scene Understanding Networks for Autonomous Driving based on Around View Monitoring System • In this architecture, due to the overlapping between the columns, more than 95% of the computation is redundant. • Motivated by this observation, replace the column-wise network implementation with an end-to-end architecture. • This network encoded the image into a deep feature map using multiple convolutional layers and then used multiple upsampling layers to generate a feature map having the same resolution as the input image. • Crop hardcoded regions of the image corresponding to the pixel columns augmented with the neighboring area of 23 pixels. • As a result, the regions of interest for cropping the upsampled feature map are 23 pixels wide and 720 (height) pixels tall. • Slide this window horizontally over the image at each x-coordinate. • The resulting crops are then resized to a fixed length (e.g.7x7) in the ROI pooling layer and are classified to one-of-k classes (k is the height of the image), to ultimately predict the bottom point.
  • 20. Scene Understanding Networks for Autonomous Driving based on Around View Monitoring System Bottom prediction architecture using ROI pooling for each column Use a single shot method for the final classification layer of the bottom prediction task. Moreover, to make the network more efficient, replace the decoder part of the network corresponding to the multiple upsample layers with a single dense horizontal upsampling layer. The resulting feature map generated from the encoder after applying multiple convolutions with stride > 1 has a resolution of [width/16, height/16], being reduced 16 times the original image size.
  • 21. Scene Understanding Networks for Autonomous Driving based on Around View Monitoring System Finally, add another fully connected layer on top of the horizontal upsampling layer to make a linear combination of each column’s input. A softmax is used to classify each of the resulted columns to one-of-k categories, where k is the height of the image being predicted. Each column classification subtask automatically takes into account the pixels displayed in the proximity of the center column being classified and represents the final bottom prediction. Bottom-Net architecture
  • 22. Scene Understanding Networks for Autonomous Driving based on Around View Monitoring System Unified architectures which combine the bottom prediction and the object detection networks usually take advantage of shared computation of the encoder for better training optimization and runtime performance.
  • 23. Scene Understanding Networks for Autonomous Driving based on Around View Monitoring System The final architecture consists of two branches, for object orientation estimation based on angle discretization and for object dimensions regression, respectively. 3D-Net architecture
  • 24. Scene Understanding Networks for Autonomous Driving based on Around View Monitoring System Side view detections. (left) left view. (right) right view.
  • 25. Scene Understanding Networks for Autonomous Driving based on Around View Monitoring System Captured frame from the high accuracy solution.
  • 26. SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images • Omnidirectional cameras offer great benefits over classical cameras wherever a wide field of view is essential, such as in virtual reality applications or in autonomous robots. • Unfortunately, standard convolutional neural networks are not well suited for this scenario as the natural projection surface is a sphere which cannot be unwrapped to a plane without introducing significant distortions, particularly in the polar regions. • SphereNet is a deep learning framework which encodes invariance against such distortions explicitly into convolutional neural networks. • Towards this goal, SphereNet adapts the sampling locations of the convolutional filters, effectively reversing distortions, and wraps the filters around the sphere. • By building on regular convolutions, SphereNet enables the transfer of existing perspective convolutional neural network models to the omnidirectional case. • On the tasks of image classification and object detection, it exploits two newly created semi- synthetic and real-world omnidirectional datasets.
  • 27. SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images Overview. (a+b) Capturing images with fisheye or 360◦ action camera results in images which are best represented on the sphere. (c) Using regular convolutions (e.g., with 3 × 3 filter kernels) on the rectified equirectangular representation (see Fig. 2b) suffers from distortions of the sampling locations (red) close to the poles. (d) In contrast, our SphereNet kernel exploits projections (red) of the sampling pattern on the tangent plane (blue), yielding filter outputs which are invariant to latitudinal rotations.
  • 28. SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images Kernel Sampling Pattern at φ = 0 (blue) and φ = 1.2 (red) in spherical (a) and equirectangular (b) representation. Note the distortion of the kernel at φ = 1.2 in (b).
  • 29. SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images Uniform Sphere Sampling. Comparison of an equirectangular sampling grid on the sphere with N = 200 points (a) to an approximation of evenly distributing N = 127 sampling points on a sphere with the Saff - Kuijlaars method(b, c). Note that the sampling points at the poles are much more evenly spaced in the uniform sphere sampling (b) compared to the equirectangular representation (a) which oversamples the image in these regions.
  • 30. SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images • SphereNet can be integrated into a convolutional neural network for image classification by adapting the sampling locations of the convolution and pooling kernels. • Furthermore, it is straightforward to additionally utilize a uniform sphere sampling, which is compared to nearest neighbor and bilinear interpolation on an equirectangular representation in the experiments. • The integration of SphereNet into an image classification network does not introduce novel model parameters and no changes to the training of the network are required. • In order to perform object detection on the sphere, the Spherical Single Shot MultiBox Detector (Sphere-SSD) adapts the Single Shot MultiBox Detector (SSD) to objects located on tangent planes of a sphere. • SSD exploits a fully convolutional architecture, predicting category scores and box offsets for a set of default anchor boxes of different scales and aspect ratios. • Sphere-SSD uses a weighted sum between a localization loss and confidence loss. • However, in contrast to the original SSD, anchor boxes are now placed on tangent planes of the sphere and are defined in terms of spherical coordinates of their respective tangent plane, the width/height of the box on the tangent plane as well as an in-plane rotation.
  • 31. SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images Spherical Anchor Boxes are gnomonic projections of 2D bounding boxes of various scales, aspect ratios and orientations on tangent planes of the sphere. The figure visualizes anchors of the same orientation at different scales and aspect ratios on a 16 × 8 feature map on a sphere (a) and an equirectangular grid (b).
  • 32. SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images Detection Results on FlyingCars Dataset. The ground truth is shown in green, SphereNet (NN) results in red.
  • 33. Eliminating the Blind Spot: Adapting 3D Object Detection and Monocular Depth Estimation to 360◦ Panoramic Imagery • Recent automotive vision work has focused on processing forward-facing cameras. • However, future autonomous vehicles will not be viable without a more comprehensive surround sensing, akin to a human driver, as can be provided by 360◦ panoramic cameras. • Here is an approach to adapt contemporary deep network architectures developed on conventional rectilinear imagery to work on equirectangular 360◦ panoramic imagery. • To address the lack of annotated panoramic automotive datasets availability, it adapts a contemporary automotive dataset, via style and projection transformations, to facilitate the cross-domain retraining of contemporary algorithms for panoramic imagery. • Following this approach, it retrains and adapts existing architectures to recover scene depth and 3D pose of vehicles from monocular panoramic imagery without any panoramic training labels or calibration parameters. • This approach is evaluated qualitatively on crowd-sourced panoramic images and quantitatively using an automotive environment simulator to provide the first benchmark for such techniques within panoramic imagery.
  • 34. Eliminating the Blind Spot: Adapting 3D Object Detection and Monocular Depth Estimation to 360◦ Panoramic Imagery Panoramic images are typically represented using an equirectangular projection (A); in contrast, a conventional camera uses a rectilinear projection. In this projection, the image-space coordinates are proportional to latitude and longitude of observed points rather than the usual projection onto a focal plane. Adaptig 3D Object Detection and Depth Estimation to Panoramic Imagery 3 monocular depth (B) and to recover the full 3D pose of vehicles (B) from panoramic imagery.
  • 35. Eliminating the Blind Spot: Adapting 3D Object Detection and Monocular Depth Estimation to 360◦ Panoramic Imagery Convolutions are computed seamlessly across horizontal image boundaries using the padding approach.
  • 36. Eliminating the Blind Spot: Adapting 3D Object Detection and Monocular Depth Estimation to 360◦ Panoramic Imagery Monocular depth recovery and 3D object detection with our approach. Left: Real-world images. Right: Synthetic images.
  • 37. FisheyeMODNet: Moving Object detection on Surround-view Cameras for Autonomous Driving • Moving Object Detection is an important task for achieving robust autonomous driving. • An autonomous front vehicle has to estimate collision risk with other interacting objects in the environment and calculate an optional trajectory. • Collision risk is typically higher for moving objects than static ones due to the need to estimate the future states and poses of the objects for decision making. • This is particularly important for near-range objects around the vehicle which are typically detected by a fisheye surround-view system that captures a 360◦ view of the scene. • This work is a CNN architecture for moving object detection using fisheye images that were captured in autonomous driving environment. • To target embedded deployment, it designs a lightweight encoder sharing weights across sequential images.
  • 38. FisheyeMODNet: Moving Object detection on Surround-view Cameras for Autonomous Driving Images from the surround-view camera network showing near field sensing and wide field of view. Four fish- eye cameras (marked green) provide 360◦ surround view.
  • 39. FisheyeMODNet: Moving Object detection on Surround-view Cameras for Autonomous Driving Network Architecture adapted from ShuffleSeg base network. Two sequential images encoding the motion information across time are utilized train the network end-to-end for MOD.
  • 40. OmniDRL: Robust Pedestrian Detection using Deep Reinforcement Learning on Omnidirectional Cameras • Pedestrian detection is one of the most explored topics in computer vision and robotics. • Deep Reinforcement Learning has proved to be within the SoA in terms of both detection in perspective cameras and robotics applications. • However, for detection in omnidirectional cameras, the literature is still scarce, mostly because of their high levels of distortion. • This is an efficient technique for robust pedestrian detection in omnidirectional images. • The method uses deep RL that takes advantage of the distortion in the image. • By considering the 3D bounding boxes and their distorted projections into the image, this method is able to provide the pedestrian’s position in the world, in contrast to the image positions provided by most SoA methods for perspective cameras. • The method avoids the need of pre-processing steps to remove the distortion, which is computation- ally expensive.
  • 41. OmniDRL: Robust Pedestrian Detection using Deep Reinforcement Learning on Omnidirectional Cameras Illustration of the method, using a Multi-task network, for pedestrian detection in omnidirectional cameras. The input is an omnidirectional image with an initial state of the bounding box, represented in the world coordinate system. Using this information, a set of possible actions are applied in order to detect the pedestrian in the 3D environment. After the trigger is activated, the line segments of 3D bounding box estimated are projected to the omnidirectional image. Then, the IoU between the ground truth and our estimation is computed in the image coordinates.
  • 42. OmniDRL: Robust Pedestrian Detection using Deep Reinforcement Learning on Omnidirectional Cameras Depiction of the scheme of the proposed network, where the first convolutional layers are shared, and then split into branches (DQN and Classification).
  • 43. OmniDRL: Robust Pedestrian Detection using Deep Reinforcement Learning on Omnidirectional Cameras This figure shows the image formation using unified central catadioptric cameras. (a) the projection of a point R ∈ R3 onto the normalized image plane {i−, i+} (intermediate projection on the unitary sphere {n− , n+ }). (b) the projection of 3D straight line segments for images using this model (x1 and x2 are the edges of the line’s segment).
  • 44. WoodScape: A multi-task, multi-camera fisheye dataset for autonomous driving • Fisheye cameras are commonly employed for obtaining a large field of view in surveillance, augmented reality and in particular automotive applications. • In spite of their prevalence, there are few public datasets for detailed evaluation of computer vision algorithms on fisheye images. • The 1st extensive fisheye automotive dataset, WoodScape, named after Robert Wood who invented the fisheye camera in 1906. • WoodScape comprises of 4 surround view cameras and nine tasks including segmentation, depth estimation, 3D bounding box detection and soiling detection. • Semantic annotation of 40 classes at the instance level is provided for over 10,000 images and annotation for other tasks are provided for over 100,000 images.
  • 45. WoodScape: A multi-task, multi-camera fisheye dataset for autonomous driving WoodScape, the first fisheye image dataset dedicated to autonomous driving. It contains four cameras covering 360° accompanied by a HD laser scanner, IMU and GNSS. Annotations are made available for nine tasks, notably 3D object detection, depth estimation (overlaid on front camera) and semantic segmentation.
  • 46. WoodScape: A multi-task, multi-camera fisheye dataset for autonomous driving Comparison of fisheye models.
  • 47. WoodScape: A multi-task, multi-camera fisheye dataset for autonomous driving Undistorting the fisheye image: (a) Rectilinear correction; (b) Piecewise linear correction; (c) Cylindrical correction. Left: raw image; Right: undistorted image.
  • 48. WoodScape: A multi-task, multi-camera fisheye dataset for autonomous driving Segmentation using ENet (top) and Object detection using Faster RCNN (bottom).
  • 49. FisheyeDistanceNet: Self-Supervised Scale-Aware Distance Estimation using Monocular Fisheye Camera for Autonomous Driving • Fisheye cameras are commonly used in applications like autonomous driving and surveillance to provide a large field of view (> 180◦). • However, they come at the cost of strong non-linear distortion which require more complex algorithms. • Here is Euclidean distance estimation on fisheye cameras for automotive scenes. • Obtaining accurate and dense depth supervision is difficult in practice, but self-supervised learning approaches show promising results and could potentially overcome the problem. • This is a self-supervised scale-aware framework for learning Euclidean distance and ego- motion from raw monocular fisheye videos without applying rectification. • While it is possible to perform piece-wise linear approximation of fisheye projection surface and apply standard rectilinear models, it has its own set of issues like re-sampling distortion and discontinuities in transition regions.
  • 50. FisheyeDistanceNet: Self-Supervised Scale-Aware Distance Estimation using Monocular Fisheye Camera for Autonomous Driving Overview: the 1st row represents ego masks , Mt-1, Mt+1, indicates which pixel coordinates are valid when constructing It−1 from It and It from It+1 respectively. The 2nd row indicates the masking of static pixels computed after 2 epochs, where black pixels are filtered from the photometric loss (i.e. σ = 0). It prevents dynamic objects at similar speed as the ego car and low texture regions from contaminating the loss. The masks are computed for forward and backward sequences from the input sequence S and reconstructed images. The 3rd row represents the distance estimates corresponding to their input frames. Finally, the vehicle’s odometry data is used to resolve the scale factor issue.
  • 51. FisheyeDistanceNet: Self-Supervised Scale-Aware Distance Estimation using Monocular Fisheye Camera for Autonomous Driving • The overall self-supervised SfM from motion objective consists of a photometric loss term Lp imposed between the reconstructed target image Iˆt and the target image It, and a distance regularization term Ls ensuring edge-aware smoothing in the distance estimates. • Finally, Ldc a cross-sequence distance consistency derived from the chain of frames in the training sequence S. • To prevent the training objective getting stuck in the local minima due to the gradient locality of the bilinear sampler, adopt 4 scales to train the network. • The distance estimation network is mainly based on the U-net architecture, an encoder- decoder network with skip connections. • After testing different variants of ResNet family, chose a ResNet18 as the encoder. • The key aspect is replacing with deformable convolutions since regular CNNs are inherently limited in modeling large, unknown geometric distortions due to their fixed structures, such as fixed filter kernels, fixed receptive field sizes, and fixed pooling kernels.
  • 52. FisheyeDistanceNet: Self-Supervised Scale-Aware Distance Estimation using Monocular Fisheye Camera for Autonomous Driving • The backbone of pose estimation network is based on paper "Digging into self-supervised monocular depth estimation”, which predicts rotation using Euler angle parameterization. • Replace normal convolutions with deformable convolutions for the encoder-decoder setting. • Predict the rotation using an axis-angle representation, and scale the rotation and translation outputs by 0.01. • For monocular training, use a sequence length of three frames, while pose network is formed from a ResNet18, modified to accept a pair of color images (or six channels) as input and to predict a single 6-DoF relative pose between It−1→t and It→t−1. • Perform horizontal flips and following training augmentations: random brightness, contrast, saturation, and hue jitter with respective ranges of ±0.2, ±0.2, ±0.2, and ±0.1. • Importantly, the color augmentations are only applied to the images which are fed to the networks, not to those used to compute photometric loss term Lp. • All 3 images fed to the pose and depth networks are augmented with the same parameters.
  • 53. FisheyeDistanceNet: Self-Supervised Scale-Aware Distance Estimation using Monocular Fisheye Camera for Autonomous Driving (a) Depth network: U-Net. (b) Pose network: A separate pose network. (c) Per-pixel minimum reprojection: When correspondences are good, the reprojection loss should be low. (d) Full-resolution multi-scale: Upsample depth predictions at intermediate layers and compute all losses at the input resolution, reducing texture-copy artifacts.
  • 54. FisheyeDistanceNet: Self-Supervised Scale-Aware Distance Estimation using Monocular Fisheye Camera for Autonomous Driving FisheyeDistanceNet produces sharp distance maps on distorted fisheye images.