Fisheye Omnidirectional View in Autonomous Driving

Fisheye/Ominidirectional View
in Autonomous Driving
YuHuang
Yu.huang07@gmail.com
Sunnyvale,California

Outline
• Graph-Based Classification of Omnidirectional Images
• Flat2Sphere: Learning Spherical Convolution for Fast Features from 360° Imagery
• Spherical CNNs
• Scene Understanding Networks for AD based on Around View Monitoring System
• Eliminating the Blind Spot: Adapting 3D Object Detection and Mono Depth Estimation to
360◦ Panoramic Imagery
• SphereNet: Learning Spherical Representations for Detection and Classification in
Omnidirectional Images
• FisheyeMODNet: Moving Object detection on Surround-view Cameras for AD
• OmniDRL: Robust Pedestrian Detection using DRL on Omnidirectional Cameras
• WoodScape: A multi-task, multi-camera fisheye dataset for AD
• FisheyeDistanceNet: Self-Supervised Scale-Aware Distance Estimation using Mono Fisheye
Camera for AD

Graph-Based Classification of Omnidirectional Images
• Omnidirectional cameras are widely used in such areas as robotics and virtual reality as they
provide a wide field of view.
• Their images are often processed with classical methods, which might unfortunately lead to
non-optimal solutions as these methods are designed for planar images that have different
geometrical properties than omnidirectional ones.
• Here, image classification by taking into account the specific geometry of omnidirectional
cameras with graph-based representations.
• In particular, deep learning architectures for data on graphs.
• It is a principled way of graph construction such that convolutional filters respond similarly
for the same pattern on different positions of the image regardless of lens distortions.
• Reference: “Graph-based Isometry Invariant Representation Learning”, ICML, 2017

• Transformation Invariant Graph-based Network (TIGraNet):
• It takes as input images that are represented as signals on a grid graph and gives
classification labels as output.
• Briefly this approach proposes a network of alternatively stacked spectral convolutional and
dynamic pooling layers, which creates features that are equivariant to the isometric
transformation.
• Further, the output of the last layer is processed by a statistical layer, which makes the
equivariant representation of data invariant to isometric transformations.
• Finally, the resulting feature vector is fed to a number of fully-connected layers and a
softmax layer, which outputs the probability distribution that the signal belongs to each of
the given classes.
• This transformation-invariant classification algorithm is extended to omnidirectional images
by incorporating the knowledge about the camera lens geometry in the graph structure.

The graph construction method makes response of the filter
similar regardless of different position of the pattern on an
image from an omnidirectional camera.

TIGraNet architecture. The network is composed of an alternation of spectral convolution layers Fl and dynamic
pooling layers Pl, followed by a statistical layer H, multiple fully-connected layers (FC) and a softmax operator
(SM). The input of the network is an image that is represented as a signal y0 on the grid-graph with Laplacian
matrix L. The output of the system is a label that corresponds to the most likely class for the input sample.

Example of the gnomonic projection. An object
from tangent plane Ti is projected to the sphere
at tangency point X0,i, which is defined by
spherical coordinates φi , θi . The point Xk,I is
defined by coordinates (xk,i , yk,i ) on the plane.
Example of the equirectangular representation of
the image. On the left, the figure depicts the
original image on the tangent plane Ti, on the right,
projected to the points of the sphere. To build an
equirectangular image the values points on the
discrete regular grid are often approximated from
the values of projected points by interpolation.

a) Choose pattern p0 , .., p4 from an object on tangent plane Te at
equator (φe = 0, θe = 0) (red points) and then, b) move this object
on the sphere by moving the tangent plane Ti to point (φi,θi). c)
Thus, the filter localized at tangency point (φi , θi ) uses values pi,1 ,
pi,3 (blue points) which we can obtain by interpolation.
The goal is to develop a transformation
invariant system, which can recognize
the same object on different planes Ti
that are tangent to S at different points
(φi , θi ) without any extra training.
The challenge of building such a system
is to design a proper graph signal
representation that allow compensating
for the distortion effects that appear on
different elevations of S.

Comparison to the state-of-the-art methods on the ETH- 80 datasets.
Select the architecture of different methods to feature similar number
of convolutional filters and neurons in the fully-connected layers.

Flat2Sphere: Learning Spherical Convolution
for Fast Features from 360° Imagery
• While 360° cameras offer tremendous new possibilities in vision, graphics, and augmented
reality, the spherical images they produce make core feature extraction non-trivial.
• Convolutional neural networks (CNNs) trained on images from perspective cameras yield
“flat" filters, yet 360° images cannot be projected to a single plane without significant
distortion.
• A naive solution that repeatedly projects the viewing sphere to all tangent planes is accurate,
but much too computationally intensive for real problems.
• Flat2Sphere learns a spherical convolutional network that translates a planar CNN to process
360° imagery directly in its equirectangular projection.
• This approach learns to reproduce the flat filter outputs on 360° data, sensitive to the
varying distortion effects across the viewing sphere.
• The key benefits are 1) efficient feature extraction for 360° images and video, and 2) the
ability to leverage powerful pre-trained networks researchers have carefully honed (together
with massive labeled image training sets) for perspective images.

Strategies for applying CNNs to 360° images. Top: The 1st strategy unwraps the 360° input into a single planar image
using a global projection (equirectangular), then applies the CNN on the distorted planar image. Bottom: The 2nd
strategy samples multiple tangent planar projections to obtain multiple perspective images, to which the CNN is
applied independently to obtain local results for the original 360° image. Strategy I is fast but inaccurate; Strategy II is
accurate but slow. The approach learns to replicate flat filters on spherical imagery, offering both speed and accuracy.

Spherical convolution differs from ordinary CNN. (a) The kernel weight in spherical convolution is tied only
along each row, and each kernel convolves along the row to generate 1D output. Note that the kernel size
differs at different rows and layers, and it expands near the top and bottom of the image. (b) Inverse
perspective projections P−1 to equirectangular projections at different polar angles θ. The same square image
will distort to different sizes and shapes depending on θ.

Object detection examples on 360° PASCAL test images. Images show the top 40% of equirectangular
projection; black regions are undefined pixels. Text gives predicted label, multi-class probability, and IoU, resp.

Spherical CNNs
• Convolutional Neural Networks (CNNs) have become the method of choice for learning
problems involving 2D planar images.
• However, a number of problems of recent interest have created a demand for models that
can analyze spherical images.
• Examples include omnidirectional vision for drones, robots, and autonomous cars, molecular
regression problems, and global weather and climate modelling.
• A naive application of convolutional networks to a planar projection of the spherical signal is
destined to fail, because the space-varying distortions introduced by such a projection will
make translational weight sharing ineffective.
• In this work there are building blocks for constructing spherical CNNs.
• It defines the spherical cross-correlation that is both expressive and rotation-equivariant.
• The spherical correlation satisfies a generalized Fourier theorem, which allows to compute it
efficiently using a generalized (non-commutative) Fast Fourier Transform (FFT) algorithm.

Spherical CNNs
• S2 and SO(3) correlation by analogy to the classical planar Z2 correlation.
• The planar correlation can be understood as follows:
• The value of the output feature map at translation x ∈ Z2 is computed as an inner
product between the input feature map and a filter, shifted by x.
• Similarly, the spherical correlation can be understood as follows:
• The value of the output feature map evaluated at rotation R ∈ SO(3) is computed
• as an inner product between the input feature map and a filter, rotated by R.
• For functions on the sphere and rotation group, there is an analogous transform, which is
referred to as generalized Fourier transform (GFT) and a corresponding fast algorithm (GFFT).

Spherical CNNs
Spherical correlation in the spectrum. The signal f and the locally-supported filter ψ are Fourier transformed,
block-wise tensored, summed over input channels, and finally inverse transformed. Note that because the
filter is locally supported, it is faster to use a matrix multiplication (DFT) than an FFT algorithm for it. It
parameterizes the sphere using spherical coordinates α, β, and SO(3) with ZYZ-Euler angles α, β, γ.

Scene Understanding Networks for Autonomous
Driving based on Around View Monitoring System
• Modern driver assistance systems rely on a wide range of sensors (RADAR, LIDAR,
ultrasound and cameras) for scene understanding and prediction.
• These sensors are typically used for detecting traffic participants and scene elements
required for navigation.
• Relying on camera based systems, specifically Around View Monitoring (AVM) system has
great potential to achieve these goals in both parking and driving modes with decreased
costs.
• This is a new end-to-end solution for delimiting the safe drivable area for each frame by
means of identifying the closest obstacle in each direction from the driving vehicle;
• It calculates the distance to the nearest obstacles and is incorporated into a unified end- to-
end architecture capable of joint object detection, curb detection and safe drivable area
detection.
• Augmentation of the base architecture with 3D object detection.

This approach for detecting the curb and the free
drivable area is inspired by a Stixel representation
of the world. Originally, the network takes as input
each vertical column of an image. The input
columns that the network used had width 24,
overlapped over 23 pixels. Each column would
then be passed through a convolutional network
to output one-of-k labels, with k being the height
dimension. As a result, it would learn to classify
the position of the bottom pixel of the obstacle
corresponding to that column. The union of all
columns would build either the curb or the free
drivable area of the scene.

• In this architecture, due to the overlapping between the columns, more than 95% of the
computation is redundant.
• Motivated by this observation, replace the column-wise network implementation with an
end-to-end architecture.
• This network encoded the image into a deep feature map using multiple convolutional
layers and then used multiple upsampling layers to generate a feature map having the same
resolution as the input image.
• Crop hardcoded regions of the image corresponding to the pixel columns augmented with
the neighboring area of 23 pixels.
• As a result, the regions of interest for cropping the upsampled feature map are 23 pixels
wide and 720 (height) pixels tall.
• Slide this window horizontally over the image at each x-coordinate.
• The resulting crops are then resized to a fixed length (e.g.7x7) in the ROI pooling layer and
are classified to one-of-k classes (k is the height of the image), to ultimately predict the
bottom point.

Bottom prediction architecture using ROI pooling for each column
Use a single shot method for the final classification layer of the bottom prediction task. Moreover,
to make the network more efficient, replace the decoder part of the network corresponding to the
multiple upsample layers with a single dense horizontal upsampling layer. The resulting feature
map generated from the encoder after applying multiple convolutions with stride > 1 has a
resolution of [width/16, height/16], being reduced 16 times the original image size.

Finally, add another fully connected layer on top of the horizontal upsampling layer to make a linear combination of each
column’s input. A softmax is used to classify each of the resulted columns to one-of-k categories, where k is the height of
the image being predicted. Each column classification subtask automatically takes into account the pixels displayed in the
proximity of the center column being classified and represents the final bottom prediction.
Bottom-Net architecture

Unified architectures which combine the bottom prediction and the object detection networks usually take
advantage of shared computation of the encoder for better training optimization and runtime performance.

The final architecture consists of two branches, for object orientation estimation based on angle
discretization and for object dimensions regression, respectively.
3D-Net architecture

Side view detections. (left) left view. (right) right view.

Captured frame from the
high accuracy solution.

SphereNet: Learning Spherical Representations for Detection
and Classification in Omnidirectional Images
• Omnidirectional cameras offer great benefits over classical cameras wherever a wide field of
view is essential, such as in virtual reality applications or in autonomous robots.
• Unfortunately, standard convolutional neural networks are not well suited for this scenario as
the natural projection surface is a sphere which cannot be unwrapped to a plane without
introducing significant distortions, particularly in the polar regions.
• SphereNet is a deep learning framework which encodes invariance against such distortions
explicitly into convolutional neural networks.
• Towards this goal, SphereNet adapts the sampling locations of the convolutional filters,
effectively reversing distortions, and wraps the filters around the sphere.
• By building on regular convolutions, SphereNet enables the transfer of existing perspective
convolutional neural network models to the omnidirectional case.
• On the tasks of image classification and object detection, it exploits two newly created semi-
synthetic and real-world omnidirectional datasets.

Overview. (a+b) Capturing images with fisheye or 360◦ action camera results in images which are
best represented on the sphere. (c) Using regular convolutions (e.g., with 3 × 3 filter kernels) on
the rectified equirectangular representation (see Fig. 2b) suffers from distortions of the
sampling locations (red) close to the poles. (d) In contrast, our SphereNet kernel exploits
projections (red) of the sampling pattern on the tangent plane (blue), yielding filter outputs
which are invariant to latitudinal rotations.

Kernel Sampling Pattern at φ = 0 (blue) and φ = 1.2 (red) in spherical (a) and equirectangular (b)
representation. Note the distortion of the kernel at φ = 1.2 in (b).

Uniform Sphere Sampling. Comparison of an equirectangular sampling grid on the sphere with N =
200 points (a) to an approximation of evenly distributing N = 127 sampling points on a sphere with
the Saff - Kuijlaars method(b, c). Note that the sampling points at the poles are much more evenly
spaced in the uniform sphere sampling (b) compared to the equirectangular representation (a)
which oversamples the image in these regions.

• SphereNet can be integrated into a convolutional neural network for image classification by
adapting the sampling locations of the convolution and pooling kernels.
• Furthermore, it is straightforward to additionally utilize a uniform sphere sampling, which is
compared to nearest neighbor and bilinear interpolation on an equirectangular representation in
the experiments.
• The integration of SphereNet into an image classification network does not introduce novel
model parameters and no changes to the training of the network are required.
• In order to perform object detection on the sphere, the Spherical Single Shot MultiBox
Detector (Sphere-SSD) adapts the Single Shot MultiBox Detector (SSD) to objects located on
tangent planes of a sphere.
• SSD exploits a fully convolutional architecture, predicting category scores and box offsets for a
set of default anchor boxes of different scales and aspect ratios.
• Sphere-SSD uses a weighted sum between a localization loss and confidence loss.
• However, in contrast to the original SSD, anchor boxes are now placed on tangent planes of the
sphere and are defined in terms of spherical coordinates of their respective tangent plane, the
width/height of the box on the tangent plane as well as an in-plane rotation.

Spherical Anchor Boxes are gnomonic projections of 2D bounding boxes of various scales, aspect
ratios and orientations on tangent planes of the sphere. The figure visualizes anchors of the same
orientation at different scales and aspect ratios on a 16 × 8 feature map on a sphere (a) and an
equirectangular grid (b).

Detection Results on FlyingCars Dataset. The ground truth is shown in green, SphereNet (NN) results in red.

Eliminating the Blind Spot: Adapting 3D Object Detection and
Monocular Depth Estimation to 360◦ Panoramic Imagery
• Recent automotive vision work has focused on processing forward-facing cameras.
• However, future autonomous vehicles will not be viable without a more comprehensive
surround sensing, akin to a human driver, as can be provided by 360◦ panoramic cameras.
• Here is an approach to adapt contemporary deep network architectures developed on
conventional rectilinear imagery to work on equirectangular 360◦ panoramic imagery.
• To address the lack of annotated panoramic automotive datasets availability, it adapts a
contemporary automotive dataset, via style and projection transformations, to facilitate the
cross-domain retraining of contemporary algorithms for panoramic imagery.
• Following this approach, it retrains and adapts existing architectures to recover scene depth
and 3D pose of vehicles from monocular panoramic imagery without any panoramic training
labels or calibration parameters.
• This approach is evaluated qualitatively on crowd-sourced panoramic images and
quantitatively using an automotive environment simulator to provide the first benchmark for
such techniques within panoramic imagery.

Panoramic images are typically represented using an equirectangular projection (A); in contrast, a
conventional camera uses a rectilinear projection. In this projection, the image-space coordinates are
proportional to latitude and longitude of observed points rather than the usual projection onto a focal plane.
Adaptig 3D Object Detection and Depth Estimation to Panoramic Imagery 3 monocular depth (B) and to
recover the full 3D pose of vehicles (B) from panoramic imagery.

Convolutions are computed seamlessly across horizontal image boundaries using the padding approach.

Monocular depth recovery and
3D object detection with our
approach. Left: Real-world
images. Right: Synthetic images.

FisheyeMODNet: Moving Object detection on
Surround-view Cameras for Autonomous Driving
• Moving Object Detection is an important task for achieving robust autonomous driving.
• An autonomous front vehicle has to estimate collision risk with other interacting objects in
the environment and calculate an optional trajectory.
• Collision risk is typically higher for moving objects than static ones due to the need to
estimate the future states and poses of the objects for decision making.
• This is particularly important for near-range objects around the vehicle which are typically
detected by a fisheye surround-view system that captures a 360◦ view of the scene.
• This work is a CNN architecture for moving object detection using fisheye images that were
captured in autonomous driving environment.
• To target embedded deployment, it designs a lightweight encoder sharing weights across
sequential images.

Images from the surround-view camera network showing near field sensing and wide
field of view. Four fisheye cameras (marked green) provide 360◦ surround view.

Network Architecture adapted from ShuffleSeg base network. Two sequential images encoding
the motion information across time are utilized train the network end-to-end for MOD.

OmniDRL: Robust Pedestrian Detection using Deep
Reinforcement Learning on Omnidirectional Cameras
• Pedestrian detection is one of the most explored topics in computer vision and robotics.
• Deep Reinforcement Learning has proved to be within the SoA in terms of both detection in
perspective cameras and robotics applications.
• However, for detection in omnidirectional cameras, the literature is still scarce, mostly
because of their high levels of distortion.
• This is an efficient technique for robust pedestrian detection in omnidirectional images.
• The method uses deep RL that takes advantage of the distortion in the image.
• By considering the 3D bounding boxes and their distorted projections into the image, this
method is able to provide the pedestrian’s position in the world, in contrast to the image
positions provided by most SoA methods for perspective cameras.
• The method avoids the need of pre-processing steps to remove the distortion, which is
computationally expensive.

Illustration of the method, using a
Multi-task network, for pedestrian
detection in omnidirectional cameras.
The input is an omnidirectional image
with an initial state of the bounding
box, represented in the world
coordinate system. Using this
information, a set of possible actions
are applied in order to detect the
pedestrian in the 3D environment.
After the trigger is activated, the line
segments of 3D bounding box
estimated are projected to the
omnidirectional image. Then, the IoU
between the ground truth and our
estimation is computed in the image
coordinates.

Depiction of the scheme of the proposed
network, where the first convolutional layers
are shared, and then split into branches (DQN
and Classification).

This figure shows the image formation using unified central catadioptric cameras. (a) the projection of a
point R ∈ R3 onto the normalized image plane {i−, i+} (intermediate projection on the unitary sphere {n− ,
n+ }). (b) the projection of 3D straight line segments for images using this model (x1 and x2 are the edges of
the line’s segment).

WoodScape: A multi-task, multi-camera
fisheye dataset for autonomous driving
• Fisheye cameras are commonly employed for obtaining a large field of view in surveillance,
augmented reality and in particular automotive applications.
• In spite of their prevalence, there are few public datasets for detailed evaluation of computer
vision algorithms on fisheye images.
• The 1st extensive fisheye automotive dataset, WoodScape, named after Robert Wood who
invented the fisheye camera in 1906.
• WoodScape comprises of 4 surround view cameras and nine tasks including segmentation,
depth estimation, 3D bounding box detection and soiling detection.
• Semantic annotation of 40 classes at the instance level is provided for over 10,000 images
and annotation for other tasks are provided for over 100,000 images.

WoodScape, the first fisheye image dataset dedicated to autonomous driving. It contains four cameras covering
360° accompanied by a HD laser scanner, IMU and GNSS. Annotations are made available for nine tasks, notably
3D object detection, depth estimation (overlaid on front camera) and semantic segmentation.

Comparison of fisheye models.

Undistorting the fisheye image: (a)
Rectilinear correction; (b) Piecewise
linear correction; (c) Cylindrical
correction. Left: raw image; Right:
undistorted image.

Segmentation using ENet (top) and Object detection using Faster RCNN (bottom).

FisheyeDistanceNet: Self-Supervised Scale-Aware Distance
Estimation using Monocular Fisheye Camera for Autonomous Driving
• Fisheye cameras are commonly used in applications like autonomous driving and
surveillance to provide a large field of view (> 180◦).
• However, they come at the cost of strong non-linear distortion which require more complex
algorithms.
• Here is Euclidean distance estimation on fisheye cameras for automotive scenes.
• Obtaining accurate and dense depth supervision is difficult in practice, but self-supervised
learning approaches show promising results and could potentially overcome the problem.
• This is a self-supervised scale-aware framework for learning Euclidean distance and ego-
motion from raw monocular fisheye videos without applying rectification.
• While it is possible to perform piece-wise linear approximation of fisheye projection surface
and apply standard rectilinear models, it has its own set of issues like re-sampling distortion
and discontinuities in transition regions.

Overview: the 1st row represents ego masks , Mt-1,
Mt+1, indicates which pixel coordinates are valid
when constructing It−1 from It and It from It+1
respectively. The 2nd row indicates the masking of
static pixels computed after 2 epochs, where black
pixels are filtered from the photometric loss (i.e. σ
= 0). It prevents dynamic objects at similar speed
as the ego car and low texture regions from
contaminating the loss. The masks are computed
for forward and backward sequences from the
input sequence S and reconstructed images. The
3rd row represents the distance estimates
corresponding to their input frames. Finally, the
vehicle’s odometry data is used to resolve the
scale factor issue.

• The overall self-supervised SfM from motion objective consists of a photometric loss term
Lp imposed between the reconstructed target image Iˆt and the target image It, and a
distance regularization term Ls ensuring edge-aware smoothing in the distance estimates.
• Finally, Ldc a cross-sequence distance consistency derived from the chain of frames in the
training sequence S.
• To prevent the training objective getting stuck in the local minima due to the gradient
locality of the bilinear sampler, adopt 4 scales to train the network.
• The distance estimation network is mainly based on the U-net architecture, an encoder-
decoder network with skip connections.
• After testing different variants of ResNet family, chose a ResNet18 as the encoder.
• The key aspect is replacing with deformable convolutions since regular CNNs are inherently
limited in modeling large, unknown geometric distortions due to their fixed structures, such
as fixed filter kernels, fixed receptive field sizes, and fixed pooling kernels.

• The backbone of pose estimation network is based on paper "Digging into self-supervised
monocular depth estimation”, which predicts rotation using Euler angle parameterization.
• Replace normal convolutions with deformable convolutions for the encoder-decoder setting.
• Predict the rotation using an axis-angle representation, and scale the rotation and
translation outputs by 0.01.
• For monocular training, use a sequence length of three frames, while pose network is
formed from a ResNet18, modified to accept a pair of color images (or six channels) as input
and to predict a single 6-DoF relative pose between It−1→t and It→t−1.
• Perform horizontal flips and following training augmentations: random brightness, contrast,
saturation, and hue jitter with respective ranges of ±0.2, ±0.2, ±0.2, and ±0.1.
• Importantly, the color augmentations are only applied to the images which are fed to the
networks, not to those used to compute photometric loss term Lp.
• All 3 images fed to the pose and depth networks are augmented with the same parameters.

(a) Depth network: U-Net. (b) Pose network: A separate pose network. (c) Per-pixel minimum reprojection: When
correspondences are good, the reprojection loss should be low. (d) Full-resolution multi-scale: Upsample depth
predictions at intermediate layers and compute all losses at the input resolution, reducing texture-copy artifacts.

FisheyeDistanceNet produces sharp distance maps on distorted fisheye images.

Fisheye Omnidirectional View in Autonomous Driving

Fisheye Omnidirectional View in Autonomous Driving

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Fisheye Omnidirectional View in Autonomous Driving

Semelhante a Fisheye Omnidirectional View in Autonomous Driving (20)

Mais de Yu Huang

Mais de Yu Huang (20)

Último

Último (20)

Fisheye Omnidirectional View in Autonomous Driving