Pedestrian behavior/intention modeling for autonomous driving IV

Pedestrian Behavior/Intention
Modeling for Autonomous Driving IV
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California

Outline
• Looking to Relations for Future Trajectory Forecast
• The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction
• Stochastic Sampling Simulation for Pedestrian Trajectory Prediction
• Disentangling Human Dynamics for Pedestrian Locomotion Forecasting
with Noisy Supervision
• Social and Scene-Aware Trajectory Prediction in Crowded Spaces

Looking to Relations for Future Trajectory Forecast
• Inferring relational behavior between road users as well as road users and their surrounding
physical space is an important step toward effective modeling and prediction of navigation
strategies adopted by participants in road scenes.
• This paper proposes a relation-aware framework for future trajectory forecast.
• The system aims to infer relational information from the interactions of road users with each
other and with the environment.
• The first module involves visual encoding of spatio-temporal features, which captures human-
human and human-space interactions over time.
• The following module explicitly constructs pair-wise relations from spatio-temporal interactions
and identifies more descriptive relations that highly influence future motion of the target road
user by considering its past trajectory.
• The resulting relational features are used to forecast future locations of the target, in the form of
heatmaps with an additional guidance of spatial dependencies and consideration of the
uncertainty.

Spatio-temporal features are visually encoded from discretized grid to
locally discover (i) human-human and (ii) human-space over time. Then,
their pair-wise relations with respect to the past motion of the target
(→) are investigated from a global perspective for trajectory forecast.

Given a sequence of images, the GRE (gated relation encoder) visually analyzes spatial behavior of road
users and their temporal interactions with respect to environments. The subsequent RGM (relation gate
module) of GRE infers pair-wise relations from these interactions and determines which relations are
meaningful from a target agent’s perspective. The aggregated relational features are used to generate
initial heatmaps through the TPN (trajectory prediction network). Then, the following SRN (spatial
refinement network) further refines these initial predictions with a guidance of their spatial
dependencies. They additionally embed the uncertainty of the problem into the system at test time.

• They extend the definition of ‘object’ to a spatio- temporal feature representation extracted from
each region of the discretized grid over time.
• It enables to visually discover (i) human-human interactions where there exist multiple road users
interacting with each other over time, (ii) human-space interactions from their interactive
behavior with environments, and (iii) environmental representations by encoding structural
information of the road.
• The pair-wise relations between objects (i.e., local spatio- temporal features) are inferred from a
global perspective.
• Moreover, they design a new operation function to control information flow so that the network
can extract descriptive relational features by looking at relations that have a high potential to
influence the future motion of the target.

It visually extracts spatial representations of
the static road structures, the road topology,
and the appearance of road users from
individual frames using the spatial behavior
encoder (SBE) with 2D convolutions. They
individually process each entry of spatial
representations using the temporal
interaction encoder (TIE) with a 3D
convolution to model sequential changes of
road users and road structures with their
temporal interactions. The joint use of 2D
convolutions for spatial modeling and 3D
convolution for temporal modeling extracts
more discriminative spatio-temporal features
as compared to alternative methods such as
3D convolutions as a whole or 2D
convolutions with an LSTM.

They focused on the internal gating process of an LSTM unit that controls
information flow through multiple switch gates. Specifically, the LSTM employs a
sigmoid function with a tanh layer to determine not only which information is
useful, but also how much weight should be given. The efficacy of their control
process leads to design a relation gate module (RGM) which is essential to
generate more descriptive relational features from a target perspective.

• To effectively identify the pixel-level probability map, it specifically designs a trajectory prediction
network (TPN) with a set of deconvolutional layers.
• It first reshapes the relational features extracted from GRE to be the dimension 1 x 1 x w before
running the proposed TPN.
• The reshaped features are then incrementally upsampled using six deconvolutional layers, each
with a subsequent ReLU activation function.
• As an output, the network predicts a set of activations in the form of heatmaps through the
learned parameters.
• In training, the sum of squared error between the ground truth heatmaps and the prediction is
minimized, all over the 2D locations.

trajectory prediction networkrelation gate module
spatial behavior encoder + temporal interaction encoder

• Since the network independently predicts δ number of pixel-level probability maps, there is no
constraint to enforce heatmaps to be spatially aligned across predictions.
• They design a spatial refinement network (SRN) with large kernels, so the network can make use
of rich contextual information between the predicted locations.
• It first extracts intermediate activations from the TPN and let through a set of convolutional layers
with stride 2 so that the output feature map to be the same size as the earlier activation of TPN.
• Then, it upsamples the concatenated features using four deconvolutional layers followed by a 7 x
7 and 1 x 1 convolution.
• By using large receptive fields and increasing the number of layers, the network is able to
effectively capture dependencies, which results in less confusion between heatmap locations.
• In addition, the use of a 1 x 1 convolution enforces the refinement process to further achieve
pixel-level correction in the filter space.

spatial refinement network

The proposed approach properly encodes (a) human-human and (b) human-space interactions by
inferring relational behavior from a physical environment (highlighted by a dashed arrow➔). However,
it sometimes fails to predict a future trajectory when a road user (c) unexpectedly changes the direction
of its motion or (d) does not consider the interactions with an environment. (Color codes: Yellow - given
past trajectory, Red - ground-truth, and Green – the method’s prediction)

The Garden of Forking Paths: Towards Multi-
Future Trajectory Prediction
• This paper studies the problem of predicting the distribution over multiple possible future paths
of people as they move through various visual scenes.
• It makes two main contributions:
• The first contribution is a new dataset, created in a realistic 3D simulator, which is based on real
world trajectory data, and then extrapolated by human annotators to achieve different latent
goals. This provides the first benchmark for quantitative evaluation of the models to predict
multi-future trajectories.
• The second contribution is a new model to generate multiple plausible future trajectories, which
contains novel designs of using multi-scale location encodings and convolutional RNNs over
graphs, called Multiverse.
• Website: https://next.cs.cmu.edu/multiverse/index.html

Illustration of person trajectory prediction. (1) A person walks towards a car (data from the VIRAT/ActEV dataset). The
green line is the actual future trajectory and the yellow-orange heatmaps are example future predictions. Although these
predictions near the cars are plausible, they would be considered errors in the real video dataset. (2) To combat this, it
proposes a new dataset called “Forking Paths”; here it illustrates 3 possible futures created by human annotators
controlling agents in a synthetic world derived from real data. (3) Here it shows semantic segmentation of the scene. (4-6)
Here it is shown the same scene rendered from different viewing angles, where the red circles are future destinations.

Overview of the model. The input to the model is the ground truth location history, and a set of
video frames, which are preprocessed by a semantic segmentation model. This is encoded by
the “History Encoder” convolutional RNN. The output of the encoder is fed to the convolutional
RNN decoder for location prediction. The coarse location decoder outputs a heatmap over the
2D grid of size H × W . The fine location decoder outputs a vector offset within each grid cell.
These are combined to generate a multimodal distribution over R2for predicted locations.

• The history encoder computes a representation of the scene from the history of past locations. It
preprocesses each video frame using a pre-trained semantic segmentation model, the Deeplab
model, trained on the ADE20k dataset.
• Coarse Location Decoder: The graph-structured update function for the RNN ensures that the
probability mass “diffuses out” to nearby grid cells in a controlled manner, reflecting the prior
knowledge that people do not suddenly jump between distant locations.
• Fine Location Decoder: it trains a second convolutional RNN decoder to compute an offset vector
for each possible grid cell using a regression output.
• The loss function:

Stochastic Sampling Simulation for Pedestrian
Trajectory Prediction
• Urban environments pose a significant challenge for autonomous vehicles (AVs) as they must
safely navigate while in close proximity to many pedestrians.
• It is crucial for the AV to correctly understand and predict the future trajectories of pedestrians to
avoid collision and plan a safe path.
• This paper describes a method using a stochastic sampling-based simulation to train DNNs for
pedestrian trajectory prediction with social interaction.
• This simulation method can generate vast amounts of automatically-annotated, realistic, and
naturalistic synthetic pedestrian trajectories based on small amounts of real annotation.
• It then uses such synthetic trajectories to train an off-the-shelf state-of-the-art deep learning
approach Social GAN to perform pedestrian trajectory prediction.
• The proposed architecture, trained only using synthetic trajectories, achieves better prediction
results compared to those trained on human-annotated real-world data using the same network.

System overview. It proposes using a novel stochastic sampling-based
simulation system to train a deep neural network (e.g., Social GAN) to
make socially acceptable pedestrian trajectory predictions.

Disentangling Human Dynamics for Pedestrian
Locomotion Forecasting with Noisy Supervision
• It tackles the problem of Human Locomotion Forecasting, a task for jointly predicting the spatial
positions of several keypoints on human body in the near future under an egocentric setting.
• In contrast to the previous work that aims to solve either the task of pose prediction or trajectory
forecasting in isolation, it proposes a framework to unify these two problems and address the
practically useful task of pedestrian locomotion prediction in the wild.
• Among the major challenges in solving this task is the scarcity of annotated egocentric video
datasets with dense annotations for pose, depth, or egomotion.
• To surmount this difficulty, they use state-of-the-art models to generate (noisy) annotations and
propose robust forecasting models that can learn from this noisy supervision.
• This method disentangles the overall pedestrian motion into easier to learn subparts by utilizing a
pose completion and a decomposition module.
• The completion module fills in the missing key-point annotations and the decomposition module
breaks the cleaned locomotion down to global (trajectory) and local (pose keypoint movements).
• Further, with Quasi RNN as the backbone, they propose a hierarchical trajectory forecasting
network that utilizes low-level vision domain specific signals like egomotion and depth to predict
the global trajectory.

Egocentric pedestrian locomotion forecasting. Locomotion is defined as the overall motion of keypoints on
the pedestrian in contrast to predicting just the position (trajectory prediction) or the pose (pose forecasting).

An illustration for human locomotion forecasting with noisy supervision. The “Raw Pose” plane represents
the noisy input pose sequence with missing joint detection. The “Complete Pose” plane denotes the
output from the pose completion module with filled in joint positions. The completed pose is then split
into the global and local streams which separate concurrent motions. The prediction modules forecast the
future streams. Finally, these streams are merged to predict future pedestrian locomotion.

• They frame the task of forecasting human locomotion in egocentric view (of the vehicle) as a
sequence-to-sequence problem.
• They use state-of-the-art models for multiple-person keypoint detection module to autonomously
generate dense but noisy frame-level supervision for human poses.
• It autonomously estimates depth in a monocular camera using SuperDepth, which extends a
subpixel convolutional layer for depth super-resolution.
• It uses the state-of-the-art-model unsupervised model for autonomously estimating the camera
motion that occurs between consecutive frames due to the movement of the egovehicle.
• They propose a pose completion network for completing the detected human poses.
• This processing has a two-fold benefit. First, it fills in the joints that are not detected by the pose
detection module. It also suppresses noise by filling in the low confidence output with better
estimates. Second, it enables to decompose the motion with noisy data. This is because otherwise
separating the uncompleted global and local components of the motion would be perplexing as
the joints flicker frequently.

Encoder-Recurrent-Decoder architecture. The lock denotes the sharing of the frame encoder weights across
different time steps of the input sequence. Dotted squares contain values concerned with the same frame.

Architecture of the pose completion and disentangling module. The shades represent the
confidence in locating the joint. Black represents highest confidence and white represents
missing data. All detections below confidence are replaced with the autoencoder
estimates (sky blue). It is then split into local and global streams for forecasting.

• The proposal to disentangle global and local motion is motivated by the relative difference in the
nature and magnitude of motion exhibited by these streams.
• This disentangling allows to significantly reduce the overall complexity, since each of the streams
now model a much simpler and easier to predict motion.
• It proposes to use the neck joint sequence as a representation of the global stream, because the
neck is the most widely observed joint in the dataset.
• The Quasi-Recurrent Neural Network forms the backbone of the seq-to-seq learning structure.
• QRNNs consist of alternating convolutional and recurrent pooling module and is designed to
parallelize efficiently better than vanilla LSTMs.
• Quasi- RNN trains faster (825 ms/batch) compared to LSTM (886 ms/batch) on a single GPU under
same parameters and yield faster convergence for similar model capacity.

Architecture for forecasting the local stream.
The Quasi-RNN encoder-decoder has N layers of
alternate convolutions and recurrent pooling,
both in the input encoder and the output
decoder. The recurrent pooling is a thin
aggregation function applied to the convolutional
activations. The encoder churns through the
latent representation of the previous poses and
encodes the necessary information into a context
vector. This vector is then consumed by the
QRNN decoder to forecast the future poses
mapped back to the same latent space.

• The filled in and decomposed pose is used as the input to the pose prediction module.
• This module comprises of a spatial encoder with the latent dimension.
• The weights of this spatial encoder are separately trained using the autoencoder while the
complexity of the latent space is similar.
• The forecasting is processed in the latent space with layers of the QRNN Encoder-Decoder module.
• It uses the latent space to forecast because as confirmed by the pose completion module
experiments, the human pose lies on a low dimensional manifold because of the various
kinematic constraints enforced by the human body.
• Forecasting in this lower dim denser space makes the prediction easier for the quasi RNN module.
• The predicted pose is mapped back the image space with the spatial decoder to forecast pose.
• It proposes to predict residuals from the first observed positions instead of forecasting absolute
coordinates. In particular, it learns to predict the global stream from separately processed low
level vision signals (monocular depth, camera egomotion).

Qualitative results from the pose prediction module. (A) shows the tp = 15 length input sequence of poses for a
pedestrian walking on the sidewalk. (B) and (C) show the cropped frames with the pedestrian and the corresponding
filled in pose at the start and end of the input sequence. (D) shows the prediction pedestrian locomotion for the next tf
= 15 frames. (D), (E), and (F) also show the predicted poses at the start, intermediate, and end of the output sequence
respectively. Note that the actual position of the pedestrian represents the ground-truth in (D), (E) and (F).

Social and Scene-Aware Trajectory Prediction in
Crowded Spaces
• Mimicking human ability to forecast future positions or interpret complex interactions in
urban scenarios, such as streets, shopping malls or squares, is essential to develop socially
compliant robots or self-driving cars.
• Autonomous systems may gain advantage on anticipating human motion to avoid collisions
or to naturally behave alongside people.
• To foresee plausible trajectories, it constructs an LSTM-based model considering three
fundamental factors: people interactions, past observations in terms of previously crossed
areas and semantics of surrounding space.
• The model encompasses several pooling mechanisms to join the above elements defining
multiple tensors, namely social, navigation and semantic tensors.
• The network is tested in unstructured environments where complex paths emerge according
to both internal (intentions) and external (other people, not accessible areas) motivations.
• As demonstrated, modeling paths unaware of social interactions or context information, is
insufficient to correctly predict future positions.
• Codes: https://github.com/Oghma/sns-lstm/

Crowded Spaces
Overview of the proposed model. Trajectories, navigation map and semantic image are fed
to the LSTM network and combined using three pooling mechanisms. Future positions are
obtained using linear layers to extract key parameters of a Gaussian distribution.

Crowded Spaces
• Pedestrian dynamics in urban scenarios are highly influenced by static and dynamic factors which
guide people towards their destinations.
• To forecast realistic paths, it is important to allow human dynamics to be influenced by
surrounding space, not only in terms of other people in their neighborhood, but also considering
semantics of crossed areas as well as past observations.
• This framework models each pedestrian as an LSTM network interacting with the surrounding
space using three pooling mechanisms, namely Social, Navigation and Semantic pooling.
• Social pooling mechanism takes into account the neighborhood in terms of other people, merging their
hidden states.
• Navigation pooling mechanism exploits past observations to discriminate between equally likely
predicted positions using previous information about the scene.
• Finally, Semantic pooling uses semantic scene segmentation to recognize not crossable areas.

Crowded Spaces
Semantic map is generated from the reference
image while the Navigation map is obtained from
observed data. The image shows an example of
such maps for ETH dataset.
Overview of the pooling mechanisms. Three tensors take
into account social neighborhood, past observations and
semantics of surrounding space, respectively. Tensors are
finally concatenated, processed by ReLU layers and fed to
LSTM networks along with embedded positions. Figure
also highlights dimensions of each introduced tensor.

Crowded Spaces
Some examples of predicted trajectories for
HOTEL dataset. Ground-truths are shown as
solid lines, while predicted trajectories as
dashed lines. First column shows cases
where predicted positions are very close to
the real paths. Second column shows cases
where the SNS-LSTM appears not able to
correctly predict future positions.

Crowded Spaces
Temporal sequences visualization for
different tracks drawn from both HOTEL
and ETH dataset. The circles represent
ground truth (green), SNS-LSTM model
(blue) and S-LSTM model (red),
respectively. For each row, the first
image shows the observed path (in
green) which corresponds to 8 frames
(the three circles are superimposed),
while the remaining ones show the 9th,
13th, 17th and 20th predicted frames,
respectively.

Pedestrian behavior/intention modeling for autonomous driving IV

Pedestrian behavior/intention modeling for autonomous driving IV

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Pedestrian behavior/intention modeling for autonomous driving IV

Similar to Pedestrian behavior/intention modeling for autonomous driving IV (20)

More from Yu Huang

More from Yu Huang (20)

Recently uploaded

Recently uploaded (20)

Pedestrian behavior/intention modeling for autonomous driving IV