Pedestrian behavior/intention modeling for autonomous driving III

Pedestrian Behavior/Intention
Modeling for Autonomous Driving III
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California

Outline
• SR-LSTM: State Refinement for LSTM towards Pedestrian Trajectory Prediction
• Pedestrian Path, Pose, and Intention Prediction Through Gaussian Process Dynamical
Models
• StarNet: Pedestrian Trajectory Prediction using Deep Neural Network in Star Topology
• Social Ways: Learning Multi-Modal Distributions of Pedestrian Trajectories with GANs
• Multi-Agent Tensor Fusion for Contextual Trajectory Prediction
• Which Way Are You Going? Imitative Decision Learning for Path Forecasting in Dynamic
Scenes
• TraPHic: Trajectory Prediction in Dense and Heterogeneous Traffic Using Weighted
Interactions
• Learning to Infer Relations for Future Trajectory Forecast
• Peeking into the Future: Predicting Future Person Activities and Locations in Videos

SR-LSTM: State Refinement for LSTM
towards Pedestrian Trajectory Prediction
• 2019.3
• In crowd scenarios, reliable trajectory prediction of pedestrians requires insightful
understanding of their social behaviors.
• These behaviors have been well investigated by plenty of studies, while it is hard to be fully
expressed by hand-craft rules.
• Recent studies based on LSTM networks have shown great ability to learn social behaviors.
• However, many of these methods rely on previous neighboring hidden states but ignore the
important current intention of the neighbors.
• In order to address this issue, this is a data-driven state refinement module for LSTM
network (SR- LSTM), which activates the utilization of the current intention of neighbors, and
jointly and iteratively refines the current states of all participants in the crowd through a
message passing mechanism.
• To effectively extract the social effect of neighbors, further introduce a social-aware
information selection mechanism consisting of an element-wise motion gate and a
pedestrian-wise attention to select useful message from neighboring pedestrians.

• Current states of neighbors are important for timely interaction inference.
When predicting for the lady at time t, considering the trajectory of the man on the right up to time t
(a), or the one up to time t − 1 (b), can cause great deviation in predicting results (dashed lines).

• Useful information should be adaptively selected from neighbors, based on
their motions and locations.
(a) Activation trajectory patterns of hidden neurons in LSTM, which start from the origin. Each trajectory
pattern marked by certain color contains trajectories from database which has top- 20 responses for the
hidden neuron. (b) A sample of three pedestrian interaction. How will the dyad pay attention to the other
pedestrian on the left?

Framework overview of SR-LSTM. States refinement module is considered as an additional subnetwork of
the LSTM cells, which aligns pedestrians together and updates current states of them. The refined states
are used to predict the location at the next time step.

• The Vanilla LSTM is used for extracting features from the pedestrian trajectory separately.
• The main difference is that the States Refinement (SR) module is used for refining the cell
states by passing message among pedestrians.
• The SR module takes the following three information sources of all pedestrians as input: the
current locations of pedestrians, hidden states and cell states from LSTM.
• The output of the SR module is the refined cell states.
• In the task of pedestrian trajectory prediction, further refinement could improve the quality
of the interaction model, indicating the intention negotiation of human interaction natures.
• The motion gate and the pedestrian-wise attention jointly select the important information
from neighboring pedestrians for message passing.

In SR- LSTM, current states of pedestrians can
timely refine each other, particularly in the case
where pedestrians change their intentions.
SR-LSTM are able to implicitly explain for common
social behaviors, which gives moderate future
predictions and relatively low errors.

Pedestrian Path, Pose, and Intention Prediction Through Gaussian
Process Dynamical Models and Pedestrian Activity Recognition
• 2019. 5 IEEE T-ITS
• The predictions of pedestrian paths improve current automatic emergency braking systems.
• It is to predict future pedestrian paths, poses, and intentions up to 1 s in advance.
• This method is based on balanced Gaussian process dynamical models (B-GPDMs), which
reduce the 3-D time-related information extracted from key points or joints placed along
pedestrian bodies into low-dimensional spaces.
• The B-GPDM is also capable of inferring future latent positions and reconstruct their
associated observations.
• However, learning a generic model for all kinds of pedestrian activities normally provides
less accurate predictions.
• The proposed method obtains multiple models of four types of activity, i.e., walking,
stopping, starting, and standing, and selects the most similar model to estimate future
pedestrian states.
• This method detects starting activities 125ms after the gait initiation with an accuracy 80%
and recognizes stopping intentions 58.33ms before the event with an accuracy 70%.

General description of the method based on
B-GPDMs. The algorithm is divided into
two stages: offline training (top) and online
execution (bottom).
The method learns multiple models of each
type of pedestrian activity, i.e. walking,
stopping, starting and standing, and selects
the most appropriate one to estimate future
pedestrian states at each time step.
A training dataset of motion sequences, in
which pedestrians perform different activities,
is split into 8 subsets based on typical
crossing orientations and type of activity.
A B-GPDM is obtained for each sequence
with one activity contained in the dataset.

• The proposed method is based on B-GPDMs, which reduce the 3D time-related
positions and displacements extracted from key points or joints placed along the
pedestrian bodies into low-dimensional latent spaces.
• The B-GPDM also has the peculiarity of inferring future latent positions and
reconstructing the observation associated to a latent position from the latent space.
• Therefore, it is possible to reconstruct future observations from future latent positions.
• In the online execution, given a new pedestrian observation, the current activity is
determined using a HMM.
• Thus, the selection of the most appropriate model among the trained ones is centered
solely on that activity.
• Finally, the selected model is used to predict the future latent positions and reconstruct
the future pedestrian path and poses.

StarNet: Pedestrian Trajectory Prediction using
Deep Neural Network in Star Topology
• 2019.6
• Pedestrian trajectory prediction is crucial for many important applications.
• This problem is a great challenge because of complicated interactions among pedestrians.
• Previous methods model only the pairwise interactions between pedestrians, which not only
oversimplifies the interactions among pedestrians but also is computationally inefficient.
• StarNet has a star topology which includes a unique hub network and multiple host
networks.
• The hub network takes observed trajectories of all pedestrians to produce a comprehensive
description of the interpersonal interactions.
• Then the host networks, each of which corresponds to one pedestrian, consult the
description and predict future trajectories.
• The star topology gives StarNet two advantages over conventional models.
• StarNet is able to consider the collective influence among all pedestrians in the hub network,
making more accurate predictions.
• StarNet is computationally efficient since the number of host network is linear to the number of
pedestrians.

The structure of StarNet. StarNet mainly consists a centralized hub network and several host
networks. The hub network collects movement information and generates a feature which
describes joint interactions among pedestrians. Each host network, corresponding to a certain
pedestrian, queries the hub network and predicts the pedestrian’s trajectory.

• Pedestrian path prediction is a great challenge due to the uncertainty of future movements.
• Conventional methods tackle this problem with manually crafted features.
• Data-driven methods remove the requirement of hand- crafted features, and greatly
improve the ability to predict pedestrian trajectories.
• However, existing methods compute pairwise features, and thus oversimplified the
interactions in the real word environment.
• Meanwhile, they suffer from a huge computational burden in crowded scenes.
• StarNet has two advantages over previous methods.
• 1) the representation is to describe not only pairwise interactions but also collective ones.
• Such a comprehensive representation enables StarNet to make accurate predictions.
• 2) the interactions between one pedestrian and others are efficiently computed.
• When predicting all pedestrians’ trajectories, the computational time increases linearly, rather
than quadratically, as the number of pedestrians increases.

The process of predicting the coordinates.

• The hub network takes all of the observed trajectories simultaneously and produces a
comprehensive representation r of the crowd of pedestrians.
• The representation r includes both spatial and temporal information of the crowd, which is
the key to describe the interactions among pedestrians.
• The hub network produces r by two steps: 1) produce a spatial representation of the crowd
for each time step; 2) the spatial representation is fed into a LSTM to produce the spatio-
temporal representation r.
• For the i-th pedestrian, the host network first embeds the observed trajectory Oi, and then
combines the embedded trajectory with the spatio-temporal representation rt, predicting
the future trajectory.
• Specifically, the host network predicts the future trajectory by two steps: 1) take the
observed trajectory Oi and the spatio-temporal representation rt as input and generates an
integrated representation; 2) predict the future trajectory of the i-th pedestrian depending
on the observed trajectory Oi and the integrated representation.

Predicted trajectories and the
corresponding ground truths.
Different colors indicate different
trajectories. The trajectories of
ground truth are labeled with dots.
The predicted trajectories are
labeled with triangles.

Social Ways: Learning Multi-Modal Distributions
of Pedestrian Trajectories with GANs
• CVPRW 2019
• This paper proposes an approach for predicting the motion of pedestrians interacting with
others.
• It uses a Generative Adversarial Network (GAN) to sample plausible predictions for any
agent in the scene.
• As GANs are very susceptible to mode collapsing and dropping, they show that the recently
proposed Info-GAN allows dramatic improvements in multi-modal pedestrian trajectory
prediction to avoid these issues.
• It also left out L2-loss in training the generator, unlike some previous works, because it
causes serious mode collapsing though faster convergence.
• They show through experiments on real and synthetic data that the proposed method leads
to generate more diverse samples and to preserve the modes of the predictive distribution.
• In particular, to prove this claim, we have designed a toy example dataset of trajectories that
can be used to assess the performance of different methods in preserving the predictive
distribution modes.

Illustration of the trajectory prediction problem. Having the observed trajectories
of a pedestrian of interest, here shown with a *, and the ones of other pedestrians
in the environment, the system should be able to build a predictive distribution of
possible trajectories (here with two modes in dashed yellow lines).

• When deciding his steering actions, a pedestrian anticipates likely scenarios
about the evolution of his surrounding in the near future.
• Now, this anticipation may not be always very easy, because of the
uncertainties in the neighbors future motion and intentions.
• In most recent NN-based motion prediction systems, the input is taken as
the set of most recent observations of the surrounding pedestrians.
• Hence, mappings from observations to predicted trajectories built through
the networks do not consider explicitly uncertain and multimodal nature of
the neighbors future trajectories, and, in a way, the network is expected to
learn it too, which may be too much to expect.
• The Social Ways GAN generates independent random trajectory samples
that mimic the distribution of trajectories among our training data,
conditioned on observed initial tracklets of duration τ for all the agents in the
scene.

Block Diagram of the Social Ways prediction system. The yellow ellipses represent loss calculations. The dashed
arrows show the backpropagation directions. The bold arrows carry ground truth data.

• GAN training is known to be hard, as it may not converge, exhibit vanishing gradients when
there is imbalance between the Generator and the Discriminator, or may be subject to mode
collapsing, i.e. sampling of synthetic data without diversity.
• When predicting pedestrian motion, it is critical to avoid mode collapsing, as it could result
in catastrophic decisions, i.e. for an autonomous driving agent.
• Here introduced two major changes in the GAN training.
• First, do not use an L2 loss enforcing the generated samples to be close to the true data, because
having observed negative impact of this term in the diversity of the generated samples.
• Also, implemented an Info-GAN architecture has a very positive impact on avoiding the mode
collapsing problem with respect to other versions of GANs.
• Info-GAN learns disentangled representations of the sources of variation among the data, and
does so by introducing a new coding variable c as an input.
• The training is performed by adding another term to maximize a lower bound of the mutual
information between the distribution of c and the distribution of the generated outputs,
which requires training another sub-network which serves as a surrogate to evaluate the
likelihoods over the generated data.

It illustrates sample outputs (in magenta color). The observed trajectories are shown in blue and ground truth
prediction and constant-velocity predictions are shown in cyan and orange lines, respectively.

Multi-Agent Tensor Fusion for Contextual
Trajectory Prediction
• Accurate prediction of others’ trajectories is essential for autonomous driving.
• Trajectory prediction is challenging because it requires reasoning about agents’ past
movements, social interactions among varying numbers and kinds of agents, constraints
from the scene context, and the stochasticity of human behavior.
• This approach models these interactions and constraints jointly within a Multi-Agent
Tensor Fusion (MATF) network.
• Specifically, the model encodes multiple agents’ past trajectories and the scene context
into a Multi-Agent Tensor, then applies convolutional fusion to capture multiagent
interactions while retaining the spatial structure of agents and the scene context.
• The model decodes recurrently to multiple agents’ future trajectories, using adversarial
loss to learn stochastic predictions.
• Experiments on both highway driving and pedestrian crowd datasets show that the model
achieves state-of- the-art prediction accuracy.
2019.7

• There are two parallel encoding streams in the MATF architecture.
• One encodes the past trajectories of each individual agent xi independently using single agent
LSTM encoders, and another encodes the static scene context image c with a CNN.
• Each LSTM encoder shares the same set of parameters, so the architecture is invariant to the
number of agents in the scene.
• The outputs of the LSTM encoders are 1-D agent state vectors {x′1 , x′2 , .., x′n } without
temporal structure.
• The output of the scene context encoder CNN is a scaled feature map c′ retaining the spatial
structure of the bird’s-eye view static scene context image.
• Next, the two encoding streams are concatenated spatially into a Multi-Agent Tensor.
• Agent encodings {x′1, x′2, .., x′n} are placed into one bird’s-eye view spatial tensor, which is
initialized to 0 and is of the same shape (width and height) as the encoded scene image c′.

• The dimension axis of the encodings fits into the channel axis of the tensor.
• The agent encodings are placed into the spatial tensor with respect to their positions at the
last time step of their past trajectories.
• This tensor is then concatenated with the encoded scene image in the channel dimension to
get a combined tensor. If multiple agents are placed into the same cell in the tensor due to
discretization, element-wise max pooling is performed.
• The Multi-Agent Tensor is fed into fully convolutional layers, which learn to represent
interactions among multiple agents and between agents and the scene context, while
retaining spatial locality, to produce a fused Multi-Agent Tensor.
• Specifically, these layers operate at multiple spatial resolution scale levels by adopting U-
Net-like architectures to model interaction at different spatial scales.
• The output feature map of this fused model c′′ has exactly the same shape as c′ in width and
height to retain the spatial structure of the encoding.

The Multi-Agent Tensor encoding is a spatial
feature map of the scene context and multiple
agents from an overhead perspective, including
agent channels (above) and context channels
(below). Agents’ feature vectors (red) output
from single- Agent LSTM encoders are placed
spatially w.r.t. agents’ coordinates to form the
agent channels. The agent channels are aligned
spatially with the context channels (a context
feature map) output from scene context
encoding layers to retain the spatial structure.

• To decode each agent’s predicted trajectory, agent- specific representations with fused
interaction features for each agent {x1′′ , x2′′ , .., xn′′ } are sliced out according to their
coordinates from the fused Multi-Agent Tensor output c′′.
• These agent-specific representations are then added as a residual to the original encoded
agent vectors to form final agent encoding vectors {x1′ + x1′′ , x2′ + x2′′ , ..., xn′ + xn′′ }, which
encode all the information from the past trajectories of the agents themselves, the static
scene context, and the interaction features among multiple agents.
• In this way, this approach allows each agent to get a different social and contextual
embedding focused on itself.
• Importantly, the model gets these embeddings for multiple agents using shared feature
extractors instead of operating n times for n agents.
• Finally, for each agent in the scene, its final vector xi′ + xi′′ is decoded to future trajectory
prediction yiˆ by LSTM decoders.
• Similar to the encoders for each agent, parameters are shared to guarantee that the network
can generalize well when the number of agents in the scene varies.

Illustration of the Multi-Agent Tensor Fusion (MATF) architecture.

Stanford Drone dataset. From left to right：MATF Multi Agent Scene, MATF Multi Agent, and LSTM. Blue past trajectories,
red ground truth, and green predicted. The closer the green predicted trajectory is to the red ground truth future
trajectory, the more accurate the prediction. The model predicts that (1) two agents entering the roundabout from the
top will exit to the left; (2) one agent coming from the left on the pathway above the roundabout is turning left to move
toward the top of the image; (3) one agent is decelerating at the door of the building above and to the right of the
roundabout. (4) In one interesting failure case, an agent on the top-right of the roundabout is turning right to move
toward the top of the image; the model predicts the turn, but not how sharp it will be.

Which Way Are You Going? Imitative Decision
Learning for Path Forecasting in Dynamic Scenes
• Here propose a Imitative Decision Learning (IDL) approach, which delves deeper into the key
that inherently characterizes the multimodality – the latent decision.
• The proposed IDL first infers the distribution of such latent decisions by learning from moving
histories.
• A policy is then generated by taking the sampled latent decision into account to predict the
future.
• Different plausible upcoming paths correspond to each sampled latent decision.
• This approach significantly differs from the mainstream literature that relies on a predefined
latent variable to extrapolate diverse predictions.
• In order to augment the understanding of the latent decision and resultant multimodal future,
investigate their connection through mutual information optimization.
• Moreover, the proposed IDL integrates spatial and temporal dependencies into one single
framework, in contrast to handling them with two-step settings.
• This approach enables simultaneous anticipating the paths of all pedestrians in the scene.
CVPR 2019

The multimodal nature of future paths in a dynamic scene: There are multiple plausible
forthcoming paths (the dash red and cyan lines) based on identical historical moving
records (the solid red and cyan lines). Here only display three possibilities as an example.
One issue that has been challenging for the task of path forecasting in dynamic scenes is the multimodal
nature of the future: Given a set of historical observations, there will be more than one probable future.
Despite tremendous accomplishments that has been made to foresee a deterministic future, the
majority of the existing studies fail to consider the multiple possibilities of future.

• This work focuses on understanding and imitating the underlying human decision- making
process to anticipate future paths in dynamic scenes.
• Fundamentally, IDL can be viewed as jointly training:
• (1) an inference sub-network L that extrapolates the latent decision,
• (2) a policy/generator π that recovers a policy to generate upcoming paths,
• (3) a statistics sub-network Q that discovers the impact of latent decision on predictions,
• (4) a discriminator D that attempts to differentiate our generated outcomes from the expert
demonstrations.
• The detailed schematic diagram for forecasting future paths is seen in the following Figure.

• The red arrows indicate the direction of information flow between each module.
• The black arrows suggest the direction of information flow inside a module.
• The historical trajectories are input into inference sub-network to infer distribution of latent decisions.
• The temporal convolutional sub- module receives the output from the pre-trained convolutional sub-
module and produces a two-unit vector.
• A pre-trained deconvolutional sub-module and a softmax layer read each unit to form the mean and
derivation of a Gaussian distribution of latent decisions.
• Meanwhile, the encoder of policy/generator π processes the historical trajectories by a ConvGRU layer.
• An element-wise addition product on the encoded hidden states henc
tk and sampled latent decision S
initializes the decoder.
• The final predictions are generated from decoded hidden states hdec
t′ through a deconvolutional layer.
• The statistics sub-network reads prediction, latent decision to measure significance of S.
• The discriminator distinguishes predictions from ground truth future paths (expert demonstrations).

Qualitative comparisons on SAP dataset. The top left shows the observed records and the matching ground
truth (G.T.). In order to have a clear visualization for better understanding the multimodality, separately
illustrate several trajectories and the diverse predicted paths apart from others from example 1 to example 5.

TraPHic: Trajectory Prediction in Dense and
Heterogeneous Traffic Using Weighted Interactions
• CVPR 2019
• An algorithm for predicting the near-term trajectories of road agents in dense traffic videos.
• This approach is designed for heterogeneous traffic, where the road agents may correspond
to buses, cars, scooters, bi-cycles, or pedestrians.
• It models the interactions between different road agents using a novel LSTM-CNN hybrid
network for trajectory prediction.
• In particular, it takes into account heterogeneous interactions that implicitly account for the
varying shapes, dynamics, and behaviors of different road agents.
• It also models horizon-based interactions which are used to implicitly model the driving
behavior of each road agent.
• The prediction algorithm, TraPHic, tested on the standard datasets and a new dense,
heterogeneous traffic dataset corresponding to urban Asian videos and agent trajectories.

• Two observations:
• 1) Based on the idea that road agents in such dense traffic do not react to every road agent
around them; rather, they selectively focus attention on key interactions in a semielliptical region
in the field of view, which is called the “horizon”;
• 2) To capture heterogeneous road agent dynamics, embed their properties into the state-space
representation of the road agents and feed them into the hybrid network.
• TraPHic Network:
• To generate input embeddings for all agents based on trajectory information and heterogeneous
dynamic constraints such as agent shape, velocity, and traffic concentration at the agent’s spatial
coordinates, and other parameters.
• These embeddings are passed through LSTMs and eventually used to construct the horizon map,
the neighbor map and the ego agent’s own tensor map.
• The horizon and neighbor maps are passed through separate ConvNets and then concatenated
together with the ego agent tensor to produce latent representations.
• Finally, these latent representations are passed through an LSTM to generate a trajectory
prediction for the ego agent.

TraPHic Network Architecture: The ego agent is marked by the red dot. The green elliptical region around it is its
neighborhood and the cyan semi-elliptical region in front of it is its horizon.

Trajectory Prediction Results: highlight the performance of various trajectory prediction methods on TRAF dataset with
different types of road-agents. The ground truth (GT) trajectory: solid green line, and TraPHic prediction: a solid red line.
The prediction results of other methods (RNN-ED, S-LSTM, S-GAN, CS-LSTM) are drawn with different dashed lines.

Learning to Infer Relations for Future
Trajectory Forecast
• Relational inference is flexible to define ‘an object’ as a spatial feature representation
extracted from each region of the discretized grid regardless of what exists in that region.
• Inferring relational behavior between road users as well as road users and their surrounding
physical space is an important step toward effective modeling and prediction of navigation
strategies adopted by participants in road scenes.
• Here is a relation-aware framework for future trajectory forecast, which aims to infer
relational information from the interactions of road users with each other and with
environments.
• To address the different importance of relations, design a relation gate module (RGM) with
an internal gating process.
• The RGM is beneficial to control of information flow through multiple switch gates and
identifies descriptive relations that highly influence the future motion of the target by
conditioning on its past trajectory.
CVPR 2019

Trajectory Forecast
The proposed gated relation encoder (GRE) visually discovers both human-human (j -th region: woman ↔ man)
and human-space interactions (i-th region: cyclist ↔ cone) from each region of the discretized grid over time.

Trajectory Forecast
• In this framework, an object is a visual encoding of spatial behavior of road users (if they
exist) and environmental representations together with their temporal interactions over time,
which naturally corresponds to local human-human and human-space interactions in each
region of the discretized grid.
• On top of this, learning to infer relational behavior from all objects (i.e., spatio-temporal
interactions in the context) from a global perspective.
• Given a sequence of images, the gated relation encoder (GRE) visually extracts spatio-
temporal interactions (i.e., objects) through the spatial behavior encoder (SBE) and
temporal interaction encoder (TIE) .
• The RGM of GRE infers pair-wise relations from objects and then focuses on looking at
which relations will be potentially meaningful to forecast the future motion of the target
under its past behavior.
• To predict future locations using the aggregated relational features through the trajectory
prediction network (TPN) in the form of heatmaps which can be further refined by
considering spatial dependencies between predicted locations and extended to learn the
uncertainty of future forecast at test time.

Trajectory Forecast
The efficacy of the spatial refinement network
(SRN) for spatial dependencies.

Trajectory Forecast
• Predicted heatmaps from the TPN are sometimes ambiguous.
• The main insight for this issue is a lack of spatial dependencies among predictions.
• Since the network independently predicts δ heatmaps, there is no constraint to enforce
them to be spatially aligned between predictions.
• Thus, design a spatial refinement network (SRN) to learn implicit spatial dependencies in a
feature space.
• First concatenate intermediate activations (early and late features) of the TPN and let
through the SRN using large receptive fields.
• As a result, the output show less confusion between heatmap locations, making use of rich
contextual information from neighboring predictions.
• The total loss:

Trajectory Forecast
The efficacy of the uncertainty embedding into our framework with MC dropout.

Trajectory Forecast
• Bayesian neural networks (BNNs) have been considered to tackle the uncertainty of the
network’s weight parameters.
• It is found that inference in BNNs can be approximated by sampling from posterior
distribution of the deterministic network’s weight parameters using Monte Carlo dropout.
• It performs approximating variational inference using dropout at test time to draw multiple
samples from the dropout distribution.
• It literally enables us to capture multiple plausible trajectories over the uncertainties of the
network’s learned weight parameters.
• However, use the mean of L samples as prediction, which best approximates variational
inference in BNNs.
• It computes the variance of L = 5 samples to measure the uncertainty.

Trajectory Forecast
Qualitative evaluation. (Color codes: Yellow - given past trajectory, Red - ground-truth, and Green - this prediction)
Illustrations of prediction during complicated human-human interactions. (a) A cyclist (•••) interacts with a person moving
slow (•••). (b) A person (•••) meets a group of people. (c) A cyclist (•••) first interacts with another cyclist in front (•••)
and then considers the influence of a person (•••). This approach socially avoids potential collisions.

Peeking into the Future: Predicting Future
Person Activities and Locations in Videos
• CVPR 2019
• Deciphering human behaviors to predict their future paths/trajectories and what they would
do from videos is important in many applications.
• Therfore, this work studies predicting a pedestrian’s future path jointly with future activities.
• They propose an end-to-end, multi-task learning system, called Next, utilizing rich visual
features about human behavioral information and interaction with their surroundings.
• It encodes a person through rich semantic features about visual appearance, body
movement and interaction with the surroundings, motivated by the fact that humans derive
such predictions by relying on similar visual cues.
• To facilitate the training, the network is learned with an auxiliary task of predicting future
location in which the activity will happen.
• In the auxiliary task, it designs a discretized grid called the Manhattan Grid, as location
prediction target for the system.

The goal is to jointly predict a person’s future path and activity. The green and yellow line show two
possible future trajectories and two possible activities are shown in the green and yellow boxes.
Depending on the future activity, the person (top right) may take different paths, e.g. the yellow path
for “loading” and the green path for “object transfer”.

• Humans navigate through public spaces often with specific purposes in mind, ranging from
simple ones like entering a room to more complicated ones like putting things into a car.
• Such intention, however, is mostly neglected in existing work.
• The joint prediction model can have two benefits:
• 1) learning the activity together with the path may benefit the future path prediction;
Intuitively, humans are able to read from others’ body language to anticipate whether they
are going to cross the street or continue walking along the sidewalk.
• 2) the joint model advances the capability of understanding not only the future path but also
the future activity by taking into account the rich semantic context in videos; this increases
the capabilities of automated video analytics for social good, such as safety applications like
anticipating pedestrian movement at traffic intersections or a road robot helping humans
transport goods to a car.

Overview of the Next model. Given a sequence of frames containing the person for prediction, this model utilizes
person behavior module and person interaction module to encode rich visual semantics into a feature tensor.

• 4 Key components:
• Person behavior module extracts visual information from the behavioral sequence of
the person.
• Person interaction module looks at the interaction between a person and their
surroundings.
• Trajectory generator summarizes the encoded visual features and predicts the future
trajectory by the LSTM decoder with focal attention.
• Activity prediction utilizes rich visual semantics to predict the future activity label for
the person.
• In addition, divide the scene into a discretized grid of multiple scales, called
Manhattan Grid, to compute classification and regression for robust activity
location prediction.

To model appearance changes of a person, utilize a pre-trained object detection model with “RoIAlign”
to extract fixed size CNN features for each person bounding box.
To average the features along the spatial dimensions for each person and feed them into an LSTM
encoder. Finally, obtain a feature representation of Tobs × d, where d is the hidden size of the LSTM. To
capture the body movement, utilize a person keypoint detection model to extract person keypoint
information. To apply the linear transformation to embed the keypoint coordinates before feeding into
the LSTM encoder. The shape of the encoded feature has the shape of Tobs × d. These appearance
and movement features are commonly used in a wide variety of studies and thus do not introduce new
concern on machine learning fairness.

The person-objects feature can capture how far away the person is to the other
person and the cars. The person-scene feature can capture whether the person is
near the sidewalk or grass. It designs this information to the model with the hope
of learning things like a person walks more often on the sidewalk than the grass
and tends to avoid bumping into cars.

• It uses an LSTM decoder to directly predict the future trajectory in the xy-coordinate.
• The hidden state of this decoder is initialized using the last state of the person’s trajectory
LSTM encoder.
• Add an auxiliary task, i.e. activity location prediction, in addition to predicting the future
activity label of the person.
• At each time instant, the xy-coordinate will be computed from the decoder state and by a
fully connected layer.
• It employs an effective focal attention, originally proposed to carry out multimodal inference
over a sequence of images for visual question answering; which key idea is to project
multiple features into a space of correlation, where discriminative features can be easier to
capture by the attention mechanism.

To bridge the gap between trajectory generation and activity label prediction, it proposes an activity
location prediction (ALP) module to predict the final location of where the person will engage in the future
activity. The activity location prediction includes two tasks, location classification and location regression.

Qualitative comparison between this method and the baselines. Yellow path is the observable trajectory and
green path is the ground truth trajectory during the prediction period. Predictions are shown as blue heatmaps.

Pedestrian behavior/intention modeling for autonomous driving III

Pedestrian behavior/intention modeling for autonomous driving III

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Pedestrian behavior/intention modeling for autonomous driving III

Similar to Pedestrian behavior/intention modeling for autonomous driving III (20)

More from Yu Huang

More from Yu Huang (20)

Recently uploaded

Recently uploaded (20)

Pedestrian behavior/intention modeling for autonomous driving III