SlideShare a Scribd company logo
1 of 38
Download to read offline
Pedestrian Behavior/Intention
Modeling for Autonomous Driving IV
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California
Outline
• Looking to Relations for Future Trajectory Forecast
• The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction
• Stochastic Sampling Simulation for Pedestrian Trajectory Prediction
• Disentangling Human Dynamics for Pedestrian Locomotion Forecasting
with Noisy Supervision
• Social and Scene-Aware Trajectory Prediction in Crowded Spaces
Looking to Relations for Future Trajectory Forecast
• Inferring relational behavior between road users as well as road users and their surrounding
physical space is an important step toward effective modeling and prediction of navigation
strategies adopted by participants in road scenes.
• This paper proposes a relation-aware framework for future trajectory forecast.
• The system aims to infer relational information from the interactions of road users with each
other and with the environment.
• The first module involves visual encoding of spatio-temporal features, which captures human-
human and human-space interactions over time.
• The following module explicitly constructs pair-wise relations from spatio-temporal interactions
and identifies more descriptive relations that highly influence future motion of the target road
user by considering its past trajectory.
• The resulting relational features are used to forecast future locations of the target, in the form of
heatmaps with an additional guidance of spatial dependencies and consideration of the
uncertainty.
Looking to Relations for Future Trajectory Forecast
Spatio-temporal features are visually encoded from discretized grid to
locally discover (i) human-human and (ii) human-space over time. Then,
their pair-wise relations with respect to the past motion of the target
(→) are investigated from a global perspective for trajectory forecast.
Looking to Relations for Future Trajectory Forecast
Given a sequence of images, the GRE (gated relation encoder) visually analyzes spatial behavior of road
users and their temporal interactions with respect to environments. The subsequent RGM (relation gate
module) of GRE infers pair-wise relations from these interactions and determines which relations are
meaningful from a target agent’s perspective. The aggregated relational features are used to generate
initial heatmaps through the TPN (trajectory prediction network). Then, the following SRN (spatial
refinement network) further refines these initial predictions with a guidance of their spatial
dependencies. They additionally embed the uncertainty of the problem into the system at test time.
Looking to Relations for Future Trajectory Forecast
• They extend the definition of ‘object’ to a spatio- temporal feature representation extracted from
each region of the discretized grid over time.
• It enables to visually discover (i) human-human interactions where there exist multiple road users
interacting with each other over time, (ii) human-space interactions from their interactive
behavior with environments, and (iii) environmental representations by encoding structural
information of the road.
• The pair-wise relations between objects (i.e., local spatio- temporal features) are inferred from a
global perspective.
• Moreover, they design a new operation function to control information flow so that the network
can extract descriptive relational features by looking at relations that have a high potential to
influence the future motion of the target.
Looking to Relations for Future Trajectory Forecast
It visually extracts spatial representations of
the static road structures, the road topology,
and the appearance of road users from
individual frames using the spatial behavior
encoder (SBE) with 2D convolutions. They
individually process each entry of spatial
representations using the temporal
interaction encoder (TIE) with a 3D
convolution to model sequential changes of
road users and road structures with their
temporal interactions. The joint use of 2D
convolutions for spatial modeling and 3D
convolution for temporal modeling extracts
more discriminative spatio-temporal features
as compared to alternative methods such as
3D convolutions as a whole or 2D
convolutions with an LSTM.
Looking to Relations for Future Trajectory Forecast
They focused on the internal gating process of an LSTM unit that controls
information flow through multiple switch gates. Specifically, the LSTM employs a
sigmoid function with a tanh layer to determine not only which information is
useful, but also how much weight should be given. The efficacy of their control
process leads to design a relation gate module (RGM) which is essential to
generate more descriptive relational features from a target perspective.
Looking to Relations for Future Trajectory Forecast
• To effectively identify the pixel-level probability map, it specifically designs a trajectory prediction
network (TPN) with a set of deconvolutional layers.
• It first reshapes the relational features extracted from GRE to be the dimension 1 x 1 x w before
running the proposed TPN.
• The reshaped features are then incrementally upsampled using six deconvolutional layers, each
with a subsequent ReLU activation function.
• As an output, the network predicts a set of activations in the form of heatmaps through the
learned parameters.
• In training, the sum of squared error between the ground truth heatmaps and the prediction is
minimized, all over the 2D locations.
Looking to Relations for Future Trajectory Forecast
trajectory prediction networkrelation gate module
spatial behavior encoder + temporal interaction encoder
Looking to Relations for Future Trajectory Forecast
• Since the network independently predicts δ number of pixel-level probability maps, there is no
constraint to enforce heatmaps to be spatially aligned across predictions.
• They design a spatial refinement network (SRN) with large kernels, so the network can make use
of rich contextual information between the predicted locations.
• It first extracts intermediate activations from the TPN and let through a set of convolutional layers
with stride 2 so that the output feature map to be the same size as the earlier activation of TPN.
• Then, it upsamples the concatenated features using four deconvolutional layers followed by a 7 x
7 and 1 x 1 convolution.
• By using large receptive fields and increasing the number of layers, the network is able to
effectively capture dependencies, which results in less confusion between heatmap locations.
• In addition, the use of a 1 x 1 convolution enforces the refinement process to further achieve
pixel-level correction in the filter space.
Looking to Relations for Future Trajectory Forecast
spatial refinement network
Looking to Relations for Future Trajectory Forecast
The proposed approach properly encodes (a) human-human and (b) human-space interactions by
inferring relational behavior from a physical environment (highlighted by a dashed arrow➔). However,
it sometimes fails to predict a future trajectory when a road user (c) unexpectedly changes the direction
of its motion or (d) does not consider the interactions with an environment. (Color codes: Yellow - given
past trajectory, Red - ground-truth, and Green – the method’s prediction)
The Garden of Forking Paths: Towards Multi-
Future Trajectory Prediction
• This paper studies the problem of predicting the distribution over multiple possible future paths
of people as they move through various visual scenes.
• It makes two main contributions:
• The first contribution is a new dataset, created in a realistic 3D simulator, which is based on real
world trajectory data, and then extrapolated by human annotators to achieve different latent
goals. This provides the first benchmark for quantitative evaluation of the models to predict
multi-future trajectories.
• The second contribution is a new model to generate multiple plausible future trajectories, which
contains novel designs of using multi-scale location encodings and convolutional RNNs over
graphs, called Multiverse.
• Website: https://next.cs.cmu.edu/multiverse/index.html
The Garden of Forking Paths: Towards Multi-
Future Trajectory Prediction
Illustration of person trajectory prediction. (1) A person walks towards a car (data from the VIRAT/ActEV dataset). The
green line is the actual future trajectory and the yellow-orange heatmaps are example future predictions. Although these
predictions near the cars are plausible, they would be considered errors in the real video dataset. (2) To combat this, it
proposes a new dataset called “Forking Paths”; here it illustrates 3 possible futures created by human annotators
controlling agents in a synthetic world derived from real data. (3) Here it shows semantic segmentation of the scene. (4-6)
Here it is shown the same scene rendered from different viewing angles, where the red circles are future destinations.
The Garden of Forking Paths: Towards Multi-
Future Trajectory Prediction
Overview of the model. The input to the model is the ground truth location history, and a set of
video frames, which are preprocessed by a semantic segmentation model. This is encoded by
the “History Encoder” convolutional RNN. The output of the encoder is fed to the convolutional
RNN decoder for location prediction. The coarse location decoder outputs a heatmap over the
2D grid of size H × W . The fine location decoder outputs a vector offset within each grid cell.
These are combined to generate a multimodal distribution over R2for predicted locations.
The Garden of Forking Paths: Towards Multi-
Future Trajectory Prediction
• The history encoder computes a representation of the scene from the history of past locations. It
preprocesses each video frame using a pre-trained semantic segmentation model, the Deeplab
model, trained on the ADE20k dataset.
• Coarse Location Decoder: The graph-structured update function for the RNN ensures that the
probability mass “diffuses out” to nearby grid cells in a controlled manner, reflecting the prior
knowledge that people do not suddenly jump between distant locations.
• Fine Location Decoder: it trains a second convolutional RNN decoder to compute an offset vector
for each possible grid cell using a regression output.
• The loss function:
The Garden of Forking Paths: Towards Multi-
Future Trajectory Prediction
Stochastic Sampling Simulation for Pedestrian
Trajectory Prediction
• Urban environments pose a significant challenge for autonomous vehicles (AVs) as they must
safely navigate while in close proximity to many pedestrians.
• It is crucial for the AV to correctly understand and predict the future trajectories of pedestrians to
avoid collision and plan a safe path.
• This paper describes a method using a stochastic sampling-based simulation to train DNNs for
pedestrian trajectory prediction with social interaction.
• This simulation method can generate vast amounts of automatically-annotated, realistic, and
naturalistic synthetic pedestrian trajectories based on small amounts of real annotation.
• It then uses such synthetic trajectories to train an off-the-shelf state-of-the-art deep learning
approach Social GAN to perform pedestrian trajectory prediction.
• The proposed architecture, trained only using synthetic trajectories, achieves better prediction
results compared to those trained on human-annotated real-world data using the same network.
Stochastic Sampling Simulation for Pedestrian
Trajectory Prediction
System overview. It proposes using a novel stochastic sampling-based
simulation system to train a deep neural network (e.g., Social GAN) to
make socially acceptable pedestrian trajectory predictions.
Stochastic Sampling Simulation for Pedestrian
Trajectory Prediction
Disentangling Human Dynamics for Pedestrian
Locomotion Forecasting with Noisy Supervision
• It tackles the problem of Human Locomotion Forecasting, a task for jointly predicting the spatial
positions of several keypoints on human body in the near future under an egocentric setting.
• In contrast to the previous work that aims to solve either the task of pose prediction or trajectory
forecasting in isolation, it proposes a framework to unify these two problems and address the
practically useful task of pedestrian locomotion prediction in the wild.
• Among the major challenges in solving this task is the scarcity of annotated egocentric video
datasets with dense annotations for pose, depth, or egomotion.
• To surmount this difficulty, they use state-of-the-art models to generate (noisy) annotations and
propose robust forecasting models that can learn from this noisy supervision.
• This method disentangles the overall pedestrian motion into easier to learn subparts by utilizing a
pose completion and a decomposition module.
• The completion module fills in the missing key-point annotations and the decomposition module
breaks the cleaned locomotion down to global (trajectory) and local (pose keypoint movements).
• Further, with Quasi RNN as the backbone, they propose a hierarchical trajectory forecasting
network that utilizes low-level vision domain specific signals like egomotion and depth to predict
the global trajectory.
Disentangling Human Dynamics for Pedestrian
Locomotion Forecasting with Noisy Supervision
Egocentric pedestrian locomotion forecasting. Locomotion is defined as the overall motion of keypoints on
the pedestrian in contrast to predicting just the position (trajectory prediction) or the pose (pose forecasting).
Disentangling Human Dynamics for Pedestrian
Locomotion Forecasting with Noisy Supervision
An illustration for human locomotion forecasting with noisy supervision. The “Raw Pose” plane represents
the noisy input pose sequence with missing joint detection. The “Complete Pose” plane denotes the
output from the pose completion module with filled in joint positions. The completed pose is then split
into the global and local streams which separate concurrent motions. The prediction modules forecast the
future streams. Finally, these streams are merged to predict future pedestrian locomotion.
Disentangling Human Dynamics for Pedestrian
Locomotion Forecasting with Noisy Supervision
• They frame the task of forecasting human locomotion in egocentric view (of the vehicle) as a
sequence-to-sequence problem.
• They use state-of-the-art models for multiple-person keypoint detection module to autonomously
generate dense but noisy frame-level supervision for human poses.
• It autonomously estimates depth in a monocular camera using SuperDepth, which extends a
subpixel convolutional layer for depth super-resolution.
• It uses the state-of-the-art-model unsupervised model for autonomously estimating the camera
motion that occurs between consecutive frames due to the movement of the egovehicle.
• They propose a pose completion network for completing the detected human poses.
• This processing has a two-fold benefit. First, it fills in the joints that are not detected by the pose
detection module. It also suppresses noise by filling in the low confidence output with better
estimates. Second, it enables to decompose the motion with noisy data. This is because otherwise
separating the uncompleted global and local components of the motion would be perplexing as
the joints flicker frequently.
Disentangling Human Dynamics for Pedestrian
Locomotion Forecasting with Noisy Supervision
Encoder-Recurrent-Decoder architecture. The lock denotes the sharing of the frame encoder weights across
different time steps of the input sequence. Dotted squares contain values concerned with the same frame.
Disentangling Human Dynamics for Pedestrian
Locomotion Forecasting with Noisy Supervision
Architecture of the pose completion and disentangling module. The shades represent the
confidence in locating the joint. Black represents highest confidence and white represents
missing data. All detections below confidence are replaced with the autoencoder
estimates (sky blue). It is then split into local and global streams for forecasting.
Disentangling Human Dynamics for Pedestrian
Locomotion Forecasting with Noisy Supervision
• The proposal to disentangle global and local motion is motivated by the relative difference in the
nature and magnitude of motion exhibited by these streams.
• This disentangling allows to significantly reduce the overall complexity, since each of the streams
now model a much simpler and easier to predict motion.
• It proposes to use the neck joint sequence as a representation of the global stream, because the
neck is the most widely observed joint in the dataset.
• The Quasi-Recurrent Neural Network forms the backbone of the seq-to-seq learning structure.
• QRNNs consist of alternating convolutional and recurrent pooling module and is designed to
parallelize efficiently better than vanilla LSTMs.
• Quasi- RNN trains faster (825 ms/batch) compared to LSTM (886 ms/batch) on a single GPU under
same parameters and yield faster convergence for similar model capacity.
Disentangling Human Dynamics for Pedestrian
Locomotion Forecasting with Noisy Supervision
Architecture for forecasting the local stream.
The Quasi-RNN encoder-decoder has N layers of
alternate convolutions and recurrent pooling,
both in the input encoder and the output
decoder. The recurrent pooling is a thin
aggregation function applied to the convolutional
activations. The encoder churns through the
latent representation of the previous poses and
encodes the necessary information into a context
vector. This vector is then consumed by the
QRNN decoder to forecast the future poses
mapped back to the same latent space.
Disentangling Human Dynamics for Pedestrian
Locomotion Forecasting with Noisy Supervision
• The filled in and decomposed pose is used as the input to the pose prediction module.
• This module comprises of a spatial encoder with the latent dimension.
• The weights of this spatial encoder are separately trained using the autoencoder while the
complexity of the latent space is similar.
• The forecasting is processed in the latent space with layers of the QRNN Encoder-Decoder module.
• It uses the latent space to forecast because as confirmed by the pose completion module
experiments, the human pose lies on a low dimensional manifold because of the various
kinematic constraints enforced by the human body.
• Forecasting in this lower dim denser space makes the prediction easier for the quasi RNN module.
• The predicted pose is mapped back the image space with the spatial decoder to forecast pose.
• It proposes to predict residuals from the first observed positions instead of forecasting absolute
coordinates. In particular, it learns to predict the global stream from separately processed low
level vision signals (monocular depth, camera egomotion).
Disentangling Human Dynamics for Pedestrian
Locomotion Forecasting with Noisy Supervision
Qualitative results from the pose prediction module. (A) shows the tp = 15 length input sequence of poses for a
pedestrian walking on the sidewalk. (B) and (C) show the cropped frames with the pedestrian and the corresponding
filled in pose at the start and end of the input sequence. (D) shows the prediction pedestrian locomotion for the next tf
= 15 frames. (D), (E), and (F) also show the predicted poses at the start, intermediate, and end of the output sequence
respectively. Note that the actual position of the pedestrian represents the ground-truth in (D), (E) and (F).
Social and Scene-Aware Trajectory Prediction in
Crowded Spaces
• Mimicking human ability to forecast future positions or interpret complex interactions in
urban scenarios, such as streets, shopping malls or squares, is essential to develop socially
compliant robots or self-driving cars.
• Autonomous systems may gain advantage on anticipating human motion to avoid collisions
or to naturally behave alongside people.
• To foresee plausible trajectories, it constructs an LSTM-based model considering three
fundamental factors: people interactions, past observations in terms of previously crossed
areas and semantics of surrounding space.
• The model encompasses several pooling mechanisms to join the above elements defining
multiple tensors, namely social, navigation and semantic tensors.
• The network is tested in unstructured environments where complex paths emerge according
to both internal (intentions) and external (other people, not accessible areas) motivations.
• As demonstrated, modeling paths unaware of social interactions or context information, is
insufficient to correctly predict future positions.
• Codes: https://github.com/Oghma/sns-lstm/
Social and Scene-Aware Trajectory Prediction in
Crowded Spaces
Overview of the proposed model. Trajectories, navigation map and semantic image are fed
to the LSTM network and combined using three pooling mechanisms. Future positions are
obtained using linear layers to extract key parameters of a Gaussian distribution.
Social and Scene-Aware Trajectory Prediction in
Crowded Spaces
• Pedestrian dynamics in urban scenarios are highly influenced by static and dynamic factors which
guide people towards their destinations.
• To forecast realistic paths, it is important to allow human dynamics to be influenced by
surrounding space, not only in terms of other people in their neighborhood, but also considering
semantics of crossed areas as well as past observations.
• This framework models each pedestrian as an LSTM network interacting with the surrounding
space using three pooling mechanisms, namely Social, Navigation and Semantic pooling.
• Social pooling mechanism takes into account the neighborhood in terms of other people, merging their
hidden states.
• Navigation pooling mechanism exploits past observations to discriminate between equally likely
predicted positions using previous information about the scene.
• Finally, Semantic pooling uses semantic scene segmentation to recognize not crossable areas.
Social and Scene-Aware Trajectory Prediction in
Crowded Spaces
Semantic map is generated from the reference
image while the Navigation map is obtained from
observed data. The image shows an example of
such maps for ETH dataset.
Overview of the pooling mechanisms. Three tensors take
into account social neighborhood, past observations and
semantics of surrounding space, respectively. Tensors are
finally concatenated, processed by ReLU layers and fed to
LSTM networks along with embedded positions. Figure
also highlights dimensions of each introduced tensor.
Social and Scene-Aware Trajectory Prediction in
Crowded Spaces
Some examples of predicted trajectories for
HOTEL dataset. Ground-truths are shown as
solid lines, while predicted trajectories as
dashed lines. First column shows cases
where predicted positions are very close to
the real paths. Second column shows cases
where the SNS-LSTM appears not able to
correctly predict future positions.
Social and Scene-Aware Trajectory Prediction in
Crowded Spaces
Temporal sequences visualization for
different tracks drawn from both HOTEL
and ETH dataset. The circles represent
ground truth (green), SNS-LSTM model
(blue) and S-LSTM model (red),
respectively. For each row, the first
image shows the observed path (in
green) which corresponds to 8 frames
(the three circles are superimposed),
while the remaining ones show the 9th,
13th, 17th and 20th predicted frames,
respectively.
Pedestrian behavior/intention modeling for autonomous driving IV

More Related Content

What's hot

What's hot (20)

Deep VO and SLAM
Deep VO and SLAMDeep VO and SLAM
Deep VO and SLAM
 
Pedestrian behavior/intention modeling for autonomous driving III
Pedestrian behavior/intention modeling for autonomous driving IIIPedestrian behavior/intention modeling for autonomous driving III
Pedestrian behavior/intention modeling for autonomous driving III
 
Driving Behavior for ADAS and Autonomous Driving III
Driving Behavior for ADAS and Autonomous Driving IIIDriving Behavior for ADAS and Autonomous Driving III
Driving Behavior for ADAS and Autonomous Driving III
 
Driving Behavior for ADAS and Autonomous Driving X
Driving Behavior for ADAS and Autonomous Driving XDriving Behavior for ADAS and Autonomous Driving X
Driving Behavior for ADAS and Autonomous Driving X
 
Pedestrian behavior/intention modeling for autonomous driving II
Pedestrian behavior/intention modeling for autonomous driving IIPedestrian behavior/intention modeling for autonomous driving II
Pedestrian behavior/intention modeling for autonomous driving II
 
Pedestrian behavior/intention modeling for autonomous driving V
Pedestrian behavior/intention modeling for autonomous driving VPedestrian behavior/intention modeling for autonomous driving V
Pedestrian behavior/intention modeling for autonomous driving V
 
Driving Behavior for ADAS and Autonomous Driving VIII
Driving Behavior for ADAS and Autonomous Driving VIIIDriving Behavior for ADAS and Autonomous Driving VIII
Driving Behavior for ADAS and Autonomous Driving VIII
 
Driving Behavior for ADAS and Autonomous Driving IX
Driving Behavior for ADAS and Autonomous Driving IXDriving Behavior for ADAS and Autonomous Driving IX
Driving Behavior for ADAS and Autonomous Driving IX
 
Jointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planningJointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planning
 
camera-based Lane detection by deep learning
camera-based Lane detection by deep learningcamera-based Lane detection by deep learning
camera-based Lane detection by deep learning
 
Driving Behavior for ADAS and Autonomous Driving VII
Driving Behavior for ADAS and Autonomous Driving VIIDriving Behavior for ADAS and Autonomous Driving VII
Driving Behavior for ADAS and Autonomous Driving VII
 
Deep Learning’s Application in Radar Signal Data II
Deep Learning’s Application in Radar Signal Data IIDeep Learning’s Application in Radar Signal Data II
Deep Learning’s Application in Radar Signal Data II
 
Driving behaviors for adas and autonomous driving xiv
Driving behaviors for adas and autonomous driving xivDriving behaviors for adas and autonomous driving xiv
Driving behaviors for adas and autonomous driving xiv
 
Depth Fusion from RGB and Depth Sensors IV
Depth Fusion from RGB and Depth Sensors  IVDepth Fusion from RGB and Depth Sensors  IV
Depth Fusion from RGB and Depth Sensors IV
 
3-d interpretation from stereo images for autonomous driving
3-d interpretation from stereo images for autonomous driving3-d interpretation from stereo images for autonomous driving
3-d interpretation from stereo images for autonomous driving
 
Deep Learning’s Application in Radar Signal Data
Deep Learning’s Application in Radar Signal DataDeep Learning’s Application in Radar Signal Data
Deep Learning’s Application in Radar Signal Data
 
Camera-Based Road Lane Detection by Deep Learning II
Camera-Based Road Lane Detection by Deep Learning IICamera-Based Road Lane Detection by Deep Learning II
Camera-Based Road Lane Detection by Deep Learning II
 
Lidar for Autonomous Driving II (via Deep Learning)
Lidar for Autonomous Driving II (via Deep Learning)Lidar for Autonomous Driving II (via Deep Learning)
Lidar for Autonomous Driving II (via Deep Learning)
 
Driving behavior for ADAS and Autonomous Driving
Driving behavior for ADAS and Autonomous DrivingDriving behavior for ADAS and Autonomous Driving
Driving behavior for ADAS and Autonomous Driving
 
Depth Fusion from RGB and Depth Sensors III
Depth Fusion from RGB and Depth Sensors  IIIDepth Fusion from RGB and Depth Sensors  III
Depth Fusion from RGB and Depth Sensors III
 

Similar to Pedestrian behavior/intention modeling for autonomous driving IV

Cooperative positioning and tracking in disruption tolerant networks
Cooperative positioning and tracking in disruption tolerant networksCooperative positioning and tracking in disruption tolerant networks
Cooperative positioning and tracking in disruption tolerant networks
JPINFOTECH JAYAPRAKASH
 

Similar to Pedestrian behavior/intention modeling for autonomous driving IV (20)

5438-Article Text-8663-1-10-20200511.pdf
5438-Article Text-8663-1-10-20200511.pdf5438-Article Text-8663-1-10-20200511.pdf
5438-Article Text-8663-1-10-20200511.pdf
 
Deep Multi-View Spatial-Temporal Network for Taxi Demand Prediction
Deep Multi-View Spatial-Temporal Network for Taxi Demand PredictionDeep Multi-View Spatial-Temporal Network for Taxi Demand Prediction
Deep Multi-View Spatial-Temporal Network for Taxi Demand Prediction
 
Learning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RLLearning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RL
 
NS-CUK Seminar: S.T.Nguyen, Review on "Multi-modal Trajectory Prediction for ...
NS-CUK Seminar: S.T.Nguyen, Review on "Multi-modal Trajectory Prediction for ...NS-CUK Seminar: S.T.Nguyen, Review on "Multi-modal Trajectory Prediction for ...
NS-CUK Seminar: S.T.Nguyen, Review on "Multi-modal Trajectory Prediction for ...
 
IRJET - A Review on Pedestrian Behavior Prediction for Intelligent Transport ...
IRJET - A Review on Pedestrian Behavior Prediction for Intelligent Transport ...IRJET - A Review on Pedestrian Behavior Prediction for Intelligent Transport ...
IRJET - A Review on Pedestrian Behavior Prediction for Intelligent Transport ...
 
Prediction of nodes mobility in 3-D space
Prediction of nodes mobility in 3-D space Prediction of nodes mobility in 3-D space
Prediction of nodes mobility in 3-D space
 
Laplacian-regularized Graph Bandits
Laplacian-regularized Graph BanditsLaplacian-regularized Graph Bandits
Laplacian-regularized Graph Bandits
 
Localization based range map stitching in wireless sensor network under non l...
Localization based range map stitching in wireless sensor network under non l...Localization based range map stitching in wireless sensor network under non l...
Localization based range map stitching in wireless sensor network under non l...
 
Adaptive Feature Fusion Networks for Origin-Destination Passenger Flow Predic...
Adaptive Feature Fusion Networks for Origin-Destination Passenger Flow Predic...Adaptive Feature Fusion Networks for Origin-Destination Passenger Flow Predic...
Adaptive Feature Fusion Networks for Origin-Destination Passenger Flow Predic...
 
A Study of Mobile User Movements Prediction Methods
A Study of Mobile User Movements Prediction Methods A Study of Mobile User Movements Prediction Methods
A Study of Mobile User Movements Prediction Methods
 
JAVA 2013 IEEE DATAMINING PROJECT Distributed web systems performance forecas...
JAVA 2013 IEEE DATAMINING PROJECT Distributed web systems performance forecas...JAVA 2013 IEEE DATAMINING PROJECT Distributed web systems performance forecas...
JAVA 2013 IEEE DATAMINING PROJECT Distributed web systems performance forecas...
 
Distributed web systems performance forecasting
Distributed web systems performance forecastingDistributed web systems performance forecasting
Distributed web systems performance forecasting
 
2006.11583.pdf
2006.11583.pdf2006.11583.pdf
2006.11583.pdf
 
Where Next
Where NextWhere Next
Where Next
 
Scalable algorithms for nearest neighbor joins on big trajectory data
Scalable algorithms for nearest neighbor joins on big trajectory dataScalable algorithms for nearest neighbor joins on big trajectory data
Scalable algorithms for nearest neighbor joins on big trajectory data
 
Design of a Dynamic Land-Use Change Probability - Yongjin Joo, Chulmin Jun, S...
Design of a Dynamic Land-Use Change Probability - Yongjin Joo, Chulmin Jun, S...Design of a Dynamic Land-Use Change Probability - Yongjin Joo, Chulmin Jun, S...
Design of a Dynamic Land-Use Change Probability - Yongjin Joo, Chulmin Jun, S...
 
Cooperative positioning and tracking in disruption tolerant networks
Cooperative positioning and tracking in disruption tolerant networksCooperative positioning and tracking in disruption tolerant networks
Cooperative positioning and tracking in disruption tolerant networks
 
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...
Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detec...
 
MMekni Poster V0.2
MMekni Poster V0.2MMekni Poster V0.2
MMekni Poster V0.2
 
SAR Remote Sensing for Urban Damage Assessment for Tehran
SAR Remote Sensing for Urban Damage Assessment for TehranSAR Remote Sensing for Urban Damage Assessment for Tehran
SAR Remote Sensing for Urban Damage Assessment for Tehran
 

More from Yu Huang

More from Yu Huang (20)

Application of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous DrivingApplication of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous Driving
 
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
 
Data Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous DrivingData Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous Driving
 
Techniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous DrivingTechniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous Driving
 
BEV Joint Detection and Segmentation
BEV Joint Detection and SegmentationBEV Joint Detection and Segmentation
BEV Joint Detection and Segmentation
 
BEV Object Detection and Prediction
BEV Object Detection and PredictionBEV Object Detection and Prediction
BEV Object Detection and Prediction
 
Fisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VIFisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VI
 
Fisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving VFisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving V
 
Fisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IVFisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IV
 
Prediction,Planninng & Control at Baidu
Prediction,Planninng & Control at BaiduPrediction,Planninng & Control at Baidu
Prediction,Planninng & Control at Baidu
 
Cruise AI under the Hood
Cruise AI under the HoodCruise AI under the Hood
Cruise AI under the Hood
 
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
 
Scenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous DrivingScenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous Driving
 
How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?
 
Annotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous DrivingAnnotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous Driving
 
Simulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atgSimulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atg
 
Multi sensor calibration by deep learning
Multi sensor calibration by deep learningMulti sensor calibration by deep learning
Multi sensor calibration by deep learning
 
Data pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous drivingData pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous driving
 
Open Source codes of trajectory prediction & behavior planning
Open Source codes of trajectory prediction & behavior planningOpen Source codes of trajectory prediction & behavior planning
Open Source codes of trajectory prediction & behavior planning
 
Lidar in the adverse weather: dust, fog, snow and rain
Lidar in the adverse weather: dust, fog, snow and rainLidar in the adverse weather: dust, fog, snow and rain
Lidar in the adverse weather: dust, fog, snow and rain
 

Recently uploaded

Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 

Recently uploaded (20)

data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 

Pedestrian behavior/intention modeling for autonomous driving IV

  • 1. Pedestrian Behavior/Intention Modeling for Autonomous Driving IV Yu Huang Yu.huang07@gmail.com Sunnyvale, California
  • 2. Outline • Looking to Relations for Future Trajectory Forecast • The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction • Stochastic Sampling Simulation for Pedestrian Trajectory Prediction • Disentangling Human Dynamics for Pedestrian Locomotion Forecasting with Noisy Supervision • Social and Scene-Aware Trajectory Prediction in Crowded Spaces
  • 3. Looking to Relations for Future Trajectory Forecast • Inferring relational behavior between road users as well as road users and their surrounding physical space is an important step toward effective modeling and prediction of navigation strategies adopted by participants in road scenes. • This paper proposes a relation-aware framework for future trajectory forecast. • The system aims to infer relational information from the interactions of road users with each other and with the environment. • The first module involves visual encoding of spatio-temporal features, which captures human- human and human-space interactions over time. • The following module explicitly constructs pair-wise relations from spatio-temporal interactions and identifies more descriptive relations that highly influence future motion of the target road user by considering its past trajectory. • The resulting relational features are used to forecast future locations of the target, in the form of heatmaps with an additional guidance of spatial dependencies and consideration of the uncertainty.
  • 4. Looking to Relations for Future Trajectory Forecast Spatio-temporal features are visually encoded from discretized grid to locally discover (i) human-human and (ii) human-space over time. Then, their pair-wise relations with respect to the past motion of the target (→) are investigated from a global perspective for trajectory forecast.
  • 5. Looking to Relations for Future Trajectory Forecast Given a sequence of images, the GRE (gated relation encoder) visually analyzes spatial behavior of road users and their temporal interactions with respect to environments. The subsequent RGM (relation gate module) of GRE infers pair-wise relations from these interactions and determines which relations are meaningful from a target agent’s perspective. The aggregated relational features are used to generate initial heatmaps through the TPN (trajectory prediction network). Then, the following SRN (spatial refinement network) further refines these initial predictions with a guidance of their spatial dependencies. They additionally embed the uncertainty of the problem into the system at test time.
  • 6. Looking to Relations for Future Trajectory Forecast • They extend the definition of ‘object’ to a spatio- temporal feature representation extracted from each region of the discretized grid over time. • It enables to visually discover (i) human-human interactions where there exist multiple road users interacting with each other over time, (ii) human-space interactions from their interactive behavior with environments, and (iii) environmental representations by encoding structural information of the road. • The pair-wise relations between objects (i.e., local spatio- temporal features) are inferred from a global perspective. • Moreover, they design a new operation function to control information flow so that the network can extract descriptive relational features by looking at relations that have a high potential to influence the future motion of the target.
  • 7. Looking to Relations for Future Trajectory Forecast It visually extracts spatial representations of the static road structures, the road topology, and the appearance of road users from individual frames using the spatial behavior encoder (SBE) with 2D convolutions. They individually process each entry of spatial representations using the temporal interaction encoder (TIE) with a 3D convolution to model sequential changes of road users and road structures with their temporal interactions. The joint use of 2D convolutions for spatial modeling and 3D convolution for temporal modeling extracts more discriminative spatio-temporal features as compared to alternative methods such as 3D convolutions as a whole or 2D convolutions with an LSTM.
  • 8. Looking to Relations for Future Trajectory Forecast They focused on the internal gating process of an LSTM unit that controls information flow through multiple switch gates. Specifically, the LSTM employs a sigmoid function with a tanh layer to determine not only which information is useful, but also how much weight should be given. The efficacy of their control process leads to design a relation gate module (RGM) which is essential to generate more descriptive relational features from a target perspective.
  • 9. Looking to Relations for Future Trajectory Forecast • To effectively identify the pixel-level probability map, it specifically designs a trajectory prediction network (TPN) with a set of deconvolutional layers. • It first reshapes the relational features extracted from GRE to be the dimension 1 x 1 x w before running the proposed TPN. • The reshaped features are then incrementally upsampled using six deconvolutional layers, each with a subsequent ReLU activation function. • As an output, the network predicts a set of activations in the form of heatmaps through the learned parameters. • In training, the sum of squared error between the ground truth heatmaps and the prediction is minimized, all over the 2D locations.
  • 10. Looking to Relations for Future Trajectory Forecast trajectory prediction networkrelation gate module spatial behavior encoder + temporal interaction encoder
  • 11. Looking to Relations for Future Trajectory Forecast • Since the network independently predicts δ number of pixel-level probability maps, there is no constraint to enforce heatmaps to be spatially aligned across predictions. • They design a spatial refinement network (SRN) with large kernels, so the network can make use of rich contextual information between the predicted locations. • It first extracts intermediate activations from the TPN and let through a set of convolutional layers with stride 2 so that the output feature map to be the same size as the earlier activation of TPN. • Then, it upsamples the concatenated features using four deconvolutional layers followed by a 7 x 7 and 1 x 1 convolution. • By using large receptive fields and increasing the number of layers, the network is able to effectively capture dependencies, which results in less confusion between heatmap locations. • In addition, the use of a 1 x 1 convolution enforces the refinement process to further achieve pixel-level correction in the filter space.
  • 12. Looking to Relations for Future Trajectory Forecast spatial refinement network
  • 13. Looking to Relations for Future Trajectory Forecast The proposed approach properly encodes (a) human-human and (b) human-space interactions by inferring relational behavior from a physical environment (highlighted by a dashed arrow➔). However, it sometimes fails to predict a future trajectory when a road user (c) unexpectedly changes the direction of its motion or (d) does not consider the interactions with an environment. (Color codes: Yellow - given past trajectory, Red - ground-truth, and Green – the method’s prediction)
  • 14. The Garden of Forking Paths: Towards Multi- Future Trajectory Prediction • This paper studies the problem of predicting the distribution over multiple possible future paths of people as they move through various visual scenes. • It makes two main contributions: • The first contribution is a new dataset, created in a realistic 3D simulator, which is based on real world trajectory data, and then extrapolated by human annotators to achieve different latent goals. This provides the first benchmark for quantitative evaluation of the models to predict multi-future trajectories. • The second contribution is a new model to generate multiple plausible future trajectories, which contains novel designs of using multi-scale location encodings and convolutional RNNs over graphs, called Multiverse. • Website: https://next.cs.cmu.edu/multiverse/index.html
  • 15. The Garden of Forking Paths: Towards Multi- Future Trajectory Prediction Illustration of person trajectory prediction. (1) A person walks towards a car (data from the VIRAT/ActEV dataset). The green line is the actual future trajectory and the yellow-orange heatmaps are example future predictions. Although these predictions near the cars are plausible, they would be considered errors in the real video dataset. (2) To combat this, it proposes a new dataset called “Forking Paths”; here it illustrates 3 possible futures created by human annotators controlling agents in a synthetic world derived from real data. (3) Here it shows semantic segmentation of the scene. (4-6) Here it is shown the same scene rendered from different viewing angles, where the red circles are future destinations.
  • 16. The Garden of Forking Paths: Towards Multi- Future Trajectory Prediction Overview of the model. The input to the model is the ground truth location history, and a set of video frames, which are preprocessed by a semantic segmentation model. This is encoded by the “History Encoder” convolutional RNN. The output of the encoder is fed to the convolutional RNN decoder for location prediction. The coarse location decoder outputs a heatmap over the 2D grid of size H × W . The fine location decoder outputs a vector offset within each grid cell. These are combined to generate a multimodal distribution over R2for predicted locations.
  • 17. The Garden of Forking Paths: Towards Multi- Future Trajectory Prediction • The history encoder computes a representation of the scene from the history of past locations. It preprocesses each video frame using a pre-trained semantic segmentation model, the Deeplab model, trained on the ADE20k dataset. • Coarse Location Decoder: The graph-structured update function for the RNN ensures that the probability mass “diffuses out” to nearby grid cells in a controlled manner, reflecting the prior knowledge that people do not suddenly jump between distant locations. • Fine Location Decoder: it trains a second convolutional RNN decoder to compute an offset vector for each possible grid cell using a regression output. • The loss function:
  • 18. The Garden of Forking Paths: Towards Multi- Future Trajectory Prediction
  • 19. Stochastic Sampling Simulation for Pedestrian Trajectory Prediction • Urban environments pose a significant challenge for autonomous vehicles (AVs) as they must safely navigate while in close proximity to many pedestrians. • It is crucial for the AV to correctly understand and predict the future trajectories of pedestrians to avoid collision and plan a safe path. • This paper describes a method using a stochastic sampling-based simulation to train DNNs for pedestrian trajectory prediction with social interaction. • This simulation method can generate vast amounts of automatically-annotated, realistic, and naturalistic synthetic pedestrian trajectories based on small amounts of real annotation. • It then uses such synthetic trajectories to train an off-the-shelf state-of-the-art deep learning approach Social GAN to perform pedestrian trajectory prediction. • The proposed architecture, trained only using synthetic trajectories, achieves better prediction results compared to those trained on human-annotated real-world data using the same network.
  • 20. Stochastic Sampling Simulation for Pedestrian Trajectory Prediction System overview. It proposes using a novel stochastic sampling-based simulation system to train a deep neural network (e.g., Social GAN) to make socially acceptable pedestrian trajectory predictions.
  • 21. Stochastic Sampling Simulation for Pedestrian Trajectory Prediction
  • 22. Disentangling Human Dynamics for Pedestrian Locomotion Forecasting with Noisy Supervision • It tackles the problem of Human Locomotion Forecasting, a task for jointly predicting the spatial positions of several keypoints on human body in the near future under an egocentric setting. • In contrast to the previous work that aims to solve either the task of pose prediction or trajectory forecasting in isolation, it proposes a framework to unify these two problems and address the practically useful task of pedestrian locomotion prediction in the wild. • Among the major challenges in solving this task is the scarcity of annotated egocentric video datasets with dense annotations for pose, depth, or egomotion. • To surmount this difficulty, they use state-of-the-art models to generate (noisy) annotations and propose robust forecasting models that can learn from this noisy supervision. • This method disentangles the overall pedestrian motion into easier to learn subparts by utilizing a pose completion and a decomposition module. • The completion module fills in the missing key-point annotations and the decomposition module breaks the cleaned locomotion down to global (trajectory) and local (pose keypoint movements). • Further, with Quasi RNN as the backbone, they propose a hierarchical trajectory forecasting network that utilizes low-level vision domain specific signals like egomotion and depth to predict the global trajectory.
  • 23. Disentangling Human Dynamics for Pedestrian Locomotion Forecasting with Noisy Supervision Egocentric pedestrian locomotion forecasting. Locomotion is defined as the overall motion of keypoints on the pedestrian in contrast to predicting just the position (trajectory prediction) or the pose (pose forecasting).
  • 24. Disentangling Human Dynamics for Pedestrian Locomotion Forecasting with Noisy Supervision An illustration for human locomotion forecasting with noisy supervision. The “Raw Pose” plane represents the noisy input pose sequence with missing joint detection. The “Complete Pose” plane denotes the output from the pose completion module with filled in joint positions. The completed pose is then split into the global and local streams which separate concurrent motions. The prediction modules forecast the future streams. Finally, these streams are merged to predict future pedestrian locomotion.
  • 25. Disentangling Human Dynamics for Pedestrian Locomotion Forecasting with Noisy Supervision • They frame the task of forecasting human locomotion in egocentric view (of the vehicle) as a sequence-to-sequence problem. • They use state-of-the-art models for multiple-person keypoint detection module to autonomously generate dense but noisy frame-level supervision for human poses. • It autonomously estimates depth in a monocular camera using SuperDepth, which extends a subpixel convolutional layer for depth super-resolution. • It uses the state-of-the-art-model unsupervised model for autonomously estimating the camera motion that occurs between consecutive frames due to the movement of the egovehicle. • They propose a pose completion network for completing the detected human poses. • This processing has a two-fold benefit. First, it fills in the joints that are not detected by the pose detection module. It also suppresses noise by filling in the low confidence output with better estimates. Second, it enables to decompose the motion with noisy data. This is because otherwise separating the uncompleted global and local components of the motion would be perplexing as the joints flicker frequently.
  • 26. Disentangling Human Dynamics for Pedestrian Locomotion Forecasting with Noisy Supervision Encoder-Recurrent-Decoder architecture. The lock denotes the sharing of the frame encoder weights across different time steps of the input sequence. Dotted squares contain values concerned with the same frame.
  • 27. Disentangling Human Dynamics for Pedestrian Locomotion Forecasting with Noisy Supervision Architecture of the pose completion and disentangling module. The shades represent the confidence in locating the joint. Black represents highest confidence and white represents missing data. All detections below confidence are replaced with the autoencoder estimates (sky blue). It is then split into local and global streams for forecasting.
  • 28. Disentangling Human Dynamics for Pedestrian Locomotion Forecasting with Noisy Supervision • The proposal to disentangle global and local motion is motivated by the relative difference in the nature and magnitude of motion exhibited by these streams. • This disentangling allows to significantly reduce the overall complexity, since each of the streams now model a much simpler and easier to predict motion. • It proposes to use the neck joint sequence as a representation of the global stream, because the neck is the most widely observed joint in the dataset. • The Quasi-Recurrent Neural Network forms the backbone of the seq-to-seq learning structure. • QRNNs consist of alternating convolutional and recurrent pooling module and is designed to parallelize efficiently better than vanilla LSTMs. • Quasi- RNN trains faster (825 ms/batch) compared to LSTM (886 ms/batch) on a single GPU under same parameters and yield faster convergence for similar model capacity.
  • 29. Disentangling Human Dynamics for Pedestrian Locomotion Forecasting with Noisy Supervision Architecture for forecasting the local stream. The Quasi-RNN encoder-decoder has N layers of alternate convolutions and recurrent pooling, both in the input encoder and the output decoder. The recurrent pooling is a thin aggregation function applied to the convolutional activations. The encoder churns through the latent representation of the previous poses and encodes the necessary information into a context vector. This vector is then consumed by the QRNN decoder to forecast the future poses mapped back to the same latent space.
  • 30. Disentangling Human Dynamics for Pedestrian Locomotion Forecasting with Noisy Supervision • The filled in and decomposed pose is used as the input to the pose prediction module. • This module comprises of a spatial encoder with the latent dimension. • The weights of this spatial encoder are separately trained using the autoencoder while the complexity of the latent space is similar. • The forecasting is processed in the latent space with layers of the QRNN Encoder-Decoder module. • It uses the latent space to forecast because as confirmed by the pose completion module experiments, the human pose lies on a low dimensional manifold because of the various kinematic constraints enforced by the human body. • Forecasting in this lower dim denser space makes the prediction easier for the quasi RNN module. • The predicted pose is mapped back the image space with the spatial decoder to forecast pose. • It proposes to predict residuals from the first observed positions instead of forecasting absolute coordinates. In particular, it learns to predict the global stream from separately processed low level vision signals (monocular depth, camera egomotion).
  • 31. Disentangling Human Dynamics for Pedestrian Locomotion Forecasting with Noisy Supervision Qualitative results from the pose prediction module. (A) shows the tp = 15 length input sequence of poses for a pedestrian walking on the sidewalk. (B) and (C) show the cropped frames with the pedestrian and the corresponding filled in pose at the start and end of the input sequence. (D) shows the prediction pedestrian locomotion for the next tf = 15 frames. (D), (E), and (F) also show the predicted poses at the start, intermediate, and end of the output sequence respectively. Note that the actual position of the pedestrian represents the ground-truth in (D), (E) and (F).
  • 32. Social and Scene-Aware Trajectory Prediction in Crowded Spaces • Mimicking human ability to forecast future positions or interpret complex interactions in urban scenarios, such as streets, shopping malls or squares, is essential to develop socially compliant robots or self-driving cars. • Autonomous systems may gain advantage on anticipating human motion to avoid collisions or to naturally behave alongside people. • To foresee plausible trajectories, it constructs an LSTM-based model considering three fundamental factors: people interactions, past observations in terms of previously crossed areas and semantics of surrounding space. • The model encompasses several pooling mechanisms to join the above elements defining multiple tensors, namely social, navigation and semantic tensors. • The network is tested in unstructured environments where complex paths emerge according to both internal (intentions) and external (other people, not accessible areas) motivations. • As demonstrated, modeling paths unaware of social interactions or context information, is insufficient to correctly predict future positions. • Codes: https://github.com/Oghma/sns-lstm/
  • 33. Social and Scene-Aware Trajectory Prediction in Crowded Spaces Overview of the proposed model. Trajectories, navigation map and semantic image are fed to the LSTM network and combined using three pooling mechanisms. Future positions are obtained using linear layers to extract key parameters of a Gaussian distribution.
  • 34. Social and Scene-Aware Trajectory Prediction in Crowded Spaces • Pedestrian dynamics in urban scenarios are highly influenced by static and dynamic factors which guide people towards their destinations. • To forecast realistic paths, it is important to allow human dynamics to be influenced by surrounding space, not only in terms of other people in their neighborhood, but also considering semantics of crossed areas as well as past observations. • This framework models each pedestrian as an LSTM network interacting with the surrounding space using three pooling mechanisms, namely Social, Navigation and Semantic pooling. • Social pooling mechanism takes into account the neighborhood in terms of other people, merging their hidden states. • Navigation pooling mechanism exploits past observations to discriminate between equally likely predicted positions using previous information about the scene. • Finally, Semantic pooling uses semantic scene segmentation to recognize not crossable areas.
  • 35. Social and Scene-Aware Trajectory Prediction in Crowded Spaces Semantic map is generated from the reference image while the Navigation map is obtained from observed data. The image shows an example of such maps for ETH dataset. Overview of the pooling mechanisms. Three tensors take into account social neighborhood, past observations and semantics of surrounding space, respectively. Tensors are finally concatenated, processed by ReLU layers and fed to LSTM networks along with embedded positions. Figure also highlights dimensions of each introduced tensor.
  • 36. Social and Scene-Aware Trajectory Prediction in Crowded Spaces Some examples of predicted trajectories for HOTEL dataset. Ground-truths are shown as solid lines, while predicted trajectories as dashed lines. First column shows cases where predicted positions are very close to the real paths. Second column shows cases where the SNS-LSTM appears not able to correctly predict future positions.
  • 37. Social and Scene-Aware Trajectory Prediction in Crowded Spaces Temporal sequences visualization for different tracks drawn from both HOTEL and ETH dataset. The circles represent ground truth (green), SNS-LSTM model (blue) and S-LSTM model (red), respectively. For each row, the first image shows the observed path (in green) which corresponds to 8 frames (the three circles are superimposed), while the remaining ones show the 9th, 13th, 17th and 20th predicted frames, respectively.