Driving Behavior for ADAS and Autonomous Driving VIII

Driving Behaviors for ADAS
and Autonomous Driving VIII
Yu Huang
Yu.huang07@gmail.com
Sunnyvale, California

Outline
• Forecasting Trajectory and Behavior of Road-Agents Using Spectral Clustering in
Graph-LSTMs
• TrafficPredict: Trajectory Prediction for Heterogeneous Traffic-Agents
• Large-Scale extraction of accurate vehicle trajectories for driving behavior
learning
• Learning Traffic Behaviors for Simulation via Extraction of Vehicle Trajectories
from Video Streams
• Learning Vehicle Cooperative Lane-changing Behavior from Observed Trajectories
in the NGSIM Dataset
• Joint Prediction for Kinematic Trajectories in Vehicle-Pedestrian-Mixed Scenes

Forecasting Trajectory and Behavior of Road-Agents
Using Spectral Clustering in Graph-LSTMs
• Code, Video, Datasets at https://gamma.umd.edu/spectralcows/
• An approach for traffic forecasting in urban traffic scenarios using a combination of spectral graph
analysis and deep learning.
• It predicts both the low-level info (future trajectories) as well as the high-level info (road-agent
behavior) from the extracted trajectory of each road-agent.
• This formulation represents the proximity between the road agents using a dynamic weighted
traffic-graph.
• They use a two-stream graph convolutional LSTM network to perform traffic forecasting using
these weighted traffic-graphs.
• The first stream predicts the spatial coordinates of road-agents, while the second stream predicts
whether a road-agent is going to exhibit aggressive, conservative, or normal behavior.
• It introduces spectral cluster regularization to reduce the error margin in long term prediction (3-
5 seconds) and improve the accuracy of the predicted trajectories.
• In practice, it reduces the average prediction error by more than 54% over prior algorithms and
achieves a weighted average accuracy of 91.2% for behavior prediction.

• Many studies have been performed that provide insights into factors that contribute to different
driver behaviors classes such as aggressive, conservative, or moderate driving.
• These factors can be broadly categorized into four categories.
• The first category of factors that indicate road agent behavior is driver-related, such as age, gender,
blood pressure, personality, occupation, hearing, and so on.
• The second category corresponds to environmental factors such as weather or traffic conditions.
• The third category refers to psychological aspects that affect driving styles, like drunk driving, driving
under the influence, state of fatigue, and so on.
• The fourth category of factors contributing to driving behavior corresponds to vehicular factors such as
positions, acceleration, speed, throttle responses, steering wheel measurements, lane changes, and
brake pressure.

• Let’s represent traffic at each time instance with n road agents using a traffic-graph G, with the
spatial coordinates of the road-agent representing the set of vertices V = {v1, v2, . . . , vn} and a set
of undirected, weighted edges, E.
• Two road-agents are said to be connected through an edge if d(vi , vj ) < µ, where d(vi , vj )
represents the Euclidean distance between the road-agents and µ is a heuristically chosen
threshold parameter (µ = 10 meters, based on size of road agents and the width of the road).
• The overall flow can be described as follows:
• 1. input consists of computing the spatial coordinates over the past T seconds as well as the
eigenvectors of the traffic-graphs corresponding to the first T traffic graphs.
• 2. The first stream accepts the extracted spatial coordinates and uses an LSTM-based sequence model
to predict the trajectory of a road agent for the next τ seconds.
• 3. The second stream accepts the eigenvectors of the traffic-graphs and predicts the eigenvectors
corresponding to the traffic-graphs for the next τ seconds. The predicted eigenvectors are used within
the behavior prediction algorithm to assign a behavior label to the road-agent.
• 4. To improve long-term prediction, they propose a regularization algorithm.

Network Architecture: It shows the trajectory and behavior prediction for the ithroad-agent (red circle).
The input to the first stream consists of the spatial coordinates and the eigenvectors (green rectangles
and shade of green) of the traffic-graphs. It performs spectral clustering on the predicted eigenvectors
from the second stream (orange block) to regularize the loss function and perform backpropagation on
the new loss function to improve long-term prediction.

RMSE Curves: The plot is the logarithm of the RMSE values for visualization purposes. Lower values indicate
the direction of better performance. The prediction window is 5 seconds for the Lyft and Apolloscape datasets,
and 3 seconds for the Argoverse dataset, which corresponds to a frame length of 30, 10, and 30, respectively.

Behavior Prediction: It classifies the 3 behaviors– overspeeding (blue), neutral(green), and braking(red), for
all road-agents in one traffic video from the Lyft, Argoverse, and Apolloscape datasets, respectively.

TrafficPredict: Trajectory Prediction for
Heterogeneous Traffic-Agents
• To safely and efficiently navigate in complex urban traffic, autonomous vehicles must make responsible
predictions in relation to surrounding traffic-agents (vehicles, bicycles, pedestrians, etc.).
• A challenging task is to explore the movement patterns of different traffic-agents and predict their
future trajectories accurately to help the autonomous vehicle make reasonable navigation decision.
• To solve this problem, a long short-term memory-based (LSTM-based) realtime traffic prediction
algorithm, TrafficPredict is proposed.
• This approach uses an instance layer to learn instances’ movements and interactions and has a
category layer to learn the similarities of instances belonging to the same type to refine the prediction.
• In order to evaluate its performance, they collected trajectory datasets in a large city consisting of
varying conditions and traffic densities.
• The dataset includes many challenging scenarios where vehicles, bicycles, and pedestrians move
among one another.

• In urban traffic scenarios where various traffic-agents are interacting with others, each instance
has its own state in relation to the interaction with others at any time.
• Considering traffic-agents as instance nodes and relationships as edges, it can construct a graph in
the instance layer.
• The edge between two instance nodes in one frame is called spatial edge, which can transfer the
interaction information between two traffic-agents in spatial space.
• The edge between the same instance in adjacent frames is the temporal edge, which is able to
pass the historic information frame by frame in temporal space.
• All instances of the same type are integrated into one group and each group has an edge oriented
toward the corresponding super node.
• After summarizing the motion similarities, the super node passes the guidance through an
oriented edge to the group of instances.
• This category layer is specially designed for heterogeneous traffic and can make full use of the
data to extract valuable information to improve the prediction results.
• This layer is very flexible and can be easily degenerated to situations when several categories
disappear in some frames.

4D Graph for a traffic sequence. (a) Icons for instances and categories are shown on the left table. (b) The
instance layer of the 4D Graph with spatial edges as solid lines and temporal edges as dashed lines. (c) The
category layer with temporal edges of super nodes drawn by dashed lines.

• It gets the 4D Graph for a traffic sequence with two dimensions for traffic-agents and their
interactions, one dimension for time series, and one dimension for high-level categories.
• The instance layer aims to capture the movement pattern of instances in traffic.
• Because different kinds of traffic-agents have different dynamic properties and motion rules, only
instances of the same type share the same parameters.
• There are three types of traffic-agents in our dataset: vehicles, bicycles, and pedestrians. So there
are three different LSTMs for instance nodes.
• Usually traffic-agents of the same category have similar dynamic properties, including speed,
acceleration, steering, etc., and similar reactions to other traffic-agents or the whole environ.
• If learning the movement patterns from the same category of instances, it can better predict
trajectories for the entire instances.
• There are four important components: the super node for a specified category, the directed edge
from a group of instances to the super node, the directed edge from the super node to instances,
and the temporal edges for super nodes.

Architecture of the network for one super node
in the category layer
Assume there are n instances belonging to the
same category in the current frame. It has
already gotten the hidden state h1 and the
cell state c from the instance LSTM, which are
the input for the category layer. Because the
cell state c contains the historical trajectory
information of the instance, self-attention
mechanism is used on c by softmax operation
to explore pattern of the internal sequence.

There are six scenarios with different road conditions and traffic situations. It only shows the trajectories of
several instances in each scenario. The ground truth (GT) is drawn in green and the prediction results of
other methods (ED,SL,SA) are shown with different dashed lines. The prediction trajectories of TP algorithm
(pink lines) are the closest to ground truth in most of the cases.
Social LSTM (SL) RNN ED (ED) Social Attention (SA)

Large-Scale extraction of accurate vehicle
trajectories for driving behavior learning
• Urban environments are still a challenge for Autonomous Vehicles, due to strong interactions with
other vehicles and pedestrians.
• Machine learning methods are increasingly explored to tackle these situations, but their
performances are highly conditioned on the availability of vehicle trajectories datasets.
• As a result, only a few datasets of vehicle trajectories are currently available, representing very
specific situations such as highway driving, and containing a limited number of trajectories.
• To unleash the potential of behavior learning methods for autonomous vehicles, there needs
large datasets of accurate vehicle trajectories, for interacting vehicles in very diverse situations.
• This paper introduces a fully automatic and scalable framework for accurate vehicle trajectories
extraction from single fixed monocular traffic cameras.
• It leverages the fact that traffic cameras represent a very large and cost-effective source of highly
diverse vehicle trajectories, as they are generally located at places where traffic is dense and
where a lot of interactions occur (e.g intersections).
• It aims at developing a framework for accurate vehicle trajectories dataset creation at largescale.
• Open-source at https://gitlab.com/AubreyC/trajectory-extractor

• 1. vehicles are detected frame by
frame using a CNN object detector,
and ground position is estimated for
each detected vehicle.
• 2. detections are grouped into tracks
with an Intersection-over-Union (IoU）
method.
• 3. tracks are smoothed using a Rauch-
Tung-Striebel smoother to smooth
and estimate the ground location,
velocity and heading of the vehicles.
Trajectory extraction framework architecture

• The camera model allows to compute the projection of 3D boxes on the image plane.
• For each class of vehicles (e.g car, truck, bus) provided by the object detector, general 3D box
parameters (length, width and height) are predefined.
• Assuming the vehicles are on the ground, their x, y position defined as the geometric center of
the vehicle on the ground and orientation ψ can be estimated by maximizing the overlap between
the 3D box and projection on the image plane and the 2D instance mask provided by Mask-RCNN.

They use the realistic simulation environment CARLA to generate traffic videos with ground truth vehicles
information. The evaluation pipeline is described as follow: First generate traffic videos and save ground truth
vehicles information; Then apply the framework on the generated raw videos to extract vehicle trajectories;
Finally, evaluate the accuracy of the extracted trajectories by comparing them to the ground truth.

Learning Traffic Behaviors for Simulation via Extraction of
Vehicle Trajectories from Online Video Streams
• To collect extensive data on realistic driving behavior for use in simulation, it proposes a
framework that uses online public traffic cam video streams to extract data of driving
behavior.
• To tackle challenges like frame-skip, perspective, and low resolution, they implement a
Traffic Camera Pipeline (TCP).
• TCP leverages recent advances in deep learning for object detection to extract
trajectories from the video stream to corresponding locations in a bird’s eye view traffic
simulator.
• After collecting 2618 vehicle trajectories, it compares learned models from the extracted
data with those from a simulator and find that a held-out set of trajectories is more likely
to occur under the learned models at two levels of traffic behavior: high-level behaviors
describing where vehicles enter and exit the intersection, as well as the specific
sequences of points traversed.
• The learned models can be used to generate and simulate more plausible driving
behaviors.

Example trajectories extracted by TCP: top left
image is an illustration of trajectories overlaid
onto camera perspective; top right image shows
the trajectories in bird’s eye view in FLUIDS.
Bottom row compares three groups of left turn
trajectories of vehicles coming from the top of the
scene. A real-world held-out set consists of
trajectories that real drivers took, but are not
used in any training. The middle figure shows
trajectories generated by the RRT* algorithm. The
right figure shows TCP-generated trajectories
sampled from a model trained on collected
driving data. It is observed that the TCP-generated
trajectories better approximate the held-out set.
FLUIDS is open-source light-weight Python-based traffic intersection simulator.

TCP system architecture (excluding learning and analysis). First, capture a video stream of a traffic intersection
and use SSD, a deep object detection network, to identify and label vehicles. Then, manually label the first
detection of each vehicle in the video stream. Finally, map the identified vehicles to a bird’s eye view using
homography and run a probabilistic grouping algorithm to extract trajectories.
Homography works by estimating a projective matrix that morphs pixel locations from a source domain into a
target domain. The target domain, in this case, is the simulator, and the source is the traffic camera view.

TCP captures a four-way
intersection in Canmore,
Alberta at different times of
day. It features a variety of
lighting conditions, weather,
and road conditions. For the
following experiments, only
labeled a small subset of the
daytime videos.

It simulates vehicles at a four-way intersection by specifying
traffic behaviors in two steps. (Left) First, choose a starting lane
for a vehicle (the west lane in the figure), and an ending lane
(north lane). (Right) After the starting and end lanes are
chosen, specify a trajectory consisting of a sequence of points
for the vehicle to traverse
High-level behaviors describe where a vehicle begins
and ends at the four-way intersection. They use two
types of distributions to capture these behaviors:
distributions over the starting lanes of the vehicles,
and distributions over the actions taken (left, right,
forward, or stopped) by vehicles given the starting
location. The high-level behaviors are given by
multinomial discrete probability distributions over a
set S containing k elements.
An agent’s motion at the traffic intersection can be
specified by a sequence of Cartesian coordinates in
the bird’s eye view perspective. Using trajectories
collected by TCP, they learn a data-driven trajectory
generator model. It partitions the collected
trajectories from TCP into 12 sets: one for each
combination of starting lane and the action (left,
forward, or right).

Google Map view of the intersection in TCP. Left image shows the street names of the intersection.
Right image shows the surrounding area of the intersection, and the dropped pin shows the location
of the intersection.

Examples of held-out trajectories, trajectories
sampled from the baseline trajectory generator, and
trajectories sampled from the learned TCP generative
model. It shows five examples each for three
primitive behaviors: left turn from the north, right
turn from the north, and left turn from the west. It is
seen that the real-world held-out trajectories exhibit
greater variance in paths, and the learned generator
better matches this behavior. However, the difference
is not as apparent in the bottom row.
The dataset link:
https://berkeleyautomation.github.io/Traffic_Camera
_Pipeline/.

Learning Vehicle Cooperative Lane-changing Behavior
from Observed Trajectories in the NGSIM Dataset
• Lane changing has been regarded as one of the major factors causing traffic accidents.
• Lane-changing intention prediction has long been a hot topic in autonomous driving scenarios.
• As autonomous vehicles drive on highways, it is necessary for them to predict other vehicles’
lane-changing intention to prevent potential collisions.
• However, none of the existing literature has taken both the vehicle’s trajectory history and
neighbor information into consideration when making the predictions.
• There has been a lot of work attempting to model drivers’ lane-changing behaviors, which can be
divided into two types: rule-based algorithms and machine-learning-based algorithms.
• Here they propose a socially-aware LSTM algorithm in real world scenarios to solve this intention
prediction problem, taking advantage of both vehicle past trajectories and their neighbor’s
current states.
• These two components can lead not only to higher accuracy, but also to lower lane-changing
prediction time, which plays an important role in potentially improving the autonomous vehicle’s
overall performance.

• A human-driven vehicle’s lane-changing intention can be based on various factors, including the
vehicle’s own properties such as heading angle and acceleration, as well as its relationship to
neighboring vehicles, such as its distance from the front vehicle.
• The open source Federal Highway Administration’s Next Generation Simulation (NGSIM) data set
was picked to extract vehicle trajectories and build the lane-changing prediction model.
First gathered all of the lane changing points, i.e.,
the points where the vehicle crossed the dashed
line dividing the lanes, for each vehicle. If a vehicle
was on a lane-changing point at time step t,
checked its trajectories in [t-δt, t+δt] (δt=2s), and
calculated its heading orientation θ during that
time period. Then marked the starting point and
ending point of this lane-changing trajectory when
θ has reached a bounding value θbound: |θ| = θbound.

Shifting methods of extracting input features and output lane-changing intention for one vehicle. n
continuous time steps were packed into one trajectory piece. If the nth time step of a trajectory piece
was a lane-changing time step, then the piece was a lane-changing piece (as depicted in Piece 1 and
Piece 2, which was marked as blue), otherwise it was labeled as a car following piece (as depicted in
Piece 3, which was marked as pink). The first time step of the collected pieces shifted one step at a time
so that it could make the most use of the data.
The trajectory pieces were collected in a ’shifting’ manner to make the most use of the data. In this
paper, n is 6, 9, and 12 to determine the impact of length of the history trajectories.

Each vehicle’s lane-changing intention was
then predicted at each time step given its
previous 11-time-step history trajectories and
neighbor information in the test set. The lane-
changing prediction time was also calculated
after filtering the results. Specifically, a lane-
changing prediction point is settled if a
vehicle is predicted to make a lane change for
3 continuous time steps, and the lane-
changing prediction time is defined to be the
time gap between the lane-changing point
and the lane-changing prediction point.

• The input features for each vehicle at each time step:
• a) the vehicle’s own information
• 1. vehicle acceleration 2. vehicle steering angle with respect to the road 3. the global lateral vehicle
position with respect to the lane. 4. the global longitudinal vehicle position with respect to the lane
• b) the vehicle’s neighbor information
• 1. the existence of left lane(1 if existed, 0 if not) 2. the existence of right lane(1 if existed, 0 if not) 3. the
longitudinal distance between ego vehicle and left-front vehicle 4. the longitudinal distance between
ego vehicle and front vehicle 5. the longitudinal distance between ego vehicle and right-front vehicle 6.
the longitudinal distance between ego vehicle and left-rear vehicle 7. the longitudinal distance between
ego vehicle and rear vehicle 8. the longitudinal distance between ego vehicle and right-rear vehicle.

LSTM network structure for lane-changing intention prediction

Joint Prediction for Kinematic Trajectories in
Vehicle-Pedestrian-Mixed Scenes
• Trajectory prediction for objects is challenging and critical for various applications (e.g.,
autonomous driving, and anomaly detection).
• Most of the existing methods focus on homogeneous pedestrian trajectories prediction, where
pedestrians are treated as particles without size.
• However, they fall short of handling crowded vehicle-pedestrian mixed scenes directly since
vehicles, limited with kinematics in reality, should be treated as rigid, non-particle objects ideally.
• This paper tackles this problem using separate LSTMs for heterogeneous vehicles and
pedestrians.
• Specifically, they use an oriented bounding box to represent each vehicle, calculated based on its
position and orientation, to denote its kinematic trajectories.
• It proposes a framework called VP-LSTM (Vehicle-Pedestrian) to predict the kinematic trajectories
of both vehicles and pedestrians simultaneously.
• In order to evaluate the model, a large dataset containing the trajectories of both vehicles and
pedestrians in vehicle-pedestrian-mixed scenes is specially built.

Illustration of various interactions in a vehicle pedestrian-mixed scene. The
vehicle-vehicle, human-human, and vehicle-human interactions are separately
represented with solid blue lines, solid red lines, and orange dash lines. The
vehicle a and pedestrian b in gray dash box have similar interactions with
surrounding pedestrians. b walks freely to avoid collisions with d. However,
the vehicle a, limited with kinematics, stops to avoid collisions with c.

• In order to jointly predict the trajectories of both vehicles and pedestrians, they feed the
kinematic trajectory sequences of both pedestrians (xi
t) and vehicles (Pj
t) in an observation period
from step t = 1 to t = Tobs as the input.
• Then, the positions of pedestrians, and both the positions (yj
t) and orientations (aj
t) of vehicles in
the prediction period from step t = 1 to t = Tpred can be predicted simultaneously.
Illustration of mixed social pooling.
It applies mixed social pooling to collect the latent motion
representations of vehicles and pedestrians in the
neighborhood and uses a grid of No ×No cells, called
occupancy map, which is centered at the position of a
pedestrian or vehicle. The positions of all the neighbors,
including pedestrians and vehicles, are pooled on the
occupancy map. Through the occupancy map, pedestrians
and vehicles share latent representations with hidden
states. The occupancy map VO and PO are built
respectively for both vehicles and pedestrian.

• VP-LSTM estimates d-variate conditional distributions for pedestrians and vehicles, respectively.
• For pedestrians, they create a bivariate Gaussian distribution (d = 2) to predict the position.
• Different from pedestrians, they use a four dimensional Gaussian multivariate distribution (d=4)
to predict the position and orientation of vehicles.
• The trajectory prediction is a multi-modal problem by nature, where each sampling produces one
of multiple possible future trajectories.
• The predicted kinematic trajectories of pedestrians and vehicles at t are respectively given by:
• Apart from the variety loss function, kv acceptable kinematic trajectories for vehicles can be
obtained by randomly sampling from the above distribution, kp possible trajectories for a
pedestrian can be generated in a similar way by sampling the above distribution.

• It compared VP-LSTM with state-of-the-art human trajectory prediction methods including Vanilla
LSTM (VLSTM), Social LSTM (S-LSTM) , two Social-Gan variants (SGAN-PV and SGAN-PV-20).
• Its methods are referred to as VPLSTM-OP-N, where O denotes that the vehicles in scenes are
treated as OBB, whose orientations and positions are predicted jointly, and P signifies the mixed
social pooling are adopted in the model.
• The original video data was acquired with a drone from a top-down view.
• It chose two traffic scenarios, where large heterogeneous vehicles and pedestrians pass through
under different traffic densities.
• The trajectories in the two scenarios (called BJI and TJI, respectively) are carefully annotated,
including 6405 pedestrians and 6478 vehicles.

Driving Behavior for ADAS and Autonomous Driving VIII

Driving Behavior for ADAS and Autonomous Driving VIII

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Driving Behavior for ADAS and Autonomous Driving VIII

Semelhante a Driving Behavior for ADAS and Autonomous Driving VIII (20)

Mais de Yu Huang

Mais de Yu Huang (20)

Último

Último (20)

Driving Behavior for ADAS and Autonomous Driving VIII