SlideShare uma empresa Scribd logo
1 de 20
Human Action Recognition Using
Attention Based Spatiotemporal Graph
Convolutional Network
Under the guidance of:
Prof. Anil Singh Parihar
Dept. of Computer Science &
Engineering
Submitted By:
Anshula Sharma
2K21/CSE/07
What is Human Action Recognition?
• Human Action Recognition or (HAR) is concerned with predicting or classifying the
actions being performed by a human being.
• It is an important area of research.
• Different data modalities used for HAR:
1. Optical Flow
2. RGB Images
3. Body Skeletons
• Deep Learning techniques are used to predict and recognize human actions.
Problem Statement
• The major goal is to build and improve a model that can recognize and interpret human actions
using skeletal data.
• Traditional techniques frequently depend on RGB video data, which can be affected by lighting and
occlusions. By capturing human actions using the spatial joints, action recognition based on
skeletons provide a more robust and efficient alternative.
• However, skeleton-based action identification faces a number of obstacles:
• It is still difficult to extract relevant information from skeletal data and properly capture the
spatiotemporal dynamics of human motions.
• Existing approaches usually process the body skeletons in the entire sequence that represents
the action performed. This strategy is inefficient in terms of computation time and memory
utilization.
• We propose an attention based spatiotemporal graph convolutional network to overcome these
challenges.
Recognition using Skeleton based data
• Body skeletons are increasingly used for
Human action recognition due to their
compact and action-focused nature.
• Skeletons are three-dimensional or two-
dimensional coordinate representations of
human body joints.
• Skeletons are found in graph formations,
where the graph’s nodes represents the
skeleton's joints whereas the edges of the
graph indicate the many connections between
various body joints.
• Actions can be identified from the different
motion patterns of the joints of the skeletal
body.
Graph Convolutional Networks
• Graph Convolutional Networks (GCNs) have become quite prominent in the field of skeleton-based
action recognition.
• Using deep feed-forward architectures, graph convolutional networks successfully capture the
spatiotemporal characteristics inherent in human skeletons.
• GCNs are a variant of Convolutional Neural Networks (CNN), and help to generalize graph-
structured data.
• GCNs operate, in a manner similar to CNNs, by inspecting the neighboring nodes.
• The input is in non-Euclidean structural form data, with each node having varying numbers of
connections.
• Nodes and their connections (edges) with other nodes are represented with the help of an
adjacency matrix, which is then introduced to the forward propagation equation.
Graph Convolutional Network
• The initial input to the GCN is the graph's node
features along with the adjacency matrix. It
records local connection patterns as well as
information from surrounding nodes.
• The aggregated node features are applied with
the weight matrix. It is multiplied by the
aggregated features to compute the modified
features for each node.
• Finally, non-linear activation function is applied to
obtain updated node representations. The
updated node representations are then served as
the next layer’s input, along with the adjacency
matrix.
Dataset: NTU-RGB+D
Dataset
• The model is experimented on a large-scale indoor action recognition dataset, NTU-RGB+D.
• It is the most comprehensive datasets for 3D annotations for human action recognition tasks.
• The video clips have been acquired by Microsoft Kinect v2 sensors.
• It covers 60 action classes covering over 56,880 action samples, with the actions carried out by 40
distinct test subjects.
• The action classes cover a wide variety of daily actions, including walking, waving, clapping, sitting down,
getting up, and playing musical instruments. It consists of both individual actions and interactions
between two subjects.
• The annotations that are obtained by the Kinect depth sensors offer 3D joint positions (X, Y, Z).
• A total of 25 joints are present in each subject in the skeleton series.
• The dataset consists of two evaluation benchmarks:
1. Cross-view (X-view): 3 different cameras capture the videos. The training set comprises of 37,820 video clips. The test
set consists of 18,960 video clips.
2. Cross-subject (X-sub): Focuses on cross-subject action recognition. The training set has 40,320 videos and the test
set contains 16,560 videos.
Proposed Approach
Proposed Approach
• The proposed approach is an attention-based model for human action recognition that uses both
temporal and spatial attention modules to improve recognition accuracy.
• The temporal attention module selects the most informative frames from a sequence of skeletons,
capturing the action's critical temporal dynamics.
• Following that, the spatial attention mechanism highlights the most significant joints within the
selected frames, emphasizing their distinguishing characteristics.
• The computed attention scores are then used to select frames, allowing the identification of the
skeletons with the highest attention values.
• Both temporal and spatial relationships are effectively utilized in the skeletal data by incorporating
the attention modules into a graph convolutional network.
• The temporal and spatial attention mechanisms together improve the efficiency of human action
recognition based on skeletons, resulting in further accurate and robust identification results.
Spatial Graph Convolution Module
• Spatial Graph Convolution block focuses on capturing spatial relationships.
• It operates by considering the nodes in the network as skeletal joints and the edges as connections that
describe interactions between these joints.
• The block aggregates information from nearby nodes in order to capture local interactions and
dependencies.
• The block accepts a tensor of shape (N, Cin, T, V) as input, where N denotes the batch size, T represents
the sequence length, Cin represents the number of input channels, and V represents the number of
joints.
• Along with the input tensor, an adjacency matrix is used which depicts pairwise interactions between
the joints.
• The block then runs a graph convolution operation, which updates the characteristics of each joint by
aggregating information from its nearby joints.
• Non-linear transformations are used after the graph convolution function to incorporate non-linearity
and improve discriminative capability of the features.
Temporal Graph Convolution Module
• Temporal convolution block captures the dynamics of temporal information in the skeletal data.
• The block takes as input the output of the spatial graph convolution block.
• For temporal graph convolution, the neighborhood of each vertex is extended to include each node's
temporal neighbors.
• Each node in the preceding and following skeleton frames is linked to the same node, resulting in a
temporal neighborhood size of 2 for each node.
• A 2D convolution is performed to process the extended neighborhood, with a kernel which determines
the convolution operation's temporal receptive field.
• The temporal convolution block aggregates the features of each individual body joint at various time
steps by using a fixed kernel size, allowing the model to capture the dynamics of temporal dimension
and patterns contained in the skeletal data.
Temporal Attention Block
• Temporal attention is used to detect relevant frames within a sequence of frames.
• Its goal is to emphasize and recognize the frames that make a significant contribution to the overall
interpretation of an action.
• The temporal attention module computes the average activation over all joints and channels for each
frame.
• It aggregates the collective information inside each frame, indicating its overall significance.
• The model learns the weights associated with each frame by passing the aggregated frame-level
activations through a linear layer.
• A sigmoid activation function is then applied to these weights which produces attention weights for
each frame.
• These attention weights serve as a mask, selectively amplifying or suppressing specific frame activations.
• The model can successfully focus on the frames that are deemed significant or informative for the given
action by applying attention weights to the input.
Spatial Attention Block
• Spatial attention is concerned with identifying important joints within the frames highlighted in the
temporal attention module.
• It computes the average activation for each joint over all frames and channels.
• The joint-level activations are then transferred to a linear layer, which allows the model to learn the
weights associated with each joint.
• These weights are then fed into a sigmoid activation function, which produces attention weights that
indicate the relative significance of each joint.
• The generated attention weights operate as a mask, selectively adjusting the activations of individual
joints based on their relevance by adding attention weights to the attended frames.
• Finally, a subset of most informative skeletons from a given sequence of skeletons is fetched.
• The skeletons are sorted in a decreasing order according to their attention weights.
• These selected skeletons are then incorporated into the network's subsequent layers for additional
processing and analysis.
Network Architecture
• Skeleton data is fed as the input to the model comprising of the coordinate locations of all the joints of the
skeleton.
• The model is made up of six graph convolutional layers and an attention module.
• Initially, two spatial graph convolutional layers are present in the model. Spatial Graph Convolution layer
focuses on capturing spatial relationships.
• Temporal and Spatial attention blocks are present after the spatial graph convolutional layers. The temporal
attention module captures the most informative frames from a sequence of skeletons. The spatial attention
mechanism emphasizes the most informative joints from the frames highlighted. Frame selection is then
performed to select the skeletons with the highest attention scores.
• The last four graph convolutional layers combine spatial and temporal convolutions. Temporal dynamics in the
skeletal data is captured by the temporal convolution block. The block takes as input the output of the spatial
graph convolution block.
• Each skeleton sequence's enhanced spatiotemporal characteristics are passed through a global average
pooling, yielding a 256-dimensional output feature vector.
• Finally, human actions are classified using a fully connected layer that includes a SoftMax classifier. To reduce
classification error, the model trains end-to-end via backpropagation.
Experimental Results
Experimental Results
• The model achieved Top-1 and Top-5
accuracy of 83.58% and 97.07% on the
cross-subject benchmark. While a Top-
1 and Top-5 accuracy of 91.22% and
98.85% is achieved on the cross-view
benchmark.
• Model is compared with other notable
approaches on the Top-1 accuracy for
both cross-subject and cross-view
benchmarks.
Conclusion
• Utilization of temporal and spatial attention mechanisms helped in enhancing the model’s
performance.
• The model is evaluated on a widely used NTU-RGB+D benchmark, assessing its top-1 and top-5
accuracies on the dataset’s cross-view and cross-subject benchmarks.
• In comparison to other skeleton-based models, we observed a significant difference in
performance between RNN and CNN-based methods and our method. Furthermore, our method
outperformed other GCN-based models, demonstrating the benefit of incorporating spatial and
temporal attention mechanisms to graph convolutional network which enhanced the model's
accuracy and efficiency.
References
• S. Yan, Y. Xiong and D. Lin, "Spatial temporal graph convolutional networks for skeleton-based
action recognition," in Proceedings of the AAAI conference on artificial intelligence, 2018.
• N. Heidari and A. Iosifidis, "Temporal attention-augmented graph convolutional network for
efficient skeleton-based human action recognition," in 2020 25th International Conference on
Pattern Recognition (ICPR), 2021.
• T. N. Kipf and M. Welling, ‘Semi-Supervised Classification with Graph Convolutional Networks’,
in International Conference on Learning Representations, 2017.
• A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar and L. Fei-Fei, "Large-scale video
classification with convolutional neural networks," 2014.
• Y. Du, W. Wang and L. Wang, "Hierarchical recurrent neural network for skeleton based action
recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition,
2015.
THANK YOU

Mais conteúdo relacionado

Semelhante a final_project_1_2k21cse07.pptx

intro-to-cnn-April_2020.pptx
intro-to-cnn-April_2020.pptxintro-to-cnn-April_2020.pptx
intro-to-cnn-April_2020.pptx
ssuser3aa461
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
Pierre de Lacaze
 
Attention correlated appearance and motion feature followed temporal learning...
Attention correlated appearance and motion feature followed temporal learning...Attention correlated appearance and motion feature followed temporal learning...
Attention correlated appearance and motion feature followed temporal learning...
IJECEIAES
 

Semelhante a final_project_1_2k21cse07.pptx (20)

C1804011117
C1804011117C1804011117
C1804011117
 
A Survey of Convolutional Neural Networks
A Survey of Convolutional Neural NetworksA Survey of Convolutional Neural Networks
A Survey of Convolutional Neural Networks
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
intro-to-cnn-April_2020.pptx
intro-to-cnn-April_2020.pptxintro-to-cnn-April_2020.pptx
intro-to-cnn-April_2020.pptx
 
230727_HB_JointJournalClub.pptx
230727_HB_JointJournalClub.pptx230727_HB_JointJournalClub.pptx
230727_HB_JointJournalClub.pptx
 
IRJET-Multiple Object Detection using Deep Neural Networks
IRJET-Multiple Object Detection using Deep Neural NetworksIRJET-Multiple Object Detection using Deep Neural Networks
IRJET-Multiple Object Detection using Deep Neural Networks
 
A Virtual Infrastructure for Mitigating Typical Challenges in Sensor Networks
A Virtual Infrastructure for Mitigating Typical Challenges in Sensor NetworksA Virtual Infrastructure for Mitigating Typical Challenges in Sensor Networks
A Virtual Infrastructure for Mitigating Typical Challenges in Sensor Networks
 
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
 
[20240422_LabSeminar_Huy]Taming_Effect.pptx
[20240422_LabSeminar_Huy]Taming_Effect.pptx[20240422_LabSeminar_Huy]Taming_Effect.pptx
[20240422_LabSeminar_Huy]Taming_Effect.pptx
 
04 Deep CNN (Ch_01 to Ch_3).pptx
04 Deep CNN (Ch_01 to Ch_3).pptx04 Deep CNN (Ch_01 to Ch_3).pptx
04 Deep CNN (Ch_01 to Ch_3).pptx
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
A systematic image compression in the combination of linear vector quantisati...
A systematic image compression in the combination of linear vector quantisati...A systematic image compression in the combination of linear vector quantisati...
A systematic image compression in the combination of linear vector quantisati...
 
Attention correlated appearance and motion feature followed temporal learning...
Attention correlated appearance and motion feature followed temporal learning...Attention correlated appearance and motion feature followed temporal learning...
Attention correlated appearance and motion feature followed temporal learning...
 
CONVOLUTIONAL NEURAL NETWORK BASED FEATURE EXTRACTION FOR IRIS RECOGNITION
CONVOLUTIONAL NEURAL NETWORK BASED FEATURE EXTRACTION FOR IRIS RECOGNITION CONVOLUTIONAL NEURAL NETWORK BASED FEATURE EXTRACTION FOR IRIS RECOGNITION
CONVOLUTIONAL NEURAL NETWORK BASED FEATURE EXTRACTION FOR IRIS RECOGNITION
 
CONVOLUTIONAL NEURAL NETWORK BASED FEATURE EXTRACTION FOR IRIS RECOGNITION
CONVOLUTIONAL NEURAL NETWORK BASED FEATURE EXTRACTION FOR IRIS RECOGNITION CONVOLUTIONAL NEURAL NETWORK BASED FEATURE EXTRACTION FOR IRIS RECOGNITION
CONVOLUTIONAL NEURAL NETWORK BASED FEATURE EXTRACTION FOR IRIS RECOGNITION
 
Deep Convolutional 3D Object Classification from a Single Depth Image and Its...
Deep Convolutional 3D Object Classification from a Single Depth Image and Its...Deep Convolutional 3D Object Classification from a Single Depth Image and Its...
Deep Convolutional 3D Object Classification from a Single Depth Image and Its...
 
Wits presentation 6_28072015
Wits presentation 6_28072015Wits presentation 6_28072015
Wits presentation 6_28072015
 
Accurate and Energy-Efficient Range-Free Localization for Mobile Sensor Networks
Accurate and Energy-Efficient Range-Free Localization for Mobile Sensor NetworksAccurate and Energy-Efficient Range-Free Localization for Mobile Sensor Networks
Accurate and Energy-Efficient Range-Free Localization for Mobile Sensor Networks
 
On Learning Navigation Behaviors for Small Mobile Robots With Reservoir Compu...
On Learning Navigation Behaviors for Small Mobile Robots With Reservoir Compu...On Learning Navigation Behaviors for Small Mobile Robots With Reservoir Compu...
On Learning Navigation Behaviors for Small Mobile Robots With Reservoir Compu...
 
Crowd Density Estimation Using Base Line Filtering
Crowd Density Estimation Using Base Line FilteringCrowd Density Estimation Using Base Line Filtering
Crowd Density Estimation Using Base Line Filtering
 

Último

Principle of erosion control- Introduction to contouring,strip cropping,conto...
Principle of erosion control- Introduction to contouring,strip cropping,conto...Principle of erosion control- Introduction to contouring,strip cropping,conto...
Principle of erosion control- Introduction to contouring,strip cropping,conto...
ZAPPAC1
 
Disaster risk reduction management Module 4: Preparedness, Prevention and Mit...
Disaster risk reduction management Module 4: Preparedness, Prevention and Mit...Disaster risk reduction management Module 4: Preparedness, Prevention and Mit...
Disaster risk reduction management Module 4: Preparedness, Prevention and Mit...
BrixsonLajara
 
Joka \ Call Girls Service Kolkata - 450+ Call Girl Cash Payment 8005736733 Ne...
Joka \ Call Girls Service Kolkata - 450+ Call Girl Cash Payment 8005736733 Ne...Joka \ Call Girls Service Kolkata - 450+ Call Girl Cash Payment 8005736733 Ne...
Joka \ Call Girls Service Kolkata - 450+ Call Girl Cash Payment 8005736733 Ne...
HyderabadDolls
 
Dubai Escorts Service 0508644382 Escorts in Dubai
Dubai Escorts Service 0508644382 Escorts in DubaiDubai Escorts Service 0508644382 Escorts in Dubai
Dubai Escorts Service 0508644382 Escorts in Dubai
Monica Sydney
 

Último (20)

Principle of erosion control- Introduction to contouring,strip cropping,conto...
Principle of erosion control- Introduction to contouring,strip cropping,conto...Principle of erosion control- Introduction to contouring,strip cropping,conto...
Principle of erosion control- Introduction to contouring,strip cropping,conto...
 
Presentation: Farmer-led climate adaptation - Project launch and overview by ...
Presentation: Farmer-led climate adaptation - Project launch and overview by ...Presentation: Farmer-led climate adaptation - Project launch and overview by ...
Presentation: Farmer-led climate adaptation - Project launch and overview by ...
 
Mira Road Reasonable Call Girls ,09167354423,Kashimira Call Girls Service
Mira Road Reasonable Call Girls ,09167354423,Kashimira Call Girls ServiceMira Road Reasonable Call Girls ,09167354423,Kashimira Call Girls Service
Mira Road Reasonable Call Girls ,09167354423,Kashimira Call Girls Service
 
Call girl in Sharjah 0503464457 Sharjah Call girl
Call girl in Sharjah 0503464457 Sharjah Call girlCall girl in Sharjah 0503464457 Sharjah Call girl
Call girl in Sharjah 0503464457 Sharjah Call girl
 
Trusted call girls in Fatehabad 9332606886 High Profile Call Girls You Can...
Trusted call girls in Fatehabad   9332606886  High Profile Call Girls You Can...Trusted call girls in Fatehabad   9332606886  High Profile Call Girls You Can...
Trusted call girls in Fatehabad 9332606886 High Profile Call Girls You Can...
 
Cyclone Case Study Odisha 1999 Super Cyclone in India.
Cyclone Case Study Odisha 1999 Super Cyclone in India.Cyclone Case Study Odisha 1999 Super Cyclone in India.
Cyclone Case Study Odisha 1999 Super Cyclone in India.
 
Presentation: Farmer-led climate adaptation - Project launch and overview by ...
Presentation: Farmer-led climate adaptation - Project launch and overview by ...Presentation: Farmer-led climate adaptation - Project launch and overview by ...
Presentation: Farmer-led climate adaptation - Project launch and overview by ...
 
Delivery in 20 Mins Call Girls Dungarpur 9332606886Call Girls Advance Cash O...
Delivery in 20 Mins Call Girls Dungarpur  9332606886Call Girls Advance Cash O...Delivery in 20 Mins Call Girls Dungarpur  9332606886Call Girls Advance Cash O...
Delivery in 20 Mins Call Girls Dungarpur 9332606886Call Girls Advance Cash O...
 
Disaster risk reduction management Module 4: Preparedness, Prevention and Mit...
Disaster risk reduction management Module 4: Preparedness, Prevention and Mit...Disaster risk reduction management Module 4: Preparedness, Prevention and Mit...
Disaster risk reduction management Module 4: Preparedness, Prevention and Mit...
 
Water Pollution
Water Pollution Water Pollution
Water Pollution
 
Hook Up Call Girls Rajgir 9332606886 High Profile Call Girls You Can Get T...
Hook Up Call Girls Rajgir   9332606886  High Profile Call Girls You Can Get T...Hook Up Call Girls Rajgir   9332606886  High Profile Call Girls You Can Get T...
Hook Up Call Girls Rajgir 9332606886 High Profile Call Girls You Can Get T...
 
Joka \ Call Girls Service Kolkata - 450+ Call Girl Cash Payment 8005736733 Ne...
Joka \ Call Girls Service Kolkata - 450+ Call Girl Cash Payment 8005736733 Ne...Joka \ Call Girls Service Kolkata - 450+ Call Girl Cash Payment 8005736733 Ne...
Joka \ Call Girls Service Kolkata - 450+ Call Girl Cash Payment 8005736733 Ne...
 
Environmental Topic : Soil Pollution by Afzalul Hoda.pptx
Environmental Topic : Soil Pollution by Afzalul Hoda.pptxEnvironmental Topic : Soil Pollution by Afzalul Hoda.pptx
Environmental Topic : Soil Pollution by Afzalul Hoda.pptx
 
Hertwich_EnvironmentalImpacts_BuildingsGRO.pptx
Hertwich_EnvironmentalImpacts_BuildingsGRO.pptxHertwich_EnvironmentalImpacts_BuildingsGRO.pptx
Hertwich_EnvironmentalImpacts_BuildingsGRO.pptx
 
Test bank for beckmann and ling s obstetrics and gynecology 8th edition by ro...
Test bank for beckmann and ling s obstetrics and gynecology 8th edition by ro...Test bank for beckmann and ling s obstetrics and gynecology 8th edition by ro...
Test bank for beckmann and ling s obstetrics and gynecology 8th edition by ro...
 
Call Girls in Dattatreya Nagar / 8250092165 Genuine Call girls with real Phot...
Call Girls in Dattatreya Nagar / 8250092165 Genuine Call girls with real Phot...Call Girls in Dattatreya Nagar / 8250092165 Genuine Call girls with real Phot...
Call Girls in Dattatreya Nagar / 8250092165 Genuine Call girls with real Phot...
 
Dubai Escorts Service 0508644382 Escorts in Dubai
Dubai Escorts Service 0508644382 Escorts in DubaiDubai Escorts Service 0508644382 Escorts in Dubai
Dubai Escorts Service 0508644382 Escorts in Dubai
 
High Profile Call Girls Service in Udhampur 9332606886 High Profile Call G...
High Profile Call Girls Service in Udhampur   9332606886  High Profile Call G...High Profile Call Girls Service in Udhampur   9332606886  High Profile Call G...
High Profile Call Girls Service in Udhampur 9332606886 High Profile Call G...
 
Vip Salem Call Girls 8250092165 Low Price Escorts Service in Your Area
Vip Salem Call Girls 8250092165 Low Price Escorts Service in Your AreaVip Salem Call Girls 8250092165 Low Price Escorts Service in Your Area
Vip Salem Call Girls 8250092165 Low Price Escorts Service in Your Area
 
Role of Copper and Zinc Nanoparticles in Plant Disease Management
Role of Copper and Zinc Nanoparticles in Plant Disease ManagementRole of Copper and Zinc Nanoparticles in Plant Disease Management
Role of Copper and Zinc Nanoparticles in Plant Disease Management
 

final_project_1_2k21cse07.pptx

  • 1. Human Action Recognition Using Attention Based Spatiotemporal Graph Convolutional Network Under the guidance of: Prof. Anil Singh Parihar Dept. of Computer Science & Engineering Submitted By: Anshula Sharma 2K21/CSE/07
  • 2. What is Human Action Recognition? • Human Action Recognition or (HAR) is concerned with predicting or classifying the actions being performed by a human being. • It is an important area of research. • Different data modalities used for HAR: 1. Optical Flow 2. RGB Images 3. Body Skeletons • Deep Learning techniques are used to predict and recognize human actions.
  • 3. Problem Statement • The major goal is to build and improve a model that can recognize and interpret human actions using skeletal data. • Traditional techniques frequently depend on RGB video data, which can be affected by lighting and occlusions. By capturing human actions using the spatial joints, action recognition based on skeletons provide a more robust and efficient alternative. • However, skeleton-based action identification faces a number of obstacles: • It is still difficult to extract relevant information from skeletal data and properly capture the spatiotemporal dynamics of human motions. • Existing approaches usually process the body skeletons in the entire sequence that represents the action performed. This strategy is inefficient in terms of computation time and memory utilization. • We propose an attention based spatiotemporal graph convolutional network to overcome these challenges.
  • 4. Recognition using Skeleton based data • Body skeletons are increasingly used for Human action recognition due to their compact and action-focused nature. • Skeletons are three-dimensional or two- dimensional coordinate representations of human body joints. • Skeletons are found in graph formations, where the graph’s nodes represents the skeleton's joints whereas the edges of the graph indicate the many connections between various body joints. • Actions can be identified from the different motion patterns of the joints of the skeletal body.
  • 5. Graph Convolutional Networks • Graph Convolutional Networks (GCNs) have become quite prominent in the field of skeleton-based action recognition. • Using deep feed-forward architectures, graph convolutional networks successfully capture the spatiotemporal characteristics inherent in human skeletons. • GCNs are a variant of Convolutional Neural Networks (CNN), and help to generalize graph- structured data. • GCNs operate, in a manner similar to CNNs, by inspecting the neighboring nodes. • The input is in non-Euclidean structural form data, with each node having varying numbers of connections. • Nodes and their connections (edges) with other nodes are represented with the help of an adjacency matrix, which is then introduced to the forward propagation equation.
  • 6. Graph Convolutional Network • The initial input to the GCN is the graph's node features along with the adjacency matrix. It records local connection patterns as well as information from surrounding nodes. • The aggregated node features are applied with the weight matrix. It is multiplied by the aggregated features to compute the modified features for each node. • Finally, non-linear activation function is applied to obtain updated node representations. The updated node representations are then served as the next layer’s input, along with the adjacency matrix.
  • 8. Dataset • The model is experimented on a large-scale indoor action recognition dataset, NTU-RGB+D. • It is the most comprehensive datasets for 3D annotations for human action recognition tasks. • The video clips have been acquired by Microsoft Kinect v2 sensors. • It covers 60 action classes covering over 56,880 action samples, with the actions carried out by 40 distinct test subjects. • The action classes cover a wide variety of daily actions, including walking, waving, clapping, sitting down, getting up, and playing musical instruments. It consists of both individual actions and interactions between two subjects. • The annotations that are obtained by the Kinect depth sensors offer 3D joint positions (X, Y, Z). • A total of 25 joints are present in each subject in the skeleton series. • The dataset consists of two evaluation benchmarks: 1. Cross-view (X-view): 3 different cameras capture the videos. The training set comprises of 37,820 video clips. The test set consists of 18,960 video clips. 2. Cross-subject (X-sub): Focuses on cross-subject action recognition. The training set has 40,320 videos and the test set contains 16,560 videos.
  • 10. Proposed Approach • The proposed approach is an attention-based model for human action recognition that uses both temporal and spatial attention modules to improve recognition accuracy. • The temporal attention module selects the most informative frames from a sequence of skeletons, capturing the action's critical temporal dynamics. • Following that, the spatial attention mechanism highlights the most significant joints within the selected frames, emphasizing their distinguishing characteristics. • The computed attention scores are then used to select frames, allowing the identification of the skeletons with the highest attention values. • Both temporal and spatial relationships are effectively utilized in the skeletal data by incorporating the attention modules into a graph convolutional network. • The temporal and spatial attention mechanisms together improve the efficiency of human action recognition based on skeletons, resulting in further accurate and robust identification results.
  • 11. Spatial Graph Convolution Module • Spatial Graph Convolution block focuses on capturing spatial relationships. • It operates by considering the nodes in the network as skeletal joints and the edges as connections that describe interactions between these joints. • The block aggregates information from nearby nodes in order to capture local interactions and dependencies. • The block accepts a tensor of shape (N, Cin, T, V) as input, where N denotes the batch size, T represents the sequence length, Cin represents the number of input channels, and V represents the number of joints. • Along with the input tensor, an adjacency matrix is used which depicts pairwise interactions between the joints. • The block then runs a graph convolution operation, which updates the characteristics of each joint by aggregating information from its nearby joints. • Non-linear transformations are used after the graph convolution function to incorporate non-linearity and improve discriminative capability of the features.
  • 12. Temporal Graph Convolution Module • Temporal convolution block captures the dynamics of temporal information in the skeletal data. • The block takes as input the output of the spatial graph convolution block. • For temporal graph convolution, the neighborhood of each vertex is extended to include each node's temporal neighbors. • Each node in the preceding and following skeleton frames is linked to the same node, resulting in a temporal neighborhood size of 2 for each node. • A 2D convolution is performed to process the extended neighborhood, with a kernel which determines the convolution operation's temporal receptive field. • The temporal convolution block aggregates the features of each individual body joint at various time steps by using a fixed kernel size, allowing the model to capture the dynamics of temporal dimension and patterns contained in the skeletal data.
  • 13. Temporal Attention Block • Temporal attention is used to detect relevant frames within a sequence of frames. • Its goal is to emphasize and recognize the frames that make a significant contribution to the overall interpretation of an action. • The temporal attention module computes the average activation over all joints and channels for each frame. • It aggregates the collective information inside each frame, indicating its overall significance. • The model learns the weights associated with each frame by passing the aggregated frame-level activations through a linear layer. • A sigmoid activation function is then applied to these weights which produces attention weights for each frame. • These attention weights serve as a mask, selectively amplifying or suppressing specific frame activations. • The model can successfully focus on the frames that are deemed significant or informative for the given action by applying attention weights to the input.
  • 14. Spatial Attention Block • Spatial attention is concerned with identifying important joints within the frames highlighted in the temporal attention module. • It computes the average activation for each joint over all frames and channels. • The joint-level activations are then transferred to a linear layer, which allows the model to learn the weights associated with each joint. • These weights are then fed into a sigmoid activation function, which produces attention weights that indicate the relative significance of each joint. • The generated attention weights operate as a mask, selectively adjusting the activations of individual joints based on their relevance by adding attention weights to the attended frames. • Finally, a subset of most informative skeletons from a given sequence of skeletons is fetched. • The skeletons are sorted in a decreasing order according to their attention weights. • These selected skeletons are then incorporated into the network's subsequent layers for additional processing and analysis.
  • 15. Network Architecture • Skeleton data is fed as the input to the model comprising of the coordinate locations of all the joints of the skeleton. • The model is made up of six graph convolutional layers and an attention module. • Initially, two spatial graph convolutional layers are present in the model. Spatial Graph Convolution layer focuses on capturing spatial relationships. • Temporal and Spatial attention blocks are present after the spatial graph convolutional layers. The temporal attention module captures the most informative frames from a sequence of skeletons. The spatial attention mechanism emphasizes the most informative joints from the frames highlighted. Frame selection is then performed to select the skeletons with the highest attention scores. • The last four graph convolutional layers combine spatial and temporal convolutions. Temporal dynamics in the skeletal data is captured by the temporal convolution block. The block takes as input the output of the spatial graph convolution block. • Each skeleton sequence's enhanced spatiotemporal characteristics are passed through a global average pooling, yielding a 256-dimensional output feature vector. • Finally, human actions are classified using a fully connected layer that includes a SoftMax classifier. To reduce classification error, the model trains end-to-end via backpropagation.
  • 17. Experimental Results • The model achieved Top-1 and Top-5 accuracy of 83.58% and 97.07% on the cross-subject benchmark. While a Top- 1 and Top-5 accuracy of 91.22% and 98.85% is achieved on the cross-view benchmark. • Model is compared with other notable approaches on the Top-1 accuracy for both cross-subject and cross-view benchmarks.
  • 18. Conclusion • Utilization of temporal and spatial attention mechanisms helped in enhancing the model’s performance. • The model is evaluated on a widely used NTU-RGB+D benchmark, assessing its top-1 and top-5 accuracies on the dataset’s cross-view and cross-subject benchmarks. • In comparison to other skeleton-based models, we observed a significant difference in performance between RNN and CNN-based methods and our method. Furthermore, our method outperformed other GCN-based models, demonstrating the benefit of incorporating spatial and temporal attention mechanisms to graph convolutional network which enhanced the model's accuracy and efficiency.
  • 19. References • S. Yan, Y. Xiong and D. Lin, "Spatial temporal graph convolutional networks for skeleton-based action recognition," in Proceedings of the AAAI conference on artificial intelligence, 2018. • N. Heidari and A. Iosifidis, "Temporal attention-augmented graph convolutional network for efficient skeleton-based human action recognition," in 2020 25th International Conference on Pattern Recognition (ICPR), 2021. • T. N. Kipf and M. Welling, ‘Semi-Supervised Classification with Graph Convolutional Networks’, in International Conference on Learning Representations, 2017. • A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar and L. Fei-Fei, "Large-scale video classification with convolutional neural networks," 2014. • Y. Du, W. Wang and L. Wang, "Hierarchical recurrent neural network for skeleton based action recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015.