Report

Electronics and Computer Science
Faculty of Physical Sciences and Engineering
University of Southampton
Christopher J. Watts
April 28, 2015
Estimating Full-Body Demographics via Soft
Biometrics
Project Supervisor: Professor Mark S Nixon
Second Examiner: Professor George Chen
A project report submitted for the award of
Bachelor of Science (BSc.) in Computer Science

Abstract
Soft-biometrics is increasingly becoming more realistic for identifying individuals in the ﬁeld of
computer vision. This project proposes a novel method of automatic demographic annotation
using categoric labels for a wide range of body features including height, leg length, and shoulder
width where previous research has been limited to facial images and very few biometric features.
Using common computer vision techniques, it is possible to categorise subjects’ body features from
still images or video frames and directly compare them to other known subjects with high levels of
noise and image compression resistance. This project explores the viability of this new technique
and its impact on soft-biometrics as a whole.

Contents
1 Introduction 1
2 Requirements of Solution 2
3 Consideration of Approaches and Literature Review 3
3.1 Code Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2 Region of Interest Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2.1 Locating the Subject . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3.2 Categoric Labelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3.3 Weighting Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4 Final Design and Justiﬁcation 7
4.1 Technologies and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.1.1 Code Libraries and Project Setup . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.1.2 Subject Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.1.4 Labelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.1.5 Weighting Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 Processing Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5 Implementation 12
5.1 Loading Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.1.1 Interpreting the GaitAnnotate Database . . . . . . . . . . . . . . . . . . . . . 12
5.2 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.2.1 Limiting the Size of the Input . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.2.2 Background Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2.3 Processing Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.3 Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.4 Training Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.4.1 Further Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.5 Storing Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.5.1 Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.5.2 Principal Component Data and Training Sets . . . . . . . . . . . . . . . . . . 18
5.5.3 Query Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6 Results and Evaluation 19
6.1 Ability to Estimate Body Demographics . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.1.1 How to Measure Success . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.1.2 Results on Test Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.1.3 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.2.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.2.2 Querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3 Viability of Use as a Human Identiﬁcation System . . . . . . . . . . . . . . . . . . . 22
6.4 Background Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.5 Evaluation against Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
ii

6.6 Evaluation against Other Known Techniques . . . . . . . . . . . . . . . . . . . . . . 24
6.6.1 As Demographic Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.6.2 As Human Identiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7 Summary and Conclusion 28
7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.1.1 Migrating to C++ and OpenCV . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.1.2 Dataset Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.1.3 Use of Comparative Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.1.5 Weighting Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Appendices 32
A Project Management 33
B Results 40
C Design Archive 44
D Project Brief 45
iii

Preface
This project makes use of the terms subject and suspect. A subject is a person who is scanned by
the system to serve as training data or test data. A suspect is a person who will be input as a
query to identify matches from the known dataset of subjects.
iv

Acknowledgements
I, Christopher J. Watts, certify that this project represents solely my own work and that all work
referenced has been acknowledged appropriately. Further to this, I would like to thank those
involved in the creation of the Southampton Gait Database and GaitAnnotate projects which have
been used extensively in this work, and my supervisor Mark S. Nixon for providing guidance and
feedback throughout the duration of this project.
v

Chapter 1
Introduction
The original motivation for this project was to be able to identify a criminal suspect’s presence
in surveillance footage given one or more reference images of the suspect — even if part of the
subject, such as the face, is concealed. This would help particularly in law enforcement for finding
fugitives who appear in CCTV footage with traditional biometric data being obscured (such as
facial features). During the course of this project, the goal developed into a generalised target of
calculating individual body features from an image to assist in identification processes associated
with finding criminals.
The method proposed in this project is a mixture of computer vision and machine learning to
approximate the metrics of an individual. Approaches of this sort reside under the category of
”soft-biometrics” and there have been several attempts to identify people this way with ”accept-
able rates of accuracy” in comparison to traditional biometrics [3].
The focus of this implementation is to identify the body demographics of subjects from still im-
ages using a set of categoric labels. An example is to categorise height by [Very Short, Short,
Average, Tall, Very Tall]. There are two distinct advantages to using categories for features
rather than attempting to estimate an absolute value:
1. Estimating labels from video footage is more robust with greater invariance to noise, skew,
and low camera resolution
2. The accuracy of training data for each subject (which must be generated by hand) becomes
more reliable since research shows humans perform poorly when estimating absolute values
[19], [15].
These demographic categories have been in use before by Sina Samangooei [19] who has created a
database of subjects in collaboration with Mark Nixon for the GaitAnnotate project [20]. Previous
research has had some success with automatic demographic recognition [1], [25], [6], [7], but only
in the domain of facial imagery. Furthermore, these projects have been typically limited to a
small number of demographics — age, gender and race. This project builds on existing research to
generalise the techniques and detect a wide range of demographic information from images of full
bodies to further push the possibility of automatic human identification using computer vision.
1

Chapter 2
Requirements of Solution
In order to successfully achieve the goal of identifying body demographics, the following minimal
criteria were referred to when justifying all major decisions on the project.
ID Type Requirement
FR1 Functional The system must accept colour images and video frames of any size as
an input (although it may then convert the image to greyscale)
FR2 Functional The system must process video frame-by-frame autonomously
FR3 Functional The system must be able to perform all calculations without any addi-
tional user input
FR4 Functional The system must be invariant to the background of each image
FR5 Functional The system must be self-contained with any learning processes — storing
all heuristic knowledge needed to perform identiﬁcation
FR6 Functional The system must be able to read from a database of subjects and feature
categories (GaitAnnotate DB)
FR7 Functional Once trained, the system must produce an estimate for each body feature
on the subject given one or more query images
FR8 Functional Once trained, the system must produce the best guess or best-guesses
from the database of known subjects when given one or more query
images
R1 Non-functional The system must not rely on pixel-for-pixel measurements when esti-
mating lengths
R2 Non-functional The system must demonstrate a measurable level of invariance to noise
and resolution (as if a CCTV camera is in use)
R3 Non-functional Once trained, the system should satisfy queries within one minute on
a standard desktop or laptop computer (although 1/30th of a second is
preferable)
R4 Non-functional The accuracy when estimating each feature should be greater than ran-
dom ( 100
number of categories percent)
R5 Non-functional The accuracy of subject retrieval should be better than random (retrieval
correctly matches more subjects than a random test case for a suﬃciently
large number of queries)
Understanding the research-based nature of the project, the requirements have been kept to a
minimum to avoid over-specifying the system and eliminating possibilities before they are explored.
2

Chapter 3
Consideration of Approaches and
Literature Review
3.1 Code Libraries
As computer vision has progressed to a relatively mature field, there are programming tools to
abstract much of the functionality this project requires. One of which is OpenCV1
— a computer
vision library written in C/C++ with interfaces available for Java and Python. Due to the heavy
optimisation of the binaries, this library is computationally quick and memory efficient which makes
it ideal for real-time applications.
Another option is OpenIMAJ2
— a modernised approach to computer vision libraries written
in pure Java that makes best use of the Object-Orientated paradigm.
Finally, MATLAB provides a computer vision toolbox3
that offers rapid prototyping with many
important algorithms in-built and support for generating C-code.
3.2 Region of Interest Extraction
On their own, working out the soft-biometrics of subjects from the raw input images is intractable.
A series of filters must be used to abstract the necessary features for labelling.
3.2.1 Locating the Subject
One of the most-used methods of finding a person in an image is the Viola-Jones algorithm [22]. It is
commonly used today to detect faces on smartphones and cameras, and to automatically tag friends
on social networks. Using an alternate set of Haar-like features to the ones used for face detection,
the Viola-Jones algorithm is able to detect full-bodies, and hence extract them from an image.
Figure 3.1: An example of the Viola
Jones algorithm detecting full bodies.
Images credit: mzacha; RGBstock.com
Alternatively, background subtraction can be used. This
is plausible in the solution domain because most CCTV
cameras are static, therefore a background reference
model can be extracted over a period of time using al-
gorithms such as Temporal Median [16]. The de-facto
algorithm for background subtraction described by Hor-
prasert et al. [10] illustrates a way of obtaining a fore-
ground subject from a background model by examining
the change in brightness and the change in colour sep-
arately — allowing for shadow elimination. The largest
resulting connected components can then be masked from
1http://opencv.org/
2http://www.openimaj.org/
3http://uk.mathworks.com/products/computer- vision/
3

the background — hopefully containing the subject with minimal background pixel false-positives.
There have since been several extensions to the Horprasert algorithm such as the approach taken
by Kim et al. [13] which combines the four-class thresholding of Horprasert et al. with silhouette
extraction to smooth out noise and false-negatives in connected components.
3.3 Training
3.3.1 Feature Extraction
In order to learn labels, a feature vector needs to be made for each subject. A basic example could
be the pixel vectors from head to toe, shoulder to shoulder, pelvis to knee etc., but the solution is
unlikely to be robust if the pose were to change.
The route recommended by Hare et al. is auto-annotation with Latent Semantic Analysis (LSA) [8].
LSA works by finding the eigenvectors (Q, Q ) and eigenvalues (Λ) of a matrix of the common
terms between a set of documents (A) using the eigendecomposition equation:
A = QΛQ −1
In this case, Q is the matrix of eigenvectors for AAT
and Q is for AT
A. The eigenvectors can then
be used for many purposes. A common task is to find the similarity between any two documents.
This is achieved by finding the cosine similarity between any two rows of the eigenvector matrix
Q .
LSA was originally used in document analysis to find common terms between many large doc-
uments. By considering images as ’documents’ and features as ’terms’, Hare et al. describe how
the process of finding and sorting the principal components of the image terms can be used for im-
age retrieval [8]. Weights can then be assigned to the individual principal components to describe
how relevant each component is with respect to a particular body feature. This type of process is
known as Principal Component Analysis (PCA).
To implement PCA, the eigenvectors are organised in descending order by the corresponding eigen-
value λλλ ∈ Λ for each row. The eigenvector with the largest eigenvalue represents the principal
component: the most important variance that contributes the majority of change in an image.
Applying this to a matrix of images and their features or pixels requires finding the covariance
matrix. A quick and dirty trick is to approximate it with A = IIT
where I is the matrix of fea-
tures for n images. Since A is a covariance matrix, and therefore a real symmetric matrix, the
eigendecomposition can be reduced to
A = QΛQT
PCA has previously been used by Klare et al. [14] as part of the process in facial demographic recog-
nition for age, gender and race producing respectable results when trained with Linear Discriminant
Analysis as a classification-type technique.
3.3.2 Categoric Labelling
The training data lists a set of categoric measurements for each subject in the database of footage.
Since describing the relative traits of individuals varies based on personal experience [18], the
training data must be derived from the average ’vote’ of many judges. This project makes use of
Samangooei’s collection of categorical labels [19] for the subjects of the Southampton Gait Database
(SGDB)4
which have been derived using this method.
Further work in the field of annotation revealed that using comparative descriptions in place of
categoric labels when identifying suspects from witness statements is more reliable [17] than abso-
lute labelling alone. The primary advantage is that it eliminates the bias of previous experience
4http://www.gait.ecs.soton.ac.uk/database/
4

(e.g. what a witness thinks is tall or short) by making the witness estimate if the suspect was
taller/shorter/slimmer/fatter than the subject shown. While this technique is particularly suited
to identifying an individual through iteratively narrowing down possibilities, it is less suited to iden-
tifying the individual categories of demographic features. Instead, the bias of the witness-generated
categories is minimised by taking the average of all witness statements when preparing the training
data.
3.3.3 Weighting Function
Modelling the correlation of principal components to semantic labels requires machine learning.
Given an n-dimensional vector of principal components ppp and an expected category y, a model of
weights www can be learned such that pppwww ≈ y. Therefore, over the entire m-dimensional training set,
an error function can be defined as the squared sum of errors:
E = Xwww − yyy
2
where
X(m,n) =





ppp11
ppp12
· · · ppp1n
ppp21
ppp22
· · · ppp2n
...
...
...
...
pppm1
pppm2
· · · pppmn





yyy(m) =





y1
y2
...
ym





Perceptron Learning
One of the most common types of learning weights is using the perceptron training algorithm. The
principle is to adjust weights iteratively until the error falls below a certain threshold.
www = www − η E
Linear Regression
Using the sum of squared errors error function, linear regression offers a very simple (although
numerically unstable without a regularisation term) way of guessing the ideal set of weights for www.
www = (X X)−1
X yyy
Radial Basis Function
The Radial Basis Function (RBF) regression algorithm improves upon linear regression. It performs
clustering, then uses some non-linear function φ(α) on the distances from each cluster to map the
data onto new axes. From there, it is possible to find a linear classifier that models a non-linear
classifier on the real data, improving on both perceptron learning and linear regression.
www = (Φ Φ)−1
Φ yyy
where
Φ(m,n) =





φ( ppp1 − C1 ) φ( ppp1 − C2 ) · · · φ( ppp1 − Cn )
φ( ppp2 − C1 ) φ( ppp2 − C2 ) · · · φ( ppp2 − Cn )
...
...
...
...
φ( pppm − C1 ) φ( pppm − C2 ) · · · φ( pppm − Cn )





an example φ may be
φ(α) = e−α/σ2
5

Neural Networks
Neural Networks are another form of iterative learning in which multiple sigmoid-response percep-
trons are linked to multiple layers. Although training is much more complex and requires diﬀerential
equations to solve, using the network is relatively fast. Unlike perceptron learning, which is limited
to linearly separable problems, neural networks are ”capable of approximating any Borel measurable
function from one ﬁnite dimensional space to another” [9] which is comparable to the non-linear
attributes of RBF.
6

Chapter 4
Final Design and Justification
4.1 Technologies and Methods
4.1.1 Code Libraries and Project Setup
This project is written in Java 7 using the OpenIMAJ library described previously for the reasons
of prior experience with the language and object-orientated finesse. Maven is used for dependency
resolution. However, in knowledge of some of the flaws with OpenIMAJ, choosing C or C++ would
have been beneficial for both performance and utility reasons.
4.1.2 Subject Location
Initially, the Viola-Jones algorithm seemed to be the ideal choice. It is well-used in computer vision
and trusted. However, preliminary tests indicated several issues:
1. The algorithm runs slowly on the large sized images in the training data (approximately
1900ms per full-scale image and 350ms per image when scaled to 800x600 pixels)
2. There were up to 15 false positives for each true positive (an example of which is shown in
figure 4.1)
3. For each true positive, only 60% of the bounding boxes contained the entire body (example
shown in figure 4.2)
Efforts pursued to redeem the algorithm detailed in section 5.2.2 on page 13 were not produc-
ing results to a high enough standard, so a decision was made to revert to background subtrac-
tion and silhouetting. Basic subtraction proved to be fast, but with too much noise to per-
form any cropping. The Horprasert algorithm worked much better, but was slow and volu-
minous in code. Better performance was achieved by using the ”robust” algorithm in Kim et
Figure 4.1: A false positive
Figure 4.2: A true positive that has not been
bounded correctly
7

al. [13] up until the labelling phase. This provides the same functionality as the Horprasert al-
gorithm, but working in HSI colour space rather than RGB gives a significant efficiency boost:
calculating the changes in luminance and the changes in saturation become much more intuitive.
After background subtraction, assuming zero noise around the subject, the black-and-white mask
can be cropped to the smallest bounding box containing all the white pixels in the image. This
exclusively contains the subject and provides satisfactory alignment for Principal Component Anal-
ysis to work correctly. The mask can then be multiplied onto the image to extract the subject onto
a fully-black background before normalising the image.
Subtract
Background
Largest Connected
Component
Trim
Apply Mask
Figure 4.3: Action Diagram for back-
ground removal.
Images credit: mzacha; RGBstock.com
Principal Component Analysis is used to extract feature
vectors from each image using OpenIMAJ’s EigenImages
class — an implementation of PCA for image sets.
4.1.4 Labelling
Each feature (e.g. height, arm thickness) of each subject
is categorised by an enumerated type which represents a
number in the range [0, 6] — this is necessary to allow for
some of the more diverse categories such as age which re-
quires 7 categories. This is replaceable by an enumerated
category type for each class to resolve the issue of some
feature requiring more categories than others. An added
benefit is improved comprehensibility by using labels such
as Age.Category.YOUNG rather than Category.LOW. In
the implementation, only generic categories are utilised,
but these class-based categories are able to be applied to
final results before outputting to the terminal.
4.1.5 Weighting Function
During preliminary training, benchmarking tests (de-
scribed in section 5.4 on page 17) were performed on
each weighting function to discern the best performing
algorithm. Due to the lack of a reliable neural networks
framework, there is no implementation for feed-foward
neural networks. In further work, it may be worthwhile
writing the code to explore this option.
8

4.2 Processing Pipeline
Overall, the application will be trained and queried as follows:
Load Training
Dataset
Split Dataset into
Training and
Testing Subsets
Preprocess Images
Extract Subjects
Crop and Normalise
Train PCA
Algorithm
Analyse Training
Set
Learn Weighting
Function
Analyse Testing Set
Check Weighting
Function
Figure 4.4: Action Diagram for training the system
9

Load Footage Preprocess Images
Extract Suspect
Crop and Normalise Analyse Footage
Apply Weighting
Function
Run Query Against
Database
Return Closest
Match(es)
Figure 4.5: Action Diagram for querying the system
10

<<Abstract>><<Decorator>>
IextendsImage
SubjectProcessor
+processSubject(I)
+processImage(I)
+process(I):I
+processAll(List<I>):List<I>
+processAllInplace(List<I>)
<<Interface>>
IextendsImage
ImageProcessor
+processImage(I)
InputLimiter
+processSubject(I)
SubjectNormaliser
+processSubject(I)
SubjectResizer
+processSubject(I)
SubjectTrimmer
+processSubject(I)
IextendsImage
SubjectVideoProcessor
+processFrame(I)
<<Interface>>
IextendsImage
VideoProcessor
+processFrame(I)
<<Abstract>>
IextendsImage
BackgroundRemover
BasicBackgroundRemover
+processSubject(I)
HorprasertBackgroundRemover
+processSubject(I)
TsukabaBackgroundRemover
+processSubject(I)
-model
-model
+construct(SubjectProcessor)
+construct(SubjectProcessor)
Figure 4.6: Class Diagram for the various processing ﬁlters
11

Chapter 5
Implementation
5.1 Loading Training Data
Using OpenIMAJ’s library for datasets, all training and test data is grouped by the subject in the
image. For example, a dataset can contain 50 subjects, but multiple images per subject. This way,
it is possible to train with both side-view and frontal-view images from the gait database without
using complex iterators. Furthermore, OpenIMAJ keeps all images in datasets on the disk until
they are needed. If all images are loaded at runtime, memory would be an issue.
In order to make training valid, it is important to choose random splits each time the system
is tested which is done directly with OpenIMAJ’s group splitting class. Typically, the system is
trained with N − 20 subjects, and tested with 20.
5.1.1 Interpreting the GaitAnnotate Database
Since the demographics associated with each subject were kept in a separate MySQL database with
a table layout that was not ideally suited for this project, it was decided to migrate the database
into a new structure.
Initially, a database was set up using JavaDB/Derby to store the learned weightings. However,
it became clear that this was a heavy-weight solution for a light-weight problem so it was decided
to store weights using XML instead via the JAXB framework for simplicity and ease of manual
tweaking.
With use of a PHP script, an XML file was created for each subject in the training images contain-
ing the human-estimated categories for each feature.
In a retrospective decision, since the querying engine requires an element of speed and efficient
use of memory (R3), XML could not be used to match subjects on the trained system further in
the development timeline. Instead, the JavaDB solution has been re-implemented using the same
data as in the XML files with the added advantage of being able to use primary keys for searching,
limiting the amount of data required in memory at the time of execution. XML is still in use for
the training data due to its simplicity.
5.2 Image Processing
5.2.1 Limiting the Size of the Input
It is clear that the larger an image is, the more pixels need to be processed for subject extraction
and training. If an input image is very large (> 1000px for example), then a large amount of
processing time is wasted on insignificant details such as the buttons on a subject’s shirt when all
that’s really needed is enough resolution to identify demographics. The first processing filter is
therefore to limit the size of the image to a constant. In the default case, all images with a height
or a width greater than 800px are resized so the longest length is exactly 800px. Aspect ratio
remains the same.
12

5.2.2 Background Removal
As described in section 3.3.3 on page 5, several algorithms were shortlisted to further remove un-
necessary details from the input. Work first started on implementing the Viola-Jones algorithm to
remove the bulk of the background. OpenIMAJ has built-in methods to run the algorithm, and
contains many Haar cascades for detecting different parts of the body.
The first runs of the algorithm picked up many false positives as described in section 4.1.2 on
page 7. To remedy this, the images were run through basic background subtraction to remove
as much of the background as possible. The still-image training data does not contain an image
of just the background itself, so one was derived using some image editing software. Background
removal resulted in fewer erroneous detections, but there were still as many as 15 false positives
per subject. In knowledge of the existence of a voting variant of Viola-Jones, the following basic
voting algorithm was implemented.
detected ← V iolaJones(image)
votes ← {∅}
seen ← ∅
for all rectangle ∈ detected do
if rectangle /∈ seen then
if Area(rectangle) > 10000 then
seen ← seen ∪ rectangle
overlaps ← ∅
for all other ∈ detected do
if other /∈ seen then
if Overlapping(rectangle, other) then
if Area(other) > 10000 then
seen ← seen ∪ rectangle
overlaps ← overlaps ∪ other
end if
end if
end if
end for
votes ← votes ∪ overlaps
end if
end if
end for
regions ← ∅
for all candidates ∈ votes do
if |candidates| ≥ 3 then
regions ← MeanAverage(candidates)
end if
end for
return regions
This dramatically decreased the rate of false positives, and gave a more accurate subject boundary.
Figure 5.1 on the following page shows the result of running the voting algorithm. Taking the
median of voted images was also attempted, but with inferior results to taking the mean.
From here, edge detection was performed using the Canny operator to assist in isolating the bound-
ary of the subject — the top of their head, the bottom of their feet (although these are mostly
cropped out of the training images), and the sides of their shoulders.
However, it soon became apparent after running the application multiple times that using the
full-body cascades was too slow — much slower than using face cascades and upper body cascades.
In some cases, a single image sized 579x1423px would take 20 seconds to process. Upon speaking
with OpenIMAJ’s author, Jonathon Hare, he advised that while the cascades were taken straight
from OpenCV, some cascades perform quite poorly compared to others in the library and ”proba-
bly need improving”. In order to avoid the lengthy process of creating cascades, methods such as
13

(a) The candidate images produced by the Viola-Jones algorithm (b) The returned result
Figure 5.1: An example result of using the above voting algorithm
subtraction and silhouetting became favourable.
Second approach at Background Removal
The Horprasert algorithm immediately showed better results at background subtraction — but still
with a large amount of background being erroneously detected. The cause of the false positives
arose from the positioning of the camera in the training images. Although care had been taken
to minimise the variance of the images, occasionally the camera may have been kicked and the
treadmill moved — the repercussions of which are demonstrated in figure 5.2 on the next page.
Compared with video imagery, it is exceptionally difficult to work out the background of a still
image. Methods such as temporal median exist for video, but the backgrounds of still images must
be computed manually. To rectify the background issue, the training images were imported into
image editing software, and a script was run to automatically align and crop the images. The images
were cropped to not contain the treadmill, but as some subjects were standing behind the handles,
not all subjects fit fully into the image bounds (further experiments will need to be conducted to
examine whether this has adverse effects on the results). From this new aligned dataset, a suitable
background was derived. Figure 5.3 on the facing page shows the result of the Horprasert algorithm
using the new dataset.
The second background algorithm written by Kim et al. begins with a method very similar to
Horprasert’s, and this presented itself to be faster at the same job. However, the remainder of the
algorithm that includes labelling and silhouette extraction could not be implemented due to the
large amount of time OpenIMAJ’s default connected component labeller takes to execute on the
training images. Further work must be undertaken to make this possible.
Since the goal of background removal was to reveal a single silhouette mask that encompassed
most of the subject’s outline. A cropping algorithm was designed to remove all black areas, leaving
14

(a) A correctly aligned input image (b) Result of Horprasert algorithm
(c) An incorrectly aligned input image (d) Result of Horprasert algorithm
Figure 5.2: Results of using background subtraction on two diﬀerent training images — one of
which misaligned with the assumed background. The thresholds are manually guessed.
(a) A correctly aligned and cropped input image (b) Result of Horprasert algorithm
Figure 5.3: Results of using background subtraction on the cropped and aligned dataset.
15

the subject’s silhouette in full-frame.
bounds ← ∅
for all pixel ∈ image do
if pixel.value > 0.5 then
if bounds.x = 0 ∨ pixel.x < bounds.x then bounds.x = pixel.x
else if pixel.x − bounds.x > bounds.width then bounds.width = pixel.x - bounds.x
end if
if bounds.y = 0 then bounds.y = pixel.y
else if pixel.y − bounds.y > bounds.height then bounds.height = pixel.y - bounds.y
end if
end if
end for
The Deprecation of Background Removal
While use of both Horprasert [10] and Tsukaba [13] algorithms seemed to effectively remove the
background in the test examples, Horprasert was not robust enough to remove all non-subject
areas which prevented the cropping algorithm from working as planned. The Tsukaba algorithm
incorporates a stage for connected component labelling which mitigates this issue, but sadly the per-
formance of OpenIMAJ’s labeller was far too slow for realistic use. The feasibility of this algorithm
was later reduced when it was noted that the novel ’elastic’ borders would be almost impossible to
process at an acceptable speed in Java — only a native C compiled library would be realistic.
In order to remain on-track with the crucial research-based components of the project, background
removal was deprecated in favour of manually cropping the images and keeping the background
as the solid green screen from the laboratory. While this means the system will not work in non-
controlled conditions, it still serves as a convincing proof-of-concept with the potential to work on
more ambitious footage given a fast and optimal background removal implementation.
5.2.3 Processing Video
It was initially intended that the project could take videos as an input rather than singular still
images. Using Temporal Median (or Temporal Mode to eliminate the need for expensive sorting
operations), a representative background could be generated autonomously for use in image seg-
mentation provided there is enough movement in the scene. Another algorithm was discovered
offering a fast implementation of Temporal Median [11], but since this method relies on pixel values
being in the relatively low greyscale range of [0, 255], the 16 million possible values of an RGB
image made this algorithm unlikely to show any benefit (although this theory has not been tested).
Instead, the next best alternative was to use the Quickselect method for finding the median [21].
To further reduce processing time, the background image could be used to remove all frames with no
subjects present. This is simple to achieve by setting some threshold of foreground-to-background
pixels in the subtraction mask, or requiring that the primary connected component has a sufficiently
large area.
Unfortunately, certain problems emanated when attempting to load video footage. The Southamp-
ton Gait Database contains a very large repository of videos for each subject in the GaitAnnotate
database. All videos are encoded as raw DV footage which should be simple enough to process.
Since OpenIMAJ uses a library that relies on FFMPEG1
, a well known and respected video codec
pack, there shouldn’t be any issues. Despite this, there appears to be either some level of corruption
in the headers of the files or a bug that causes FFMPEG to load the frames, but skip all metadata
required for seeking which is essential for preparing video filters.
The only option left was to load the frames as images and work with them thus, but to train
the system this way is impossible without the hideous amounts of computer memory required to
hold each image, and the principal component data that results. It remains possible to query the
trained system with image sequences however.
1https://www.ffmpeg.org/ffplay.html
16

5.3 Component Analysis
Despite having not completed a robust method of background subtraction, Principal Component
Analysis could still be performed on the raw cropped images of the training set on the basis that
the background should be represented by the least significant components. As PCA requires the
inputs to have the same number of rows and columns, a normalising class was written to resize
each input image to exactly the same size without losing any of the original image (padding is
added to the outside if a dimension is too short). The PCA interfacing class directly invokes
OpenIMAJ’s EigenImages class, but also allows for Java serialization so it can be stored for later
use — an important time saving method for rapid training and a necessity for performing PCA on
any successive queries.
5.4 Training Algorithms
To gauge the success of a training algorithm, an error function of the total ’distance’ between the
features of each subject (f) and the features of the estimated subject (g) was devised such that
num features
i=1
|fi − gi|
When choosing the best algorithm to use, a set of tests were run for each algorithm using a 50/50
split of training to testing with 58 subjects in each group, and using both frontal and side-view
images.
Testing began with a perceptron learning algorithm to iteratively converge upon ideal weights for
each principal component using gradient descent. The weights are initialized as a uniform random
guess, and updated as described earlier using the gradient E = 2XT
(Xwww − yyy). After adjusting
the learning rate and iteration count, an overall error ≈ 80 at η = 10−5
; iter = 1000 was achieved.
By comparison, a random guess produced an error ≈ 115. A bias term was then added to the data
to minimize bias error (an extra input with a constant value of 1), but this made no significant
difference to gradient descent.
This means that using perceptron-trained weights is not much better than random guessing, but
it is most certainly an improvement. Higher iteration counts were also tried up to 10,000, but
this seemed to over-classify the data: error rates increased. In hindsight, using a verification set in
addition to training and testing could have prevented this to attain better results at higher iteration
counts.
Linear regression was implemented next using the formula described earlier. The average error
was slightly higher, ≈ 83, but the algorithm took much less time to train than using the perceptron
(4ms average for each feature of 58 subjects compared to 80ms). After further optimization by
adding a regularization term λI for variance error and a bias term as described above, the error
decreased to ≈ 45 which is much more realistic, but still not very useful.
The final algorithm tested was the Radial Basis Function (RBF). After multiple failures and sev-
eral test cases that were statistically worse than random (error ≈ 140), results were achieved in the
range (30, 40) with the a priori variable α = 5. By manually adjusting α, it was found that a value
of 20 produces the most accurate and consistent results.
During development, it was conceived that there should be separate weightings for frontal view
and side view images to classify either styles more accurately. However, this presented itself to
be a bad idea as real CCTV footage won’t guarantee a frontal or side view image, but rather a
range of oblique angles. Training is therefore best to incorporate multiple angles in efforts to reduce
generalization error.
5.4.1 Further Improvements
Shortly after the initial algorithm tests, a bug was discovered in the implementation of finding the
total distance between subjects which meant scores were more than double what they should have
17

been. After fixing this issue, along with some smaller discrepancies, gradient descent reduced to
an average distance of ≈ 36, linear regression to ≈ 8, RBF to ≈ 7 using 50/50 training to testing
splits.. It was noted at this point that linear regression is definitely a contender for the final solution
due to its speed and accuracy. Despite this, since RBF adds little extra time to training for a small
decrease in distance, it is still preferred. For larger training sets in more extreme conditions it
is quite possible that the non-linear training algorithm will have distinct advantages over linear
regression due to the flexibility of the model.
5.5 Storing Training Data
5.5.1 Heuristics
Since the JAXB library was already being used for loading demographic data, using XML to store
trained heuristics seemed a logical solution. The implementation is a trainable Heuristic class
with a subclass for each body feature. A storable JAXB version of a Heuristic was then created
with containers for the class name, the weightings map, and any serialized data that the training
algorithm may need to set up again such as centroid data for the RadialBasisFunctionTrainer.
This method proved to be effective when debugging as individual weights could be manually tweaked
with a standard text editor, and changes are easier to notice.
5.5.2 Principal Component Data and Training Sets
Since Principal Component Analysis is a costly procedure, but the results are reusable, heuristic
training times could be reduced by caching both the principal component data and any generated
training sets (containing mappings of component data to categories) to disk. Since this data is not
editable, it is simply serialized using Java serialization.
5.5.3 Query Engine
With a trained set of heuristics, one or more images should produce an estimation for each body
feature, and ultimately, a guess of whom the subject may be. In order to achieve the latter, a
database needed to be re-implemented to avoid loading every single subject’s XML file separately
for each query as it would need to find the subjects with the closest matching features.
The database is SQL-based which makes querying relatively simple. Stored procedures and func-
tions were considered to calculate the distance between a subject in the database and a suspect
probe, but rejected in favour of dynamically building a SQL statement which takes the following
format (where question marks are replaced with the respective values of the probe’s features):
SELECT id , SUM(ABS( ‘ age ‘ −?) , . . . , ABS( ‘ weight ’ −?)) as distance
FROM subjects
GROUP BY id
ORDER BY distance ASC
FETCH FIRST 5 ROWS ONLY
This produces a list of the top 5 matching subjects with their total distance in ascending order.
This can be used directly to identify the suspect, or to narrow down the possibilities in a wider
search.
18

Chapter 6
Results and Evaluation
6.1 Ability to Estimate Body Demographics
6.1.1 How to Measure Success
In this project, the success of recognition is measured by three metrics. Firstly, the percentage
accuracy for a particular feature. An example is age, which has seven categories. The accuracy of
the system for a feature is determined by the number of correct categorisations divided by the total
number of subjects used for testing.
Note that the system can either guess right or wrong — the distance from the correct category is
not taken into account. For an entirely random guess approach on a sufficiently large dataset of
test subjects, the accuracy for the seven-category age feature will be 100/7 = 14.29%.
The second metric, as described previously, is the total ’distance’ between the guessed demographics
(g) and the actual demographics (f) for all body features such that:
distance =
num features
i=1
|fi − gi|
This metric does allow for guesses to be one or two categories out which gives a clearer overall
picture of how close the match is.
The final metric is the index of the actual subject when querying the database for most-probable
subjects. For example, if the system guesses the suspect’s body demographics slightly wrong, it
may mean that the closest match in the database is not the correct person, and instead the correct
subject is ranked as the 6th
most probable.
6.1.2 Results on Test Subjects
As mentioned with the implementation of different training algorithms, the best results were ob-
tained using a Radial Basis Function φ(α) with α = 20. The results in table B.1 on page 40 show
the performance of the RBF trainer as a function of the average percentage of correct estimations
in comparison to the ’expected’ accuracy and two random techniques — the median category, and
a random category between 1 and the maximum value within the training set. For this set of tests,
the training set was kept the same. Typically, the correct classification rate for a particular body
feature is 72 ± 3.8%. Some features are particularly accurate (e.g. 90% for Proportions), and some
are close to useless (namely 55.25% for Skin Colour, and 10.5% for Ethnicity).
It is safe to assume that human demographics approximate a gaussian distribution per contin-
uously measured biometric, so it is not surprising that choosing the median category yields better
results than choosing a random category between sensible limits. However, both random methods
produced results significantly worse than the informed method which validates that it is possible
to use principal component analysis to estimate body demographics.
19

0 20 40 60 80 100
Age
Arm Length
Arm Thickness
Chest
Ethnicity
Facial Hair Colour
Facial Hair Length
Figure
Hair Colour
Hair Length
Height
Hips
Leg Length
Leg Shape
Leg Thickness
Muscle Build
Neck Length
Neck Thickness
Proportions
Sex
Shoulder Shape
Skin Colour
Weight
Percentage Accuracy
Figure 6.1: The correct classification rates of each biometric for the RBF method
Algorithm Average Distance % Recalls within top 5
RBF 7.2 51.75%
Random 26.0 5.0%
Median 10.3 5.0%
It is clear that while the difference in average distance between RBF and Median algorithms is
relatively small, the slight advantage of RBF yields a large increase in the percentage of correct
recalls (defined as the proportion of test subjects that were recalled as one of the top 5 most likely
from the database). The relative success of guessing with the median indicates the category setup
is not unique enough to separate individuals — this is explored in further detail in section 6.3 on
page 22. Comparison box-and-whisker plots for the accuracy of each algorithm is shown in full
details in figure B.1 on page 42.
The effects of occluded body features
Surprisingly, it seems that the difference in accuracy between frontal and side-views when estimat-
ing categories is minimal. The expectation is for features such as hips, chest and shoulder shape
to become less accurate as they can no longer be easily determined. However, it seems that the
proposed method implies information from alternate variances that aren’t affected by the view angle.
In the example shown in figure 6.2 on page 26, a subject (not used in training) has been queried
against the known information in the database for both frontal and side views. However, the dif-
ferences are minor and unrelated to the aforementioned categories that should be affected in the
side image. This phenomenon may be of great help when identifying suspects at odd angles where
humans are unable to determine certain traits.
Difficulties and Limitations with Training Data
It is apparent from the results in table B.1 on page 40 that certain body features are more easily
detectable than others. The likely explanations are the following:
1. Features such as ethnicity do not have a natural order or scale. Far-Eastern and Black for
example are next to each other by their categorical order, yet they share few similarities. This
effect can reduce the accuracy of classification.
20

2. Subjects in the training set are predominantly male and have normal proportions which makes
guessing by median more accurate than informed estimation for certain characteristics.
3. The dataset is limited in size — with just 198 images to learn from.
4. The training data is written by humans in a consensus approach [19] which means while
an individual’s error is overruled, the entire group of annotators can be biased towards a
particular category based on their experience — especially if the annotators are mostly white
British males in their early twenties.
In addition to these, there are issues with the images themselves such as body parts being obstructed
by a treadmill, poor exposure and white balance issues leading to skin colour looking darker than
normal. It is entirely plausible that a more comprehensive, higher quality training set could greatly
augment the trained system — a suggestion that should be taken into account in future research.
6.1.3 Robustness
The most surprising results come with robustness testing which surpassed expectations. In par-
ticular, the system appears to be strong against resolution, quality, and noise constraints as the
following tests show:
Test Total
Distance
Recall
Rate
Normal Image (sub-
ject 010)
5.0 10/116
+10% Uniform Noise 5.0 10/116
+100% Uniform Noise 11.0 69/116
(a) Response in accuracy as noise levels increase
(100% noise = every pixel is different from its orig-
inal)
Test Total
Distance
Recall
Rate
Normal Image (sub-
ject 053)
5.0 1/116
50% Resolution 5.0 1/116
25% Resolution 6.0 1/116
12.5% Resolution 6.0 1/116
6.25% Resolution 6.0 1/116
3.13% Resolution 11.0 64/116
1.57% Resolution 15.0 107/116
(b) Response in accuracy as resolution decreases
(original size = 579x1423px; 72dpi)
Test Total
Distance
Recall
Rate
Normal Image (sub-
ject 053)
5.0 1/116
Lowest JPEG Quality 5.0 1/116
100 Colour GIF 5.0 1/116
20 Colour GIF 6.0 1/116
(c) Response in accuracy as a result of image com-
pression
Table 6.1: Results of degraded image quality tests
Though these tests are limited to select few subjects (as each image requires manual editing),
repeating the tests with different subjects and training sets yield similar results. These are cru-
cially important findings as this shows a strong invariance to factors commonly associated with
CCTV footage which would aid deployment using existing non-expensive camera equipment. By
these findings, even a cheap webcam could be used as a state-of-the-art demographic estimator
whereas other methods such as gait recognition are more susceptible to resolution and frame rate [5].
To showcase the importance of these results, figure 6.3 on page 27 shows the input images for
6.25% resolution and 20 colour GIF.
21

Subroutine Execution Time
Image Loading 2.565s
PCA Analysis 170.9s
Training Set Generation 81.99s
Training (RBF) 15.48s
Total (Complete) 271.1s
Total (Cached) 18.22s
Table 6.2: Performance of Training Engine
Image Analysis 0.367s
Categorisation 0.072s
Subject Matching 0.150s
Total 1.237s
Table 6.3: Performance of Query Engine on Desktop Computer
6.2 Performance
6.2.1 Training
Training performance depends on whether prior training has taken place. Since gathering principal
component data and creating training sets takes a long time and are both reusable, they are cached
for future use. On a desktop computer (Intel Core i7 2600K @ 3.6GHz; 8GB 1600MHz RAM;
7200RPM HDD; Java 8), table 6.2 shows the mean execution time for each on the GaitAnnotate
database with 198 images.
6.2.2 Querying
For the same desktop computer, the mean execution time for taking a single image, and identifying
a list of 10 potential matches are shown in table 6.3. The code was also run on a Raspberri Pi 2
Model B (ARM Cortex-A7 quad core @ 900MHz; 1GB RAM; microSD; Java 8) - results
of which are shown in table 6.4.
Although subject matching is relatively fast when performing individual queries, multiple queries
made less than one second apart can form a queue on the database which can delay queries to up
to 20 seconds each if many thousands of requests are made.
6.3 Viability of Use as a Human Identification System
While the estimation of demographics is certainly an effective breakthrough, there appears to be
little hope in the current system becoming a way to identify masked criminals. With the small
population of 116 subjects in the database, there is already a significant overlap of ’average’ people.
This has meant that the fully trained system can only manage to retrieve a subject in the top 5
50-60% of the time, as opposed to retrieving the correct subject 80-100% of the time which would
be more realistic. Notwithstanding, random estimation has an average top-5 recall rate of 5%, so
the result is certainly significant if not ideal.
Image Analysis 9.489s
Categorisation 1.876s
Subject Matching 7.216s
Total 12.60s
Table 6.4: Performance of Query Engine on Raspberry Pi
22

Despite having a low recall rate, the proposed system may decrease the amount of searching re-
quired if the system could narrow down the possible candidates in a man-hunt from thousands to
just a few hundred.
The system can also be used to simply aid with witness descriptions. The labels in table 6.4b
on page 27 were generated from a single image of ’Jihadi John’ — the masked murderer of the
Islamic State — which seems to identify features with respectable accuracy despite the system not
being trained for backgrounds other than green baize. While some features are misguided (e.g. skin
colour), the majority of estimations are certainly fitting. The unusual proportions, short legs, and
small figure are explained by the legs being cropped out.
While this system was trained to use discrete categoric labels for identification, it may be pos-
sible to use comparative labels described in section 3.3.2 on page 4, whereby metrics are given as
relative to other subjects (e.g. taller, fatter), with the intention of reducing the amount of conflicts
by increasing the uniqueness of each database entry, ultimately leading to a system that can more
accurately identify a single person.
6.4 Background Invariance
Though the training data is limited to laboratory conditions, a limited set of tests, including the
annotation of ’Jihadi John’ indicate a slight invariance to background images that may render
background subtraction obsolete if results improve when using training data with non-uniform,
indoor and outdoor backgrounds. See figure B.2 on page 43 for the results of trying to classify an
individual standing outside, which show mostly correct or reasonable categories. It is interesting
to note that both frontal and side-on views under the same lighting produce more-or-less the same
results. This further validates the decision to use a single set of heuristics for all view angles in
section 5.4 on page 17.
6.5 Evaluation against Requirements
ID Pass/Fail Comments
FR1 Pass The system resizes colour images to a constant dimension before pro-
cessing
FR2 Fail Video footage could not be loaded successfully using OpenIMAJ
FR3 Pass Both the training and querying engines require no user input other than
number of matches to retrieve and the images to use.
FR4 Fail* *While background removal was not successfully implemented, tests in
table 6.4b on page 27 and figure B.2 on page 43 demonstrate the possi-
bility that the background could be rendered insignificant given a com-
prehensive training set
FR5 Pass Heuristics are stored as XML, principal component data is stored in
serialized form
FR6 Pass XML files for each subject contain training inputs, and the query
database contains all required information to find matches derived from
the GaitAnnotate project
FR7 Pass Example: table 6.4b on page 27
FR8 Pass The system produces the n top matching subjects from the database
when using the query engine
R1 Pass The system uses statistical analysis for estimations
R2 Pass See section 6.1.3 on page 21
R3 Pass A single query, with the overhead of loading principal component data
and heuristics, takes an average of 5.9 seconds on a standard desktop
computer
R4 Pass See table B.1 on page 40
R5 Pass Average recall within top 5 for Radial Basis Function is between 50%
and 60%, average recall for random or median is less than 10%
23

6.6 Evaluation against Other Known Techniques
6.6.1 As Demographic Estimation
There has been previous research this domain of soft-biometrics which offers some comparison to
the performance of this project. The comparisons will be made mostly against research conducted
by Hu Han et al. [7] and their ”biologically inspired” framework which includes the performance of
human estimation for the demographics of age, gender and race.
However, this is as far as evaluation for demographic estimation can extend — nobody has yet
published any generalised form that is not limited to select few categories. This project is not only
able to identify twenty three different traits, but it may use images of whole bodies which is more
suitable for low quality CCTV — a domain where research of this manner is most important.
In terms of simplicity, this project also takes a more general and malleable approach compared
to facial processing and biologically inspired features [7], facial surface normals [24], and Active
Appearance Models (AAM) [4] which all require human faces as training images. By contrast, the
proposed method makes use of natural variances in any image (not necessarily that of a human),
and classifies with any regression-based or classification-based machine learning algorithm (RBF;
Linear Regression; Neural Networks) resulting in a more generic framework which can be built on
and improved for greater accuracy across a limit-less number of demographic features.
Age Estimation
Because age was estimated with an absolute figure rather than in categories in [7], it is not directly
comparable to the technique presented in this project. Moreover, Han et al. achieved a mean average
error of 3.8 ± 3.3 years with the MORPH II facial image database (without Quality Assessment)
compared to 6.2 ± 4.9 years with human estimation. By comparison, assuming each of the 7
categories represents a range of 10 years, this solution obtains a mean average error of 4.3±0.9, which
by the research presented above is potentially better than human estimation. The measurement is
flawed however as it assumes correct categorisation has an error of exactly 0 years (and 10 years
for 1 category difference).
Gender Classification
For the gender demographic, Han et al. achieved an accuracy of 97.6%, while Guodong Guo and
Guowang Mu’s method [6] achieves 98.5% on the MORPH II dataset. These are both higher than
the human performance for this dataset which is 96.9%. These results do exceed the accuracy of
this project which has a mean accuracy of 85%. The decrease in accuracy could be as a result of
using a relatively small dataset for training — just 198 training images compared to 2,000 images
in the MORPH II dataset used by the majority of papers on the subject of demographic estimation
in facial images.
Race Classification
Han et al. classify races as black or white only with an accuracy of 99.1%, compared to the human
performance of 97.8% using the MORPH II dataset. Guodong Guo and Guowang Mu achieve an
accuracy of 99.0%. In contrast, this project performs poorly when judging both skin colour and
race: 55.25% for the former and 10.5% for the latter. The most likely reason is a mix of incorrect
data in the database (some samples are clearly European, yet labeled as ’Far-Eastern’ or ’Other’),
and poor colour balance in the photos which can make some subjects appear over-saturated, thus
more tanned. In modern society race and skin colour are becoming meaningless, if not entirely
subjective: a white person may think an Indian is black, while a black person might say they are
white. Because of these conflicts, it’s understandable why some of the training data is inconsistent.
6.6.2 As Human Identification
Gait biometrics has been an active area of research for human identification since the 1990s and
has proved to be a viable and robust method. Goffredo et al. have compiled a report on the
performance of gait recognition with the techniques in use at the University of Southampton [5].
24

In the report, the recognition rates for gait techniques peak at 96% on a database of 12 subjects
over 275 video sequences, tailing oﬀ to circa 52% for more acute view angles. The proposed method
is evidently not suitable for identiﬁcation in its current state with a mean recall rate of 51.75%
for matches within the top-5 retrievals. If recall is limited to being the top ranking match from
the database exclusively, then the recall decreases to 29.75% — this implies that most of the top-5
recalls are actually the top-scoring result.
25

(a) (b)
Feature Image A Image B Target
Age 4 / 5*/ 5
Arm Length 3* 3* 3
Arm Thickness 2* 2* 2
Chest 3* 3* 3
Ethnicity 3 3 4
Facial Hair Colour 1* 1* 1
Facial Hair Length 1* 1* 1
Figure 3* 3* 3
Hair Colour 2 2 1
Hair Length 3*/ 4 / 3
Height 4*/ 3 / 4
Hips 2* 2* 2
Leg Length 3* 3* 3
Leg Shape 2* 2* 2
Leg Thickness 2 / 3*/ 3
Muscle Build 2* 2* 2
Neck Length 3* 3* 3
Neck Thickness 3* 3* 3
Proportions 1* 1* 1
Sex 2* 2* 2
Shoulder Shape 3 3 4
Skin Colour 3*/ 2 / 3
Weight 3* 3* 3
(c) Per-feature categories for each of the above images, and the actual category targets. Asterisks mark
correct estimations, slashes mark diﬀerences between A and B estimations.
Figure 6.2: The eﬀects of occluded body parts on categoric estimations
26

(a) Subject 053 at 6.25% resolution (b) Subject 053 as a 20 colour GIF
Figure 6.3: Reduced quality images used for queries
(a) A freeze-frame of ’Jihadi John’
Feature Category
age MIDDLE AGED
armlength LONG
armthickness THICK
chest VERY SLIM
ethnicity OTHER
facialhaircolour NONE
facialhairlength NONE
ﬁgure SMALL
haircolour DYED
hairlength MEDIUM
height TALL
hips NARROW
leglength SHORT
legshape VERY STRAIGHT
legthickness AVERAGE
musclebuild AVERAGE
necklength VERY LONG
neckthickness THICK
proportions UNUSUAL
sex MALE
shouldershape ROUNDED
skincolour WHITE
weight THIN
(b) ’Jihadi John’s estimated labels
27

Chapter 7
Summary and Conclusion
This project details a novel, effective, and noise-invariant way at estimating body demographics
through computer vision. While the demographics themselves cannot be used as an identification
process in their current form, the estimation process can greatly assist in a broad range of appli-
cations such as witness descriptions at crime scenes, targeted advertising for passers-by, and even
keeping track of wildlife (there’s no fathomable reason why the system cannot be trained with
images of animals).
Further research in this area may be able to allow for multiple subjects in an image, improved
accuracy, improved speed, and greater uniqueness and differentiability in the demographics used —
perhaps using comparative labels — in order to potentially use this as an identification mechanism
to automatically find people across an array of CCTV cameras in real-time.
Since the system produced respectable results despite the background being present in the training
images, there is a possibility of utilising pedestrian detection techniques such as patterns of motion
and appearance [23] given a suitably fast implementation (which may enforce the use of native
C code) to train on extracted, low-resolution subjects without requiring background removal. If
training is successful across a wide range of camera angles, then theoretically any CCTV camera
could be used to estimate body demographics.
7.1 Future Work
7.1.1 Migrating to C++ and OpenCV
While Java and OpenIMAJ have produced a clear and coherent object-orientated solution, a com-
bination of bugs with OpenIMAJ and speed issues with Java indicate that migration to C or C++
with OpenCV would be beneficial.
7.1.2 Dataset Improvements
As mentioned in section 6.1.2 on page 20 there are some issues with the dataset used. A more
comprehensive dataset with greater than 1000 images across indoor and outdoor environments
with some subjects wearing masks or other items of clothing that would inhibit other methods of
recognition would be ideal for training a system that works in the real world.
7.1.3 Use of Comparative Labels
While categoric labels have been useful in determining categories for individual body features,
comparative labels described in section 3.3.2 on page 4 could increase recall rates.
There exists research on choosing the correct number of principal components for optimal compres-
sion [2], [12] including trial and improvement to adequately represent the human body for these
experiments. This was not included in the research presented due to the Radial Basis Function
28

requiring at least as many samples as the number of clusters therefore limiting the number of
principal components to the quantity of images in the database. This project uses a constant 100
principal components; improvements may be observed when using more.
7.1.5 Weighting Functions
As discussed in section 4.1.5 on page 8, neural networks were not implemented due to their com-
plexity in setting up. There is a realistic probability that neural networks are even better than RBF
at increasing noise, lighting and background invariance due to the inherent power of the models
when a single hidden layer is used in a feed-forward network.
29

Bibliography
[1] Niyati Chhaya and Tim Oates. Joint inference of soft biometric features. In Biometrics (ICB),
2012 5th IAPR International Conference on, pages 466–471. IEEE, 2012.
[2] Ralph B. D’agostino and Heidy K. Russell. Scree Test. John Wiley & Sons, Ltd, 2005.
[3] S. Denman, C. Fookes, A. Bialkowski, and S. Sridharan. Soft-biometrics: Unconstrained
authentication in a surveillance environment. In Digital Image Computing: Techniques and
Applications, 2009. DICTA ’09., pages 196–203, Dec 2009.
[4] Xin Geng, Zhi-Hua Zhou, and Kate Smith-Miles. Automatic age estimation based on facial ag-
ing patterns. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 29(12):2234–
2240, 2007.
[5] Michela Goffredo, Imed Bouchrika, John N Carter, and Mark S Nixon. Performance analysis
for gait in camera networks. In Proceedings of the 1st ACM workshop on Analysis and retrieval
of events/actions and workflows in video streams, pages 73–80. ACM, 2008.
[6] Guodong Guo and Guowang Mu. Joint estimation of age, gender and ethnicity: Cca vs. pls.
In Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference
and Workshops on, pages 1–6. IEEE, 2013.
[7] H. Han, C. Otto, X. Liu, and A. Jain. Demographic estimation from face images: Human
vs. machine performance. Pattern Analysis and Machine Intelligence, IEEE Transactions on,
PP(99):1–1, 2014.
[8] Jonathon S. Hare, Sina Samangooei, Paul H. Lewis, and Mark S. Nixon. Semantic spaces
revisited: Investigating the performance of auto-annotation and semantic retrieval using se-
mantic spaces. In Proceedings of the 2008 International Conference on Content-based Image
and Video Retrieval, CIVR ’08, pages 359–368, New York, NY, USA, 2008. ACM.
[9] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are
universal approximators. Neural Networks, 2(5):359 – 366, 1989.
[10] T. Horprasert, D. Harwood, and L. S. Davis. A statistical approach for real-time robust
background subtraction and shadow detection. In Proc. IEEE ICCV, volume 99, pages 1–19.
[11] Mao-Hsiung Hung, Jeng-Shyang Pan, and Chaur-Heh Hsieh. Speed up temporal median fil-
ter for background subtraction. In Pervasive Computing Signal Processing and Applications
(PCSPA), 2010 First International Conference on, pages 297–300, Sept 2010.
[12] Donald A Jackson. Stopping rules in principal components analysis: a comparison of heuristical
and statistical approaches. Ecology, pages 2204–2214, 1993.
[13] Hansung Kim, Ryuuki Sakamoto, Itaru Kitahara, Tomoji Toriyama, and Kiyoshi Kogure.
Robust foreground extraction technique using gaussian family model and multiple thresholds.
In Yasushi Yagi, SingBing Kang, InSo Kweon, and Hongbin Zha, editors, Computer Vision
– ACCV 2007, volume 4843 of Lecture Notes in Computer Science, pages 758–768. Springer
Berlin Heidelberg, 2007.
[14] B.F. Klare, M.J. Burge, J.C. Klontz, R.W. Vorder Bruegge, and A.K. Jain. Face recognition
performance: Role of demographic information. Information Forensics and Security, IEEE
Transactions on, 7(6):1789–1801, Dec 2012.
30

[15] C. Neil Macrae and Galen V. Bodenhausen. Social cognition: Thinking categorically about
others. Annual Review of Psychology, 51(1):93–120, 2000.
[16] Mark Nixon and Alberto S. Aguado. Feature Extraction & Image Processing for Computer
Video, Third Edition. Academic Press, 3rd edition, 2012.
[17] D.A. Reid, M.S. Nixon, and S.V. Stevenage. Soft biometrics; human identification using
comparative descriptions. Pattern Analysis and Machine Intelligence, IEEE Transactions on,
36(6):1216–1228, June 2014.
[18] Daniel A. Reid. Human Identification Using Soft Biometrics. PhD thesis, University of
Southampton, Apr 2013.
[19] S. Samangooei, Baofeng Guo, and M.S. Nixon. The use of semantic human description as
a soft biometric. In Biometrics: Theory, Applications and Systems, 2008. BTAS 2008. 2nd
IEEE International Conference on, pages 1–7, Sept 2008.
[20] Sina Samangooei and Mark S. Nixon. Performing content-based retrieval of humans using gait
biometrics. In David Duke, Lynda Hardman, Alex Hauptmann, Dietrich Paulus, and Steffen
Staab, editors, Semantic Multimedia, volume 5392 of Lecture Notes in Computer Science, pages
105–120. Springer Berlin Heidelberg, 2008.
[21] Ryan J Tibshirani. Fast computation of the median by successive binning. Unpublished
manuscript, http://stat. stanford. edu/ryantibs/median, 2008.
[22] Paul Viola and Michael J. Jones. Robust real-time face detection. International Journal of
Computer Vision, 57(2):137–154, 2004.
[23] Paul Viola, Michael J Jones, and Daniel Snow. Detecting pedestrians using patterns of motion
and appearance. International Journal of Computer Vision, 63(2):153–161, 2005.
[24] Jing Wu, William AP Smith, and Edwin R Hancock. Facial gender classification using shape-
from-shading. Image and Vision Computing, 28(6):1039–1048, 2010.
[25] Zhiguang Yang and Haizhou Ai. Demographic classification with local binary patterns. In
Advances in Biometrics, pages 464–473. Springer, 2007.
31

Appendix A
Project Management
33

Figure A.1: Task list of proposed work
34

Figure A.2: Task list of proposed work
35

Figure A.3: Gantt chart of proposed work
36

Figure A.4: Task list of actual work
37

Figure A.5: Task list of actual work
38

Figure A.6: Gantt chart of actual work
39

Appendix B
Results
Average Accuracy Per Algorithm
Feature Num. Categories Expectation RBF Random Median
Age 7 14.25% 70.75% 12% 35%
Arm Length 5 20% 79.25% 21% 55%
Arm Thickness 5 20% 62% 23.75% 55%
Chest 5 20% 65% 22% 55%
Ethnicity 6 16.5% 10.5% 26% 10%
Facial Hair Colour 6 16.5% 87% 60.25% 90%
Facial Hair Length 5 20% 82.5% 36% 85%
Figure 5 20% 86.25% 26% 75%
Hair Colour 6 16.5% 61.75% 23.5% 60%
Hair Length 5 20% 67.25% 20.75% 60%
Height 5 20% 75.75% 15% 45%
Hips 5 20% 66% 24.25% 55%
Leg Length 5 20% 66.75% 20.75% 45%
Leg Shape 5 20% 59.5% 24.75% 50%
Leg Thickness 5 20% 68.5% 25.75% 60%
Muscle Build 5 20% 72.25% 22.25% 50%
Neck Length 5 20% 75.5% 19.5% 45%
Neck Thickness 5 20% 78.5% 22.5% 65%
Proportions 2 50% 85% 95% 95%
Sex 2 50% 80% 15% 85%
Shoulder Shape 5 20% 72.5% 20% 70%
Skin Colour 4 25% 55.25% 47% 65%
Weight 5 20% 75% 23.75% 59.25
Table B.1: Per-algorithm accuracy for correctly estimating each feature on a human body. Expec-
tation is the expected accuracy for random guessing.
40

0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
PercentageCorrect
FeatureRecognitionRatesforRBFTraining
(a) Accuracy of RBF on 116 subjects
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
PercentageCorrect
FeatureRecognitionRatesforRandomGuessing
(b) Accuracy of Random Guessing on 116 sub-
jects
41

0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
PercentageCorrect
FeatureRecognitionRatesforMedianGuessing
(c) Accuracy of Median Guessing on 116 subjects
Figure B.1: Correct classiﬁcation percentages for each biometric feature for the preferred method
(RBF), and two guessing algorithms for comparison.
42

(a) (b) (c) (d)
Feature Image A Image B Image C Image D Target
Age Middle Aged Middle Aged Young Adult* Adult Young Adult
Arm Length Long Long Long Long Average
Arm Thickness Thick Thick Average* Thick Average
Chest Very Slim Slim* Slim* Slim* Slim
Ethnicity Other European* European* European* European
Facial Hair Colour None None None None Brown
Facial Hair Length Stubble* None None None Stubble
Figure Average Average Small* Small* Small
Hair Colour Grey Grey Blond* Grey Blond
Hair Length Medium* Medium* Short Short Medium
Height Tall Tall Tall Tall Average
Hips Average* Average* Average* Average* Average
Leg Length Average Short* Average Average Short
Leg Shape Very Straight Very Straight Straight Straight Average
Leg Thickness Average* Average* Average* Average* Average
Muscle Build Muscly Muscly Average* Average* Average
Neck Length Long* Long* Long* Long* Long
Neck Thickness Thick Thick Average* Average* Average
Proportions Average* Average* Average* Average* Average
Sex Male* Male* Male* Male* Male
Shoulder Shape Average Average Average Average Rounded
Skin Colour Tanned Tanned Tanned Tanned White
Weight Average* Fat Average* Average* Average
(e) Per-feature estimations for each of the above images, and a self-estimated target. Asterisks mark correct
estimations
Figure B.2: Estimating demographics with images taken outdoors on an unseen subject
43

Appendix C
Design Archive
Table of Contents
cache
Pre-built cache ﬁles for the PrincipleComponentExtractor class and TrainingSet class.
db
Populated GaitAnnotate database.
heuristics*
An exemplary set of training weights for the RadialBasisFunctionTrainer. Heuristics marked
with numerical ranges indicate the particular queries they were used for in the results present in
this document for the purpose of repeatability.
queries
Images used to test the robustness of the solution.
scripts
Database table generation code.
src
Full source code in Java.
tests
Windows Batch ﬁles for invoking the QueryEngine class and other testing classes for quick perfor-
mance analysis.
trainingdata
Images used for training and testing.
44

Wally - A System to Identify Criminal Suspects by Generating
Labels from Video Footage and Still Images
Christopher Watts - cw17g12
Supervised by Mark Nixon
October 10, 2014
1 Problem Overview
Traditionally in law enforcement, an image of a criminal suspect is cross-referenced to databases such as
the Passport database for information. However, it’s becoming increasingly prominent for well-organised
criminals to use fake identification or, for foreign criminals, no identification at all.
The proposition of this project is to create a system that given an image or a video of a criminal
suspect will identify metrics unique to the person from sets of comparative and categorical labels. For
example: height; length of forearms; width of shoulders. Comparative labels will be used over absolute
labels (e.g. ’taller than’ rather than roughly ’5’9”’) because of observed accuracy benefits.
From this information, it should be possible to scan CCTV footage for pedestrians whose labels match
(within a certain error margin) those of the suspect so law enforcement can track the movements of the
suspect and potentially reveal who they really are.
2 Goals
The aim of the project is to assist law enforcement on finding fugitives and criminals in video footage
who are not initially identifiable from traditional techniques such as face and voice recognition.
Some current examples include finding ”Jihadi John,” responsible for the murders of, at the time
of writing, four British and American nationals in Syria, and Op Trebia, wanted for terrorism by the
Metropolitan Police since 2012.
3 Scope
The application domain of this project is rather large. Ideally, it would work on any CCTV footage from
any angle. However, due to time restraints on the project, some limitations will be imposed:
• The system will use computer vision to generate comparative labels for subjects whose full body
is visible in either still or video imagery.
• Initially, all footage must be front body view, front facial view or side body view at a constant
elevation and angle with the same lighting.
• When analysing footage of the full body, the subject may be masked.
• The system will use machine learning to train the label generator on a limited set of subjects from
the Soton HiD gait database with known comparative and categorical labels.
• The system will match subsequent footage of a suspect to the most likely subjects known by the
system, ranked by certainty.
1

Report

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Destaque

Destaque (20)

Semelhante a Report

Semelhante a Report (20)

Report