O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a navegar o site, você aceita o uso de cookies. Leia nosso Contrato do Usuário e nossa Política de Privacidade.
O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a utilizar o site, você aceita o uso de cookies. Leia nossa Política de Privacidade e nosso Contrato do Usuário para obter mais detalhes.
"Semantic Indexing of Wearable Camera Images: Kids’Cam Concepts"
Semantic Indexing of Wearable
Camera Images: Kids’Cam
Alan F. Smeaton
(Dublin City University)
… and …
... Kevin McGuinness and Cathal Gurrin and Jiang Zhou
and Noel E. O’Connor
and Peng Wang
and Brian Davis and Lucas Azevedo
and Andre Freitas
and Louise Signal and Moira Smith and James Stanley
and Michelle Barr and Tim Chambers and Cliona Ní
• Automatic assignment of one-per-class concept detectors
is now commonplace.
• We’re interested in the challenging case of processing
images from wearable cameras where improvement is
• We try to exploit some limited manual annotations to
improve accuracy of automatic concept weights.
• This work is not complete, its ongoing, but the story is
Analysis of Visual Media
• More progress made within the last few years than in previous decade
• Incorporation of deep learning plus availability of huge searchable
image resources and training data
• Automatic image tagging is now hosted
and offered by website like Aylien,
Imagga, Clarifai, and others, and very
Analysis of Visual Media
• These developments are welcome … but … restrictive tagging
• How to map to users formulating queries
• Alternative approach is tagging at query time but its expensive and not
scalable to huge collections.
• Almost all work on concept detection based on one concept at a time.
• TRECVid tried simultaneous detection of concept pairs like “computer-
screen with telephone”, and “airplane with clouds”.
• Limited success but “Government Leader with Flag” was OK !
• Detection of concepts independently needs a course-correction
– Doesn’t avail of all available information sources
– Doesn’t map to a user’s search vocabulary
Long-term approach …
Images Concept Set
How can a single image be mapped to two different vocabularies ?
Using NL for image search … tagging
• NL is fraught with complexities, ambiguities at all levels ..
– Lexical level polysemy
– Syntactic level structural ambiguity
– Semantic interpretations
– Discourse level pronoun resolution
• + vocabulary limitations when finding word or phrase to describe
• When using computers to help search for image data, language
challenges are exacerbated yet we assume a “simplistic” approach of
tagging by a set of concepts, notwithstanding what we’re seeing with
captioning here today
• Tagging is very useful for smaller, niche applications in restricted
domains with manual tagging, but we see scalability problems
– Addressed with progress in automatic tagging but we’re tolerant of
In this paper …
• We are interested in images from wearable cameras with lots of juicy
• Notoriously difficult to process automatically because …
– Blurring caused by wearer moving at image capture
– Occlusions from wearer’s hands
– Lighting conditions
– Fisheye lens for wider perspective causing distortion
– First person viewpoint but not what wearer sees
– Content varies hugely across subjects
• Applications in memory support, behaviour recording and analysis,
security, other work-related, and QS.
• In this paper we work with wearable camera data from school children,
for analysis of their environments
The Kids’Cam Project
• Child obesity is a significant public health concern, worldwide.
• Unequivocal evidence that marketing of energy-dense and nutrient-
poor foods and beverages is a causal factor in child obesity.
• Evidence of children’s total exposure to advertising of poor foodstuffs
is not quantified.
• Kids’Cam study aimed to determine the frequency, nature and duration
of children’s exposure to such marketing.
• 169 randomly selected children 11 to 13 yo from 16 schools in
Wellington, NZ, each wore an Autographer and carried a GPS for 4
days .. .mages every 7 seconds, GPS every 5 seconds.
– 1.5M images, 2.5M GPS datapoints
• Manual annotation for food / beverage marketing using a 3-level, 53
concept ontology .. Inter-annotator reliability of 90%.
Training Free Refinement
• Current concept-at-a-time classifiers do not consider inter-
concept relationships or dependencies yet these do exist
• To improve one-per-class detectors, we post-process detection
– We take advantage of concept co-occurrence and re-
occurrence which depend on the particular collection
– We take advantage of local (temporal) neighbourhood
information where concepts are likely to re-occur close in
– We use GPS location information where concepts identified
by a person at a location, may re-occur subsequently at that
• TFR is based on non-negative matrix factorisation, described
• We do not know accuracy of assignment of 1,000 concepts but we
know accuracy of assignment of 53 concepts …and we have 1.5M
images each mapped into 2 concept spaces
• Can we adjust values in (b), anchored and pivoting around (a) in
addition to having already used local, within-collection distributions ?
(a) Manual, correct (b) Automatic,
Cross-mapping concept spaces
• Distributional semantics – corpus-driven approach – based
on hypothesis that co-occurring words in similar contexts
have similar meaning
• Using word2vec in DINFRA, we can
map all words in a vocabulary to an
n-dimensional vector space, where
we can obtain relatedless scores
among the words
• Figure illustrates an example
• For each image in Kids’Cam we can
evaluate relatedness between human
annotation and automatic concepts
• We have top-ranked
relatedness to the
manual tags …
• First effort is to simply
multiple, as in Table, but
its hard to see the
impact of this
And the result is …
• … and that’s where we currently are !
Conclusions and Future Work
• Since automatic concept-detection using pre-defined models has
made so much progress recently, we’re seeing vocabulary / concept
• Using 1.5M Kids’Cam images from wearable cameras, we have used
within-collection distributions to “smooth” concept weights (outliers and
gaps) in TFR
• We are trying to pivot around some manual annotations in order to
improve concept accuracies
• But, we need …
– More concepts – a richer vocabulary of them
– More varied manual annotations, not just fast food adverts
– A more global or collection-wide way to combine concept
confidences and relatedless to known manual annotations
– Some validation of accuracy of automatic concepts to measure
accuracy of our post-processing
Finally, a plug …
• TRECVid Video captioning Pilot task 2016
• 2,000 x Vine Videos, manually annotated with
• 8 participating groups (CMU, CUHK, DCU, GMU,
NII, UvA, Sheffield)
• Two tasks …
– For each video, rank the 2,000 captions –
metric is MRR
– For each video, generate your own caption –
metrics are bleu, meteor, and UMBC STS
(Semantic Textual Similarity) Service
• Lots of lessons learned and will build upon for full
task in 2017, probably using Vine videos