Bertini - Automatic Metadata Extraction in VidiVideo & im3i @EUscreen Mykonos

Automatic Metadata Extraction

Marco Bertini
Università di Firenze - MICC
www.micc.uniﬁ.it

giovedì 24 giugno 2010

The problem

The massive increase in digital audio-visual information
poses high demands on advanced storage and search
engines for consumers and professional archives.
Video is now a natural form of communication
for the Internet and mobile devices.
Video search engines are the product of progress in many
technologies: visual and audio analysis, machine learning
techniques, as well as visualization and interaction.


Two solutions

www.vidivideo.info www.im3i.eu


VidiVideo: project overview
The VidiVideo project addressed the
challenge of creating a substantially
enhanced semantic access to video,
implemented in a search engine.
The outcome of the project is an audio-visual search
engine, composed of two parts: a automatic annotation
part, that runs off-line, where detectors for more
than 1000 semantic concepts are collected in a
thesaurus to process and automatically annotate the
video and an interactive part that provides a video
search engine for both technical and non-technical
users.


VidiVideo: project results
The automatic annotation part of the system performs audio
and video segmentation, speech recognition,
speaker clustering and semantic concept detection.
The VidiVideo system has achieved the highest
performance in the most important object and concept
recognition international contests (PASCAL VOC and
TRECVID).
The interactive part provides a desktop-based and a
web-based search engines. The system permits different
query modalities (free text, natural language, graphical
composition of concepts using boolean and temporal relations
and query by visual example) and visualizations for video
retrieval and browsing.

Call Identifier FP7-SME-2010-1
Submitted 03 December 2009

VidiVideo: project partners
Name of the co-ordinating person Dr.-Ing. Georgios Ioannidis
E-Mail gi@in-two.com
Fax +49-179-33-2286677

No. Participant Name Type Short Name Country
1 IN2 search interfaces development Ltd SME IN2 UK
2 spring techno GmbH SME SPRING DE
3 VISup Srl SME VISUP IT
4 Hogeschool voor de Kunsten Utrecht RTDP HKU NL
5 University Firenze RTDP UNIFI IT
6 Instituto de Engenharia de Sistemas e RTDP INESC-ID PT
Computadores


IM3I: project overview
IM3I aims to provide the creative media sector with new
ways of searching, summarising and visualising large
multimedia archives.
IM3I will provide a service-oriented architecture
that allow multiple viewpoints upon multimedia data that
are available in a repository, and provide better ways to
interact and share rich media. This paves the way for a
multimedia information management
platform which is more ﬂexible, adaptable and
customisable than current repository software.
This in turn enables new opportunities for content
owners to exploit their digital assets.


IM3I: project results
Developed a set of tools for automatic audio-visual
annotation and search
Developed a set of web services to manage, create and
orchestrate the indexing services
Developed a set of specialized search and
management interfaces
IM3I authoring platform: allows professional users to
import and publish repositories of digital media, authoring of
web-based environments for the end-users, creation of
elaborate workﬂow patterns and search & retrieval interfaces
to allow a diversity of end-user interactions and scenarios


IM3I: project partners


The VidiVideo backend


Video and scene segmentation
•Developed a new gradual transition detection algorithm
•Uses novel individual criteria that exhibit less sensitivity to local or global motion:
•Color Coherence Change
•Macbeth Color Histogram Change
•Luminance Center of Gravity Change
•Combines these criteria (and their multi-scale extensions) using a machine learning
technique
•Advantages:
•Signiﬁcantly improved performance
•Lack of need for any threshold selection
Scene or story unit: collection of temporally
consecutive shots which are about the
same topic or event
•Developed a multimodal scene
segmentation based on Scene Transition
Graph
• Signiﬁcantly improved performance
over visual-only STG


Audio analysis in VidiVideo
• Audio segmentation / audio diarization
• Audio events detection (AED)
• Automatic speech recognition (ASR)
• Language identiﬁcation (LID)


Block diagram of audio processing
Current
Audio event detection framework Concept Detectors
s
Non Speech
Feature
extraction
Feature
Reductio
SVM
classification
AE 61 AE +
n 10 Sports (testing)
Audio
Segmentatio
Speech Speech 6 Speech
n
Speaker ID Reasoning Narrator, 3 Monologue
Anchor … Dialogue
Audio
Music
Data Detector
Music
3 Classes (base)
4 New (testing)

Telephone Low 1 Telephone
detector Frequenc
y
Detector
Audio --------------
Processing Total
74+10 (testing)
Video Processing Audio + Video
+(-3+4) (change
music detectors)


Audio events corpora
• Sound effect corpus: 18,700 short ﬁles (290 hrs.),
provided by B&G. Intrinsically labelled corpus.

• Selection of subset for training 61 semantic concepts
with more examples.

• Extended feature set: MFCCs, ZCR, Brightness / Audio
spectrum centroid, Bandwidth / Audio spectrum
spread Audio spectrum envelope, Audio spectrum
ﬂatness, Pitch, Harmonicity

• Tested on Movies, Documentaries, Broadcast News,
and Talk Shows (TS).

• Mean Average Precision=0.459 (6 test concepts)

Machine learning
• Learning of many independent binary classiﬁcation
tasks is computationally expensive

• KDA using Spectral Regression to solve this problem:

• The time complexity scales linearly with respect to
number of labels (i.e. concepts)

• Training in just 1.3 hours compared to 30.2 hours
using SVM, over 20 times faster! (MAP ~ the same)

• Tested on Pascal VOC 2008 (20 Concepts)

• Best Method in Pascal VOC 2008

• Ranked First in 9 out of 20 concepts

Color Features

Point sampling Color Descriptor
• Harris-Laplace • SIFT
• Dense sampling • OpponentSIFT
• WSIFT
Spatial Pyramid
• rgSIFT
• 1x1
• Transformed color SIFT
• 2x2
• 1x3

0.25
Results
MediaMill Semantic Video Search Engine at TRECVID 2009

216 other concept detection methods
Our results
MediaMill concept detection method
0.2

0.15
TRECVid 2009
0.1

0.05

0
0 20 40 60 80 100 120 140 160 180 200 220
Concept Detection Task Submissions

•Good local descriptors: SIFT, OpponentSIFT, rgSIFT/WSIFT,
0.25

Transformed color SIFT
0.2
22 users of other video retrieval systems
2 users of MediaMill video search engine

•Combining these color features gives state-of-the-art
0.15

performance
•Drawback: computational costs, reduced adopting GPU
0.1

0.05
implementations (codebook creation is 80% of CPU time!) for 17x
speed-up
0
0 5 10 15
Interactive Search Task Submissions
20 25


The IM3I backend


Visual annotation
• Split a video detecting shots and large content changes
with very fast algorithm
• Use different annotation strategies and types of
detectors:
• low level (color, B/W, motion)
• Haar-based boosted classiﬁers
• HOG + SVMs
• Bag-of-words
• k-NN + voting
• simple MPEG-7 XML format (full and fragment)

Baseline: typical BoW

Hierarch.
clustering
Feature
extract.

visual words
histo

Learning


Fusion schemes

• Early fusion: integrates unimodal features before learning concepts.

• Late fusion: ﬁrst reduces unim. feat. to separately learned concepts
scores, then these scores are integrated to learn concepts.

Early fusion approach

Hierarch.
clustering

• Hypothesis: MSER isolate semantically relevant information.

• Idea: represent points that have some spatial relation with regions that are inside, outside, just
on the border

• Sampling: SIFT-SURF, dense.


Late fusion approach
Hierarch.
clustering

Hierarch.
clustering
!"#

!1 !2

!"###$%#&'%(!")#*%+,$-#&'-(!")#*%+......$%#&'%(!")#*/+,$-#&'-(!")#*/+#

• Use SURF/SIFT + MSER

• Use geometric descriptors for MSERs


Test: baseline
Time Avg. Max
Method Sampling # points Time
accuracy accuracy

• Best: SURF 64 Grid 10 (accuracy, computational cost)
• SURF 64 Grid 5: +7-8% accuracy, +300% time
• the number of points inﬂuences accuracy


Test: early fusion
Sampling Avg. Max
Method # points Time Time
accuracy accuracy

• Best: EF SURF 64 Grid 10 (accuracy, computational cost)

• EF SURF 64 Borders: many points, accuracy ~ that of Grid 10 but higher
computational costs

• EF SURF 64 Grid 10 is worst than SURF 64 Grid 10, but much faster (50% of
execution time)


Test: late fusion
Method 1 Method 2 Accuracy

• weighting 0.6 (best method) and 0.4 (worst method) lead to good results
• best performance: dense sampling + sparse sampling
• best combination: SURF 64 + EF SURF 64 Grid 10 (improved accuracy, modest
computational cost increase)

Conclusions
• Early fusion strategies:
• ~ baseline accuracy
• faster
• Late fusion strategies:
• better accuracy than baseline
• each method corrects some errors made by the other
• fuse keypoints/regions (SURF, fusion of SURF and
MSER)

• IM3I users will be able to chose what’s best for them


The users


Video search engine
Our goal is to provide a search engine for videos
for both technical and non-technical users.
Provide different interfaces that permit different query
modalities: free-text, natural language,
graphical composition of concepts using boolean and
temporal relations and query by visual example.
In addition, exploit ontologies and their structure
to encode semantic relations between
concepts permitting, for example, to expand queries to
synonyms and concept specializations.


Sirio and Orione
• Design goals/assumptions:

• semantic content-based retrieval

• efﬁcient web-based interface

• System features: • System interface query options:

• Sirio is a Rich Internet • ontology exploration using a
Application (in Adobe Flex) front graph-based view
end.
• compact keyframe-based results
• Orione is web service search engine presentation / streaming videos

• Support for multiple ontologies • concept drag&drop facility (to build
and ontology reasoning complex queries)

• Results are in Media RSS format • natural language query (with Boolean/
(queries treated as RSS feeds) temporal ops.)

• New search engine able to scale • free text query (for Google-like
to large number of instances of search)
ontology concepts


Sirio and Orione


Andromeda
• System interface query options:
• Shows the concepts with more
instances in a concept cloud view
• semantic content-based browsing

• efﬁcient web-based interface using • Graph representation of
semantic data structure
RIA

• System features: • Multiple automatic layout algorithms
for spatial positioning and manual drag
• Query manager as a Rich Internet & drop
Application (in Adobe Flex).
Connects to web service (search • Thumbnails view of the instances of
each concept
engine)

• Support for multiple ontologies • Access to video metadata and video
streaming
and ontology reasoning
• Access to social content related
to ontology concepts (Flickr,YouTube,
and real time tweets from Twitter)


Andromeda


Pan

• complete/correct automatic
annotations
• System interface options
• help in training new automatic
• Integrated with web-based
concept detectors
search engine and automatic
• System features: video annotation

• Rich Internet Application • Multiple user proﬁles: a
(in Adobe Flex). simple user may change his own
annotations, while a super user
• video streaming using the same can import the annotations of
system of Sirio and Andromeda other users, e.g. to supervise
the annotation process
• new backend within an organization.

• geotagging using Google Maps


Pan

!

Pan


Daphnis

• build on image tagging made popular • System interface options
by Flickr and tag clouds
• users can tag images and retrieve
images based on tags, or use tags
• connect to social web sites to ﬁlter the results of similarity
based retrieval.
• allow CBIR

• System features: • Ongoing work:

• Rich Internet Application • merging with automatic video
annotation for automatic
(in Adobe Flex).
tagging
• Connects to Flickr (and also
• adoption of mechanisms for
Facebook, if needed)
tag suggestion, based on
• Approximate nearest recent research work in this
ﬁeld (use content, tags and
neighbour search using MPEG-7
descriptors, to scale to large number geolocalization)
of images


Daphnis

!


Daphnis


IM3I: authoring platform
A CMS approach to repository
analysis, authoring and publication


IM3I: authoring platform
Authoring IM3I end-user functionality typically covers 5
distinctive stages:

• Importing an existing repository from RSS and various
XML streams

• Extending the associated datamodel

• Editing layout and editing features

• Editing Search and Retrieval interfaces

• Embedding the IM3I end-user interfaces in a (corporate)
website


Editing workﬂow demo
•Step 1: Importing a video-repository
•Step 2: Enhancing the datamodel
•Step 3: Authoring layouts
•Step 4: Publishing the repository


I: Importing a repository

•Importing an existing repository to an internal and
ﬂexible datamodel
•Aggregating and harmonizing multiple repositories
•Visualisation of markup and preview of contents
•Flexibly mapping by drag-and-drop


I: Importing a repository

Mapping the
contents of video
RSS to an IM3I
Datamodel


II: Enhancing the Datamodel
•Datamodels contain the descriptions of your
repository and in this way stipulate what can be
shown to- or retrieved by an end-user.

•Datamodels can reference to each other
•Datamodels can be extended overtime by adding
elements

•Elements are based on types: media ﬁles, URIs, date,
string, etc.

•Elements can be shared across datamodels to allow
search & retrieval across multiple collections


II: Enhancing the Datamodel

Adding a ‘translation’ element to the datamodel

III: Layout and Functionality
Easy manipulation of layout to a repository by:

•Table metaphor (easy editing of table
characteristics)

•Drag and drop graphical elements
•Drag and drop contents of repository in cells
•Easy manipulation of look and feel
•Easy adding editing functionalities to a layout
•Easy preview and markup functionalities


Deﬁning a layout table


Dragging repository contents to layout


Previewing layout

IV: Embedding in website

Easy blend- in of layouts in corporate websites

•By means of plugins for CMSs (e.g. Webmanager,
WordPress, Typo3)

•By <embed> </embed>
•Allowing for elaborate workﬂow patterns in
combining multiple layouts


IV: Embedding in website

Original
contents Added
Translation
Functionality


The super users


Atlante - process manager
• Main functions of this
• Web application that is used for application are:
creation, technical
administration and monitoring • creation of new type of
of IM3I processing pipeline (e.g. (distributed) process
automatic annotation process,
media transcoding, etc.) • params setting for new type
of process
• This web application has
• creation of “Multiprocess”
multiple user proﬁle:
composed by sets of single
• managers (distributed) Processes

• administrators • starting/pausing/stopping a
process

• monitoring running processes


Atlante

!


Gaia - media manager

• Web application that will be used for a technical
administration and monitoring of the database

• Main functions of this application are:

• media management

• conﬁguration of metadata, broadcasters,
Annotations types, Concept types and Media types

• media annotations monitoring by technical backend


Gaia

!

One more thing...


ACM MM 2010 Workshop
3rd International Workshop on Automated Information Extraction in Media Production
AIEMPro'10

Organizers:
Dr. Robbie De Sutter
Vlaamse Radio- en Televisieomroep - Medialab
Jean-Pierre Evain
European Broadcasting Union . Union Européenne de Radiotélévision
Dr. Gerald Friedland
ICSI (International Computer Science Institute)
Dr. Alberto Messina
RAI Radiotelevisione Italiana, Centre for Research and Technological Innovation
Dr. Masanori Sano
NHK (Japan Broadcasting Corporation) Science and Technology Research Laboratories


“Sirio” R.I.A. search engine demo


Web-based R.I.A. archive browsing


Bertini - Automatic Metadata Extraction in VidiVideo & im3i @EUscreen Mykonos

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (19)

Semelhante a Bertini - Automatic Metadata Extraction in VidiVideo & im3i @EUscreen Mykonos

Semelhante a Bertini - Automatic Metadata Extraction in VidiVideo & im3i @EUscreen Mykonos (20)

Mais de EUscreen

Mais de EUscreen (20)

Último

Último (20)

Bertini - Automatic Metadata Extraction in VidiVideo & im3i @EUscreen Mykonos