1. Introduction to MPEG-7
Guest lecture for ECE417 TSH
Charlie Dagli
[dagli@illinois.edu]
April 7, 2009
2. Contents
This lecture : A general idea of MPEG – 7
MPEG-7
–Background
–Introduction
–Components of MPEG-7
Description Definition Language (DDL)
Multimedia Description Scheme (MDS)
Video Descriptors
Audio Descriptors
–References
2
3. Background
Search and Retrieval of Multimedia data
– In recent years, there has been a huge increasing amount of audiovisual data
that is becoming available
– Applications
Large-scale multimedia search engines on the Web
Media asset management systems in corporations
AV broadcast servers
Personal media servers…
– Need: Retrieval, search, storage of the AV-data with higher level concept
– A solver:
Efficient processing tools to create description of AV material or to support the
identification or retrieval of AV documents.
– The research activity on processing tools, the need for interoperability
between devices has been recognized and standardization activities have
been launched.
MPEG-7, “MULTIMEDIA CONTENT DESCRIPTION INTERFACE”,
standardizes the description of multimedia content supporting wide range of
applications.
MPEG stands for Moving Picture Experts Group (1988)
3
4. Introduction : What is MPEG-7?
“Multimedia Content Description Interface”
–Intuition:
NOT focus so much on processing tools
Concentrate more on the selection of features that have to be described
Find a way to structure and instantiate the selected features with a
common language
–Efficient representation of audio-visual (AV) meta-data
–Goal: allow interoperable searching, indexing, filtering and
access of multimedia content by enabling interoperability
among devices that deal with multimedia content description.
4
5. MPEG-7 Main Elements
Descriptor (D) – standardized “audio
only” and “visual only” descriptors. <ex>
a time code for duration, color histograms
for color.
Multimedia Description Scheme (MDS)
– standardized description schemes for
audio and visual descriptors. <ex> video:
temporally structured scenes and shots,
including textual descriptors at the scene
level and color, motion, audio amplitude
descriptors at the shot level.
Description Definition Language
(DDL) – provides a standardized
language to express description schemes,
– based on XML (eXtensible Markup
Language) – a language that allows the
creation of new description schemes, and
possibly, descriptors. Also allows the
extension and modification of existing
description schemes.
5
6. What can MPEG-7 do?
Increasing availability of potentially interesting audiovisual
materials makes search more difficult.
The searching system that any type of AV material may be
retrieved by means of any type of query materials, such as video,
music, speech, etc.
– Some query examples
Music : Play a few notes on a keyboard and get in return a list of musical pieces
containing the required tune or images somehow matching the notes.
Image : Define objects, including color patches or textures and get in return
examples among which you select the interesting objects to compose your image
Voice : Using an excerpt of Pavarotti’s voice, and getting a list of Pavarotti’s
records, video clips where Pavarotti is singing or video clips where Pavarotti is
present.
Sports video analysis: can be solved by a much easier way with better results
6
7. Application Areas
Application domains listed in the MPEG-7 Applications document:
– Education
– Journalism (e.g. searching speeches of person using his name, his voice or
his face)
– Tourist information
– Cultural services (museum, art gallery, digital library)
– Entertainment (searching a game, karaoke)
– Investigation services (human characteristics recognition)
– Geographical information systems
– Remote sensing
– Surveillance (traffic control, surface transportation)
– Shopping
– Architecture, real estate, and interior design
– Social (Dating Service)
– Film, Video and Radio archives. ……..
– Audiovisual content production
7
8. MPEG-7 v.s. previous MPEG activities
MPEG-1,2, & 4 are designed to represent the information itself,
while MPEG-7 is meant to represent information about the
information.
MPEG-1,2, & 4 made content available, MPEG-7 allows you to
find the content you need.
Also, MPEG-7 can be used independently of the other MPEG
standards – the description might even be attached to an analog
movie.
8
9. MPEG-7 Parts
ISO/IEC TR 15938-1 (Systems)
– The binary format for encoding MPEG-7 descriptions and the terminal architecture.
ISO/IEC TR 15938-2 (Description Definition Language)
– The language for defining the syntax of the MPEG-7 Description Tools and for
defining new Description Schemes.
ISO/IEC TR 15938-3 (Visual)
– The Description Tools dealing with Visual descriptions.
ISO/IEC TR 15938-4 (Audio)
– The Description Tools dealing with Audio descriptions.
ISO/IEC TR 15938-5 (Multimedia Description Schemes)
– The Description Tools dealing with generic features and multimedia descriptions.
ISO/IEC TR 15938-6 (Reference Software)
– A Software implementation of relevant parts of the MPEG-7 Standard with
normative status.
ISO/IEC TR 15938-7 (Conformance Testing)
– Guidelines and procedures for testing conformance of MPEG-7 implementations
ISO/IEC TR 15938-8 (Extraction and use of descriptions)
– Informative material (in the form of a Technical Report) about the extraction and use
of some of the Description Tools.
9
11. Description Definition Language (DDL)
Foundations of MPEG-7 standard, provides the language for
defining the structure and content of multimedia information
A schema language to represent the results of modeling
audiovisual data, (i.e. descriptors, and description schemes) as
a set of syntactic, structural and value constraints to which
valid MPEG-7 descriptors, description schemes, and
descriptions must confirm.
Also provide the rules by which user can combine, extend, and
refine existing description schemes and descriptors.
XML. Example
<PersonName>
<Title> Prof. </Title>
<Firstname>Thomas </Firstname>
<Lastname>Huang</Lastname>
<Nickname>Tom</Nickname>
</PersonName>
11
13. Multimedia Description Schemes (MDS)
An overview of the organization of MPEG-7 MDS : Organized
in 6 Areas, Basic Elements, Content Descriptions, Content
Organization, Content management, Navigation and Access, and
User Interaction
13
14. MDS: Basic Elements
Basic Elements – fundamental constructs of the
definition of MPEG-7 description schemes
–Schema Tools :
facilitate the creation of valid MPEG-7 descriptions and packing..
–Basic Data types :
Integer & Real – represent constrained integer and real value
Vectors & Matrix – represent arbitrary sized vectors and matrices of
integer or real values
Probability Vectors & Matrices – represent probability distribution
described using vectors/matrices
String – represents codes identifying content type, countries, regions,
currencies, and character sets
–Linking, Identification and Localization Tools :
tools for referencing MPEG-7 descriptions, for linking descriptions to
multimedia content and for describing time in multimedia content
14
15. MDS: Basic Elements
–Example: Three kinds of media time representation:
t1 t2
Duration
TimeBase
RelTimePoint
A) Simple time: Specify a time point and a duration
B) Relative time: Specify a media time point relative to a time base, and a
duration
C) Incremental time: Specification of time using a predefined interval
called Time Unit and counting the number of intervals (efficient for
periodic signals)
15
16. MDS: Basic Elements
– Basic Description Tools : A library of description schemes and data types, which
are used as primitive components for building more complex and functionality-
specific description tools found in the rest of MPEG-7.
Graph and relation tools: weave together complex multimedia description
structures <Graph>
<Node id = “A”/> <Node id = “A”/> <Node id = “A”/> <Node id
– Ex. = “A”/> <Node id = “A”/>
<Relation type = “#r1” source “#A” target = “#B”/>
r3 r3
D <Relation type = “#r2” source “#A” target = “#C”/>
C
B r1 …………..
A
E
r4 r1 r2 </Graph>
Textual annotations: represent textual descriptions
– Free text annotation : Spain scores a goal against Sweden.
– Keyword annotation : score, Sweden, Spain
Classification schemes and terms: define and reference vocabularies for
multimedia content descriptors.
– Ex. Part of a ClassificationScheme for sports:
sports
soccer basketball baseball tennis
16
17. MDS: Basic Elements
People and locations: represent people and places related to
multimedia content
– Agent: persons, organizations, groups of persons,…
Ex. <PersonGroup>
<Name>Spanish National Soccer Team </Name>
<Kind><Name>Soccer Team </Name></Kind>
<Member>
<Name> Fernando </Name>
</Member>
<Member>
….
</PersonGroup>
– Places: existing, historical, and fictional places.
Affective description: describe emotional response to
multimedia content
– Ex. Recording an audience’s excitement while watching an action movie
Ordering tools:
– Provides a hint for ordering descriptions for presentation based on
information contained in those descriptions
– Ex. Order a set of video segments in a soccer game by the amount of
17
camera zoom within each segment.
19. MDS: Content Management
Content management : the description of the life cycle of the
content, from content to consumption
– Creation and Production Description,
Including title, textual annotation, creators, creation locations, dates, how the data
is classified, review and guidance information, and related multimedia material.
– Usage Description
Describes information related to the usage rights, usage record, and financial
information.
Rights information is not explicitly included in the description but links are
provided to the rights holders or right management.
Usage record description provides information related to the use of the content,
such as broadcasting, or demand delivery.
Financial information provides information related to the cost of production and
the income resulting from content use.
Usage description is dynamic and subject to change during the lifetime of the
multimedia content.
– Media Description
Describes the storage media in particular the compression, coding, and storage
format of multimedia content. It describes the master media that is the original
source from which different instances of the multimedia content are produced.
19
21. MDS: Structural Content Description
Content Description: structural and conceptual aspects
– Structure Description: describes the structure of multimedia built around the
notation of Segment Description Scheme that represents the spatial, temporal, or
spatiotemporal portion of the multimedia content
Segment DSs (the core element)
– Example: Mosaic DS – panoramic view of video segment constructed by
aligning together and warping the frames of a Video Segment upon each other
21
22. MDS: Structural Content Description
Specific features for structural data description
Feature Video Still Moving Audio
Segment Region Region Segment
Time X X X
Shape X X
Color X X X
Texture X
Motion X X
Camera X
motion
Audio X X X
features
22
24. MDS: Conceptual Content Description
Conceptual aspects: describes the multimedia content from
the viewpoint of real-world semantics and conceptual
notations.
– Involve entities such as objects, events, abstract concepts and relationships.
– Segment description schemes and semantic description schemes are related
by a set of links that allows the multimedia content to be described on the
basis of both content structure and semantics together.
24
25. MDS: Conceptual Content Description
Example of video segments and Regions Corresponding SegmentRelationship Graph
25
27. MDS: Navigation and Access
Facilitating browsing and retrieval by defining summaries,
views, and variations of the multimedia content.
Summaries: provide compact highlights of the multimedia
content to enable discovering, browsing, navigation, and
visualization of multimedia content.
– Hierarchical navigation mode
– Sequential navigation mode
27
28. MDS: Navigation and Access
View: based on partitions and decompositions, which
describes different decompositions of the multimedia signals
in space, time, and frequency. The partitions and
decompositions can be used as different views of the
multimedia content important for multi-resolution access
and progressive retrieval.
Variations: provides different variations of multimedia
programs, such as summaries and abstract, scaled,
compressed and low-resolution versions and versions with
different languages and modalities – audio, video, image, text,
and so forth allow the selection of the most suitable
variation of a multimedia program
28
30. MDS: Content Organization
Content Organization – tools describe collections and models
– Collection: unordered sets of multimedia content, segments, descriptor
instances, concepts or mixed sets of the above
(Example of collections of AV content including the relationships (i.e.
RAB,RBC,RAC) within and across Collection Clusters)
Collection structure
Content collection
Segment collection
Descriptor collection Collection (abstract)
Concept collection
Mixed collection
30
31. MDS: Content Organization
– Model tools: Parameterized representation of an instance or class
multimedia content, descriptors or collections, as follows:
Probability model : Associates statistics or probabilities with the attributes of
multimedia content, descriptors or collections
Analytic model: Associates labels or semantics with multimedia content or
collections
Cluster model: Associates labels or semantics and statistics or probabilities with
multimedia content collections
Classification model: Describes information about known collections of
multimedia content in terms of labels, semantics, and models that can be used to
classify unknown multimedia content
Model (abstract)
Classification Model
Probability Model Analytic Model Cluster Model
Cluster Model
Probability Model Collection Model ClusterClassification
Model
Discrete distribution Probability Model class
ProbabilityClassification
Continuous
Model
distribution
Finite State Model 31
32. MDS: Content Organization
– Clusters of positive
and negative
examples of images
are described using
Cluster Model tool.
– Soccer video sequence
modeled using State
Transition Model tool.
32
34. MDS: User Interaction
User interaction describes user preferences and usage history
Allow matching between user preferences and MPEG-7
content description facilitate personalization of multimedia
content access, presentation, and consumption.
34
35. Introduction to MPEG-7
Guest lecture for ECE417 TSH
Charlie Dagli
[dagli@illinois.edu]
April 7, 2009
36. Introduction : What is MPEG-7?
“Multimedia Content Description Interface”
–Intuition:
NOT focus so much on processing tools
Concentrate more on the selection of features that have to be described
Find a way to structure and instantiate the selected features with a
common language
–Provide a way to get information about the audiovisual (AV)
data without the need of performing the actual decoding of these
data.
–Goal: allow interoperable searching, indexing, filtering and
access of multimedia content by enabling interoperability
among devices that deal with multimedia content description.
36
37. MPEG-7 Main Elements
Descriptor (D) – provides standardized “audio only” and “visual only”
descriptors. <ex> a time code for duration, color histograms for color.
Multimedia Description Scheme (MDS) – provides standardized description
schemes involving both audio and visual descriptors. <ex> a movie,
temporally structured as scenes and shots, including textual descriptors at the
scene level and color, motion, audio amplitude descriptors at the shot level.
Description Definition Language (DDL) – provides a standardized language
to express description schemes,
– based on XML (eXtensible Markup Language) – a language that allows the creation
of new description schemes, and possibly, descriptors. Also allows the extension and
modification of existing description schemes.
Coding Schemes – compressing MPEG-7 textual XML descriptions into
Binary format (BiM) to satisfy application requirements for compression
efficiency, error resilience, ...
SYSTEM:
37
38. Visual Descriptors
Cover 6 basic visual features as
–Color
–Texture
–Shape
–Motion
–Localization
–Face Recognition
38
39. Color descriptors
Color Descriptors
– Color Space : defines the color components as continuous-value entities
R, G, B
Y, Cr, Cb
– Y = 0.299R + 0.587G + 0.114B
– Cb = – 0.169R – 0.331G + 0.500B Min (whiteness)
– Cr = 0.500R – 0.419G – 0.081B
H, S, V (Hue, Saturation, Value)
– A nonlinear transform of the RGB
– Quantized into 16,32,64,128,256 bins for
scalable color descriptor and frames
histogram descriptor
HMMD (Hue, Max, Min, Diff, Sum)
– Max = max (R, G, B)
– Min = min (R, G, B)
– Diff = Max – Min Max (blackness)
– Sum = (Max + Min ) / 2
Linear transformation matrix with reference to R, G, B
– Any 3 x 3 color transform matrix that specifies the linear
transformation between RGB and the respective color space.
Monochrome: Y component alone in YCrCb is used
39
40. Color Descriptors
–Color Quantization Descriptor : specifies the partitioning of the
given color space into discrete bins.
–Dominant Color Descriptor (DCD): allows specification of a small
number of dominant color values as well as their statistical properties, such as
distribution and variance provides an effective an compact representation
of colors present in a region or an image.
DCD is defined to be
F = {(ci, pi, vi), s}, (i = 1, 2, .. N), N is the number of dominant colors
ci dominant color value, a vector of corresponding color space component
values
pi the fraction of pixels in the image corresponding to ci
vi the variation of the color values of the pixels in a cluster around the
corresponding representative color
s the spatial coherency, represents the overall spatial homogeneity
(Examples of low and high spatial coherency of color)
40
41. Color Descriptors
–Scalable Color Descriptor : a Haar transform-based encoding
scheme applied across values of a color histogram in the HSV
color space
– Useful for image-to-image matching and retrieval based on color feature. Its
binary representation is scalable in terms of bin numbers and bit
representation accuracy over a broad range of data rate.
–Group-of-Frame or Group-of-Picture Descriptor :
For joint representation of color-based features for multiple images or multiple
frames in a video segment
Traditionally for a group of frames or pictures a key frame or image is
selected and the color-related features of the entire collection are represented by
the chosen sample unreliable
By GoF and GoP histogram based descriptors that reliably capture the color
content of multiple images or video frames.
41
42. Color Descriptors
– Color Layout Descriptor (CLD) : represents the spatial distribution of
representative colors on a grid superimposed on a region or image. Representation is
based on coefficients of Discrete Cosine Transform. This is a very compact
descriptor being highly efficient in fast browsing and search applications.
– Color Structure Descriptor (CSD): based on color histogram, but aims at
identifying localized color distributions using a small structuring window. To
guarantee, interoperability, the CSD is bound to the HMMD color space.
– CSD: the degree to which its pixels are clumped together relative to the scale of an
associated structuring element.
Examples of structured and unstructured color.
42
43. Texture Descriptors
Homogeneous Texture Descriptor (HTD):
– provides a quantitative representation using 62 numbers, consisting of the
mean energy and energy deviation from a set of frequency channel
– Useful for similarity retrieval
– Effective in characterizing homogeneous texture regions
Texture Browsing Descriptor (TBD):
– Defined for coarse level texture browsing
– Provides a perceptual characterization of texture, similar to human
characterization, in terms of regularity, coarseness and directionality of the
texture pattern.
Edge Histogram Descriptor (EHD):
– Capture spatial distribution of edges in an image
– Useful in matching regions with partially varying, non-uniform texture.
43
44. Homogeneous Texture Descriptor
• Texture Descriptor
– Homogeneous Texture Descriptor (HTD): characterize the region
texture using the mean energy and the energy deviation from a set of
frequency channel. The 2D frequency plane is partitioned into 30
channels as the following:
(Frequency layout for
feature extraction)
ω
The Syntax of the HTD is as follows:
HTD = [fDC, fSD, e1, e2, ..,e30, d1, d2, .. ,d30]
Where fDC and fSD are the mean and standard deviation of input images, and ei
and di are the nonlinearly scaled and quantized mean energy and energy
44
deviation of the i-th channel.
45. Texture Browsing Descriptor
– Texture Browsing : Perceptual characterization of a texture, similar to a human
characterization, in terms of regularity, coarseness and directionality
– TBD = [v1,v2,v3,v4,v5]
v1 ∈ {1, 2, 3, 4} or {00,01,10,11}: represents the regularity
v2,v3 ∈ {1, 2, 3, 4, 5, 6} : capture the directionality of the texture
v4, v5 ∈ {1, 2, 3, 4}: capture the coarseness of the texture
Regularity Semantics
00 irregular
01 slightly regular
10 regular
11 highly regular
Semantics of Regularity.
11 01 00
10
Regularity
Examples of Regularity
45
46. Edge Histogram Descriptor
– Edge Histogram: represents local edge distribution in the image
Five types of edges: 5 histogram bins per each sub-image
BinCounts[k] Semantics
BinCounts[0] Vertical edges in sub-image (0,0)
BinCounts[1] Horizontal edges in sub-image (0,0)
BinCounts[2] 45 degree edges in sub-image (0,0)
BinCounts[3] 135 degree edges in sub-image (0,0)
BinCounts[4] Non-directional edges in sub-image (0,0)
BinCounts[5] Vertical edges in sub-image (0,1)
BinCounts[74] Non-directional edges in sub-image (3,2)
BinCounts[75] Vertical edges in sub-image (3,3)
BinCounts[76] Horizontal edges in sub-image (3,3)
BinCounts[77] 45 degree edges in sub-image (3,3)
BinCounts[78] 135 degree edges in sub-image (3,3)
BinCounts[79] Non-directional edges in sub-image (3,3)
46
47. Shape Descriptors
Shape Descriptors
– Region-based Shape Descriptor
Expresses pixel distribution within a 2-D object or region.
Based on both boundary and internal pixels and can describe complex objects
consisting of multiple disconnected regions as well as simple objects with or
without holes.
– Contour-based Shape Descriptor
Based on CSS representation of the contour
– 3-D Spectrum Descriptor
Expresses characteristic features of objects represented as discrete polygonal 3-D
meshes.
Based on the histogram of local geometrical properties of the 3-Dsurfaces of the
object.
47
48. Shape Descriptors
– Region-based shape descriptor utilizes a set of ART(Angular Radial
Transform) coefficients. Twelve angular and three radial functions are used
(n < 3, m < 12).
Fnm is an ART coefficient of order n and m. V is ART basis function and f is an image function
V (ART basis function) is separable along the angular and radial directions
(Real part of the ART basis functions)
ART coefficients are divided by the magnitude of ART coefficient of order n= 0, m = 0, which is not used
as a descriptor element.
Quantization is applied to each coefficient using 4 bit per coefficient to minimize the size of the descriptor
48
49. Shape Descriptors
– Contour-based Shape Descriptor : describes a closed contour of a 2D object or
region in image or video sequence. Based on the Curvature Scale Space (CSS)
representation of the contour
(A 2D visual object (region) and its corresponding shape)
Field No. of bits Meaning
No. of peaks 6 No. of peaks in CSS image
Circularity and eccentricity
2×6
GlobalCurvature
of the contour
Circularity and eccentricity
2×6
PrototypeCurvature
of the smoothed contour
Absolute height of the highest
HighestPeakY 7
peak (quantized)
X-position on the contour of a
PeakX[] 6
peak (quantized)
Height of the peak
PeakY[] 3
(quantized)
(CSS Image Formation)
49
Smoothing evolution of zero-crossing
50. Shape Descriptors
Contour-based Shape Descriptor has the following properties
• It can distinguish between shapes that have similar region-shape properties but
different contour-shape properties.
– · It supports search for shapes that are semantically similar for humans
– · It is robust to significant non-rigid deformations
– · It is robust to distortions in the contour due to perspective transformations, which are
common in the images and video
– · It is robust to noise present on the contour.
– · It is very compact (14 Bytes per contour on average).
– · The descriptor is easy to implement and offers fast extraction and matching.
50
51. Shape Descriptors
(3-Dimensional Class)
– 3-D Shape spectrum descriptor : This descriptor specifies an intrinsic shape
description for 3D mesh models. It exploits some local attributes of the 3D surface.
The shape index, introduced by Koenderink, is defined as a function of the two principal
curvatures, and associated with point p on the 3D surface S.
with
By definition, the shape index value is in the interval [0,1]
The shape spectrum of the 3D mesh (3D-SSD) is the histogram of the shape indices (Ip‘s)
calculated over the entire mesh.
51
52. Motion Descriptors
Camera Motion Descriptor
Motion Trajectory Descriptor
Parametric Motion Descriptor
Motion Activity Descriptor
Moving region
Video segment
Camera motion Mosaic
Motion trajectory
Motion activity
Warping
Parametric motion
parameters
52
53. Motion Descriptors
Motion Descriptors
– Camera Motions: pan, track, tilt, boom, zoom, dolly, roll, absence
perspective projection and camera
motion parameters
53
54. Motion Descriptors
– Motion Trajectory : describes the displacements of objects in time. A high
level feature associated to a moving region, defined as the spatiotemporal
localization of one of its representative points (such as its center) as a list of key
points (x, y, z, t)
– Parametric Motion : describing the motion of objects in video sequences as a 2D
parametric model.
Affine Models (6): translations, rotations, scaling and combination of these.
Planar Perspective Models (8) : Global deformations with perspective projections
Quadratic Models (12) : describes more complex movements
– Motion Activity : Intuitive notion of ‘intensity of action’ or ‘pace of action’ in a
video segment.
Example of high “activity”: Goal scoring in a soccer match
Can be used in diverse applications such as content repurposing, video summarization,
surveillance, content-based querying, etc.
Four attributes:
– Intensity of activity: indicate high or low activity by a integer lying in [1—5]
– Direction of activity: expresses the dominant direction of the activity if any
– Spatial distribution of activity: the number and size of active regions in a frame
– Temporal distribution of activity: expresses the variation of activity over the duration
54
55. Localization Descriptors
Localization Descriptors
– Region Locator : Localization of regions within images or frames by specifying
them with a brief and scalable representation of a Box or a Polygon. Procedure
consists of the following 2 steps
Extraction of vertices of the region to be localized
Localization of the region within the image or frame
(localization using a polygonal and Box element of the RegionLocator)
– Spatio Temporal Locator: describes spatial-temporal regions in a video
sequence, such as moving object regions, and provides localization
functionality.
55
56. Face Recognition Descriptor
FaceRecognition Descriptor : Used to retrieve face images which match a query
face image.
–Face Recognition : The projection of a face vector onto a set of 48 basis
eigenvectors U (‘eigenfaces’) which span the space of possible face vectors.
–Feature Extraction : The FaceRecognition feature set is extracted from a
normalized face image. This normalized face image contains 56 lines with 46
intensity values in each line. The centre of the two eyes in each face image are
located on the 24th row and the 16th and 31st column for the right and left eye
respectively.
Features are given by the vector W
and is the mean face vector.
The features are normalized and clipped using Z=2048 as follows.
56
57. Face descriptor
– Automatic Face Image Localization
(Block Diagram of the Automatic face Image Localization algorithm)
Color Segmentation
(A color segmentation example: a) the skin color region in the Cb-Cr plane
b) original image c) results of the color segmentation algorithm)
57
59. Audio Descriptors
Basic Descriptors: temporally sampled scalar values for general use,
applicable to all kinds of signals
– AudioWaveform Descriptor : Audio waveform envelope (minimum and
maximum), typically for display purposes
– AudioPower Descriptor : the temporally smoothed instantaneous power,
which is useful as a quick summary of a signal, and in conjunction with the
power spectrum.
Basic Spectral Descriptors: all deriving from a single time-frequency
analysis of an audio signal
– AudioSpectrumEnvelope Descriptor : a logarithmic-frequency spectrum,
spaced by a power-of-two divider (multiple of an octave)
– AudioSpectrumCentroid Descriptor : the center of gravity of the log-
frequency power spectrum, which describes the shape of the power
spectrum
59
60. Audio Descriptors
– AudioSpectrumSpread Descriptor : complementary of the previous descriptor
by describing the second moment of log-frequency power spectrum. This may
help distinguish between pure-tone and noise-like sounds
– AudioSpectrumFlatness Descriptor : the flatness properties of the spectrum of
an audio signal for each of a number of frequency bands. When this indicates a high
deviation from a flat spectral shape for a given band, it may signal the presence of
tonal components
(Example of AudioSpectrumEnvelope description of a pop song)
Visualized using a spectrogram.
Required data storage is NM values
where N is the no. of spectrum bins
and M is the no. of time points
60
61. Audio Descriptors
Spectral Basis Descriptor: low-dimensional projections of a high-
dimensional spectral space to aid compactness and recognition, which are
used primarily with the Sound Classification and Indexing Description Tools
– AudioSpectrumBasis : a series of basis functions that are derived from the
singular value decomposition of a normalized power spectrum
– AudioSpectrumProjection : Used with above descriptor, and represents low-
dimensional features of a spectrum after projection upon a reduced rank basis.
(Example: A 10-basis component reconstruction showing most of the detail of the
original spectrogram including guitar, bass guitar, etc.)
The left vectors are an AudioSpectrumBasis
Descriptor and the top vectors are the
corresponding AudioSpectrumProjection
Descriptor. The required data storage is
10(M+N) values
61
62. Audio Descriptors
Signal Parameters : apply chiefly to periodic or quasi-periodic
signals
– AudioFundamentalFrequency Descriptor : fundamental frequency of an
audio signal, which represents for a confidence measure in recognition of
the fact that the various extraction methods, commonly called “pitch-
tracking”, are not perfectly accurate.
– AudioHarmonicity Descriptor : the harmonicity of a signal, allowing
distinction between sounds with a harmonic spectrum (e.g., musical tones
or voiced speech [vowels like ‘a’]), sounds with an inharmonic spectrum
(e.g., metallic or bell-like sounds) and sounds with a non-harmonic
spectrum (e.g., noise, unvoiced speech [fricatives like ‘f’], or dense
mixtures of instruments).
62
63. Audio Descriptors
Timbral Temporal Descriptor : temporal characteristics of segments
of sounds, useful for the description of musical timbre( characteristic tone
quality independent of pitch and loudness).
– LogAttackTime Descriptor : the ‘attack’ of a sound, the time it takes for the signal
to rise from silence to the maximum amplitude. It tells the difference between a
sudden and a smooth sound
– TemporalCentroid Descriptor : the signal envelope, representing where in time the
energy of a signal is focused. It is used for the distinction between a decaying piano
note and a sustained organ note, when the lengths and the attacks of the two notes
are identical.
Timbral Spectral Descriptor : spectral features in a linear-frequency
space especially applicable to the perception of musical timbre.
– SpectralCentroid Descriptor : the power-weighted average of the frequency of the
bins in the linear power spectrum. Very similar to the AudioSpectrumCentroid, but
specialized for use in distinguishing musical instrument timbres. It tells the
“sharpness” of a sound.
63
64. Audio Descriptors
– HarmonicSpectralCentroid Descriptor : the amplitude-weighted mean of the
harmonic peaks of the spectrum. It has a similar semantic to the other centroid
descriptors, but applies only to the harmonic parts of the musical tone.
– HarmonicSpectralDeviation Descriptor : the spectral deviation of log-amplitude
components from a global spectral envelope.
– HarmonicSpectralSpread Descriptor : the amplitude-weighted standard deviation
of the harmonic peaks of the spectrum, normalized by the instantaneous
HarmonicSpectralCentroid.
– HarmonicSpectralVariation Descriptor : the normalized correlation between the
amplitude of the harmonic peaks between two subsequent time-slices of the signal.
Silence Segment : attaches the simple semantic of “silence” (i.e. no
significant signal) to an Audio Segment. It may be used to aid further
segmentation of the audio stream, or as a hint not to process a segment.
64
65. Audio Descriptors
High-level Audio Description Tools (Ds and DSs)
– Audio Signature DS : A condensed representation of an audio signal designed to
provide a unique content for the purpose of robust automatic identification of audio
signals. Applications include audio fingerprinting, identification of audio based on a
database of known works
– Musical Instrument Timbre Description Tools
HarmonicInstrumentTimbre Descriptor : Four harmonic timbral spectral
Descriptors with the LogAttackTime Descriptor
PercussiveInstrumentTimbre Descriptor : The timbral temporal Descriptors
with a SpectralCentroid Descriptor
– Melody Description Tools
Include a rich representation for monophonic melodic information to
facilitate efficient, robust, and expressive melodic similarity matching.
MelodyContour DS: terse, efficient melody contour representation
MelodySequence DS: a more verbose, complete, expressive melody
representation
65
66. Audio Descriptors
– General Sound Recognition and Indexing Description Tools
A collection of tools for indexing and categorization of sound (effects) in
general
SoundModelStatePath Descriptor: states generated by a sound model
SoundModelStateHistogram Descriptor: normalized histogram of the state
sequence generated by a sound model
– Spoken Content Description Tools
Consists of combined word and phone lattices for each speaker in an audio
stream. Use phone lattices to alleviate out-of-vocabulary problem (OOV)
SpokenContentLattice Description Scheme : the actual decoding produced by
an ASR(Automatic Speech Recognition) engine.
SpokenContentHeader : information about the speakers being recognized and
the recognizer itself.
66
67. References
Book – Introduction to MPEG-7: Multimedia Content
Description Interface
B. S. Manjunath (Editor), Philippe Salembier (Editor), Thomas
Sikora (Editor)
ISBN: 0-471-48678-7
http://www.wiley.com/WileyCDA/WileyTitle/
productCd-0471486787.html
MPEG-7
http://www.chiariglione.org/mpeg/standards/mpeg-7/mpeg-7.htm
MPEG-7 DDL Homepage
http://archive.dstc.edu.au/mpeg7-ddl/
67