Analog Digital Video

Analog Digital Video

By:
Yossi Cohen / DSP-IP

Copyright © 2008 LOGTEL

Course Content

Introduction to Video
• Basic Concepts & Formats
• Introduction to Multimedia coding
• Lossy Compression
• Basic Video CODEC
• Standardization Landscape
• Components
• File Formats
• AVI, MPEG4 FF, MKV
• Codecs
• H264, VP6, WMV / VC-1, VP8
Copyright © 2008 LOGTEL Yossi Cohen

Course Content

• Delivery methods
• RTP Streaming
• Progressive Download
• HTML5 Video
• HTTP Streaming


Introduction to Video
By:
Yossi Cohen / DSP-IP


Agenda

Basic Video Concepts
Color Spaces
Interlacing
Video Connection(Component, S-Video)
Image compression
Introduction to video compression


4.2 Color Models in Images
Colors models and spaces used to store, display,
and print images.
RGB Color Model for CRT Displays
We expect to be able to use 8 bits per color channel
for color that is accurate enough.
However, in fact we have to use about 12 bits per
channel to avoid an aliasing effect in dark image areas
— contour bands that result from gamma correction.
For images produced from computer graphics, we
store integers proportional to intensity in the frame
buffer. So should have a gamma correction LUT
between the frame buffer and the CRT.


Color matching
How can we compare
colors so that the
content creators and
consumers know what
they are seeing?
Many different ways
including CIE
chromacity diagram


Video Color Transforms
Largely derived from older analog methods of coding
color for TV. Luminance is separated from color
information.
YIQ is used to transmit TV signals in North America and
Japan.This coding also makes its way into VHS video
tape coding in these countries since video tape
technologies also use YIQ.
In Europe, video tape uses the PAL or SECAM codings,
which are based on TV that uses a matrix transform
called YUV.
Finally, digital video mostly uses a matrix transform
called YCbCr that is closely related to YUV


Color Models in Video
• Largely derive from older analog methods of coding
color for TV. Luminance is separated from color
information.
• A matrix transform YIQ is used to transmit TV signals
in North America and Japan. (NTSC) This coding also
makes its way into VHS video tape coding in these
countries since video tape technologies also use YIQ.
• In Europe, video tape uses the PAL or SECAM
codings, which are based on TV that uses a matrix
transform called YUV.
• Finally, digital video mostly uses a matrix transform
called YCbCr that is closely related to YUV.


YUV Separation


YUV Color Model

•YUV codes a luminance signal (for gamma-corrected
signals) equal to Y , the “luma".
•Chrominance refers to the difference between a color
and a reference white at the same luminance. (U and V)

The transform is:


RGB->YUV Color Transform

G
G
B B

Y
U

V

R
R


YIQ Color Model

YIQ is used in NTSC color TV broadcasting.
Again, gray pixels generate zero (I;Q)
chrominance signal.
I and Q are a rotated version of U and V .

The transform is:


YCbCr Color Model

1. The Rec. 601 standard for digital video uses
another color space YCbCr which closely
related to the YUV transform.
2. The YCbCr transform is used in JPEG image
compression and MPEG video compression.

For 8-bit coding:


VIDEO CONNECTION TYPES
• Component Video
• Composite Video
• S-Video


Component Video
High-end solution, use of three separate video signals
for R,G,B planes.
Each color channel is sent as a separate video signal.
(a) Most computer systems use Component Video, with
separate signals for R, G, and B signals.
(b) Provides the best color reproduction since there is
no “crosstalk“ between the three channels.
(c) Component video, requires more bandwidth and
good synchronization of the three components than
composite/S-Video .


Composite Video
• color (“chrominance") and intensity (“luminance")
signals are mixed into a single carrier wave.
a) Chrominance is a composition of two color
components (I and Q, or U and V).
b) In NTSC TV, e.g., I and Q are combined into a
chroma signal, and a color subcarrier is then
employed to put the chroma signal at the high-
frequency end of the signal shared with the
luminance signal.
c) The chrominance and luminance components can
be separated at the receiver end and then the two
color components can be further recovered.


Composite Video
d) When connecting to TVs or VCRs, Composite
Video uses only one wire and video color signals
are mixed, not sent separately. The audio and
sync signals are additions to this one signal.
Since color and intensity are wrapped into the
same signal, some interference between the
luminance and chrominance signals is inevitable.


S-Video
Uses two wires, one for luminance and another for a
composite chrominance signal.
less crosstalk between the color information and the gray-
scale information.
In fact, humans are able to differentiate spatial resolution
in grayscale images with a much higher acuity than for the
color part of color images.
As a result, we can reduce color
information since we can only see
fairly large blobs of color, so it
makes sense to send less color
detail.


VIDEO SCANNING
•Interlacing
•De-Interlacing


Analog Video Scanning Process
An analog signal f(t) samples a time-varying image. So-
called “progressive" scanning traces through a complete
picture (a frame) row-wise for each time interval.
In TV, and in some monitors and multimedia standards as
well, another system, called “interlaced" scanning is used:
a) The odd-numbered lines are traced first, and then the
even-numbered lines are traced. This results in “odd" and
“even" fields | two fields make up one frame.
b) In fact, the odd lines (starting from 1) end up at the
middle of a line at the end of the odd field, and the even
scan starts at a half-way point.


Q R : horizontal Trace. V P : vertical trace


Interlacing effects

• Because of interlacing, the odd and even lines
are displaced in time from each other |
generally not noticeable except when very fast
action is taking place on screen, when blurring
may occur.
• For example, in the video in Fig. 5.2, the
moving helicopter is blurred more than is the
still background.


Interlaced and de-Interlace images


de-Interlace
Since it is sometimes necessary to change the frame rate,
resize, or even produce stills from an interlaced source
video, various schemes are used to “de-interlace" it.
a) The simplest de-interlacing method consists of
discarding one field and duplicating the scan lines of
the other field. The information in one field is lost
completely using this simple technique.
b) b) Other more complicated methods that retain
information from both fields are also possible. Analog
video use a small voltage offset from zero to indicate
“black", and another value such as zero to indicate the
start of a line. For example, we could use a blacker-
than-black“ zero signal to indicate the beginning of a
line.

NTSC Video
NTSC
NTSC (National Television System Committee) TV
standard is mostly used in North America and Japan. It
uses the familiar 4:3 aspect ratio (i.e., the ratio of picture
width to its height) and uses 525 scan lines per frame at 30
frames per second (fps).
a) NTSC follows the interlaced scanning system, and each
frame is divided into two fields, with 262.5 lines/field.
b) Thus the horizontal sweep frequency is 525 X 29.97
=15, 734 lines/sec, so that each line is swept out in 63.6 u
second.
c) Since the horizontal retrace takes 10.9 u sec, this leaves
52.7 sec for the active line signal during which image data
is displayed (see Fig.5.3).

NTSC
NTSC video is an analog signal with no fixed horizontal
resolution. Therefore one must decide how many times to
sample the signal for display: each sample corresponds
to one pixel output.
A “pixel clock" is used to divide each horizontal line of
video into samples. The higher the frequency of the pixel
clock, the more samples per line there are.
Different video formats provide dierent numbers of
samples per line, as listed in Table 5.1.


NTSC


NTSC Color Modulation

NTSC uses the YIQ color model, and the technique of quadrature
modulation is employed to combine (the spectrally overlapped part of) I (in-
phase) and Q (quadrature) signals into a single chroma signal C:
C = I cos(Fsct) + Qsin(Fsct) (5:1)
This modulated chroma signal is also known as the color subcarrier, whose
magnitude is qI2 +Q2, and phase is arctan(Q/I). The frequency of C is Fsc
3:58 MHz.
The NTSC composite signal is a further composition of the luminance signal Y
and the chroma signal as defined below:
composite = Y +C = Y +I cos(Fsct) + Qsin(Fsct) (5:2)


PAL
PAL (Phase Alternating Line) is a TV standard widely
used in Western Europe, China, India, and many other
parts of the world.
PAL uses 625 scan lines per frame, at 25
frames/second, with a 4:3 aspect ratio and interlaced
fields.
(a) PAL uses the YUV color model. It uses an 8 MHz
channel and allocates a bandwidth of 5.5 MHz to Y, and
1.8 MHz each to U and V. The color subcarrier
frequency is fsc 4:43 MHz.
(b) In order to improve picture quality, chroma signals
have alternate signs (e.g., +U and -U) in successive
scan lines, hence the name “Phase Alternating Line".

PAL
(c) This facilitates the use of a (line rate) comb filter at the
receiver| the signals in consecutive lines are averaged so
as to cancel the chroma signals (that always carry
opposite signs) for separating Y and C and obtaining high
quality Y signals.


Video Worlds

Intro to Media Coding
Image and Video
Speech
Audio


Compression

Compression – Representing information by
less bit than the original information
Lossless Compression – Original information
and compressed information are identical.
example LZ, TAR and other compression
techniques.
Lossy Compression – Compressed info is not
the same as uncompressed info. Example:
MP3, JPEG etc
Lossy compression is often MODEL Based
Compression

Compression terms
Encoder – Module which compress the
information
Decoder – Module which decompress the
information
CODEC – (en)CODer / DEcoder
Channel – the medium which the information is
passed through for example ADSL line or disk
Decoder
Encoder Channel

Disk

Model Based Compression

Pre
Processing

Losless Compression

Model Quantize / Entropy
Based Prioritize Reorder Coding
Transform

Bit rate control


Human Visual System

The human eye has two basic light receptors:
Rods – Light Intensity receptors
Cons – Colored light receptors


The Human Eye

Rods Concentration >> Cons Concentration
Green Discrimination << Red, Blue
Discrimination
Low Frequency > High Frequency


Image Coding Model Based transformations

RGB (3 equally quantized colors) ->
YUV (Light Intensity + two color channels)
Pixel based domain -> Frequency domain


Speech coding

In speech coding, the vocal tract is used as a
model:


Audio / Music Coding

In general Audio Coding, the ear is used as a
model:
Frequencies -> Frequency bands
Masking and Temporal Masking are used


Basic Image and Video coding
Definitions

Where to lose information: color & frequency


What is a digital image?

Audio PCM
One 1-D array of
sample
BMP Image
Three 2-D arrays of
numbers representing
Red, Green and Blue
values


Image Compression? Why?

Image size = 720*580
3 Image Layers RGB =720*580*3
8 Bits per pixel 720*580*3*8
= 10022400 bits
Lots of bits for one Lena


IMAGE COMPRESSION


Color based decimation

Our eyes have better resolution and scaling
for luminance then for color.
Compress color by using 4:2:0 method


Counting the bits

How much can we save by color
compression?
3*Image size in RGB 24 bit color representation.
1 + 2*1/4 Image size in 4:2:0 YUV representation.
Compression ratio is 2 !!
Actual saving is bigger due to different Y and
UV quantization.


Linear Transform

If the signal is formatted as a Energy compaction property:
vector, a linear transform can The transformed signal vector
be formulated as a matrix- has few, large coefficients and
vector product that transform many nearly zero small
the signal into a different coefficients. These few large
domain. coefficients can be encoded
Examples: efficiently with few bits while
K-L Transform retaining the majority of energy
of the original signal.
Discrete Fourier Transform
Discrete cosine transform
Discrete wavelet transform


Block-based Image Coding

Block-based image Advantages:
coding scheme: Parallel processing
partitions the entire can be applied to
image into 8 by 8 or process individual
blocks in parallel.
16 by 16 (or other
Redundant information
size) blocks. in close proximity (like
The coding algorithm cache)
is applied to individual
blocks independently.


Transform - DCT

The DCT transform the data from pixel
intensity to frequency intensity.
Low frequency are important high frequency
less
1 7 7 (2m + 1)uπ (2n + 1)vπ
 4 ∑∑ F (u , v) cos cos m = n = 0;
 u =0 v =0 16 16
f (m, n) =  7 7
1 (2m + 1)uπ (2n + 1)vπ
 8 ∑∑
F (u, v) cos cos 0 ≤ m, n ≤ 7; m + n > 0.
 u = v =0
(You’ll0 get launch even if you 1616
don’t remember
the IDCT formula above)


DCT Coefficients Quantization


AC Coefficients
AC coefficients are first
weighted with a quantization 1 2 6 7 15 16 28 29

matrix: 3 5 8 14 17 27 30 43

C(i,j)/q(i,j) = Cq(i,j) 4 9 13 18 26 31 42 44

Then quantized. 10 12 19 25 32 41 45 54

Then they are scanned in a 11 20 24 33 40 46 53 55

zig-zag order into a 1D 21 23 34 39 47 52 56 61

sequence to be subject to AC 22 35 38 48 51 57 60 62

Huffman encoding. 36 37 49 50 58 59 63 64

Question: Given a 8 by 8
array, how to convert it into a Zig-Zag scan order
vector according to the zig-
zag scan order? What is the
algorithm?

DCT Basis Functions


DCT compression Example

Original Image


DCT 1 coefficient


DCT 6 coefficients


DCT 20 coefficient


JPEG Image Coding Algorithms

Quantization DC
8x8 Matrix DC DPCM Huffman
block
DCT Q
Zig Zag AC
AC Scan Huffman
Code books

JPEG Encoding Process


Generalization of JPEG Coding

Transform Entropy
Color, Frequency Quantize Reorder Coding

JPEG Encoding Process


Video Coding Basics
By:
Yossi Cohen


Video Coding
Video coding is often implemented as encoding
a sequence of images.Motion compensation
is used to exploit temporal redundancy
between successive frames.
Examples: MPEG-I, MPEG-II, MPEG-IV,
H.263, H.263+, H264
Existing video coding standards are based on
JPEG image compression as well as motion
compensation.


Video Coding Standardization Scope

Only restrictions on the Bitstream, Syntax, and
Decoder are standardized:
Permits the optimization of encoding
Permits complexity reduction
Provides no guarantees on quality


Video Encoding

Buffer control
Current
frame x(t) r Bit stream
+ DCT Q VLC Buffer
−
Q-1 This is a simplified block
diagram where the
encoding of intra coded
IDCT frames is not shown.

Xp(t): predicted ^ r(t): reconstructed residue
frame
+
^
x(t): reconstructed
Motion ^x(t-1) current frame
x(t) Frame
Estimation &
Compensation Buffer
Motion vectors


Video Encoding

Color Frequency
Transform Buffer control
Transform

+ Q Reorder Entropy
−
Q-1 This is a simplified block
diagram where the
encoding of intra coded
Tf-1 frames is not shown.

Xp(t): predicted ^ r(t): reconstructed residue
frame
+
^
x(t): reconstructed
Motion ^x(t-1) current frame
x(t) Frame
Estimation &
Compensation Buffer
Motion vectors


Forward Motion Estimation

1 2 3 4 1 2 4
3
5 6 7 8 5 7 8
6
9 10 11 12 9 11 12
10
13 15 16
13 14 15 16 14

Current frame constructed From
different parts of reference frame Reference frame


Video sequence : Tennis frame 0, 1

previous frame current frame

50 50

100 100

150 150

200 200

50 100 150 200 250 300 350 50 100 150 200 250 300 350


Frame Difference

Frame Difference :frame 0 and 1


What is motion estimation?

Motion Vector Field of frame 1
50

0

-50

-100

-150

-200

-250
0 50 100 150 200 250 300 350 400


What is motion compensation ?

Motion compensated frame

50

100

150

200

50 100 150 200 250 300 350


Motion Compensated Frame Difference

Motion Compensated Frame Difference :frame 0 and 1
Frame Difference :frame 0 and 1


Video Worlds

Video Structures


Frame Types

Three types of frames:
Intra (I): the frame is coded as if it is an image
Predicted (P): predicted from an I or P frame
Bi-directional (B): forward and backward predicted
from a pair of I or P frames.
A typical frame arrangement is:
I1 B1 B2 P1 B3 B4 P2 B5 B6 I2
P1, P2 are both forward-predicted from I1. B1, B2 are
interpolated from I1 and P1, B3, B4 are interpolated
from P1, P2, and B5, B6 are interpolated from P2, I2.
New Coding standards added other frame types:
SP, SI, D


Macro-blocks and Blocks

Y(16x16) Cr (8x8)
RGB

Cb (8x8)

16x16x3


VIDEO CODING STANDARDS


Chronological evolution of Video Coding Standards

ITU-T H.263 H.263++
VCEG (1995/96) H.263+ (2000)
H.261 (1997/98)
H.264
(1990) MPEG-2
( MPEG-4
(H.262)
Part 10 )
(1994/95) MPEG-4 v1 (2002)
ISO/IEC (1998/99)
MPEG MPEG-4 v2
MPEG-1 (1999/00)
MPEG-4 v3
(1993)
(2001)

1990 1992 1994 1996 1998 2000 2002 2003

ITU Standards

H261
Early standard
Compressed data rate, n*64 Kbps (was created for ISDN
connections, remember it’s an ITU standard)
Resolution
QCIF 176x144,CIF 352x288
H263
Supports a wider range of bit-rates <64Kbs and up
Error recovery and performance improvements over h.261
Resolution
SQCIF, QCIF, CIF, 4CIF 704x576, 16CIF 1408x115

www.dsp-ip.com


ITU Standards

H264
Improved H263
Arithmetic coding
Dynamic block size (not only 8x8)
(Much) Better results then MPEG4-2
Tradeoff – computational overhead.

www.dsp-ip.com


ITU Standards
ITU standard evolution over the years
H261

H262
MPEG2

What’s next?
H263
H264

www.dsp-ip.com


ISO MPEG Standards

MPEG-1: CD Compression (X1)
MPEG-2: Television Broadcast quality
MPEG-4: Multimedia & Systems standard
MPEG-7: Meta-Data description
MPEG-21: Standard for the creation,
distribution and consumption of Multimedia
(mainly DRM, IPMP).

www.dsp-ip.com


Data virtualization in ISO standards
The evolution of standards from pixel description to
object description manipulation and right in ISO
standards

Object Rights
MPEG-21
Object Descriptors
MPEG-7

Object coding
MPEG-4
Image Coding

MPEG-1/2

www.dsp-ip.com


MPEG-1

A standard for storage and retrieval of audio and
video, (1992)
Up to 1.5 Mbps
P-frame, Predictive-coded frames
requires info from previous I or P frames
B-frames, Bi-directionally predictive coded frames
requires previous and following frames
D-frame, DC-coded frames
Consists of lowest frequency of an image
Used for fast forward and fast reverse modes


MPEG-2

A standard for high-quality video and digital
television, (1994)
2-100 Mbps
Coding similar to MPEG-1
Several profiles and levels for different
resolutions and qualities
Enhanced audio, (multiple channels)


MPEG-4

Designed for multimedia, (v1 Oct.1998)
Coding of both natural and synthetic audio-
visual data
Improved efficiency, (object based)
Error robustness
Many more MM features


Why ISO adopted ITU technology
Comparison of compression formats

38 CIF 30Hz
37
36
35
34
33
Quality 32
Y-PSNR [dB] 31
30
29
28 JVT/H.26L
27 MPEG-4
26 MPEG-2
25 H.263
0 500 1000 1500 2000 2500 3000 3500
Bit-rate [kbit/s]


MPEG-2 STANDARD


MPEG History

Moving Picture Experts Group was founded in
January 1988 by Leonardo Chiariglione together with
around 15 experts in compression technology
Creator of numerous standards like MPEG-1, MPEG-
2, MPEG-4, MPEG-7, MPEG-21 etc.
The Group has not limited it’s scope to only “pictures”
– sound wasn’t forgot (e.g. MPEG-1 Layer3)
The industry adopted fast the MPEG standard
(Philips, Samsung, Intel, Sony etc)
MPEG has given birth to a number of technologies
we take now for granted: DVD and Digital TV
(MPEG-2), MP3 (MPEG-1 L3)


MPEG-2

In 1994, MPEG has published the ISO/IEC-
13818, also known as MPEG-2
MPEG-2 was the standard adopted by DVD
and Digital TV
MPEG2 is designed for video compression
between 1.5 and 15 Mbps for SD
MPEG-2 streams come in 2 forms: Program
Stream and Transport Stream


The MPEG Standard


MPEG2- Systems

Define
Storage
Transport
Control
of MPEG2 streams

Yossi Cohen Yossi Cohen
DSP-IP

Model for MPEG-2 Systems

Yossi Cohen Yossi Cohen
DSP-IP

MPEG-2 Program Stream
Similar to MPEG-1 Systems Multiplex
Combines one or more Packetised Elementary
Streams (PES), which have a common time-
base, into a single stream
Designed for use in relatively error-free
environments and suitable for applications
which may involve software processing
Program stream packets may be of variable and
relatively great length
Variable length / Error free what's the
connection?

MPEG-2 Transport Stream
Combines one or more Packetized Elementary
Streams (PES) with one or more independent
time bases into a single stream (sometimes
called multiplex)
Elementary streams sharing a common time-
base form a program
Designed for use in environments where errors
are likely, such as storage or transmission in
lossy or noisy media
The transport stream is made of packets with
fixed length of 188 bytes – Why?
What is the header overhead in 188 bytes
packet?

MPEG2 AAC


MPEG2 Audio (AAC)


MPEG-2 Audio

Backwards compatible - defines extensions:
MultiChannel coding
5 channel audio (L, R, C, LS, RS)
Multilingual coding
7 multilingual channels
Lower sampling frequencies (LSF)
Optional Low Frequency Enhancement (LFE) -
Bass


Media Delivery Components
File Format / Container
Codec
Delivery Protocols


File Formats

Movie (meta-data)
Video track
trak
moov
Audio track
trak

Media Data

sample sample sample sample
mdat
frame frame


Agenda

Intro to file formats
Second Generation formats
RIFF: AVI, WAV
Third Generation Containers
MPEG4 FF
MKV


File Format Segmentation

File
Formats

3rd 2nd 1st
Generation Generation Generation

Object Media Raw /
XML Based
Based Muxer Proprietary


2ND GENERATION FILE FORMATS


2ND Generation Files features
Multiple media track in the same file
Identification of codec
Usually by FourCC
Interleaving


2nd Generation File Formats

2nd Generation FF

RIFF ASF MPEG2 FLV

MP2PS
WAV AVI WMA WMV MP2TS
VOB


AVI FILE FORMAT


AVI Overview
AVI files use the AVI RIFF format (like WAV)
Introduced by Microsoft on 1992
File is divided into:
Streams – Audio, Video, Subtitles
Blocks “Chunks” -


Blocks / Chunks
A RIFF File logical unit
Chunks are identified by four letters (FOUR-CC)
RIFF file has two mandatory sub-chunks and
one optional sub-chunk
Mandatory Chunks: RIFF ('AVI '
LIST ('hdrl‘
hdrl – File header
'avih'(<Main AVI Header>)
movi - Media Data LIST ('strl’ ... ) . . . )
LIST ('movi‘ . . . )
Optional Chunk ['idx1
['idx1'<AVI Index>]
idx1 - Index )
*This order is fixed


AVI main header
RIFF 'AVI ' - Identifies the file as RIFF file.
LIST 'hdrl' - Identifies a chunk containing sub-
chunks that define the format of the data.
'avih' - Identifies a chunk containing general
information about the file. Includes:
dwMicrosecPerFrame - Time between frames
dwMaxBytesPerSec – number of bytes per second
the player should handle
dwReserved1 - Reserved
dwFlags - Contains any flags for the file.


Example - headers
Avi file header

Initial frame

chunk ID chunk size format chunk ID

Data rate flages
Time between streams
frames

Total no. of
frames

Frame Stream header
width 320

Frame
height

reserved

Size of padding Junk chunk
identifier


Example – data chunks

Audio data chunk
(stream 01)
video data chunk
(stream 00)


AVI Summary
Advantages
Includes both audio and video
Index-able
Disadvantage
Not suited for progressive DW
Very rigid format
Insufficient support for: seeking, metadata multi-
reference frames


3RD GENERATION FILE FORMATS


Why “Fix it”?
2nd Generation Formats are missing:
Metadata
Separate from Media
Info on angle, language, Synchronization
Versioning
Better Streaming Support
Reduce CPU per stream
Better seeking support
Better parsing
XML
Atom Based


Main Attributes
File format is not just a Video / Audio multiplexer
Separation between
Media – Audio, Video, Images, Subtitles
Metadata – Indexing, frame length, Tags


3rd Generation File Formats

3rd Generation

XML Based Object Based

Matruska (MKV) MOV MPEG4 FF

Fragmented
3GPP FF
MPEG4 FF


MPEG4 FILE FORMAT


MP4 File Format
File Structuring Concepts
Separate the media data from descriptive (meta)
data.
Support the use of multiple files.
Support for hint tracks:
support of real time streaming over any protocol


Separate Metadata and Media
Key meta-information is compact
The type of media present
Time-scales
Timing
Synchronization points etc.
Enables
Random access
Inspection, composition, editing etc.
Simplified update


Multiple file support
Use URLs to ‘point to’ media
Distinct from URLs in MPEG-4 Systems
URLs use file-access service
e.g. file://, http://, ftp:// etc.
Permits assembly of composition without
requiring data-copy
Referenced files contain only media
Meta-data all in ‘main’ file


Logical File Structure
Presentation (‘movie’) contains
Tracks which contain
Samples


Physical Structure—File
Succession of objects (atoms, boxes)
Exactly one Meta-data object
Zero or more media data object(s)
Free space etc.


Example Layout

Movie (meta-data)
Video track

trak
moov
Audio track

trak

Media Data

mdat
frame frame


Meta-data tables
Sample Timing
Sample Size and position
Synchronization (random access) points, priority
etc.
Temporal/physical order de-coupled
May be aligned for optimization
Permits composition, editing, re-use etc. without re-
write
Tables are compacted


Multi-protocol Streaming support
Two kinds of track
Media (Elementary Stream) Tracks
Sample is Access Unit
Protocol ‘hint’ tracks
Sample tells server how to build protocol transmission
unit (packet, protocol data unit etc.)


Track types
Visual—’description’ formats
MPEG4
JPEG2000
Audio—’description’ formats
MPEG4 compressed tracks
‘Raw’ (DV) audio
Other MPEG-4 tracks
Hint Tracks (streaming)


Track Structure
Sample pointers (time, position)
Sample description(s)
Track references
Dependencies, hint-media links
Edit lists
Re-use, time-shifting, ‘silent’ intervals etc.


Hint Tracks
May include media (ES) data by ref.
Only ‘extra’ protocol headers etc. added to hint
tracks — compact
Make SL, RTP headers as needed
May multiplex data from several tracks
Packetization/fragmentation/multiplex through
hint structures
Timing is derived from media timing


Hint track structure

Movie (meta-data)

Video track
trak
moov
Hint track
trak

Sample Data

sample sample
hint sample hint sample
mdat header header
frame frame
pointer pointer


Extensibility
Other media types.
Non-sc29 sample descriptions (e.G. Other video).
Non-sc29 track types (e.G. Laboratory instrument
trace).
Copyright notice (file or track level) etc.
General object extensions (GUIDs).


Advantages
Compatibility
files can be played by other companies players.
Real Player with envivo plug-in.
Windows media player etc.
Files can be streamed by other companies streaming
server
Darwin Streaming Server.
Quick Time Streaming Server.


Single File-Multiple data types
No need to do an export process for files, one
file type is used for storage of video, audio,
events, continues telemetry data from sensors
and JPEG images in one file.

Audio
Métadonnées

Video

JPEG1
JPEG1

Sensor Continues data

events


Single file playback
All video track of a site could be stored in one
file. In order to view many cameras in a
synchronized manner the MPEG-4 file format
can hold all the views of multiple cameras in one
file.

Audio
Métadonnées

Video cam 1

Video cam 2

Video cam …….

Video cam N


Skimming
Skimming – shortening a long movie to its
interesting points, much like creating a “promo”.
For example skimming a surveillance movie of
two hours to 2 minutes where there is movement
and people are entering the building.
MPEG-4 FF enables the creation of skims within
the file through the use of edit-list (part of the
standard) without overhead.


MKV FILE FORMAT

XML Based File-Format


MKV - File Format
Container file format for videos, audio tracks,
pictures and subtitles all in one file.

Announced on Dec. 2002 by Steve Lhomme.

Based on Binary XML format called EBML
(Extensible Binary Meta Language)

Complete Open-Standard format. (Free for
personal use).

Source is licensed under GNU L-GPL.


MKV - Specifications
Can contain chapter entries of video streams

Allows fast in-file seeking.

Metadata tags are fully supported.

Multiple streams container in a single file.

Modular – Can be expanded to company special
needs.

Can be streamed over HTTP, FTP, etc.


MKV Support software & hardware
Players:
All Player, BS.Player, DivX Player, Gstreamer-Based
players, VLC media, xine, Zoom Player, Mplayer,
Media Player Classic, ShowTime, Media Player
Classic and many more
Media Centers:
Boxee, DivX connected, Media Portal, PS3 Media
Server, Moovida, XBMC etc.
Blu-Ray Players:
Samsung, LG and Oppo.
Mobile Players:
Archos 5 android device, Cowon A3 and O2.


MKV - EBML in details
A binary format for representing data in XML-like
format.
Using specific XML tags to define stream
properties and data.
MKV conforms to the rules of EBML by defining
a set of tags.
Segment , Info, Seek, Block, Slices etc.
Uses 3 Lacing mechanisms for shortening small
data block (usually frames).
Uses: Xiph, EBML or fixed-sized lacing.


MKV – Simple representation
Type Description
Header Version info, EBML type ( matroska in our case ).
Meta Seek Optional, Allows fast seeking of other level 1 elements in file.
Information
Segment File information - title, unique file ID, part number, next file
Information ID.
Track Basic information about the track – resolution, sample rate,
codec info.
Chapters Predefines seek point in media.
Clusters Video and audio frames for each track
Cueing Data Stores cue points for each track. Allows fast in track seeking.
Attachment Any other file relates to this. ( subtitles, Album covers, etc… )
Tagging Tags that relates to the file and for each track (similar to MP3
ID3 tags).


MKV – Streaming
Matroska supports two types of streaming.
File Access
Used for reading file locally or from remote web
server.
Prone to reading and seeking errors.
Causes buffering issues on slow servers.

Live Streaming
Usually over HTTP or other TCP based protocol.
Special streaming structure – no Meta seek, Cues,
Chapters or attachments are allowed.


File Format Summary - Trends
Metadata is important
Simple metadata or XML
Separated from media
Forward compatibility
Not crash if don’t understand a data entry
Progressive download oriented
Multi-bitrate oriented
Fragmentation -> Lower granularity
Self contained File fragments
CDN-ability


Video Codecs

Movie (meta-data)
Video track
trak
moov
Audio track
trak

Media Data

mdat
frame frame

Why Advance ? MPEG2 Works .
Coding efficiency
Packetization
Robustness
Scalable profiles
Internet requires Interaction
Scalable & On demand
Fast-Forward / Fast Rewind / Random Access
Stream switching
Multi
Bitrate
resolution /screen

Coding efficiency Motivation


Codec discussion

Internet and video codec
Standard codecs – MPEG4-2 and H.264
Non standard codecs
Sorenson Spark
VP6
WMV9
VC-1
VP8


H.264


H.264 Terminology
The following terms are used interchangeably:
H.26L
“JVT CODEC”
The “AVC” or Advanced Video CODE
Proper Terminology going forward:
MPEG-4 Part 10 (Official MPEG Term)
ISO/IEC 14496-10 AVC
H.264 (Official ITU Term)


H264 Standard ideas
“Blocks” size fixed ->Variable
Slice
Block
Block Size order/scanning –> different orders
Zig-zag, Flexible Macroblock Order
Additional spatial prediction - >Intra prediction
Inter prediction 1 frame only ->Multiple frames
P and B picture
Multiple reference frame


H264 Standard Ideas

Pixel interpolation
Motion vectors
In-loop Deblocking filter
Improved Entropy coding


New Features of H.264 - summarized

SP, SI - Additional picture types
NAL (Network Abstraction Layer)
CABAC - Additional entropy coding mode
¼ & 1/8-pixel motion vector precision
In-loop de-blocking filter
B-frame prediction weighting
4×4 integer transform
Multi-mode intra-prediction
NAL - Coding and transport layers separation
FMO - Flexible MacroBlock ordering.

Block diagram


Profiles and Levels
Profiles: Baseline, Main, and X
Baseline: Progressive, Videoconferencing &
Wireless
Main: esp. Broadcast
Extended: Mobile network
Wireless <> Mobile


Baseline Profile
Baseline profile is the minimum
implementation
No CABAC, 1/8 MC, B-frame, SP-slices
15 levels
Resolution, capability, bit rate, buffer, reference #
Built to match popular international production
and emission formats
From QCIF to D-Cinema
Progressive (not interlaced)
I and P slices types


Baseline Profile
1/4-sample Inter prediction
Deblocking filter, Redundant slices
VLC-based entropy coding (no CABAC)
4:2:0 chroma format
Flexible Macroblock Ordering (FMO)
Arbitrary Slice Order (ASO)
Decoder process slices in an arbitrary order as they
arrive to the decoder.
The decoder dose not have a wait for all slices to be
properly arranged before it starts processing them.
Reduces the processing delay at the decoder.


Baseline Profile
FMO: Flexible Macroblock Ordering
With FMO, macroblocks are coded according to a
macroblock allocation map that groups, within a given
slice.
Macroblocks from spatially different locations in the
frame.
Enhances error resilience
Redundant slices:
allow the transmission of duplicate slices.


H.264 Profiles & Levels - Main
All Baseline features Plus
Interlace
B slice types (bi directional reference )
CABAC
Weighted prediction
All features included in the Baseline profile
except:
Arbitrary Slice Order (ASO)
Flexible Macroblock Order (FMO)
Redundant Slices


Main Profile
CABAC
Good performance (bit rate reduction) by
Selecting models by context
Adapting estimates by local statistics
Arithmetic coding reduces computational complexity
Improve computational complexity more than
10%~20% of the total decoder execution time at
medium bitrate
Average bit-rate saving over CAVLC 10-15%


Extended Profile
All Baseline features plus
Interlace
B slice types
Weighted prediction


Frame structure
Slices:
A picture is split into 1 or several slices.
Slices are self contained.
Slices are a sequence of MB.
MacroBlocks [MB]
Basic syntax & processing unit.
Contains 16x16 luma samples and
2 x 8x8 chroma samples.
MB within a slice depend on each other.
MB can be further partitioned.


Macroblock scanning


Scanning order of residual blocks
For Intra 16x16 MB
, block labeled -1 is
transmitted first
containing DC
coeff.
Luma residual
blocks 0-15 are
transmitted
Block 16 & 17
contain a 2x2 array
of chroma DC
coeff.
Chroma residual
blocks 18-25 are
sent


Variable block size
Slices
A picture split into 1 or several slices
Slices are a sequence of macroblocks
Macroblock
Contains 16x16 luminance samples and
two 8x8 chrominance samples
Macroblocks within a slices depend on
each others
Macroblocks can be further partitioned

Slice 0
Slice 1
Slice 2

Basic Marcoblock Coding Structure
Input Coder
Video Control
Control
Signal
Data
Transform/
Quant.
Scal./Quant.
- Transf. coeffs
Decoder Scaling & Inv.
Split into
Macroblocks Transform
16x16 pixels Entropy
Coding
De-blocking
Intra-frame Filter
Prediction
Output
Motion- Video
Compensation Signal
Intra/Inter

Motion
Data
Motion
Estimation


Motion Compensation
Input Coder
Video Control
Control
Signal
Data
Transform/
Quant.
Scal./Quant.
- Transf. coeffs
Split into
Coding
De-blocking
16x16 16x8 8x16 8x8
Intra-frame MBFilter 0 0 1
Prediction Types 0 0 1
Output1 2 3
Motion- Video
Compensation 8x8 8x4
Signal 4x8 4x4
Intra/Inter
8x8 0 0 1
0 0 1
Types Motion
1 2 3
Data
Motion Various block sizes and shapes
Estimation


Tree structured Motion Compensation
Input Coder
Video Control
Control
Signal
Data
Transform/
Quant.
Scal./Quant.
- Transf. coeffs
Split into
16x16 pixels 16x16 16x8 8x16
Entropy 8x8
MB 0 Coding 0 1
Types 0 0 1
De-blocking 1 2 3
Intra-frame Filter
8x8 8x4 4x8 4x4
Prediction 0 0 1
8x8 0 Output 0 1
Motion- Types Video 1 2 3
Compensation Signal
Intra/Inter

Motion 5
013
46
Data7
Motion 2
8
Estimation
Motion vector accuracy 1/4 (6-tap filter)


Variable block size
Block sizes of 0 0 1
16x8, 8x16, 8x8,
8x4 , 4X8 and 0 0 1 2 3
1
4X4 are
available.
Mode 1 Mode 2 Mode 3 Mode 4
1 16x16 block 2 16x8 blocks 2 8x16 blocks 4 8x8 blocks

0 1 0 1 2 3
Using seven different 0 1 2 3
2 3 4 5 6 7
block sizes can translate
4 5 8 9 1 1
into bit rate savings of 4 5 6 7 0 1
6 7
more than 15% as 1
2
1
3
1
4
1
5
compared to using only a Mode 5 Mode 6
16x16 block size. 8 8x4 blocks 8 4x8 blocks
Mode 7
16 4x4 blocks


How to select the partition size?

The partition size that minimizes the coded
residual and motion vectors


The Trade off .
Large partition size (e.g. 16x16,16x8, 8x16) requires small
number of bits to signal the choice of motion vector and the
partition type.
However, the motion compensated residual may contain a
significant amount of energy in frame areas with high details.
Small partition size (e.g. 8x4, 4x4 etc) may give a lower energy
residual after motion compensation but requires a large number
of bits to signal the motion vectors and the choice of partition.
The choice of partition size therefore has significant impact on
compression performance.
In general, a large partition size is appropriate for
homogeneous areas of the frame and a small partition size may
be beneficial for details area.


Interpolation
Quarter sample luma
interpolation
2 steps:
Applying a 6 tap filter
with tap values:
(1,-5,20,20,-5,1)
Quarter sample
positions are obtained
by averaging samples at
integer and half sample
positions.
b=round((E-5F+20G+20H-5I+J)/32)


Chroma Interpolation
Chroma interpolation is 1/8
sample accurate since luma
motion is ¼ sample accurate.

Fractional chroma sample
positions are obtained using the
equation:


Inter prediction modes
MVs for neighboring partitions are often highly
correlated.
So we encode MVDs instead of MVs
MVD = predicted MV – MVp
¼ pixel accurate motion compensation


Multiple Reference Frames

Input Coder
Video Control
Control
Signal
Data
Transform/
Quant.
Scal./Quant.
- Transf. coeffs
Split into
Coding
De-blocking
Intra-frame Filter
Prediction
Output
Motion- Video
Compensation Signal
Intra/Inter

Motion
Multiple Reference Data
Frames for
Motion Motion Compensation
Estimation


Multiple Reference Frames


Intra prediction modes
4x4 luminance prediction modes
0(vertical) 1(Horizontal) 2(DC) 3(Diagonal 4(Diagonal
Down/left) Down/right)

5(Vertical-right) 6(Horizontal-down) 7(Vertical-left) 8(Horizontal-top)

Mode 2 (DC)
Predict all pixels from
(A+B+C+D+I+J+K+L+4)/8 or
(A+B+C+D+2)/4 or (I+J+K+L+2)/4

4x4 luminance prediction modes


Intra 16x16 luminance and 8x8 chrominance prediction modes


Inter prediction modes
chrominance Pixel interpolation

Quarter chrominance Pixels are A B
interpolated by tacking weighted dy
dx
averages of distance from the new S-dx
pixel to four surrounding original S-dy
pixels.
C D
(s-dx)(s-dy)A+dx(s-dy)B+(s-dx)dyC+dxdyD+s2/2
V=
S2


Deblocking filter
Deblocking filter:
Improves subjective visual and objective quality of
the decoded picture
Significantly superior to post filtering
Filtering affects the edges of 4x4 block structure
Highly content adaptive filtering procedure mainly
removes blocking artifacts and does not
unnecessarily blur the visual content
Filtering strength is dependent on inter,intra, motion
and coded residuals.


Deblocking filter

Principle:


Deblocking filter
Deblocking filter: Highly compressed decoded inter picture

1) Without Filter 2) with H264/AVC Deblocking


Entropy coding


Entropy coding
Entropy coding methods:
CABAC - Discussed
UVLC
H.264 offers a single Universal VLC (UVLC) table for
all symbol
CAVLC
CAVLC (Context-based variable Length Coding )
Probability distribution is static
Code words must have integer number of bits (Low
coding efficiency for highly peaked pdfs)


CABAC: Technical Overview

update probability estimation

Context Binarization Probability Coding
modeling estimation engine

Adaptive binary arithmetic coder

Chooses a model Maps non-binary Uses the provided model
conditioned on symbols to a for the actual encoding
past observations binary sequence and updates the model


Complexity of codec design
Codec design includes much higher complexity (memory &
computation) – rough guess 2-3x decoding power increase
relative to MPEG4, 4-5x encoding
Problem areas:
Smaller block sizes for motion compensation (cache access
issues)
Longer filters for motion compensation (more memory
access)
Multi-frame motion compensation (more memory for
reference frame storage)
More segmentations of macroblock to choose from (more
searching in the encoder)
More methods of predicting intra data (more searching)
Arithmetic coding (adaptivity, computation on output bits)


Comparison


Summary

New key features are:
Enhanced motion compensation
Small blocks for transform coding
Improved de-blocking filter
Enhanced entropy coding
Substantial bit-rate savings (up to 50%) relative to
other standards for the same quality
The complexity of the encoder triples that of the
prior ones
The complexity of the decoder doubles that of the
prior ones

Sorenson Spark video Codec
H263 variant
Low footprint (code size) ~100K
Good performance for 2002
Quality SPARK vs Optimal MPEG (H263+)
20-30% less efficient
SPARK Quality RT vs Offline
RT has Considerably lower quality due to processing
power and RT (delay) constraints


Sorenson Spark - 2
Does Not support:
Arithmetic coding
Advance prediction
B-frames
Features
De-blocking filter mode
UMV - Unrestricted Motion Vector mode
Arbitrary frame dimensions
Supported by FFMPEG
D – Frames


D-Frames
D (Disposable) frames
One way prediction
Provides flexible bit-rate: I-D-P-D-P-D-P
D-frames used only when feeding a flash
communication server


On2 TrueMotion VP6
Features
Compressed I-frames (Intra-compression makes use
of spatial predictors)
unidirectional predicted frames (P-frames)
Multiple reference P-frames
8x8 iDCT-class transform (4x4 in VP7)
improved quantization strategy (preserves image
details)
Advance Entropy Coding


VP6 Features
Entropy Coding
various techniques are used based on complexity and
frame size including:
VLC
Context modeled binary coding (like H264 CABAC)
Bit Rate Control
To reach the requested data rate, VP6 adjusts
Quantization levels
Encoded frame dimensions
Entropy Coding
Drop frames


VP6 motion prediction
Motion Vectors
One vector per MacroBlock (16x16)
or
4 vectors for each block (8x8)
Quarter pel motion compensation support
Unrestricted motion compensation support
Two reference frames:
The previous frame
or
Previously bookmarked frame


VP6 vs H264

VP6 is much simpler than H.264
Requires less CPU resourced for decoding & encoding
Code size is considerably smaller.
Simpler means less efficient? NO! Techniques
used:
Mix of adaptive sub-pixel motion estimation
Better prediction of low-order frequency coefficients
Improved quantization strategy
de-blocking and de-ringing filters
Enhanced context based entropy coding,


PSNR Graphs are used for comparative
analysis of compression quality. Each line
720p High Profile H.264 vs VP7
represents the encode quality on a given
This axis represents quality. Higher is better

clip at multiple datarates. The highest line
Draw a line straight
represents the codec with the best quality.
Alexander Trailer intersect
across until you
In this case VP7 clearly is better than x264.
47
the lower line ( in this
Pick any point on case x264. i.e. keep the
46.5
the top line, in this Tips for reading this kind of a
quality/ psnr constant )
46 case it’s VP7. graph (a PSNR graph):
What this means:
45.5 On this clip VP7 at 2750 kbps has the
same quality / PSNR as x264 high profile
45
Draw a line straight kbps. i.e. you’d need 30% higher a line straight
at 3620
PSNR

Draw
44.5 down from that pointdatarate to get the same quality outfrom that point to
to down of Vp7

the datarate axis. The x264 that you got from vP7. x264
the datarate axis. The
44
crossing point tells you crossing point tells you
43.5 the datarate at that the datarate at that
point. point.
43

42.5
1400 1900 2400 2750 kbps
2900 3400 3620 kbps 3900 4400
Kbps

This axis represents datarate in kilobits per second.


VP6 vs. H264

There is a difference between the codec
technology and a codec implementation.


On2 VP7
Not open source
Non-standard royalties model
Better video quality than H264
Used by:
Part of EVD – China standard for HD-DVD
Skype Beta (V 2.0)
Flash Player


Windows Media
Windows media is a format used by Microsoft for
encoding and distributing Audio and Video.
Windows Media has two types of media:
Windows Media Audio (WMA)
Windows Media video (WMV)
Windows Media Video
A modified version of MPEG 4
Codec version has initially started from version 7 for
windows media player 7 and then evolved to version
8-10


Windows Media 9 - VC1 Format
Microsoft has submitted Version 9 codec to the Society
of Motion Picture and Television Engineers (SMPTE), for
approval as an international standard. SMPTE is
reviewing the submission under the draft-name "VC-1")

This codec is also used to distribute high definition video
on standard DVDs in a format Microsoft has branded as
WMV HD. This WMV HD content can be played back on
computers or compatible DVD players.

The Trial version of standards were published by
SMPTE in September 2005

WMV9 was approved by SMPTE, April 2006


GOOGLE VP8


Before we start
VP8 goal is NOT to delivery the best video
quality in any given bitrate
VP8 was designed as a mobile video decoder
and should be examined in this context:
VP8 vs H.264 base profile


Google VP8
Last month, in Google IO (its developer
confrence), Google released VP8 as open
source
VP8 is a light weight video codec developed by
On2.
VP8 provide quality which is the same/higher
than H.264 base profile
VP8 memory requirements are lower than H.264
base profile
After optimization, VP8 might have better MIPS
performance than H.264 base profile


Genealogy
VP8 is part of a well know codec family
VP3 was released to open source to become
XIPH Theora
VP6 is used in Flash video
VP7 is used in Skype
Theora VP3
Motivation:
“No Royalties” CODEC
VP7 VP6

VP8


ADAPTATION – WHO USE IT?

Software
Hardware
Platform & Publishers


Software Adaptation
Android, Anystream, Collabora
Corecodec, Firefox, Adobe Flash
Google Chrome, iLinc,
Inlet, Opera, ooVoo
Skype, Sorenson Media
Theora.org, Telestream, Wildform.


Hardware adaptation
AMD, ARM, Broadcom
Digital Rapids, Freescale
Harmonic ,Logitech, ViewCast
Imagination Technologies, Marvell
NVIDIA, Qualcomm, Texas Instruments
VeriSilicon, MIPS


Platforms and Publishers
Brightcove
Encoding.com
HD Cloud
Kaltura
Ooyala
YouTube
Zencoder


VP8 MAIN FEATURES


Adaptive Loop Filter
Improved Loop filter provides better quality &
preformance in comparison to H.264

Source: On2


Golden Frames
Golden frames enables better decoding of
background which is used for prediction in later
frames
Could be used as resync-point:
Golden frame can reference an I frame
Could be hidden (not for display)

Source: On2

Decoding efficiency
CABAC is an H.264 feature which improves
coding efficiency but consumes many CPU
cycles
VP8 has better entropy coding than H.264, this
leads to relatively lower CPU consumption under
the same conditions
• Decoding efficiency is
important for smooth
operation and long battery
life in netbooks and mobile
devices

Copyright © 2008 LOGTEL Source: On2 Yossi Cohen

Resolution up-scaling & downscaling
Supported by the decoder
Encoder could decide dynamically (RT
applications) to lower resolution in case of low
bit rate and let the decoder scale.
Remove decision from the application
No need for an I frame


VP8 BASICS
Definitions
Bitstream structure
Frame structure


Definitions
Frame – same as H.264
Segment – Parallel to slice in H.264. MB in the
same segment will use the settings such as:
Probabilistic encoder/decoder settings
De-blocking filter settings
Partition – block of byte aligned compressed
video bits.


Analog Digital Video

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Analog Digital Video

Similar to Analog Digital Video (20)

More from Yoss Cohen

More from Yoss Cohen (20)

Recently uploaded

Recently uploaded (20)

Analog Digital Video