White Paper - Mpeg 4 Toolkit Approach

White paper

SCALABLE MEDIA PERSONALIZATION

Amos Kohn
September 2007

ABSTRACT
User expectations, competition and sheer revenue pressures are driving rapid development—and
operator acquisition--of highly complex media processing technologies.
Historically, cable operators provided ―one stream for all‖ service in both the analog and digital domains.
At most, they provided two to three streams for East and West Coast delivery. Video on Demand (VOD)
represented a first step toward personalization, using personalized delivery, in the form of ―pumping‖
and network QAM routing, in lieu of personalization of the media playout itself. In some cases,
personalized advertisement play-lists were also created. This resulted in massive deployments of VOD
servers and edge QAMs.
The second step in this evolution is the introduction of switched digital video, which takes the linear
delivery one step further to deliver a hybrid VOD/linear experience without applying any personal media
processing. Like previous personalization approaches, user-based processing is limited to network
pumping and routing, with no access to the actual media or ability to manipulate it for true
personalization.
True user personalization requires the generic ability to perform intensive media processing on a per
user basis. As of today, a STB-based approach to media personalization seems to be dominant. This
approach necessitates future deployment of more capable (thus more expensive) STBs. This approach,
although straight-forward, is incompatible with the need to lower costs, unify user experience, and retain
customers and other operator needs. The network approach, where per-user personalization is
completely or partially accomplished BEFORE the video reaches the STB (or any other user device)
delivers the same experience but has been explored only in a very limited fashion. However, this
approach has the most potential to benefit operators as it addresses most of the current and future
challenges that operators face.

1

NETWORK-BASED PROCESSING TOOLKIT

The following defines a set of coding properties that are used as part of the media personalization
solution. As indicated below, one of the advantages of this solution is that it is standard-based, as are the
tools. The properties defined here are a combination of MPEG-4 (H.264 mostly) and MPEG-2. The
combination provides a solution for both coding schemes.
MPEG-4 is composed of a collection of "tools" built to support and enhance scalable composition
applications Among the tools discussed here are shape coding, motion estimation and compensation,
texture coding, error resilience, sprite coding and scalability.
Unlike MPEG-4, MPEG-2 provides a very limited set of functionality for scalable personalization. The
tools defined in this document are nevertheless sufficient to provide personalization in the MPEG-2
domain.

Object-Based Structure and Syntax

Content-based interactivity, The MPEG4 standard extends the traditional frame-based processing
towards the composition of several video objects superimposed on a background image. For the
proper rendering of the scene without disturbing artifacts on the border of video objects (VO), the
compressed stream contains the encoded shape of the VO representing video as objects rather than in
video frames, enables content-based applications. This, in turn, provides new levels of content
interactivity based on efficient representation of objects, object manipulation, bit stream editing and
object-based scalability.

An MPEG-4 visual scene may consist of one or more video objects. Each video object is characterized
by temporal and spatial information in the form of shape, motion and texture. The visual bit stream
provides a hierarchical description of a visual scene. Start codes, which are special code values, can
access each level of the hierarchy in the bitstream. The ability to process objects, layers and sequences
selectively is a significant enabler for scalable personalization. Hierarchical levels include:

 Visual Object Sequence (VS): MPEG-4 scene may include any 2-D or 3-D natural or synthetic
objects. Those objects and sequences can be addressed individually based on the targeted user.

 Video Object (VO): A video object is linked to a certain 2-D element in the scene. A rectangular
frame provides the simplest example, or it can be an arbitrarily shaped object that corresponds to
an object or background of the scene.

 Video Object Layer (VOL): Video object encoding takes place in one of two modes, scalable or
non-scalable, depending on the application represented in the video object layer (VOL). The VOL
provides support for scalable coding.

 Group of Video Object Planes (GOV): Optional in nature, GOVs enable random access points
into the bitstream by providing points where video object planes are independently encoded.
MPEG-4 video consists of various video objects, rather than frames, allowing a true interactivity
and manipulation of separate arbitrary object shape object with efficient scheduling scheme to
speedup real-time computation.

2

 Video Object Plane (VOP): VOPs are video objects sampled in time. They can either be sampled
independently or dependently by using motion compensation. Rectangular shapes can represent
a conventional video frame. A motion estimation and compensation technique is provided for
interlaced digital video such as video object planes (VOPs). Predictor motion vectors for use in
differentially encoding a current field coded macroblock are obtained using the median of motion
vectors of surrounding blocks or macroblocks which will support high system scalability.

Figure 1 below illustrates an object-based visual bitstream.

A visual elementary stream compresses visual data of just one layer of one visual object. There is
only one elementary stream (ES) per visual bitstream. Visual configuration information includes the
visual object sequence (VOS), visual object (VO) and visual object layer (VOL). Visual configuration
information must be associated with each ES.

Figure 1: The visual bitstream format

Compression TOOLS

Intra Coded VOPS (I-VOPS): VOPS that are coded with information within the VOP, removing some of
the spatial redundancy. Inter coding makes use of temporal redundancies between frames by the
method of motion estimation and compensation: two modes of inter coding are provided for - prediction
based on a previous VOP (P-VOPs) and prediction based on a previous VOP and a future VOP (B-
VOPs). These tools are use in the content preparation stage to increase compression efficiency, error
resilience, and coding of different types of video objects.

Shape coding tools: MPEG4 provides tools for encoding arbitrary shaped objects. Binary shape
information defines which portions (pixels) of the object belong to the video object at a given time, and is
encoded by a motion compensated block-based technique that allows both lossless and lossy coding.
The technique allows for accurate representation of object that in turn improved accuracy of quality of
3

the final composition, as well as assist the differentiation between video and non video objects within the
stream.

Sprite coding: Sprite is an image composed of pixels belonging to a video object visible throughout a
video sequence and an efficient and concise method for representation of background video object,
which is typically compressed with the object-based coding technique. Sprite has high compression
efficiency when a video frame contains the whole background that is at least visible once over a video
sequence.
MPEG4 H.264/AVC Scalable Video Coding (SVC): A method of achieving high efficiency of video
compression is the scalable extension of H.264/AVC, known as scalable video coding or SVC.
A scalable video bitstream contains the non-scalable base layer and one or more enhancement layers.
(The term ―Layer‖ in Video Coding Layer (VCL) is related to syntax layers such as: block, macroblock,
slice, etc., layers). The basic SVC design can be classified as layered video codec. In general, the coder
structure as well as the coding efficiency depends on the scalability space that is required by an
application. An enhancement layer may enhance the temporal resolution (i.e. the frame rate), the spatial
resolution, or the quality of the video content represented by the lower layer or part of it. The scalable
layers can be aggregated to a single transport stream, or transported independently.
Scalability is provided at the bitstream level, allowing for reduced complexity. Reduced spatial and/or
temporal resolution can be obtained by discarding NAL units (or network packets) from a global SVC bit-
stream that are not required for decoding the target resolution. NAL units contain motion information and
texture data. NAL units of Progressive Refinement (PR) slices can additionally be truncated in order to
further reduce the bit-rate and the associated reconstruction quality.

4

NETWORK BASED PERSONALIZATION CONCEPT
Network-based personalization represents an evolution of the network infrastructure. The solution
includes devices which allow multi-point media processing, enables the network to target any user with
any device with any content. In this paper, we are focusing primarily on the cable market and TV
services. However, the concept is not confined to these areas.
The existing content flow remains intact regardless of how processing functionality is extended within
each of the network components, including the user device. This approach can accommodate the range
of available STBs, employ modifications based on user profiles, and support a variety of sources.
The methodology behind the concept anticipates that the in-and-out point of the system must support a
variety of containers, formats, profiles, rates and so forth. However, within the system, the manipulation
flow is unified for simplification and scalability. Network-based personalization can provide service to
incoming baseline (Low Resolutions), Standard Definition (SD) and High Definition (HD), formats and
support multiple containers (such as Flash, Windows Media, Quicktime, MPEG Transport Stream and
Real).
Network personalization requires an edge processing point and optionally, an ingest and user premise as
content manipulation locations. The conceptual flow of the solution is defined in figure 2 below.

Interact

Prepare Integrate Create Present

Asset Session

Figure 2: Virtual Flow: Network based personalization

The virtual flow and building blocks defined is generic and can be placed at different locations of the
network, co-located or remote. Specific examples of architecture will be reviewed later in this paper.

5

)
At the ―preparation‖ point, media content is ingested and manipulated in several aspects: 1 Analysis of
the content and creation of relevant information (metadata), which will then accompany it across the flow.
2) Processing of the content for integration and creation, which includes manipulation such as changing
format, structure, resolution and rate. The outcome of the preparation stage is a single copy of the
incoming media, but in a form that includes data that will allow the other blocks to create multiple
personalized streams from it.

The ―integration‖ point is a transition point from asset focus to session focus. The block is all about
connecting, synchronizing prepared media streams with instructions and other data to create a complete
session specific media and data flow, to be provided later to the ―create‖ block.
―Create‖ and ―present‖ blocks are the final content processing steps where for a given session each
media stream is crafted according to the user, device and medium (in the ―create‖ block), then joined to
a visual experience at the ―present‖ block. The ―create‖ and ―present‖ blocks are intentionally defined
separately, in order to accommodate different end user device types and power. Further discussion of
this subject appears in the ―Power to the user section‖ below.

6

PUTTING IT ALL TOGETHER

The proposed implementation of network-based personalization takes into account the set of tools and
the virtual building blocks defined above to create the required end result.
To support high level personal session-based services we propose to utilize the MPEG-4 toolkit which
enables scene-related information to be transmitted together with video, audio and data to a processor-
based network element in which an object-based scene is composed based on user device rendering
capabilities. Using MPEG4 authoring tools and applying BIFS (Binary Format for Scenes) encoding at
the content preparation stage the system will support efficiency enhancement of personalization stream
processing , specifically at the ―create‖ and ―present‖ stages. Different encoding levels are required to
support the same bitstream; for example varied network computational power will be required to process
the foreground, background and other data such as 2D/3D in the same in the same bitstream. Moreover,
some of the video rendering will be passed directly to the user reception device (STB) and will reduce
network image processing requirements.
The solution described in this paper utilizes a set of tools allowing the content creator to build multimedia
applications without any knowledge of the internal representation structure of an MPEG-4 scene. By
using an MPEG4 toolkit, the multi-media content is object-oriented with spatial - temporal attributes
which can be attached to it, including BIFS encoding scheme. The MPEG4 encoded objects address
video, audio and multimedia presentations such as 3D as defined by the authoring tools.
The solution is built on four network elements: Prepare, integrate, create and present. All four network
elements work together to ensure the highest processing efficiency and accommodate different service
scenarios such as legacy MPEG2 set top boxes; H.264 set top boxes with no object-based rendering
capabilities and finally, STBs with full MPEG4 object-based processing capabilities. Two-way feedback
between the STB, the edge network and the network-based stream processor will be established in
order to define what will be processed in each of the network stages.

PREPARE
At the prepare stage, the assumption is that incoming content is received or converted to support
MPEG4 toolkit encoding, generating content media in object based format. Using authoring tools to
upload content and create scene-related object information will support improved media compression
that will be transmitted and processing by the network. The object based scene will be created using
MPEG4 authoring tools and applying BIFS (Binary Format for Scenes) encoding to support the
integration and control of different audio/visual and synthetic objects seamlessly in a scene.
Compression and manipulation of visual content using MPEG4 toolkit introduces novel concept of a
Video Object Plane (VOP) and a sprite. Using video segmentation, each frame of an input video
sequence can be segmented into a number of VOPs, each of which may describe a physical object
within the scene. A sprite coding technique may be used to support a mosaic layout. It s based on large
image composed of pixels belonging to a video object visible throughout a video segment. It captures
spatio-temporal information in a very compact way.

Other tools also might be used at the prepare stage applied to improve the network processing and
reduced bandwidth, those includes I-VOPs - "Intra-coded Video Object Plane" that allow

7

encoded/decoded based on its shape, motion and texture. Bidirectional Video Object Plane (B-VOP)
may be used to predict from a past and a future reference VOP for each object or shape motion vector
built from neighbouring motion vectors that were already encoded.
The output of the prepare stage is, per asset, set of object based information, coded as elementary
streams, packetized elementary streams and metadata. The different object layers and data can in turn
be transported as independent IP flows, over UDP, RTP, to the Integrate stage.
INTEGRATE
The session with the preparation stage will be an "object-based" session which is embodied mainly in
its visualization of several visual object types. The scalable core profile is required mostly because it
supports arbitrary-shaped coding, temporal/spatial scalability, etc. At the same time, the scalable core
profile will need to support computer graphics, such as 2D mesh, synthetic objects, etc. as part of the
range of scalable objects in the integration stage.

MPEG4 object-based coding allows separate encoding of foreground figures and background scenes.
Arbitrary shaped coding needs to be supported to maintain the quality of the input elements. It
includes shape information in the compressed stream.
In order to apply stream adaptation to support different delivery environments and available
bandwidths, temporal and spatial scalability are included in the system. Spatial scalability allows
addition of one or more enhancement VOL (video object layers) to the base VOL to achieve different
video scenes.

To Summarize, at the integrate stage, a user composed out of multiple incoming object based assets,
to create a the final, synchronized, video object layers and object planes. The output of the integrate
includes all the info and media require for the session; however at this point the media is still not tuned
to the specifics of the network, device and user, it is a super set of it. The streams will than be
transport to the ―create‖ and ―present‖ stages, where the final manipulation is done.

CREATE
The system part of MPEG-4 allows creation or viewing of a multimedia sequence with hybrid
elementary streams, which can be encoded and decoded with the best suitable codec for each
stream. However, to manipulate those streams synchronously and compose them onto a screen in
real time is computationally demanding. Therefore a temporal cache will be used in the ―create‖ stage
to store the encoded media streams. All of the ES (elementary streams) consist of either a multiplexed
(using the MPEG-4 defined FLEXMUX) stream or a single stream, but all of them have been
packetized by the MPEG-4 SL (sync layer). The uses of FLEXMUX and sync layer will allow grouping
of the elementary streams with a low multiplexing overhead at the ―prepare‖ and ―Integrate‖ stages,
where the SL will be used to synchronize bitstream delivery information from the previous stage to the
―create‖ stage.

In order to generate the relevant session (stream) the ―create‖ stage will use an HTTP submission to
ask for a desired media presentation. The submission will only contain the index of the preformatted
Binary Format for Scenes - BIFS for those of a pre-created and stored presentation or a text-based
description of the user’s authored presentation. BIFS coding also allow integration and control of
different audio/video objects seamlessly in a scene. The ―integrate‖ stage will receive the request and
will send the media to the ―create‖ stage, i.e. the BIFS stream together with the object descriptor in the
8

form of an initial object descriptor stream. The MPEG-4 BIFS will allow integration and control of
different audio/video objects seamlessly in a scene.

If the client side can satisfy the decoding requirements, it will send a confirmation to the ―create‖ stage
to start the presentation; otherwise, the client will send its decoding and resolution capabilities to the
―create‖ stage. At this point it will repeatedly downgrade to a lower-profile until it meets the decoding
capabilities or will inform the ―present‖ stage to compose a stream that will satisfy the client decoding
device (i.e. H.264 or MPEG2).

The ―create‖ stage will initiate the establishment of the necessary sessions for the SD (scene
description) stream (BIFS format) and the OD (object description) stream referenced with the user
device. It will allow the user device to retrieve the compressed media stream by using the URL
contained in the ES descriptor stream in real time. The BIFS is used to lay out the media elementary
stream in the presentation, as it provides the spatial and temporal relationship of those objects by
referencing their ES_IDs.

If the ―create‖ stage needs to modify the received scene, such as by adding an enhancement layer to
the current scene based on user device or network capabilities, it can send a BIFS update command
to the ―integrate‖ stage and obtain a reference to the new media elementary stream.

The ―create‖ stage can handle multiple streams and sync between different objects and between the
different elementary streams of a single object (e.g., base layer and enhancement layer). The
synchronization layer is responsible for synchronizing the elementary streams. Each SL-packet
consists of an Access Unit (AU) or a fragment of an AU. An AU needs to have time stamps for
synchronization and constitutes the data unit that will be consumed by the decoder at the ―create‖
stage or the user device decoder. An AU consists of a Video Object Plan (VOP). Each AU will be
receiving by the decoder at the time instance specified by a Decoding Time Stamp (DTS).

The media is processed by the ―present‖ stage in such a way that MPEG objects are transcoded to
either an H.264 or MPEG2 transport stream utilizing stored motion vector information and macroblock
mode. The applicable process is defined based on user device rendering capabilities. When an
advanced user device with MPEG4 object layer decoding capabilities is the target, the ―present‖
processor acts as a stream adaptor, resizing where composition will be performed by the client device
(advanced STB).

PRESENT

The modularity of the coding tools, expressed as well-known MPEG profiles and levels, allows for
easy customization of the ―present‖ stage for a selected segment. For example, MPEG2 legacy STB
markets where full stream composition needs to be applied at the network vs. full MPEG4 scene
object-based advanced set top box capability where minimum stream preparation will need to be
applied by the network ―resent‖ stage.
Two extreme service scenarios might be applied as follows:
Network-based ―present‖: The ―present‖ function applies stream adaptation and resizing; composes the
network object elements; and applies transcoding functions to convert MPEG4 file-based format to either
MPEG2 stream-based format or MPEG4/AVC stream-based format.

9

STB based ―present‖: The ―present‖ function might path through to the network the object elements after
rate adaptation and resizing to be composed and presented by the advanced user device
The ―present‖ functionality is based on client/network awareness. In general, media provisioning will be
based on metadata that will be generated by the client device and the network manger. Metadata will
include the following information:
 Video format. i.e. MPEG2, H.264. VC-1, MPEG4, QT etc.
 User device rendering capabilities
 User devise resolution format. i.e. SQCIF, QCIF, CIF, 4CIF, 16CIF
 Network bandwidth allocation for the session

“Present” stage performance
It is essential that the ―present‖ function be composed of object-based elements that use the defined set
of tools which present binary coded representation of individual audiovisual objects, text, graphics, and
synthetic objects. It composes Visual Object Sequence (VS), Video Object Layer (VOL) or any other
defined tool to a valid H.264 stream or MPEG2 stream in the resolution and the BW as it defined by
the client device and the network metadata feedback.
The elementary streams (scene data, visual data, etc.) will be received at the ―present‖ stage from the
―create‖ system element which allows scalable representations, alternate coding (bitrate, resolution,
etc.), enhanced with metadata and protection information. An object described by an ObjectDescriptor
will be sent from the content originator i.e. the ―prepare” stage, and provides simple meta-data
related to the object such as content creation information or chapter time layout. This descriptor also
contains all information related to stream setup, including synchronization information or initialization
data for decoders.
The BIFS (Binary Format for Scenes) at the ―present‖ stage will be used to place each object, with
various effects potentially applied to it, in a display which will be transcoded to an MPEG2 or H.264
stream.
STB-based ―present‖: Object reconstruction
The essence of MPEG4 lies in its object-oriented structure. As such, each object forms an independent
entity that may or may not be linked to other objects, spatially and temporally. This approach gives the
end user at the client side tremendous flexibility to interact with the multimedia presentation and
manipulate the different media objects. End users can change the spatial-temporal relationships among
media objects, turn on or shut down media objects. However, it will require difficult and complicated
session management and control architecture.
A remote client retrieves information regarding the media objects of interest, and composes a
presentation based on what is available and desired. The following communication messages between
the client device and ―present” stage will occur:
 The client requests a service by submitting the description of the presentation to the data
controller (DC) at the ―present‖ stage side.
 The DC on the ―present‖ stage side controls the encoder/producer module to generate the
corresponding scene descriptor, object descriptors, command descriptors and other media
10

streams based upon the presentation description information submitted by the end user at the
client side.
 Session control on the ―Create‖ stage side controls the session initiation, control and termination.
 Actual stream delivery commences after the client indicates that it is ready to receive and
streams flow from the ―Create‖ Stage to the ―Present‖ client. After the decoding and composition
procedures, the MPEG-4 presentation authored by the end user is rendered on his or her display.
It is required that the set top box client support the architectural design of the MPEG4 system decoder
model (SDM), which is defined to achieve media synchronization, buffer management, and timing when
reconstructing the compressed media data.
The session controller at the client side communicates with the session controller at the server (―Create‖
Stage) side to exchange session status information and session control data. The session controller
translates the user action into appropriate session control commands.

Network-based MPEG4 to H.264/AVC baseline profile transcoding
Transcoding from MPEG4 to H.264/AVC can be done in the spatial domain and compressed domain.
The most straightforward method is to fully decode each video frame and then completely re-encode it
with H.264. This approach is known as spatial domain video transcoding. It involves full decoding and re-
encoding and is therefore very computationally intensive.
Motion vector refinement and an efficient transcoding algorithm are used for transcoding the MPEG4
object-based scene to a H.264 stream. The algorithm exploits the side information from the decoding
stage to predict the coding modes and motion vectors of H.264 encoding. Both INTRA macroblock
(MBs) transcoding and INTER macroblock transcoding will be exploited by the transcoding algorithm at
the ―present‖ stage.
During the decoding stage, the incoming bitstream is parsed in order to reconstruct the spatial video
signal. During the decoding process, the prediction direction for INTRA coded macro blocs and motion
vectors are stored and then used in the coding process.
To get the highest transcoding efficiency by the ―present‖ stage, side information will be stored. During
the decoding process of MPEG4, a lot of side information (like MVs) is obtained. The ―present‖ stage
reuses the side information, which reduces the transcoding complexity compared to a full decode/re-
encode scenario. In the process of decoding the MPEG4 bitstream, the side information is stored and
used to facilitate the re-encoding process. In the transcoding process both MV estimation and coding
mode decisions are addressed by reusing the side information to reduce complexity and computation
power.
Network-based MPEG4 to MPEG2 transcoding
To support legacy STBs that have limited local processing capabilities and support only MPEG2
transport streams, a full decode-encode will be performed by the ―present‖ stage. However, the ―present‖
stage utilizes tools that have been used for the conversion of MPEG4 to H.264 in order to remove
complexity. Stored motion vector information and macroblock mode decision algorithms for inter-frame
prediction based on machine learning techniques will be used as part the MPEG4 to MPEG2 transcode
process. Since coding mode decisions take up the most of the resources in video transcoding, a fast
macro block (MB) mode estimation would lead to reduced complexity.

11

The implementation presented above has the ability to incorporate in offline and realtime environment.
See appendix 2 for elaboration on real time implementation.

12

BENEFITS OF NETWORK-BASED PERSONALIZATION
Deploying network-based processing, whether complete or hybrid, has significant benefits:
 A unified user experience is delivered across the various STB’s in the field;
 It is a presentation, future-proof cost model for low to high-end STBs.
 It utilizes existing the VOD environment, servers and infrastructure. Network-based processing
accommodates low-end and future high-end systems, all under existing, managed operators’ on-
demand systems. Legacy servers require more back office preparation, with minimal server
processing power overhead, while newer servers can provide additional per-user processing and
thus more personalization features.
 Rate utilization is optimized. Instead of consuming the multiplication of all streaming comprised in
the user experience, network optimized processing reduces overhead significantly. In the
extreme case, it may be a single stream with no overhead, instead of 4-5 times the available
BW. In the common case, it has overhead of approximately 20%.
 Best quality of service fpr connected home optimization. By performing most or all the
processing before hitting the home, the operator optimizes the bandwidth and experience across
the user end devices, delivering best quality of service.
 Prevention of subscriber churn in favour of direct over-the-top (OTT). The operator has control
over the edge network. Over-the-top providers do not. Media manipulation in the network can
and will be done by OTT operators. However, unlike cable operators, they do not have control
over the edge network, limiting the effectiveness of their action, unless there is a QOS
agreement with the operator, in which case control stays in the operator’s hands.
 Maintaining the position of current and future ―smart pipe‖. Being aware of the end-user device
and processing for it is critical for the operator to maintain processing capabilities that will allow
migration to other areas such as mobile and 3D streaming.

13

IMPLEMENTING NETWORK-BASED PERSONALIZATION
As indicated earlier in the document, the solution can be implemented in a variety of ways. In this
section, we present three of the options, all under a generic North America on-demand architecture. The
three options are: Hybrid network edge and back office; Network edge; and Hybrid home network.

Hybrid network edge and back office
As the user device powers up or the user starts using personalization features, the user client connects
with the session manager, identifies the user, his device-type and his personalization requirements. and
once resources are identified, starts a session. In this implementation the ―prepare‖ function is physically
separated from the other building blocks, and the user STB is not capable of relevant video processing/
rendering. Each incoming media is processed and extracted to create it for downstream personalization
as part of the standard ingest process. Once a session is initiated and the edge processing resources
are found, sets of media and metadata flows are propagated across the internal CDN to the ―integrate‖
step at the network edge. The set of flows include the different media flows, related metadata (which
includes target STB-based processing, source media characteristics, target content insertion information,
interactivity support and so forth. The metadata needs to be available for the edge to start processing the
session), objects, data from content provider/ advertiser and so forth.
After arrival at the edge, the ―integrate‖ function aligns the flow and passes it to the ―create‖ and
―present‖ functions, which in this case, generate a single, personally composed stream, accompanied
with relevant metadata, directed at a specific user.

Back Office Region Edge Curve User

IP IP IP HFC HFC

Analog, Broadcast

Realtime

Offline Wired
App Session
Servers Manager Edge QAM Legacy STB
Media Over
Broadband
AMS, UERM
Prepare CDN Legacy Media
Integrate
Compose
Wireless
Present

Figure 3: Hybrid back office and network edge

14

As can be seen from Figure 3 above, the SMP (Scalable Media Personalization) session manager is
connecting between the user device and the network, influencing in real time the ―integrate‖, ―create‖ and
―compose‖ edge functions.
Network edge only
This application case is about doing all the processing on-demand, in real time. It is similar to the hybrid
case; however, instead of the ―prepare‖ function being located at the back office and working offline, all
functions in this case are on the same platform. As can be expected this option has significant
horsepower requirements for the ―prepare‖ function, since content needs to be ―prepared‖ in real time. In
this example, the existing flow is almost seamless, as the resource manager simply identifies it as
another network resource and manages it accordingly.


IP IP IP HFC HFC

Analog, Broadcast

Realtime

Offline Wired
App Session
Servers Manager Edge QAM Legacy STB
Media Over
Broadband
Generic AMS, UERM
CDN Prepare
Ingest Legacy Media
Integrate
Compose
Present Wireless

Figure 4: Network Edge

15

Hybrid Home and Network
In the hybrid implementation, the end user device (STB in our case) was identified as one that is capable
of hosting the ―present‖ function. As a result, as can be seen from Figure 5, the ―present‖ function is
dislocated from the user home, while the system demarcation is the ―create‖ function. During the
session, multiple ―prepared‖ flows of data and media will arrive to the STB, taking significantly less
bandwidth versus the non-prepared options and consuming reduced CPU horsepower as part of the
―present‖ function.


IP IP IP HFC HFC

Analog, Broadcast

Realtime

Offline Wired
App Session
Servers Manager Edge QAM ADV STB
Media Over
Broadband
AMS, UERM
Prepare CDN Legacy Media
Integrate
Compose
Wireless

Figure 5: Hybrid Home and Network

16

POWER SHIFTING TO THE USER
Although legacy STBs are indeed present in many homes, the overall processing horsepower at the
home is growing and will continue to grow. That means that the user device will be able to do more
processing at home and theoretically less in need of network-based assistance. At first glance this is
indeed the case. However, when the subject is delved into further, two main challenges reveal
themselves.
1. The increase in user device capabilities and actual user expectations, comes back to the network
as a direct increase in bandwidth utilization, which then reflects back on users’ experience and
ability to run enhanced applications such as multi-view.
For example, today’s next generations STBs support 800MIPS to 16000 MIPS versus the legacy
20 to 1000 MIPS, with dedicated dual 400Mhz video graphics processors and dual 250-MHz
audio processors (S-A/Cisco’s next-gen Zeus silicon platform).
In Figure 6 below, the expected migration of media services into other home devices such as
media centres and game consoles significantly increases available home processing power.

Processing Roadmap [TMIPS]

3

2.5

2

1.5

1

0.5

0
2007 2008 2009 2010

Figure 6: Home Processing Power Roadmap

2. No matter how ―fast and furious‖ processing power is in the home, users will always want more.
Having home devices perform ALL the video processing increases utilization of CPU and
memory and directly diminishes the performance of other applications.
In addition, as discussed earlier in the document, the increase in open standard home capabilities
substantially strengthens the threat of customer churn for the cable operators.
Network-based personalization is targeted at providing solutions to the above challenges. The approach
is to use network processing to help the user, improving his experience.

17

By performing the ―prepare‖, ―integrate‖ and ―create‖ functions in the network, and leaving only the
―present‖ function to the user home, several key benefits are delivered which effectively address the
above challenges.
Network bandwidth utilization: The ―create‖ function drives down network bandwidth consumption.
The streams that are delivered to the user are no longer the complete, original media as before, but
rather only what is needed. For example, when looking at 1 HD and 2 SD in the same multi-view window,
each of the three streams will have the correct resolution and frame rate required at each given moment,
resulting in significant bandwidth savings, as can be seen in Figure 7.

Bandwidth to the home example (1HD, 2SD)

18
16
14
12 STB Only [Mbps]
10
Hybrid [Mbps]
8
6 Network Only [Mbps]
4
2
0
MPEG2 H.264

Figure 7: 2SD, 1HD bandwidth to the home
CPU Processing power: As indicated in the ―putting it all together‖ section, our solution allows for object
layer selective composition. Also, the actual multi-view is created out of multiple resolutions and thus
there is no need for render-resize-compose functions at the user device, which in turn reduces the
overall CPU utilization.
Finally, the fact that the network can deliver the above benefits inherently drives power back to the hands
of the operator, who can deliver the best user experience.

18

SUMMARY
Exceeding user expectation while maintaining a viable business case is becoming more challenging than
ever for the cable operator. As the weight is shifted to the home and broadband streaming, the operator
is forced to find new solutions to maintain leadership in the era of personalization and interactivity.
Network base personalization provides a balanced solution. The ability to maintain an open, standard
based solution, while being able to dynamically shift the processing balance based on user, device,
network and time, can provide the user and operator a ―golden‖ solution.

REFERENCES
 Ahmad, X. Wei, Y. Sun and Y.-Q. Zhang, "Video Transcoding: An Overview of Various Techniques
and Research Issues," IEEE Transactions on Multimedia, Vol. 7, No. 5, pp. 793-04, Oct. 2005.
 ISO/IEC JTC 1/SC 29/WG 11, "Information technology-Coding of audio-visual objects, Part8:
Carriage of MPEG-4 contents over IP networks (ISO/IEC 14496-8)― Jan. 2001.
 Ishfaq Ahmad Dept. of Computer Science and Engineering The University of Texas Arlington, TX
―MPEG-4 To H.264/AVC Transcoding‖.
 Haining Liu, Xiaoping Wei, and Magda El Zarki - ―Real Time Interactive MPEG-4 Client-Server‖
 ISO/IEC JTC 1/SC 29/WG 11- ―MPEG-4 Terminal Architecture‖
 ISO/IEC JTC1/SC29/WG11 – ―CODING OF MOVING PICTURES AND AUDIO‖
 ITU-T – ―The Advanced Video Coding Standard‖
 MPEG Video Group, Description of Core Experiments in SVC, ISO/IEC JTC1/SC 29 WG 11
Document N6898, 2005
 John Watkinson ―THE MPEG HANDBOOK‖

ABOUT THE AUTHOR
Amos Kohn is Vice President of Business Development at Scopus Video Networks. He has more then
20 years of multi-national executive management experience in convergence technology development,
marketing, business strategy and solutions engineering at Telecom and new multimedia emerging
organizations. Prior to joining Scopus, Amos Kohn held a senior position at ICTV, Liberate Technologies
and Golden Channels.

19

APPENDIX 1: STB BASED ADDRESSABLE ADVERTISING

In the home addressable advertising model, multiple user profiles in the same household are offered to
advertisers within the same ad slot. For example, within the same slot, multiple targeted ads will replace
the same program feed targeted at different ages of youth while another advertisement may target the
adults at the house (male, female) based on specific profiles. During the slot, youth will see one ad while
the adult will see another ad. Addressable advertising require more bandwidth to the home then
traditional zone-based advertisements. Granularity might step one level up, where the targeted
advertisement will target the household and not the user within a household. In this case, less bandwidth
will be required in a given serving area in comparison to the userbased targeted advertisement. The
impact of home addressability on the infrastructure of channels that are already in the digital tier and
enabled for local ad insertion will be similar to unicast VOD service bandwidth requirements.
In case of a four demographics scenario, for each ad zone, four times the bandwidth that has been
allocated for a linear ad will need to be added.

APPRENDIX 2: REALTIME IMPLEMENTATION
Processing in real time is defined by stream provisioning (fast motion estimation), stream complexity and
the size of the buffer at each stage.
The scenes as compositions of audiovisual objects (AVO's), support of hybrid coding of natural video
and 2D/3D graphics, and provision of advanced system and interoperability capabilities support real time
processing.
MPEG4 real time software encoding of arbitrarily shaped video objects (VO) is an important key in the
structure of the solution. The MPEG4 toolkit unites the advantages of block and pixel-recursive motion
estimation methods in one common scheme, leading to a fast hybrid recursive motion estimation which
supports MPEG4 processing.

20

White Paper - Mpeg 4 Toolkit Approach

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a White Paper - Mpeg 4 Toolkit Approach

Semelhante a White Paper - Mpeg 4 Toolkit Approach (20)

White Paper - Mpeg 4 Toolkit Approach