Introduction to Video Compression Techniques - Anurag Jain
1.
2.
3.
4.
5. Video Compression Standards Variable Object-based coding, synthetic content, interactivity MPEG-4 Variable From Low bitrate coding to HD encoding, HD-DVD, Surveillance, Video conferencing. H.264 < 33.6 kb/s Video telephony over PSTN H.263 > 2 Mb/s Digital Television MPEG-2 1.5 Mb/s Video on digital storage media (CD-ROM) MPEG-1 p x 64 kb/s Video telephony and teleconferencing over ISDN H.261 Variable Continuous-tone still-image compression JPEG BIT RATE APPLICATION STANDARD
Digital video compression is the enabling technology in many multi-media applications. These compression algorithms reduce the bit-rate requirements for transmitting digital video and reduce delivery costs.With these appealing properties, digital video is rapidly becoming an experience of everyday life. For example, video telephony assists corporate and research users in a variety of collaborations utilizing the Public Switched Telephone Networks or the Internet. DVD players, High-Definition Television devices, digital camcorders, digital VCRs and time-shifting products provide consumers with enhanced entertainment environments and novel methods for accessing media content. In the near future, wireless videophones promise un-tethered video communication between users.Several compression standards are key to the success of digital video applications.
In this presentation we will see the basic ideas and techniques behind video compression. We will also see various techniques used to bring down the complexity of video processing. This lecture will provide an overview of current and emerging video compression standards which has been crucial for overcoming bottlenecks such as limited bandwidth and storage capacity. In this lecture, we briefly discuss the motivation for creating standards. We also discuss the aspects of a video compression system that the standards actually specify. We briefly review the basics of video compression and then highlight some of the important features of the H.261, H.263, MPEG-1, MPEG-2, and MPEG-4 standards. We conclude by briefly mentioning two recent MPEG standards that are not targeted for compression to clarify their scope.
Before discussing video compression standards, let us examine why video compression is important. Video compression is necessary to make many video applications practical. This is because raw video contains an immense amount of data. Most video applications require the communication or storage of video. These capabilities are usually limited and/or very expensive. Consider, for example, video transmission within high-definition television, or HDTV. A popular HDTV video format is progressively scanned at 720x1280 pixels/frame, 60 frames/s video signal, with 24-bits/pixel, 8 bits each for red, green and blue, which corresponds to a raw data rate of about 1.3 Gbits/sec. With modern digital communications, we can only transmit about 20 Mb/s in the 6 MHz bandwidth allocated per television channel. Therefore, powerful video compression techniques are applied to compress the video by about a factor of 70 to send the video, which has a raw rate of 1.3 Gb/s, through the available 20 Mb/s channel. As will be discussed later in this presentation, the MPEG-2 video compression standard is used to compress HDTV video.
Currently there are two families of video compression standards, determined under the auspices of the International Telecommunications Union-Telecommunications, or ITU-T, formerly the International Telegraph and Telephone Consultative Committee, or CCITT, and the International Standards Organization, or ISO. The first video compression standard to gain widespread acceptance was the ITU H.261, which was designed for videoconferencing over the integrated services digital network, or ISDN. H.261 was adopted as a standard in 1990. It was designed to operate at p=1,2, ..., 30 multiples of the baseline ISDN data rate, or p x 64 kb/s. In 1993, the ITU-T initiated a standardization effort with the primary goal of video telephony over the public switched telephone network, or PSTN, conventional analog telephone lines, where the total available data rate is only about 33.6 kb/s. The video compression portion of the standard is H.263. Its first phase was adopted in 1996. An enhanced H.263, known as H.263+, was finalized in 1997. A new long-term compression standard, known as H.26L, is currently under development. The moving pictures expert group, or MPEG, was established by the ISO in 1988 to develop a standard for compressing motion pictures and associated audio on digital storage media such as CD-ROM. The resulting standard, commonly known as MPEG-1, was finalized in 1991. It achieves approximately VHS quality video and audio at about 1.5 Mb/s. A second phase of their work, commonly known as MPEG-2, was an extension of MPEG-1 developed for digital television and higher bit rates. Currently, the video portion of digital television, or DTV, and high definition television, or HDTV, standards for large portions of North America, Europe, and Asia are based on MPEG-2. A third phase of their work, known as MPEG-4’s primary goal was to provide increased functionality, including content-based processing, integration of both natural and synthetic, computer generated material, and interactivity with the scene.
Video compression standards provide a number of benefits, foremost of which is facilitating interoperability. By ensuring interoperability, standards lower the risk for both consumers and manufacturers . This results in quicker acceptance and widespread use. In addition, these standards are designed for a wide variety of applications. The resulting economies of scale lead to reduced cost and greater use.
An important question is what does a video compression standard actually specify? A video compression system is composed of an encoder, compressed bit-streams, and a decoder. The encoder takes original video and compresses it to a bit-stream. The bit-stream is passed to the decoder which decodes it to produce the reconstructed video. One possibility is that the standard would specify both the encoder and decoder, but this approach would have a number of disadvantages. Instead, the standards have a limited scope to ensure interoperability while enabling as much differentiation as possible.
The standards do not specify the encoder or the decoder. Instead they specify the bit-stream syntax and the decoding process. The bit-stream syntax is the format for representing the compressed data. The decoding process is the set of rules for interpreting the bit-stream. Note that specifying the decoding process is different from specifying a particular decoder implementation. For example, the standard may specify that the decoder uses an IDCT, but not how to implement the IDCT. The IDCT may be implemented in a direct form, or by a fast algorithm similar to the FFT, and may be optimized to specific targets like DSP. The specific implementation is not standardized. This allows different designers and manufacturers to differentiate their work. The encoding process is also not standardized. For example, more sophisticated encoders can be designed that provide improved performance over baseline encoders. In addition, improvements can be incorporated even after a standard is finalized. For instance, improved algorithms for motion estimation or bit allocation may be incorporated in the future in a standard-compatible manner. The only constraint is that the encoder produces a syntactically correct bit-stream that can be properly decoded by a standard-compatible decoder. Because of these issues, it is important to remember that not all encoders are created equal.
The standards do not specify the encoder or the decoder. Instead they specify the bit-stream syntax and the decoding process. The bit-stream syntax is the format for representing the compressed data. The decoding process is the set of rules for interpreting the bit-stream. Note that specifying the decoding process is different from specifying a particular decoder implementation. For example, the standard may specify that the decoder uses an IDCT, but not how to implement the IDCT. The IDCT may be implemented in a direct form, or by a fast algorithm similar to the FFT, and may be optimized to specific targets like DSP. The specific implementation is not standardized. This allows different designers and manufacturers to differentiate their work. The encoding process is also not standardized. For example, more sophisticated encoders can be designed that provide improved performance over baseline encoders. In addition, improvements can be incorporated even after a standard is finalized. For instance, improved algorithms for motion estimation or bit allocation may be incorporated in the future in a standard-compatible manner. The only constraint is that the encoder produces a syntactically correct bit-stream that can be properly decoded by a standard-compatible decoder. Because of these issues, it is important to remember that not all encoders are created equal.
On the next few pages, we present a very brief review of video compression. Compression is achieved by exploiting the similarities or correlations that exist in a typical video signal. This can be viewed as reducing the redundancy in the video data. For example, consecutive frames in a video sequence are often highly correlated in that they contain the same objects, perhaps undergoing some movement between the frames. We refer to this as temporal redundancy. Also within a single frame there is spatial redundancy as the amplitudes of nearby pixels are often correlated. Similarly, the red, green, and blue color components of a given pixel are often correlated. The redundancy in a video signal generally can be identified and exploited. Another goal of video compression is to reduce the irrelevancy in the video signal; that is, to reduce the information that is not perceptually important. For example, it would be wasteful to spend valuable bits coding video features that cannot be seen or perceived. Unfortunately, human visual perception is very difficult to model, so determining which data is perceptually irrelevant is a difficult task and therefore irrelevancy is difficult to exploit.
Current video compression standards achieve compression by applying the same basic principles. The temporal redundancy is exploited by applying MC-prediction. The spatial redundancy is exploited by applying the DCT. The color space redundancy is exploited by a color space conversion. The resulting DCT coefficients are quantized and the non-zero quantized DCT coefficients are run-length and Huffman coded to produce the compressed bit-stream.
The MPEG standard codes video in a hierarchy of units called sequences, pictures, groups of pictures, slices, macro-blocks, and DCT blocks. MC-prediction is performed on 16x16-pixel blocks. A 16x16-pixel block is called a macro-block and is coded using 8x8-pixel block DCTs, typically four 8x8-pixel blocks for luminance, two for chrominance, and possibly a forward and/or backward motion vector. The macro-blocks are scanned in a left-to-right, top-to-bottom fashion. A series of these macro-blocks form a slice. All the slices in a frame comprise a picture. Contiguous pictures form a GOP. The GOPs form the entire sequence.
A video sequence consists of a sequence of video frames or images. Each frame may be coded as a separate image, for example by independently applying JPEG-like coding to each frame. However, video has the property that neighboring video frames are typically very similar. Video compression can achieve much higher compression ratios than image compression by exploiting this temporal redundancy or similarity between frames. The fact that neighboring frames are highly similar can be exploited by coding a given frame by first predicting it based on a previously coded frame and then coding the prediction error. There are three basic types of coded frames: I-frames are intra-coded frames; that is, frames that are coded independently of all other frames; predictively coded, or P-frames, where the frame is coded based on a previously coded frame; and bi-directionally predicted frames, or B-frames, where the frame is coded using both previous and future coded frames.
Consecutive video frames typically contain the same imagery, although possibly at different spatial locations. To exploit the predictability among neighboring frames, it is important to estimate the motion between the frames and then form an appropriate prediction while compensating for the motion. The process of estimating the motion between frames is known as motion estimation. The process of predicting a given frame based on the previously coded reference frame, while compensating for the relative motion between the two frames, is referred to as motion-compensated prediction. Block-based, motion-compensated prediction is often used because it achieves good performance and has a basic, periodic structure that simplifies implementations. Examples of block-based forward and bi-directional motion-compensated prediction are illustrated on the left and right, respectively. The current frame to be coded is partitioned into 16x16-pixel blocks. For each block in the current frame, a prediction is formed by finding the best-matching block in a previously coded reference frame. The displacement or relative motion for the best-matching block is referred to as motion vector.
This page and the next illustrate high-level views of a typical video encoder and decoder. As previously discussed, the various standards specify the bit-stream syntax and the decoding process, but not the encoder processing or the specific decoder implementation. Therefore, these figures should be viewed only as examples of typical encoders and decoders in a video compression system. In the encoder, the input RGB video signal is first transformed into a luminance/chrominance color space, a YUV, to exploit the color space redundancy. To exploit the temporal redundancy, motion estimation and motion-compensated prediction are used to form a prediction of the current frame from the previously encoded frame. The prediction error, or residual, is partitioned into 8x8 blocks and the 2-D DCT is computed for each block. The DCT coefficients are adaptively quantized to exploit the local video characteristics, human perception, and to meet any bit-rate targets. The quantized coefficients and other information are Huffman coded for increased efficiency. Often a buffer is used to couple the variable bit-rate output of the video encoder to the desired channel. This is accomplished via a buffer control mechanism whereby the buffer fullness is used to regulate the coarseness versus fineness of the coefficient of quantization, and thereby the video bit-rate.
The video decoding process is the inverse of the encoding process. The bit-stream is parsed and Huffman decoded. The non-zero DCT coefficients are identified and the inverse quantized. An inverse block DCT operation produces the residual signal, which is combined in a spatially adaptive manner with the previously reconstructed frame to reconstruct the current frame. Finally, the reconstructed frame is converted back to the RGB color space to produce the output video signal.
This page and the next highlight the basic structures used in the MPEG standards. The MPEG standards group video frames into coding units called groups of pictures, or GOPs. GOPs have the property of re-initializing the temporal prediction used during encoding, which is important to enable random access into a coded video stream. Specifically, the first frame of a GOP is an I-frame and the other frames may be I, P, or B frames. In this example, the GOP contains nine video frames, I 0 through B 8 , where the subscript indicates the frame number. Frame I 9 is the first of the next GOP. The arrows indicate the prediction dependencies. The frame at the base of each arrow, the anchor frame, is used to predict the frame at the tip of the arrow, the predicted frame. I frames are coded independently of other frames. P frames depend on the prediction based on the preceding I or P frame. B frames depend on a prediction based on the preceding and following I or P frames. Notice that each B frame depends on data from a future frame, which means that the future frame must be de-coded before the current B frame can be de-coded. Also note that the use of B frames adds additional delay. Therefore, while B frames are fine for broadcast or storage applications, they are often not appropriate for use in real-time, two-way communications or other applications where low delay is a requirement.
The current video compression standards are based on the same basic building blocks, which include motion-compensated prediction, DCT, scalar quantization, run-length, and Huffman coding. Additional features added for particular applications include the capability to code interlaced video and error resilience or scalability tools. A major distinction between video compression standards is that the early standards – including H.261, H.263, MPEG-1, and MPEG-2 – used frame-based coding. Specifically, they viewed each frame as a rectangular group of pixels and attempted to code these pixels using block-based motion-compensated prediction and block DCT. In effect, these standards modeled the video as being composed of moving square blocks. In contrast, MPEG-4 provides the capability to model the video as being composed of a number of separate objects, such as a person, car, or background, and each object can have an arbitrary, non-square shape. MPEG-4 uses the same basic building blocks, but applies them to objects with arbitrary shapes. On the following pages, we briefly highlight the salient features of the different video compression standards. We first examine the MPEG-1 and MPEG-2 standards since they are the most popular, then the H.261 and H.263 standards. We end with the MPEG-4 standard since it is the newest and in many ways the most revolutionary.
The H.261 video compression standard was designed for real-time, two-way communication. Short delay was a critical feature, thus a maximum allowable delay of 150 ms was specified. H.261 was designed to operate over ISDN, at p=1,2,…,30 multiples of the baseline ISDN data rate, or p x 64 kb/s. H.261 uses only I and P frames. It does not use B frames in order to minimize the delay. H.261 employs 16x16-pixel ME/MC-P and 8x8-pixel Block DCT. The motion estimation is computed to full-pixel accuracy. The search range is +/-15 pixels. An interesting note is that H.261 provides the option of applying a 3x3 low-pass filter within the MC-P feedback loop to smooth the previous reconstructed frame as part of the prediction process. A loop filter is not used in MPEG-1, or MPEG-2, or H.263 since they use motion estimation with half-pixel accuracy and the resulting spatial interpolation has a similar effect as the loop filter. H.261 was standardized in 1990.
The H.263 video compression standard was designed with the primary goal of communication over conventional analog telephone lines. Transmitting video, speech, and control data over a 33.6 kb/s modem means that there typically is only about 20 to 24 kb/s available for the video. The H.263 coder has a similar structure to H.261. It was designed to facilitate interoperability between H.261 and H.263 coders. A number of enhancements over H.261 were introduced: Reducing the overhead information required. Improving the error resilience. Providing enhancements to some of the baseline coding techniques (including half-pixel MC-P). Providing improved compression efficiency via four advanced coding options. The advanced coding options are negotiated, in that the encoder and decoder communicate to determine which options can be used before compression begins. When all the coding options are used, H.263 provides significant quality improvement over H.261, particularly at the very-low, bit-rates. For example, at rates below 64 kb/s, H.263 typically achieves approximately a 3 dB improvement over H.261 at the same bit rate, or 50% reduction in bit rate for the same SNR quality. H.263 was standardized in 1995.
The moving pictures expert group, or MPEG, was originally established by ISO to develop a standard for compression of moving pictures, video, and associated audio on digital storage media such as CD-ROM. The resulting standard, commonly known as MPEG-1, was finalized in 1991 and achieves approximately VHS quality video and audio at about 1.5 Mb/s. A second phase of their work, commonly known as MPEG-2, was originally intended as an extension of MPEG-1 and was developed for application toward interlaced video from conventional television and for bit rates up to 10 Mb/s. A third phase was envisioned for higher-bit-rate applications such as HDTV, but it was recognized that those applications could also be addressed within the context of MPEG-2. Hence, the third phase was wrapped back into MPEG-2 and, as a result, there is no MPEG-3 standard. Both MPEG-1 and MPEG-2 are actually composed of a number of parts, including video, audio, systems, compliance testing, etc. The video compression parts of these standards are often referred to as MPEG-1 video and MPEG-2 video, or MPEG-1 and MPEG-2 for brevity. Currently, MPEG-2 video has been adopted as the video portion of the digital television and HDTV standards for large portions of North America, Europe, and Asia. MPEG-2 video is also the basis for the digital video disk, or DVD, standard. MPEG-2 is a superset of MPEG-1, supporting higher bit rates, higher resolutions, and interlaced pictures for television. For interlaced video, the even and odd fields may be coded separately or a pair of even and odd fields can be combined and coded as a frame. For field-based coding, MPEG-2 provides field-based methods for MC-prediction, Block-DCT, and alternate zigzag scanning. In addition, MPEG-2 provides a number of enhancements, including scalable extensions.
The MPEG standards were designed to address a large number of diverse applications in which each application required a number of different tools or functionalities. Encoders and decoders that support all the functionalities would be very complex and expensive. However, a typical application is likely to use only a small subset of the MPEG functionalities. Therefore, to enable more efficient implementations for different applications, MPEG grouped together appropriate subsets of functionalities and defined a set of profiles and levels. A profile corresponds to a set of functionalities that are useful for a particular range of applications. Within a profile, a level defines the maximum range on some of the parameters, such as resolution, frame rate, bit rate, and buffer size, which is a lower bound. This figure illustrates a simplified version of the 2-D matrix of profiles and levels in MPEG-2. A decoder is specified by the profile and level that it conforms to, main profile at main level, or MP@ML. In general, a more complex profile/level is a superset of a less complex profile/level. Two widely used profile/levels are MP@ML, which is used to compress conventional television, as used on DVDs and standard definition digital television, or SD-DTV, and main profile at high level which can be used to compress HDTV.
MPEG-4 is quite different from MPEG-1 and MPEG-2 in that its primary goals are to enable new functionalities, not just to provide better compression. MPEG-4 supports an object-based or content-based representation. This enables separate coding of different video objects in a video scene and, furthermore, allows individual access and manipulation of different objects on a video. Note that MPEG-4 does not specify how to identify or segment the objects in a video. That operation is performed at the encoder which is not specified by the standard. However, if the individual objects are known, MPEG-4 provides a method to compress those objects. MPEG-4 also supports compression of synthetic or computer-generated video objects, as well as the integration of natural and synthetic objects within a single video. MPEG-4 also enables interaction with the individual video objects. In addition, MPEG-4 supports error-resilient communication over error-prone channels such as the Internet and the third generation wireless system. MPEG-4 also includes most of the coding techniques developed in earlier standards. As a result, MPEG-4 supports both frame-based and object-based video coding. The first version of MPEG-4 was finalized in 1999. A second superset version, referred to as MPEG-4 Version 2, was finalized in 2000. A third version is currently being finalized.
A number of important differences become evident when comparing the various compression standards. MPEG-1, MPEG-2, H.261, and H.263 were primarily designed to compress video. They provide a pipe for storing or transmitting the video, use frame-based methods for coding, and are primarily designed for hardware implementations. In contrast, MPEG-4 is designed as a large set of tools for a variety of applications. These tools support both object-based and frame-based coding, and they also support the coding of synthetic video. MPEG-4 has a software emphasis and provides the capability to download certain types of algorithms that may be used at the decoder to support a rich variety of applications, such as interacting with the video or managing decoder client resources. Note that MPEG-4 prohibits the downloading of encoding or decoding algorithms.
Field prediction: Top and bottom fields of reference frame predicts first field Bottom field of previous frame and top field of current frame predicts the bottom field of current frame 16 X 8 motion compensation mode A macroblock may have two of them A B picture macroblock may have four! Dual prime motion compensation Top field of current frame is predicted from two motion vectors coming from the top and bottom field of reference frame Works for P vectors