Internet Engineering Task Force                             G. Liebl,
                                                          T.Stockhammer Liebl
  Internet Draft                                  LNT, Munich Univ. of
  Document: draft-ietf-avt-uxp-03.txt
  June draft-ietf-avt-uxp-04.txt
  November 2002                                  M. Wagner, J.Pandel, J. Pandel,
                                                                W. Weng, G. Baese,
                                                  M. Nguyen, F. Burkert Weng
  Expires: December  2002 May  2003                                Siemens AG, Munich

       An RTP Payload Format for Erasure-Resilient Transmission of
                     Progressive Multimedia Streams
  Status of this Memo
     This document is an Internet-Draft and is in full conformance
        with all provisions of Section 10 of RFC2026.
     Internet-Drafts are working documents of the Internet Engineering
     Task Force (IETF), its areas, and its working groups. Note that
     other groups may also distribute working documents as Internet-
     Drafts. Internet-Drafts are draft documents valid for a maximum
     of six months and may be updated, replaced, or obsoleted by other
     documents at any time. It is inappropriate to use Internet-
     Drafts as reference material or to cite them other than as "work
     in progress."
     The list of current Internet-Drafts can be accessed at
     The list of Internet-Draft Shadow Directories can be accessed at

     This document specifies an efficient way to ensure erasure-
     resilient transmission of progressively encoded multimedia
     sources via RTP using Reed-Solomon (RS) codes together with
     interleaving. The level of erasure protection can be explicitly
     adapted to the importance of the respective parts in the source
     stream, thus allowing a graceful degradation of application
     quality with increasing packet loss rate on the network. Hence,
     this type of unequal erasure protection (UXP) schemes is intended
     to cope with the rapidly varying channel conditions on wireless
     access links to the Internet backbone. Nevertheless, backward
     compatibility to currently standardized non-progressive
     multimedia codecs is ensured, since equal erasure protection
     (EXP) represents a subset of generic UXP. By applying
     interleaving and RS codes a  payload format is defined, which can
     be easily integrated into the existing framework for RTP.
  1. Introduction

     Due to the increasing popularity of high-quality multimedia
     applications over the Internet and the high level of public

  Liebl,Stockhammer,Wagner,Pandel,Weng,Baese,Nguyen,Burkert      [Page1]
     acceptance of existing mobile communication systems, there is a
     strong demand for a future combination of these two techniques:
     One possible scenario consists of an integrated communication
     environment, where users can set up multimedia connections
     anytime and anywhere via radio access links to the Internet.

  Liebl,Wagner,Pandel,Weng                                    [Page1]
     For this reason, several packet-oriented transmission modes have
     been proposed for next generation wireless standards like EGPRS
     (Enhanced General Packet Radio Service) or UMTS (Universal Mobile
     Telecommunications System), which are mostly based on the same
     principle: Long message blocks, i.e. IP packets, that enter the
     wireless part of the network are split up into segments of
     desired length, which can be multiplexed onto link layer packets
     of fixed size. The latter are then transmitted sequentially over
     the wireless link, reassembled, and passed on to the next network
     However, compared to the rather benign channel characteristics on
     today's fixed networks, wireless links suffer from severe fading,
     noise, and interference conditions in general, thus resulting in
     a comparably high residual bit error rate after detection and
     decoding. By use of efficient CRC-mechanisms, these bit errors
     are usually detected with very high probability, and every
     corrupted segment, i.e. which contains at least one erroneous
     bit, is discarded to prevent error propagation through the
     network. But if only one single segment is missing at the
     reassemble stage, the upper layer IP packet cannot be
     reconstructed anymore. The result is a significant increase in
     packet loss rate at IP level.
     Since most multimedia applications can only recover from a very
     limited number of lost message blocks, it is vitally necessary to
     keep packet loss at IP level within a certain acceptable range
     depending on the individual quality-of-service requirements.
     However, due to the delay constraints typically imposed by most
     audio or video codecs, the use of ARQ-schemes is often prohibited
     both at link level and at transport level. In addition,
     retransmission strategies cannot be applied to any broadcast or
     multicast scenarios. Thus, forward erasure correction strategies
     have to be considered, which provide a simple means to
     reconstruct the content of lost packets at the receiver from the
     redundancy that has been spread out over a certain number of
     subsequent packets.
     There already exist some previous studies and proposals regarding
     erasure-resilient packet transmission[1,8]. Since most of them
     are based on the assumption that all parts in a message block are
     equally important to the receiver, i.e. the respective
     application cannot operate on partly complete blocks, they were
     optimized with respect to assigning equal erasure protection over
     the whole message block. However, recent developments both in
     audio and video coding have introduced the notion of
     progressively encoded media streams, for which unequal erasure
     protection strategies seem to be more promising, as it will be
     explained in more detail below. Although the scheme defined in
     [1] is in principle capable of supporting some kind of unequal
     erasure protection, possible implementations seem to be quite
     complex with respect to the gain in performance. Finally, in [1]
     it is assumed that subsequent consecutive RTP packets can have variable
     length, which would cause significant segmentation overhead at
     the link layer of almost all wireless systems.

     This document defines a payload format for RTP, such that
     different elements in a progressively encoded multimedia stream
     can be protected against packet erasures according to their
     respective quality-of-service requirement. The general principle,
     including the use of Reed-Solomon codes together with an
     appropriate interleaving scheme for adding redundancy, follows
     the ideas already presented in [2], but allows for finer
     granularity in the structure of the progressive media stream. The
     proposed scheme is generic in the way that it (1) is independent
     of the type of media stream, be it audio or video, and (2) can be
     adapted to varying transmission quality very quickly by use of
  2. Conventions used in this document Document

     The following terms are used throughout this document:
     1.)  Message block: a higher layer transport unit (e.g. an IP
          packet), that enters/leaves the segmentation/reassembly
          stage at the interface to wireless data link layers.
     2.)  Segment: denotes a link layer transport unit.
     3.)  CRC: Cyclic Redundancy Check, usually added to transport
          units at the sender to detect the existence of erroneous
          bits in a transport unit at the receiver.
     2.)  Segmentation/Reassembly Process: If the size of the
          transport units at the link layer is smaller than that at
          the upper layers, message blocks have to be split up into
          several parts, i.e. segments, which are then transmitted
          subsequently over the link. If nothing is lost, the original
          message block can be restored at the receiving entity
     5.)  Quality-of-service: application-dependent criterion to
          define a certain desired operation point.
     3.)  Codec: denotes a functional pair consisting of a source
          encoding unit at the sender and a corresponding source
          decoding unit at the receiver; usually standardized for
          different multimedia media applications like audio or video.
     4.)  Media stream: A bitstream. which results at the output of an
          encoder for a specific media type, e.g. H.263, MPEG-4-video.
     8.) MPEG-4
     5.)  Progressive  media stream: A media stream which can be
          divided into successive elements. . The distinct elements are
          of different importance to the reconstruction decoding process at
          the decoder and are
          commonly ordered from highest to least importance, where the
          latter elements depend on the previous.
     6.)  Progressive source coding: results in a progressive media
     7.)  Reed-Solomon (RS) code: belongs to the class of linear
          nonbinary block codes, and is uniquely specified by the
          block length n, the number of parity symbols t, and the
          symbol alphabet.
     8.)  n: is a variable, which denotes both the block length of a
          RS codeword, and the number of columns in a TB (see 19).
     9.)  k: is a variable, which denotes the number of information
          symbols in a an RS codeword.
     10.) t: is a variable, which denotes the number of parity symbols
          in a an RS codeword.
     11.) Erasure: When a packet is lost during transmission, an
          erasure is said to have happened. Since the position of the
          erased packet in a sequence is usually known, a
          corresponding erasure marker can be set at the receiving

     12.) Base layer: comprises the first and most important elements
          of the   progressively encoded source,   progressive media stream, without which all
          subsequent information is useless.
     13.) Enhancement layer: comprises one or more sets of the less
          important subsequent elements of the progressively encoded
          source. progressive media
          stream. A specific enhancement layer can be decoded, if and
          only if the base layer and all previous enhancement layer
          data (of higher importance) is are available.
     14.) Info stream: denotes the  bitstream which has to be
          protected by the UXP scheme. It usually consists of the
          media stream (progressively source encoded or not), which is
          arranged according to a desired syntax (e.g. to achieve an
          appropriate framing, see Sect. 6.3 ). In any case, it is
          assumed that every info stream is already octet-aligned
          according to the standard procedures defined in the context
          of the used syntax specifications.
     15.) Info octet: Denotes one element of the info stream.
     16.) Transmission block (TB): denotes a memory array of L rows
          and n columns. Each row of a TB represents a RS codeword,
          whereas each column, together with the respective UXP header
          (see 36) in front, forms the payload of a single RTP packet.
          Each TB consists of at least two distinct transmission sub
          blocks (TSB, see20): The first L_s rows belong to the
          signaling TSB, whereas the last L_d=(L-L_s) rows belong to
          one or more data TSB.

     17.) Transmission sub block (TSB): denotes a memory array of
          0<l<L rows and n columns, which is a horizontal slice of a
          TB. Depending on whether the info octet positions are filled
          with descriptors (see31) or media data, the TSB is of type
          signaling or data, respectively.
     18.) L: is a variable, which denotes both the number of rows in a
          TB and the payload length (without UXP header, see 36) of an
          RTP packet in octets.
     19.) Unequal erasure protection (UXP): denotes a specific
          strategy which varies the level of erasure protection across
          a TB according to a given redundancy profile.

     20.) Equal erasure protection (EXP): is a subset of UXP, for
          which the level of erasure protection is kept constant
          across a TB.
     21.) Redundancy profile: describes the size of the different
          erasure protection classes in a TB, i.e. the number of rows
          (codewords) per class.
     22.) Erasure protection class: contains a set of rows (codewords)
          of the TB with same erasure correction capability.
     23.) i: is a variable, which denotes the number of parity
          symbols for each row in erasure protection class i.


     24.) EPC_i: is a variable, which denotes the set of rows
          contained in erasure protection class i.

     25.) R_i: is a variable, which denotes the total number of rows
          contained in erasure protection class i, i.e. the
          cardinality of EPC_i.
     26.) T: is a variable, which denotes the number of parity
          symbols for each row in the highest erasure protection class
          (with respect to application data) in a TB.
     27.) EPV: denotes the erasure protection vector of length (T+1)
          used to describe a certain redundancy profile.
     28.) DP: descriptor used for in-band signaling of the erasure
          protection vector.
     29.) SI: stuffing indicator, which contains the number of media
          stuffing symbols at the end of a data TSB (see 34).
     30.) Descriptor Stuffing: insertion of otherwise unused
          descriptor values (i.e. 0x00) at the end of the signaling
          TSB. Descriptor stuffing is performed, if the final sequence
          of descriptors and stuffing indicators for a valid
          redundancy profile is shorter than the space initially
          reserved for it in the signaling TSB.
     31.) Media Stuffing: insertion of additional symbols at the end
          of a data TSB. Media stuffing is performed, if the info
          stream (see 17) is shorter than the space reserved for it in
          the data TSB for a desired redundancy profile. Since the
          number of stuffing symbols is signaled in the respective SI,
          any octet value may be used (e.g. 0x00).
     32.) Interleaver: performs the spreading of a codeword, i.e. a
          row in the TB, over n successive packets, such that the
          probability of an erasure burst in a codeword is kept small.
     33.) UXP header: is the additional header information contained
          in each RTP packet after UXP has been applied. It is always
          present at the start of the payload section of an RTP
     34.) X: denotes a currently not used extension field of 1 bit in
          the UXP header.
     35.) P: is a variable which denotes the number of parity symbols
          per row used to protect the inband signaling of the
          redundancy profile.
     36.) ceil(.): denotes the ceiling function, i.e. rounding up to
          the next integer.

     The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
     "OPTIONAL" in this document are to be interpreted as described in

  3. Reed-Solomon Codes
     Reed-Solomon (RS) codes are a special class of linear nonbinary
     block codes, which are known to offer maximum erasure correction
     capability with minimum amount of redundancy.
     An arbitrary t-erasure-correcting (n,k) RS code defined over
     Galois field GF(q) has the following parameters [3]:
     - Block length:                                      n=q-1
     - No. of information symbols in a codeword:          k
     - No. of parity-check symbols in a codeword:         n-k=t
     - Minimum distance:                                  d=t+1

     In what follows, only systematic RS codes over GF(2^8) shall be
     considered, i.e. the symbols of interest can be directly related
     to a tuple of eight bits, which is commonly called an octet in
     packet transmission. The principle structure of a codeword is
     shown in Fig. 1.
     By shortening the initial (n=255,n-t) RS code, any desired
     (n',n'-t) RS code for a given erasure correction capability t may
     be obtained.

       block of n octets
         k=n-t       t
       (&:info)     (*:parity)

     Fig. 1: Structure of a systematic RS codeword

  4. Progressive Source Coding

     The output of an encoder for a specific media type, e.g. H.263 or
     MPEG-4 Visual is said to be a media stream. If the media stream
     consists of several distinct elements, which are of different
     importance with respect to the quality of the reconstruction decoding process at
     the receiver, then the media stream is progressive. The
     progressive media stream is often organized in separate layers.
     Hence, there exists at least one layer, often called base layer,
     without which reconstruction decoding fails at all, whereas all the other
     layers, often called enhancement layers, just help to continually
     improve the quality. Consequently, the different layers are
     usually contained in the (source-)encoded media stream in
     decreasing order of importance, i.e. the base layer data is
     followed by the various enhancement layers.
     An example can be found in the fine granular scalability modes
     which have been proposed to various standardization bodies like
     MPEG, where the resolution of the scaling process in the
     progressive source encoder is as low as one symbol in the
     enhancement layer [4]. Another example is given by data
     partitioning which can be applied to the  ITU/MPEG H.26L standard
     [5], MPEG-4, and H.263++. Also, the existence of I,P, and B
     frames in streams which comply with standards like MPEG-2 can be
     interpreted as progressive.
     From the above definition, it is quite obvious that the most
     important base layer data must be protected as strongly as
     possible against packet loss during transmission. However, the
     protection of the enhancement layers could can be continually lowered,
     since a loss at this stage these stages has only minor consequences for the reconstruction
     decoding process. Thus, by using a suitable unequal erasure
     protection strategy across a progressive media stream, the
     overhead due to redundancy is reduced. Furthermore, if channel
     conditions get worse during transmission, only more and more
     enhancement layers are lost, i.e. a graceful degradation in
     application quality at the receiver is achieved [6].
     Nevertheless, it should be mentioned that the specific structure
     of the media stream strongly depends on the actual media codec in
     use and does not always provide suitable mechanisms for transport
     over data networks, like framing (see also Sect. 6.3 ). In order
     to keep the description of the unequal erasure protection
     strategy in section Sect. 5 as general as possible, the final bitstream
     which has to be protected by the proposed UXP scheme will be
     called "info stream" in the following. Furthermore, it is assumed
     that every info stream is already octet-aligned according to the
     standard procedures defined in the context of the used syntax

  5. General Structure of UXP schemes Schemes

     In this section, the principle features of the proposed UXP
     scheme are described with a special focus on the protection and
     reconstruction procedure which is applied to the info stream. In
     addition, the behavior of the sender and receiver is specified as
     far as it concerns the reconstruction of the info stream.
     However, the complete UXP payload structure, including the
     additional UXP header, is described in section Sect. 6.
     The reason for using the term "info stream" as well as the
     details of the construction are described in Section Sect. 6.3 . For now,
     we assume that we have an info stream which has to be protected.

     Fig. 1 already illustrated the structure of a systematic RS
     codeword, which shall be represented by a single row with n
     successive symbols that contain the information and the parity
     octets. This structure shall now be extended by forming a
     transmission block (TB) consisting of L codewords of length n
     octets each, which amounts to a total of L rows and n columns
     [7]: Each column, together with the respective UXP header in
     front, shall represent the payload of an RTP packet, i.e. the
     whole data of a TB is transmitted via a sequence of n RTP packets
     all carrying a payload of length (L+2) octets (UXP header
     Each TB usually consists of two or more horizontal sub blocks,
     the so-called transmission sub blocks (TSB), as can be seen in
     Fig. : 2: The first L_s rows always belong to the signaling TSB,
     which is used to convey the actual redundancy profile in the data
     part to the receiver (see 6.4.). The following L_d=(L-L_s) rows
     belong to one or more data TSBs, which contain the interleaved
     and RS encoded info stream, as will be described below.

     Transmission Block (TB)

                  /\ +-+-+-+-+-+-+-+-+-+ /\
                  |  |  signaling TSB  |  |  L_s octets
                  |  +-+-+-+-+-+-+-+-+-+ \/
                  |  |                 | /\               /\
                  |  +   data TSB #1   +  |  L_d(1) octets |
                  |  |                 |  |                |
                  |  +-+-+-+-+-+-+-+-+-+ \/                |
     L octets     |  |                 | /\                |
     payload      |  +   data TSB #2   +  |  L_d(2) octets |
     per packet   |  +                 |  |                |  L_d
     octets oct.
                  |  +-+-+-+-+-+-+-+-+-+ \/                |
                  |  |        .        |  .                |
                  |  +        .        +  .                |
                  |  |        .        |  .                |
                  |  +-+-+-+-+-+-+-+-+-+ /\                |
                  |  |   data TSB #z   |  |  L_d(z) octets |
                  \/ +-+-+-+-+-+-+-+-+-+ \/               \/
                           n packets
     Fig. 2: General structure of a TB
     Since the UXP procedure is mainly applied to the data TSBs, it
     will be described next, whereas the content and syntax of the
     signaling TSB will be defined in section 6.4.
     For means of simplification, only one single data TSB will be
     assumed throughout the following explanation of the encoding and
     decoding procedure. However, an extension to more than one data
     TSB per TB is straightforward, and will be shown in section 6.5.
     As depicted in Fig. 3, the rows of a transmission sub block shall
     be partitioned into T+1 different classes EPC_i, where i=0...T,
     such that each class contains exactly R_i=|EPC_i| consecutive
     rows of the matrix, where the R_i have to satisfy the following

     Data Transmission Sub Block (data TSB)
                  /\ +-+-+-+-+-+-+-+-+-+ /\
                  |  |&|&|&|&|&|*|*|*|*|  |
                  |  +-+-+-+-+-+-+-+-+-+  |  A_T=3  R_T=3
                  |  |&|&|&|&|&|*|*|*|*|  |
                  |  +-+-+-+-+-+-+-+-+-+  |
     L_d octets   |  |&|&|&|&|&|*|*|*|*| \/
     per packet   |  +-+-+-+-+-+-+-+-+-+ /\
                  |  |%|%|%|%|%|%|*|*|*|  |  A_(T-1)=1  R_(T-1)=1
                  |  +-+-+-+-+-+-+-+-+-+ \/
                  |  |$|$|$|$|$|$|$|*|*|  .
                  |  +-+-+-+-+-+-+-+-+-+  .
                  |  |!|!|!|!|!|!|!|!|*|  .
                  |  +-+-+-+-+-+-+-+-+-+ /\
                  |  |#|#|#|#|#|#|#|#|#|  |  A_0=1  R_0=1
                  \/ +-+-+-+-+-+-+-+-+-+ \/
                           n packets
     &,%,$,!,# : info octets belonging to a certain info stream in
                 decreasing order of importance
     * :         parity octets gained from Reed-Solomon coding
     Fig. 3: General structure for coding with unequal erasure
     Furthermore, all rows in a particular class EPC_i shall contain
     exactly the same number of parity octets, which is equal to the
     index i of the class. For each row in a certain class EPC_i, the
     same (n,n-i) RS code shall be applied.
     As can be observed from Fig. 3, class EPC_T contains the largest
     number of parity octets per row, i.e. offers the highest erasure
     protection capability in the block. Consequently, the most
     important element in the info stream must be assigned to class
     EPC_T, where the value of T should be chosen according to the
     desired outage threshold of the application given a certain
     packet erasure rate on the link.
     All other classes EPC_(T-1)...EPC_0 shall be sequentially filled
     with the remaining elements of the info stream in decreasing
     order of importance, where the optimal choice for the size of
     each class (0 or more rows), i.e. the structure of the redundancy
     profile, should depend on the quality-of-service requirements for
     the various (progressively-encoded) layers.
     The following set of rules contains a compact description of all
     the operations that must be performed for each transmission
     1.) The total number of columns n of the TB shall be chosen
     according to the actual delay constraints of the application.
     2.) Next, the expected number of rows reserved for the signaling
     TSB has to selected, which limits the data TSB to L_d=(L-L_s)
     3.) The maximum erasure correction capability T in the data TSB
     should be chosen according to the desired outage threshold of the
     application given the actual packet erasure rate on the link.
     4.) The redundancy profile for the rest of the data TSB should
     depend on the size and number of the various layers in the info
     stream, as well as the desired probability of successful decoding
     for each of them (quality-of-service requirement).
     5.) Any suitable optimization algorithm may be used for deriving
     an adequate redundancy profile. However, the result has to
     satisfy the following constraints:
     a) All available info octet positions in the data TSB have to be
     completely filled. If the info stream is too short for a desired
     profile, media stuffing may be applied to the empty info octet
     positions at the end of the data TSB by appending a sufficient
     number of octets (with arbitrary value, e.g. 0x00). The actual
     number of stuffing symbols per data TSB is then signaled via the
     respective stuffing indicator (see Sect. 6.4.). However, before
     resorting to any stuffing, it should be checked whether it is
     possible to strengthen the protection of certain rows instead,
     thus improving the overall robustness of the decoding process.
     b) The info stream should be fully contained within the data TSB
     (unless cutting it off at a specific point is explicitly allowed
     by the properties of the used media codec).
     c) The number of required descriptors and stuffing indicators
     (see section 6.4.) to signal the profile shall not exceed the
     space initially reserved for them in the signaling TSB.

     Constraints a) and b) should be already incorporated in the
     optimization algorithm. However, if constraint c) is not met, the
     data TSB has to be reduced by one row in favor of the signaling
     TSB to accommodate more space for the descriptors and stuffing
     indicators, i.e. steps 2-5 have to be repeated until a valid
     redundancy profile has been obtained.
     6.) For each nonempty class EPC_i, i=T...0, in the data TSB, the
     following steps have to be performed:
     a) All rows of this specific class shall be filled from left to
     right and top to bottom with data octets of the info stream in
     decreasing order of importance (i.e. starting with the most
     important element). stream.
     b) For each row in the class, the required i parity-check octets
     are computed from the same set of codewords of an (n,n-i) RS
     code, and filled in the empty positions at the end of each row.
     Thus, every row in the class constitutes a valid codeword of the
     chosen RS code.

     7.) After having filled the whole data TSB with information and
     parity octets, the redundancy profile is mapped to the signaling
     TSB as described in section 6.4.
     8.) Each column of the resulting TB is now read out octet-wise
     from top to bottom and, together with the respective UXP header
     (see section 6.2.) in front, is mapped onto the payload section
     of one and only one RTP packet.
     9.) The n resulting RTP packets shall be transmitted subsequently
     consecutively to the remote host, starting with the leftmost one.
     10.) At the corresponding protocol entity at the remote host, the
     payload (without the UXP header) of all successfully received RTP
     packets belonging to the same sending TB shall be filled into a
     similar receiving TB column-wise from top to bottom and left to
     11.) For every erased packet of a received TB, the respective
     column in the TB shall be filled with a suitable erasure marker.
     12.) Before any other operations can be performed, the redundancy
     profile has to be restored from the signaling TSB according to
     the procedure defined in section Sect. 6.4.. If the attempt fails because
     of too many lost packets, the whole TB shall be discarded and the
     receiving entity should wait for the next incoming TB.
     13.) If the attempt to recover the redundancy profile has been
     successful, a decoding operation shall be performed for each row
     of the data TSB by applying any suitable algorithm for erasure
     14.) For all rows of the data TSB for which the decoding
     operation has been successful, the reconstructed data octets are
     read out from left to right and top to bottom, and appended to
     the reconstructed version of the info stream.

     One can easily realize that the above rules describe an
     interleaver, i.e. at the sender a single codeword of a TB is
     spread out over n successive packets. Thus, each codeword of a
     transmitted TB experiences the same number of erasures at exactly
     the same positions.
     Two important conclusions can be drawn from this:

     a) Since the same RS code is applied to all rows contained in a
     specific class, either all of them can be correctly decoded or
     none. Hence, there exist no partly decodable classes at the
     b) If decoding is successful for a certain class EPC_i, all the
     classes EPC_(i+1)...EPC_T can also be decoded, since they are
     protected by at least one more parity octet per row. Together
     with rule 6, it is therefore always ensured, that in case a
     decodable enhancement layer exists, all other layers it depends
     on can also be reconstructed!

     Given the maximum erasure protection value T, the redundancy
     profile for a data TSB of size (L_d x n) shall be denoted by a
     so-called erasure protection vector EPV of length (T+1), where
     From the above definition, it is easy to realize that the trivial
     cases of no erasure protection and EXP are a subset of UXP:
     a) no erasure protection at all: all application data is mapped
        class EPC_0, i.e. EPV=(L_d,0,0,...,0).
     b) EXP: all application data is mapped onto class EPC_T, i.e.
     Hence, backward compatibility to currently standardized non-
     progressive multimedia codecs is definitely achieved.

  6. RTP payload structure

     This section is organized as follows. First, the specific
     settings in the RTP header is are shown. Next, the RTP payload
     header for UXP (the so-called UXP header) is specified. After
     that, the structure of the bitstream which is protected by UXP,
     the so-
     called so-called info stream, is discussed. Finally, the in-band
     signaling of the erasure protection vector is introduced introduced.
     For every packet, the  UXP payload is formed by reading out a
     column of the TB and prefixing it with the UXP header. Thus, an
     UXP-compliant RTP packet looks as follows:

     |RTP Header| UXP Header| one column of the TB        |

  6.1 Specific settings Settings in the RTP header Header

     The timestamp of each RTP packet is set to the sampling time of
     the first octet of the progressive media stream in the
     corresponding TB. If several data TSBs are included in one TB,
     the sampling time of data TSB #1 is relevant. This results in the
     TS value being the same for all RTP packets belonging to a
     specific TB.

     The payload type is of dynamic type, and obtained through out-of-
     band signaling similar to [1]. End systems, which cannot
     recognize a payload type, must discard it.
     The marker bit is set to 1 for every in the last packet in of a TB.
     Otherwise, TB; otherwise,
     its value is 0.
     All other fields in the RTP header are set to those values
     proposed for regular multimedia transmission using the RTP-format
     of the media stream which is protected by UXP. UXP, e.g for MPEG-4
     Visual as specified in RFC 3016.

  6.2. Structure of the UXP header Header

     The UXP header shall consist of 2 octets, and is shown in Fig. 4:

      0                   1 1 1 1 1 1
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
     |X|  block PT   | block length n|

     Fig. 4: Proposed UXP header
     The fields in the UXP header shall be are defined as follows:
     - X (bit 0): extension bit, reserved for future enhancements,
     currently not in use -> default value: 0
     - block PT (bits 1-7): regular RTP payload type to indicate the
     media type contained in the info stream
     - block length n (bits 8-15): indicates total number of RTP
     -                             packets
                                   resulting from one TB (which equals
                                   the number of columns of the TB)
     The syntax of the info stream which is protected by UXP is
     specified by the RTP payload type field contained in the UXP
     header. The details of the info stream are described in Sec. 6.3
     For example, payload type H.263 means that the info stream
     conforms to the specifications of the RTP profile for H.263 and
     does not represent the "raw" H.263 media stream produced by an
     H.263 encoder.
     However, UXP can also be applied to the "raw" media stream (in
     case it is already octet-aligned), if this can be signaled to the
     receiver via other means, e.g. by use of H.245 or SDP.
     Based on the RTP sequence number, the marker bit, and the
     repetition of the block length n in each UXP header, the
     receiving entity is able to recognize both TB boundaries and the
     actual position of lost packets (both received and lost ones) in the
  6.3 Framing and Timing Mechanism in UXP: The info stream. Info Stream
     As described in section Sect. 5, UXP creates its own packetization scheme
     by interleaving. The regular framing and timing structure of RTP
     is therefore destroyed. This section describes which kind of
     problems arise with interleaving and how they can be solved. This
     finally leads to the specification of the info stream.
     The timestamp of an RTP packet usually describes the sampling
     time of the first octet included in the RTP data packet. This is
     in principle also true for UXP RTP packets. According to the time
     stamp definition in Sect. 6.1  every packet contains the
     timestamp of the sampling time of the first octet in the
     corresponding TB. Therefore, all packets which belong to one TB
     contain the same timestamp. This can lead to problems since due
     to the theoretical size limit of a TB, TB (the limit for the number
     of columns is 256, and the limit for the number of rows is the
     maximum packet size), it can contain data from different sampling
     time instances, e.g. several video frames. Then the timing
     information of the later frames has to be determined from the
     media stream itself and not from the RTP timestamp.
     A second problem arising with interleaving is that the framing
     mechanism of RTP is not supported. Consider a media encoder,
     which does not create a fully decodable bitstream, e.g. H.26L
     with the video coding layer (VCL) and network adaptation layer
     (NAL) concept [9]. In this concept the VCL creates slices which
     are NAL prepared for transmission over several networks at the NAL.
     Consequently, in case of RTP transmission, header information
     which allows to decode the slices is included only in the RTP
     packets. Thus, to fill an UXP TB with the "raw" media stream from
     the VCL can lead, even without packet losses, to a non-decodable
     The framing problem can be solved in two ways:
     One solution could be to use the RTP payload specification of the a
     given media stream to create a bitstream with an appropriate
     framing, resulting in the so-called info stream. For example, to
     create an H.263 info stream, the following steps are necessary:
     1.)  Generate an H.263-compliant media stream, i.e. take a slice
          or a video frame directly from the H.263 encoder.
     2.)  Apply the H.263 payload specification (e.g. RFC 2429) to
          create the RTP payload for only one packet.
     3.)  Insert the latter row by row into one data TSB.
     It is possible to apply the procedure mentioned above several
     times for different data TSBs (see Sect. 6.5.). Due to the in-band in-
     band signaling, it is possible to determine the beginning and end
     of every TSB without parsing the whole TB. This allows a fast
     decomposition of the TB into the different TSB. TSBs.
     Another solution of the framing problem would be to relay rely on the
     framing mechanism of the media stream. This is, for example,
     possible for media streams which contain start codes.
     The timing problem can be solved in two ways.
     One solution is to comply with the RTP payload specification of
     the media stream. If the specification allows to put into one
     packet octets which belong to different sampling times, this
     should also be allowed for a TB.
     The second solution for the timing problem is to rely on the
     timing information contained in the media stream itself, if
     Therefore, there are two different modes for framing:
     1.)  RTP payload framing (if an RTP payload specification exists
          for the media stream),
     2.)  pure media stream framing (if framing is contained in the
          media stream),

     and two different modes for timing:
     1.)  timing rules of the RTP payload specification for the media
     2.)  timing information within the media stream.

     All combinations of timing and framing modes are possible, but
     framing mode 1 and timing mode 1 represent the default mode of
     operation for UXP. The use of other timing and framing modes has
     to be signaled by non RTP means.
     The info stream is thus defined by the media stream together with
     framing and timing rules.
     In the following, some examples will be given:
     1.)  The info stream for MPEG-4 video Visual according to RFC 3016 is
          the pure MPEG-4 compliant media stream, since RFC 3016
          specifies (in case of video) to take the MPEG-4 compliant
          video stream as payload.
     2.)  The info stream for H.263+ can be created according to RFC
          2429 as follows:
     |H.263+ payload| H.263+ compliant stream (possibly changed with|
     |header        | respect to RFC 2429) containing a slice/frame |

     This info stream is inserted into  one single data TSB.
     If necessary, for example, if the slices are too short to achieve
     a reasonable TB size, several info streams can be inserted in one
     TB by concatenating several data TSBs to one TB a single TB (see Sect.

  6.4. In-band signaling Signaling of the structure Structure of the redundancy profile Redundancy Profile

     To enable a dynamic adaptation to varying link conditions, the
     actual redundancy profile used in the data TSB as well as the
     beginning and end of a TSB must be signaled to the receiving
     entity. Since out-of-band signaling either results in excessive
     additional control traffic, or prevents quick changes of the
     profile between successive TBs, an in-band signaling procedure is
     Since without knowledge of the correct redundancy profile, the
     decoding process cannot be applied to any of the erasure
     protection classes, it the redundancy profile has to be protected at
     least as strongly as the most important element in the info
     stream. Therefore, an additional class EPC_P is used in the
     signaling TSB, where the number of parity symbols is by default
     set to the following value:
     Hence, up to 50% of the RTP packets can be lost, before the
     redundancy profile cannot be recovered anymore. This seems to be
     a reasonable value for the lowest point of operation over a lossy
     link. Alternatively, P may be explicitly signaled during session
     setup by means of SDP or H.245 protocol.
     Consequently, since all other classes must have equal or less
     erasure protection capability, the maximum allowable value for
     class EPC_T in the data TSB is now limited to T<=P.
     The signaling of the erasure protection vector is accomplished by
     means of descriptors. In the following we describe an efficient
     encoding scheme for the descriptors.
     For each class EPC_i with R_i>0, there is a descriptor DP_i
     providing information about the size of class EPC_i (i.e. the
     value of R_i) and establishing a relationship between the erasure
     protection of class EPC_i and that of the
     first preceding class EPC_(i+j) with A_(i+j)>0, EPC_(i+j), where j>0.
     j>0 and j is the smallest value for which R_(i+j)>0 is true. A
     descriptor DP_i is mapped onto one octet, which is sub-divided
     into two half-octets (i.e. the higher and the lower four bits).
     The first half-octet is of type unsigned and contains the 4-bit
     representation of the decimal value R_i. The second half-octet is
     of type signed and contains the difference in erasure protection
     between class EPC_i and class EPC_(i+j), i.e. the signed 4-bit
     representation of the decimal value (-j) (where the MSB denotes
     the sign, and the lower three bits the absolute value). Note that
     the erasure protection P of class EPC_p is fixed, whereas the
     size A_P R_P may vary.
     Thus, the data to be filled into class EPC_P shall consist of a
     sequence of descriptors separated by stuffing indicators (see
     below), where the number of descriptors is primarily given by the
     number of protection classes EPC_i, 0<=i<=T, in the data TSB with
     Without a-priori knowledge, the initial value for the size of the
     signaling TSB TSB, R_P, should be set to one (row). When the number
     of necessary descriptors and stuffing indicators exceeds the (n-P) (n-
     P) information positions, one or more additional rows have to be
     reserved. This is usually done by increasing the value for L_s to
     R_P>1, i.e. the data TSB is reduced to (L-A_P) (L-R_P) rows. Hence, in
     order to indicate the actual size of the signaling TSB, an
     additional descriptor is inserted at the very beginning, which
     takes on the value 0xq0, where q denotes the (octal) four bit
     representation of the decimal value A_P. R_P.

     Furthermore, the end of each data TSB is signaled by the
     otherwise unused descriptor value 0x00, followed by exactly one
     stuffing indicator (SI). The latter is mapped onto an octet,
     which is of type unsigned and contains the 8-bit representation
     of the decimal value of the number of media stuffing symbols used
     at the end of the respective data TSB.
     The (extended) sequence of descriptors and stuffing indicators is
     then mapped to the octet positions in the A_P R_P rows of the
     signaling TSB from left to right and top to bottom. Each row is
     then encoded with the same (n,n-P) RS code.
     If the number of descriptors and stuffing indicators is less than
     the available octet positions, however, empty positions in class
     EPC_P may be filled up with the otherwise unused descriptor 0x00.
     At the receiving entity, the sequence of descriptors shall be
     recovered by performing erasure decoding on the first row of the
     TB (which definitely belongs to the signaling TSB) using the same
     algorithm as later for the data TSB. If successful, the very
     first descriptor now indicates the number of rows of the
     signaling TSB, and the next (A_P-1) (R_P-1) rows are decoded to
     reconstruct the redundancy profile for the data TSB(s), together
     with the number of media stuffing symbols denoted by the
     respective SI(s).
     The complete structure of the TB is now depicted in Fig. 5.

     Transmission Block (TB)
                  /\ +-+-+-+-+-+-+-+-+-+ /\
                  |  |?|?|?|?|*|*|*|*|*|  |  A_P=1  R_P=1
                  |  +-+-+-+-+-+-+-+-+-+ \/
                  |  |&|&|&|&|&|*|*|*|*| /\
                  |  +-+-+-+-+-+-+-+-+-+  |  A_T=3  R_T=3
                  |  |&|&|&|&|&|*|*|*|*|  |
                  |  +-+-+-+-+-+-+-+-+-+  |
     L octets     |  |&|&|&|&|&|*|*|*|*| \/
     payload      |  +-+-+-+-+-+-+-+-+-+ /\
     per packet   |  |%|%|%|%|%|%|*|*|*|  |  A_(T-1)=1  R_(T-1)=1
                  |  +-+-+-+-+-+-+-+-+-+ \/
                  |  |$|$|$|$|$|$|$|*|*|  .
                  |  +-+-+-+-+-+-+-+-+-+  .
                  |  |!|!|!|!|!|!|!|!|*|  .
                  |  +-+-+-+-+-+-+-+-+-+ /\
                  |  |#|#|#|#|#|#|#|#|#|  |  A_0=1  R_0=1
                  \/ +-+-+-+-+-+-+-+-+-+ \/
                           n packets
     ? :          descriptors and stuffing indicators for in-band
                  signaling of the redundancy profile

     &,%,$,!,# :  info octets belonging to a certain element of the
                  info stream in decreasing order of importance

     * :          parity octets gained from Reed-Solomon coding

     Fig. 5: General structure for UXP with in-band signaling of the
     redundancy profile
     The following simple example is meant to illustrate the idea
     behind using descriptors: Let an erasure protection vector of
     length T+1=7 be given as follows:
     Hence, the length L of the TB (including one row for the
     signaling TSB) is equal to 7+2+2+3+10+1=25 (rows/octets). If the
     width is assumed to be equal to 20 (columns/packets), then the
     erasure protection of the descriptors is P=10.
     The corresponding sequence of descriptors can be written as
     where the values of the descriptors are given in hexadecimal
     notation. Next, the descriptor indicating the length of the
     signaling TSB has to be inserted, the end of the data TSB has to
     be marked by 0x00, and the SI has to be appended. If the number
     of media stuffing symbols is assumed to be 3, the 10 info octets
     in the signaling TSB take on the following values (descriptor
     stuffing included):
     6.5. Optional Concatenation of Transmission Sub Blocks: Blocks

     The following procedure may be applied if a single info stream
     would be too short to achieve an efficient mapping to a
     transmission block with respect to the fixed payload length L and
     the desired number of packets n. For example, intra-coded video
     frames (I-frames) are usually much larger than the following
     predicted ones (P-frames). In this case, a certain number z of
     successive small info streams should be each mapped to a
     transmission sub block with length L_d(y) and width n, such that
     The resulting transmission sub blocks can then be easily
     concatenated to form a TB of size L x n having one common
     signaling TSB: TSB (see Fig. 2): Since the second half-octet of the
     descriptors is of type signed, signed (cf. Sect. 6.4.), we are able to incorporate
     signal both decreasing and increasing erasure protection profiles within one single
     signaling TSB.
     Note that once the lengths L_d(y) of the individual blocks have
     been fixed, the respective redundancy profiles can be determined
     independently of each other. However, the space initially
     reserved for the signaling TSB should be already large enough to
     avoid profile recalculation for each of the data TSBs in case the
     sequence of descriptors gets too long!
     Again, we will give a simple example to illustrate this idea: Let
     the erasure protection vectors for two concatenated data TSBs be
     given as follows:

     Hence, two single identical data TSBs will be concatenated to
     form a TB of length L=2*(2+2+3+10)+2=36 (rows/octets). If the
     width is again assumed to be equal to 20 (columns/packets), then
     the erasure protection of the descriptors is P=10, and therefore P=10. We  reserve a
     total of two rows for the signaling TSB have been reserved this
     time. TSB. The corresponding
     sequence of descriptors can now be written as
     DP=(0xAC,0x39,0x2A,0x29,0xA4,0x39,0x2A,0x29), where the values of
     the descriptors are given in hexadecimal notation. The values of
     the first four descriptors are taken from the descriptor of EPV1
     as described in Sect. 6.4. (without the SI). The last four
     descriptors are taken from the descriptor of EPV2 (without SI)
     with one exception. The fifth descriptor of DP (i.e. 0xA4) is
     created as follows: The first half-octed is created according to
     Sect. 6.4. However, the second half-octed describes no longer the
     difference between R_P and R2_6. It rather describes the
     difference between R1_2 and R2_6, i.e. R1_2-R2_6, which can be a
     positive or negative number. If the number of media stuffing
     symbols is assumed to be 3 for each data TSB, the 20 info octet
     positions in the signaling TSB are filled with the following
     values (descriptor stuffing included):

     Therefore from the example above, the following general rule MUST
     be used to create the resulting descriptors for concatenated data
     TSB #u and data TSB #v, where v=u+1:
     Let EPVu=(Au_0,Au_1,...) and EPVv=(Av_0, Av_1,...) be the
     corresponding erasure protection vectors and DPu and DPv the
     corresponding descriptors created according to Sect. 6.4. (with
     stuffing). Let w be the smallest index for which Au_w >0. Let x
     be the largest index for which Av_x >0. The resulting descriptor
     can be created by concatenation of DPu and DPv where the first
     descriptor of DPv should be changed as follows:
     The second half byte is defined by Au_w-Av_x.

  7. Security Considerations
     The payload of the RTP-packets consists of an interleaved media
     and parity stream. Therefore, it is reasonable to encrypt the
     resulting stream with one key rather than using different keys
     for media and parity data. It should also be noted that
     encryption of the media data without encryption of the parity
     data could enable known-plaintext attacks.
     The overall proportion between parity octets and info octets
     should be chosen carefully if the packet loss is due to network
     congestion. If the proportion of parity octets per TB is
     increased in this case, it could lead to increasing network
     congestion. Therefore, the proportion between parity octets and
     info octets per TB MUST NOT be increased as packet loss increases
     due to network congestion.
     The overall ratio between parity and info octets MUST NOT be
     higher than 1:1, i.e. the absolute bitrate spent for redundancy
     must not be larger than the bitrate required for transmission of
     multimedia data itself.


  8. Application Statement
     There are currently two different schemes proposed for unequal
     error protection in the IETF-AVT: Unequal Level Protection (ULP)
     and Unequal Erasure Protection (UXP).
     Although both methods seem to address the same problem, the
     proposed solutions differ in many respects. This section tries to
     describe possible application scenarios and to show the strength strengths
     and weaknesses of both approaches.
     The main difference between both approaches is that while ULP
     preserves the structure of the packets which have to be protected
     and provides the redundancy in extra packets, UXP interleaves the
     info stream which has to be protected, inserts the redundancy
     information, and thus creates a totally new packet structure.
     Another difference concerns multicast compatibility: It cannot be
     assumed that all future terminals will be able to apply UXP/ULP.
     Therefore, backward compatibility could be an issue in some
     cases. Since ULP does not change the original packet structure,
     but only adds some extra packets, it is possible for terminals
     which do not
     support ULP to discard the extra packets. In case of UXP,
     however, two separate streams with and without erasure protection
     have to be sent, which increases the overall data rate.
     Next, both approaches offer different mechanisms to adjust packet
     sizes, if necessary: UXP allows to adjust the packet sizes
     arbitrarily. This is an advantage in case the loss probability is
     dependent on the packet length, which happens, for example, if
     the end-to-end connection contains wireless links. In this case
     proper adjustment of the packet size is one essential network
     adaptation technique. In addition, if a preencoded stream is sent
     over the network, the packet size can be adjusted independently
     of slice structures.
     Since ULP does not change the existing packetization scheme, this
     flexibility does not exist.
     The ability of UXP to adjust the packet size arbitrarily can be
     especially exploited in a streaming scenario, if a delay of
     several hundred milliseconds is acceptable. It is then possible
     to fill several video frames into a single TB of desired size,
     e.g. a group of pictures consisting of I-frame, P-frames and B-
     frames. The redundancy scheme can thus be selected in such a way
     as to guarantee the following property: In case of packet loss,
     the streams for P-frames are only recoverable, recoverable if the I-frame, I-frame on which the
     decoding of P-frames depends, depends is recoverable. The same is true for
     B-frames, which can only be decoded if the respective P-frames
     are recoverable. This prevents situations in which, for example,
     the B-frames have been received correctly, but the P-
     frames P-frames have
     been lost, i.e. assures a gradual decrease in application quality
     also on the frame level. Of course, a similar encoding is
     possible with ULP. But in this case one might have to send
     several frames within one packet which leads to large packet
     Furthermore, decoding delay is also a crucial issue in
     communications. Again, both approaches have different delay
     properties: UXP introduces a decoding delay because a reasonable
     amount of correctly received packets are necessary to start
     decoding of a TB. The delay in general depends on the dimensions
     of the interleaver. This should be considered for any system
     design which includes UXP.
     With ULP, every correctly received media packet can be decoded
     right away. However, a significant delay is introduced, if
     packets are corrupted, because in this case one has to wait for
     several redundancy packets. Thus, the delay is in general
     dependent on the actual ULP-FEC-packet scheme and cannot be
     considered in advance during the system design phase.
     Finally, we want to point out that UXP uses RS-codes RS codes which are
     to be the most efficient type of block codes in terms of erasure
     correction capability.


  9. Intellectual Property Considerations
     Siemens AG has filed patent applications that might possibly have
     technical relations to this contribution.
     On IPR related issues, Siemens AG refers to the Siemens Statement
     on Patent Licensing, see


  10. References
     [1] J. Rosenberg and H. Schulzrinne, "An RTP Payload Format for
     Generic Forward Error Correction", Request for Comments 2733,
     Internet Engineering Task Force, Dec. 1999.
     [2] A. Albanese, J. Bloemer, J. Edmonds, M. Luby, and M. Sudan,
     "Priority encoding transmission", IEEE Trans. Inform. Theory,
     vol. 42, no. 6, pp. 1737-1744, Nov. 1996.
     [3] Shu Lin and Daniel J. Costello, Error Control Coding:
     Fundamentals and Applications, Prentice-Hall, Inc., Englewood
     Cliffs, N.J., 1983.
     [4] W. Li: "Streaming video profile in MPEG-4", IEEE trans. Trans. on
     Circuits and Systems for Video Technology, Vol. 11, no. 3, 301-
     317, Mar March 2001.
     [5] G. Blaettermann, G. Heising, and D. Marpe: "A Quality
     Scalable Mode for H.26L", ITU-T SG16, Q.15, Q15-J24, Osaka, May

     [6] F. Burkert, T. Stockhammer, and J. Pandel, "Progressive A/V
     coding for lossy packet networks - a principle approach", Tech.
     Rep., ITU-T SG16, Q.15, Q15-I36, Red Bank, N.J., Oct. 1999.
     [7] Guenther Liebl, "Modeling, theoretical analysis, and coding
     for wireless packet erasure channels", Diploma Thesis, Inst. for
     Communications Engineering, Munich University of Technology,
     [8] U. Horn, K. Stuhlmuller, M. Link, and B. Girod, "Robust
     Internet video transmission based on scalable coding and unequal
     error protection", Image Com., vol. 15, no. 1-2, pp. 77-94, Sep.
     [9] S. Wenger, "H.26L over IP: The IP-Network Adaptation Layer",
     Packet Video 2002, Pittsburgh, Pennsylvania, USA, April 24-

  11. Acknowledgments
     Many thanks to Philippe Gentric and Gentric, Stephen Casner Casner, and Hermann
     Hellwagner for helpful comments and improvements.

  13. The authors
     would like to thank Thomas Stockhammer who came up with the
     original idea of UXP. Also, the help of Gero Baese, Frank
     Burkert, and Minh Ha Nguyen for the development of UXP is well

  12. Author's Addresses
     Guenther Liebl, Thomas Stockhammer Liebl
     Institute for Communications Engineering (LNT)
     Munich University of Technology
     D-80290 Munich
     Email: {liebl,tom}

     Minh-Ha Nguyen, Frank Burkert
     Siemens AG - ICM D MP RD MCH 83/81
     D-81675 Munich
     Email: {minhha.nguyen,frank.burkert} {liebl}

     Marcel Wagner, Juergen Pandel, Wenrong Weng, Gero Baese Weng
     Siemens AG - Corporate Technology CT IC 2
     D-81730 Munich

  Full Copyright Statement
     "Copyright (C) The Internet Society (date). All Rights Reserved.
     This document and translations of it may be copied and furnished
     to others, and derivative works that comment on or otherwise
     explain it or assist in its implementation may be prepared,
     copied, published and distributed, in whole or in part, without
     restriction of any kind, provided that the above copyright notice
     and this paragraph are included on all such copies and derivative
     works. However, this document itself may not be modified in any
     way, such as by removing the copyright notice or references to
     the Internet Society or other Internet organizations, except as
     needed for the purpose of developing Internet standards in which
     case the procedures for copyrights defined in the Internet
     Standards process must be followed, or as required to translate
     it into languages other than English.
     The limited permissions granted above are perpetual and will not
     be revoked by the Internet Society or its successors or assigns.
     This document and the information contained herein is provided on