Internet Draft                                               Adam H. Li
draft-ietf-avt-evrc-smv-00.txt
      draft-ietf-avt-evrc-smv-01.txt                                     UCLA
February 4,
      May 16, 2002                                                     Editor
      Expires: August 4, November 16, 2002

            An

                    RTP Payload Format for EVRC and SMV Vocoders

      STATUS OF THIS MEMO

         This document is an Internet-Draft and is in full conformance with
         all provisions of Section 10 of RFC 2026.

         Internet-Drafts are working documents of the Internet Engineering
         Task Force (IETF), its areas, and its working groups. Note that other
         groups may also distribute working documents as Internet-Drafts.

         Internet-Drafts are draft documents valid for a maximum of six months
         and may be updated, replaced, or obsoleted by other documents at any
         time. It is inappropriate to use Internet- Drafts as reference
         material or to cite them other than as work in progress.

         The list of current Internet-Drafts can be accessed at
         http://www.ietf.org/ietf/1id-abstracts.txt

         The list of Internet-Draft Shadow Directories can be accessed at
         http://www.ietf.org/shadow.html.

      ABSTRACT

         This document describes the RTP payload format for Enhanced Variable
         Rate Codec (EVRC) Speech and Selectable Mode Vocoder (SMV) Speech.
         Two sub-formats are specified for different application scenarios. A
         bundled/interleaved format is included to reduce the effect of packet
         loss on speech quality and amortize the overhead of the RTP header
         over more than one speech frame. A non-bundled format is also
         supported for conversational applications.

      Table of Contents

         1. Introduction ................................................... 2
         2. Background ..................................................... 2
         3. The Codecs Supported ........................................... 3
         3.1. EVRC ......................................................... 3
         3.2. SMV .......................................................... 3
         3.3. Other Frame-Based Vocoders ................................... 4
         4. RTP/Vocoder Packet Format ...................................... 4
         4.1. Type 1 Interleaved/Bundled Packet Format ..................... 4
         4.2. Type 2 Header-Free Packet Format ............................. 6
         4.3. Detecting Determining the Format of Packets .............................. ............................ 6
         5. Packet Table of Contents Entries and Codec Data Frame Format ... 7
         5.1. Packet Table of Contents entries ............................. 7
         5.2. Codec Data Frames ............................................ 8
         6. Interleaving Codec Data Frames in Type 1 Packets ............... 9
         6.1. Finding Interleave Group Boundaries ......................... 10
   6.2. Reconstructing Interleaved Speech ........................... 11
   6.3. Receiving Invalid Interleaving Values ....................... 12
   6.4.
         6.2. Additional Receiver Responsibilities ........................ 12 11
         7. Bundling Codec Data Frames in Type 1 Packets .................. 12 11
         8. Handling Missing Codec Data Frames ............................ 12
         9. Implementation Issues ......................................... 13 12
         9.1. Interleaving Length ......................................... 13 12
         9.2. Validation of Received Packets .............................. 12
         10. Mode Request ................................................ ................................................. 13
   10.
         11. Storage Mode ................................................. 13
         12. IANA Considerations .......................................... 14
   10.1 Storage Mode ................................................ 14
   10.2
         12.1. Registration of Media Type EVRC MIME ............................ 14
         12.2. Registration ...................................... of Media Type EVRC0 ........................... 15
   10.3 SMV MIME
         12.3. Registration ....................................... of Media Type SMV ............................. 16
   11.
         12.4. Registration of Media Type SMV0 ............................ 17
         13. Mapping to SDP Parameters .................................... 17
   12.
         14. Security Considerations ...................................... 17
   13. 18
         15. Adding Support of Other Frame-Based Vocoders ................. 18
   14. 19
         16. Acknowledgements ............................................. 18
   15. 19
         17. References ................................................... 18
   16. 20
         18. Authors' Address ............................................. 19 20

      1. Introduction

         This document describes how speech compressed with EVRC [1] or SMV
         [2] may be formatted for use as an RTP payload type.  The format is
         also extensible to other codecs that generate a similar set of frame
         types. Two methods are provided to packetize the codec data frames
         into RTP packets: an interleaved/bundled format and a zero-header
         format. The sender may choose the best format for each application
         scenario, based on network conditions, bandwidth availability, delay
         requirements, and packet-loss tolerance.

         The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
         "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
         document are to be interpreted as described in RFC 2119 [3].

      2. Background

         The 3rd Generation Partnership Project 2 (3GPP2) has published two
         standards which define speech compression algorithms for CDMA
         applications: EVRC [1] and SMV [2]. EVRC is currently deployed in
         millions of first and second generation CDMA handsets. SMV is the
         preferred speech codec standard for CDMA2000, and will be deployed in
         third generation handsets in addition to EVRC. Improvements and new
         codecs will keep emerging as technology improves, and future handsets
         will likely support multiple codecs.

         The formats of the EVRC and SMV codec frames are very similar. Many
         other vocoders also share common characteristics, and have many
         similar application scenarios. This parallelism enables an RTP
         payload format to be designed for EVRC and SMV that may also support
         other, similar vocoders with minimal additional specification work.
         This can simplify the protocol for transporting vocoder data frames
         through RTP and reduce the complexity of implementations.

      3. The Codecs Supported

      3.1. EVRC

         The Enhanced Variable Rate Codec (EVRC) [1] compresses each 20
         milliseconds of 8000 Hz, 16-bit sampled speech input into output
         frames in one of the three different sizes: Rate 1 (171 bits), Rate
         1/2 (80 bits), or Rate 1/8 (16 bits). In addition, there are two zero
         bit codec frame types: null frames and erasure frames. Null frames
         are produced as a result of the vocoder running at rate 0. Null
         frames are zero bits long and are normally not transmitted. Erasure
         frames are the frames substituted by the receiver to the codec for
         the lost or damaged frames. Erasure frames are also zero bits long
         and are normally not transmitted.

         The codec chooses the output frame rate based on analysis of the
         input speech and the current operating mode (either normal or one of
         several reduced rate modes). For typical speech patterns, this
         results in an average output of 4.2 kilobits/second for normal mode
         and a lower average output for reduced rate modes.

      3.2. SMV

         The Selectable Mode Vocoder (SMV) [2] compresses each 20 milliseconds
         of 8000 Hz, 16-bit sampled speech input into output frames of one of
         the four different sizes: Rate 1 (171 bits), Rate 1/2 (80 bits), Rate
         1/4 (40 bits), or Rate 1/8 (16 bits). In addition, there are two zero
         bit codec frame types: null frames and erasure frames. Null frames
         are produced as a result of the vocoder running at rate 0. Null
         frames are zero bits long and are normally not transmitted. Erasure
         frames are the frames substituted by the receiver to the codec for
         the lost or damaged frames. Erasure frames are also zero bits long
         and are normally not transmitted.

         The SMV codec can operate in four modes. Each mode may produce frames
         of any of the rates (full rate to 1/8 rate) for varying percentages
         of time, based on the characteristics of the speech samples and the
         selected mode. The SMV mode can change on a frame-by-frame basis. The
         SMV codec does not need additional information other than the codec
         data frames to correctly decode the data of various modes; therefore,
         the mode of the encoder does not need to be transmitted with the
         encoded frames.

         The percentage of different frame rates and the average data rate
   (ADR) for the four SMV modes are
         shown in the table below.

                           Mode 0       Mode 1       Mode 2        Mode 3
             -------------------------------------------------------------
             Rate 1        68.90%       38.14%       15.43%        07.49%
             Rate 1/2      06.03%       15.82%       38.34%        46.28%
             Rate 1/4      00.00%       17.37%       16.38%        16.38%
             Rate 1/8      25.07%       28.67%       29.85%        29.85%
       -------------------------------------------------------------
       ADR          7205 bps     5182 bps     4073 bps      3692 bps

         The SMV codec chooses the output frame rate based on an analysis of
         the input speech and the current operating mode. For typical speech
         patterns, this results in an average output of 4.2k bits/second 4.2kilobits/second for
         Mode 0 in two way conversation (assuming 50% active speech time and
         50% in eighth rate while listening) and lower for other reduced rate
         modes.

         SMV is more bandwidth efficient than EVRC. EVRC is equivalent in
         performance to SMV mode 1.

      3.3. Other Frame-Based Vocoders

         Other frame-based vocoders can be carried in the packet format
         defined in this document, as long as they possess the following
         properties:

          o The codec is frame-based;
          o blank and erasure frames are supported;
          o the total number of rates is less than 17;
          o the maximum full rate frame can be transported in a single RTP
            packet using this specific format.

         Vocoders with the characteristics listed above can be transported
         using the packet format specified in this document with some
         additional specification work; the pieces that must be defined are
         listed in Section 13. 15.

      4. RTP/Vocoder Packet Format

         In the packet format diagrams shown in this document, bit 0 is the
         most significant bit. The RTP payload vocoder speech data MUST be transmitted in
         RTP packets of one of the following two types.

      4.1. Type 1 Interleaved/Bundled Packet Format

         This format is used to send one or more vocoder frames per packet.
         Interleaving or bundling MAY be used. The RTP packet for this format
         is as follows:

          0                   1                   2                   3
          0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
         |                      RTP Header [4]                           |
         +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
         |R|R| LLL | NNN | FFF |  Count  |  TOC  |  ...  |  TOC  |padding|
         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
         |        one or more codec data frames, one per TOC entry       |
         |                             ....                              |
         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

         The RTP header has the expected values as described in the RTP
         specification [4]. The RTP timestamp is in 1/8000 of a second units
         for EVRC and SMV. For any other vocoders that use this packet format,
         the timestamp unit needs to be defined explicitly. The M bit should
         be set as specified in the applicable RTP profile, for example, RFC
         1890 [5]. Note that RFC 1890 [5] specifies that if the sender does
         not suppress silence, the M bit will always be zero. When multiple
         codec data frames are present in a single RTP packet, the timestamp
   is, as always,
         is that of the oldest data represented in the RTP packet. The
         assignment of an RTP payload type for this new packet format is
         outside the scope of this document, and will not be specified here.
   It document; it is expected that specified by the RTP
         profile for a particular class of
   applications will assign a payload type for under which this encoding, or if that
   is not done, then a payload type in the dynamic range shall be chosen
   by the sender. format is used.
         The first octet of a Type 1 Interleaved/Bundled format packet is the
         Interleave Octet. The second octet contains the Mode Request and
         Frame Count fields. The Table of Contents (ToC) field then follows.
         The fields are specified as follows:

         Reserved (RR): 2 bits
            Reserved bits. MUST be set to zero by sender, SHOULD be ignored
            by receiver.

         Interleave Length (LLL): 3 bits
            Indicates the length of interleave; a value of 0 indicates
            bundling, a special case of interleaving. See Section 6 and
            Section 7 for more detailed discussion.

         Interleave Index (NNN): 3 bits
            Indicates the index within an interleave group. MUST have a value
            less than or equal to the value of LLL. Values of NNN greater
            than the value of LLL are invalid. Packet with invalid NNN values
            SHOULD be ignored by the receiver.

         Mode Request (FFF): 3 bits
            The Mode Request field is used to signal Mode Request
            information. See Section 9.2 10 for details.

         Frame Count (Count): 5 bits
      Indicates the
            The number of ToC fields (and therefore vocoder frames)
      present. present in the
            packet is the value of the frame count field plus one. A value of
            zero indicates that the packet contains one ToC field (and vocoder frame). A field, while a
            value of 31 indicates 32 ToC
      fields (and vocoder frames) are in that the packet. The number of packet contains 32 ToC
      fields (and vocoder frames) present is the value of the frame
      count field plus one. fields.

         Padding (padding): 0 or 4 bits
            This padding ensures that codec data frames start on an octet
            boundary. When the frame count is odd, the sender MUST add 4 bits
            of padding following the last TOC. When the frame count is even,
            the sender MUST NOT add padding bits. If padding is present, the
            padding bits MUST be set to zero by sender, and SHOULD be ignored
            by receiver.

         The Table of Contents field (ToC) provides information on the codec
         data frame(s) in the packet. There is one ToC entry for each codec
         data frame. The detailed formats of the ToC field and codec data
         frames are specified in Section 5.

         Multiple data frames may be included within a Type 1
         Interleaved/Bundled packet using interleaving or bundling as
         described in Section 6 and Section 7.

      4.2. Type 2 Header-Free Packet Format

         The Type 2 Header-Free Packet Format is designed for maximum
         bandwidth efficiency and low latency. Only one codec data frame can
         be sent in each Type 2 Header-Free format packet. None of the payload
         header fields (LLL, NNN, FFF, Count) nor ToC entries are present. The
         codec rate for the data frame can be determined from the length of
         the codec data frame, since there is only one codec data frame in
         each Type 2 Header-Free packet.

         Use of the RTP header fields for Type 2 Header-Free RTP/Vocoder
         Packet Format is the same as described in Section 4.1 for Type 1
         Interleaved/Bundled RTP/Vocoder Packet Format. The detailed format of
         the codec data frame is specified in Section 5.

          0                   1                   2                   3
          0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
         |                      RTP Header [4]                           |
         +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
         |                                                               |
         +          ONLY one codec data frame            +-+-+-+-+-+-+-+-+
         |                                               |
         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

      4.3. Detecting Determining the Format of Packets

         All receivers MUST SHOULD be able to process both types of packets. The
         sender MAY choose to use one or both types of packets.

         A receiver MUST have prior knowledge of the packet type to correctly
         decode the RTP packets. The packet types used in an RTP session MUST
         be specified by the sender, and signaled through out-of-band means,
         for example by SDP during the setup of a session.

         When packets of both formats are used within the same session,
         different RTP payload type values MUST be used for each format to
         distinguish the packet formats. The association of payload type
         number with the packet format is done out-of-band, for example by SDP
         during the setup of a session.

      5. Packet Table of Contents Entries and Codec Data Frame Format

      5.1. Packet Table of Contents entries

         Each codec data frame in a Type 1 Interleaved/Bundled packet has a
         corresponding Table of Contents (ToC) entry. The ToC entry indicates
         the rate of the codec frame. (Type 2 Header-Free packets MUST NOT
         have a ToC field, and there is always only one codec data frame in
         each Type 2 Header-Free packet.)

         Each ToC entry is occupies four bits. The format of the bits is
         indicated below:

             0 1 2 3
            +-+-+-+-+
            |fr type|
            +-+-+-+-+

         Frame Type: 4 bits
            The frame type indicates the type of the corresponding codec data
            frame in the RTP packet.

         For EVRC and SMV codecs, the frame type values and size of the
         associated codec data frame are described in the table below:

         Value   Rate      Total codec data frame size (in octets)
         ---------------------------------------------------------
           0     Blank      0    (0 bit)
           1     1/8        2    (16 bits)
           2     1/4        5    (40 bits; not valid for EVRC)
           3     1/2       10    (80 bits)
           4     1         22    (171 bits; 5 padded at end with zeros)
           5     Erasure    0    (SHOULD NOT be transmitted by sender)

         All values not listed in the above table MUST be considered reserved.
         A ToC entry with a reserved Frame Type value SHOULD be considered invalid and substituted with an erasure frame.
         invalid. Note that the EVRC codec does not have 1/4 rate frames, thus
         frame type value 2 MUST be considered a reserved value when the EVRC
         codec is in use.

         Other vocoders that use this packet format need to specify their own
         table of frame types and corresponding codec data frames.

      5.2. Codec Data Frames

         The output of the vocoder MUST be converted into codec data frames
         for inclusion in the RTP payload. The conversions for EVRC and SMV
         codecs are specified below. (Note: Because the EVRC codec does not
         have Rate 1/4 frames, the specifications of 1/4 frames does not apply
         to EVRC codec data frames). Other vocoders that use this packet
         format need to specify how to convert vocoder output data into
         frames.

         The codec output data bits as numbered in EVRC and SMV are packed
         into octets. The lowest numbered bit (bit 1 for Rate 1, Rate 1/2,
         Rate 1/4 and Rate 1/8) is placed in the most significant bit
         (internet bit 0) of octet 1 of the codec data frame, the second
         lowest bit is placed in the second most significant bit of the first
         octet, the third lowest in the third most significant bit of the
         first octet, and so on. This continues until all of the bits have
         been placed in the codec data frame.

         The remaining unused bits of the last octet of the codec data frame
         MUST be set to zero. Note that in EVRC and SMV this is only
         applicable to Rate 1 frames (171 bits) as the Rate 1/2 (80 bits),
         Rate 1/4 (40 bits, SMV only) and Rate 1/8 frames (16 bits) fit
         exactly into a whole number of octets.

         Following is a detailed listing showing a Rate 1 EVRC/SMV codec
         output frame converted into a codec data frame:

         The codec data frame for a EVRC/SMV codec Rate 1 frame is 22 octets
         long. Bits 1 through 171 from the EVRC/SMV codec Rate 1 frame are
         placed as indicated, with bits marked with "Z" set to zero. EVRC/SMV
         codec Rate 1/8, Rate 1/4 and Rate 1/2 frames are converted similarly,
         but do not require zero padding because they align on octet
         boundaries.

                              Rate 1 codec data frame (octets 0 - 3)
          0                   1                   2                   3
          0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
         |0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
         |0|0|0|0|0|0|0|0|0|1|1|1|1|1|1|1|1|1|1|2|2|2|2|2|2|2|2|2|2|3|3|3|
         |1|2|3|4|5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1|2|
         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                    Rate 1 codec data frame (octets 19 - 21)

    1           1                   1                   1
    4           5                   6                   7
    4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1| | | | | |
   |4|4|4|4|4|5|5|5|5|5|5|5|5|5|5|6|6|6|6|6|6|6|6|6|6|7|7|Z|Z|Z|Z|Z|
   |5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1| | | | | |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

6. Interleaving Codec Data Frames in Type
         :                                                               :
         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
         |1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1| | | | | |
         |4|4|4|4|4|5|5|5|5|5|5|5|5|5|5|6|6|6|6|6|6|6|6|6|6|7|7|Z|Z|Z|Z|Z|
         |5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1| | | | | |
         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      6. Interleaving Codec Data Frames in Type 1 Packets

         As indicated in Section 4.1, more than one codec data frame MAY be
         included in a single Type 1 Interleaved/Bundled packet by a sender.
         This is accomplished by interleaving or bundling.

         Bundling is used to spread the transmission overhead of the RTP and
         payload header over multiple vocoder frames. Interleaving
         additionally reduces the listener's perception of data loss by
         spreading such loss over non-consecutive vocoder frames. EVRC, SMV,
         and similar vocoders are able to compensate for an occasional lost
         frame, but speech quality degrades exponentially with consecutive
         frame loss.

         Bundling is signaled by setting the LLL field to zero and the Count
         field to greater than zero. Interleaving is indicated by setting the
         LLL field to a value greater than zero.

         The discussions on general interleaving apply to the bundling (which
         can be viewed as a reduced case of interleaving) with reduced
         complexity. The bundling case is discussed in detail in Section 7.

         Senders MAY support interleaving and/or bundling. All receivers MUST
         support interleaving and bundling.

         Given a time-ordered sequence of output frames from the EVRC codec
         numbered 0..n, a bundling value B (in (the value in the Count field), field plus
         one), and an interleave length L where n = B * (L+1) - 1, the output
         frames are placed into RTP packets as follows (the values of the
         fields LLL and NNN are indicated for each RTP packet):

         First RTP Packet in Interleave group:
            LLL=L, NNN=0
            Frame 0, Frame L+1, Frame 2(L+1), Frame 3(L+1), ... for a total of
            B frames

         Second RTP Packet in Interleave group:
            LLL=L, NNN=1
            Frame 1, Frame 1+L+1, Frame 1+2(L+1), Frame 1+3(L+1), ... for a
            total of B frames

         This continues to the last RTP packet in the interleave group:

         L+1 RTP Packet in Interleave group:
            LLL=L, NNN=L
            Frame L, Frame L+L+1, Frame L+2(L+1), Frame L+3(L+1), ... for a
            total of B frames

         Within each interleave group, the RTP packets making up the
         interleave group MUST be transmitted in value-increasing order of the
         NNN field. While this does not guarantee reduced end-to-end delay on
         the receiving end, when packets are delivered in order by the
         underlying transport, delay will be reduced to the minimum possible.

         Receivers MAY signal the maximum number of codec data frames (i.e.,
         the maximum acceptable bundling value B) they can handle in a single
         RTP packet using the OPTIONAL maxptime RTP mode parameter identified
         in Section 10. 12.

         Receivers MAY signal the maximum interleave length (i.e., the maximum
         acceptable LLL value in the Interleaving Octet) they will accept
         using the OPTIONAL maxinterleave RTP mode parameter identified in
         Section 10. 12.

         The parameters maxptime and maxinterleave are exchanged at the
         initial setup of the session. In one-to-one sessions, the sender MUST
         respect these values set be the receiver, and MUST NOT
         interleave/bundle more packets than what the receiver signals that it
         can handle. This ensures that the receiver can allocate a known
         amount of buffer space that will be sufficient for all
         interleaving/bundling used in that session. During the session, the
         sender may decrease the bundling value or interleaving length (so
         that less buffer space is required at the receiver), but never exceed
         the maximum value set by the receiver. This prevents the situation
         where a receiver needs to allocate more buffer space in the middle of
         a session but is unable to do so.

         Additionally, senders have the following restrictions:

         o  MUST NOT bundle more codec data frames in a single RTP packet than
            indicated by maxptime (see Section 10) 12) if it is signaled.

         o  SHOULD NOT bundle more codec data frames in a single RTP packet
            than will fit in the MTU of the underlying network.

         o  Once beginning a session with a given maximum interleaving value
            set by maxinterleave in Section 10, 12, MUST NOT increase the
            interleaving value (LLL) to exceed the maximum interleaving value
            that is signaled.

         o  MAY change the interleaving value value, but MUST do so only between
            interleave groups.

         o  Silence suppression MAY only be used between interleave groups. A
            ToC with Frame Type 0 (Blank Frame, Section 5.1) MUST be used
            within interleaving groups if the codec outputs a blank frame.
            The M bits in the RTP header MUST NOT be set, is not set for these blank frames,
            as the stream is continuous in time. Because there is only one
            time stamp for each RTP packet, silence suppression used within
            an interleave group
      will would cause ambiguities when reconstructing
            the speech at the receiver side, and thus is prohibited.

      6.1. Finding Interleave Group Boundaries

         Given an RTP packet with sequence number S, interleave length (field
         LLL) L, interleave index value (field NNN) N, and bundling value B,
         the interleave group consists of this RTP packet and other RTP
         packets with sequence numbers from S-N mod 65536 to S-N+L mod 65536
         inclusive. (The
   sequence numbers used here are for illustrative purposes. When
   wrapping around happens, the sequence numbers need to be adjusted
   accordingly). In other words, the interleave group always consists of
         L+1 RTP packets with sequential sequence numbers. The bundling value
         for all RTP packets in an interleave group MUST be the same.

         The receiver determines the expected bundling value for all RTP
         packets in an interleave group by the number of codec data frames
         bundled in the first RTP packet of the interleave group received.
         Note that this may not be the first RTP packet of the interleave
         group if packets are delivered out of order by the underlying
         transport.

   On receipt of an RTP packet in an interleave group with other than
   the expected bundling value,

      6.2. Additional Receiver Responsibilities

         Assume that the receiver MAY discard codec data
   frames off the end of the RTP packet or add erasure codec data frames
   to the end of the packet in order to manufacture a substitute packet
   with the expected bundling value.  The receiver MAY instead choose to
   discard the whole interleave group.

6.2. Reconstructing Interleaved Speech

   Given an RTP sequence number ordered set of RTP packets in an
   interleave group numbered 0..L, where L is the interleave length and
   B is the bundling value, and codec data frames within each RTP packet
   that are numbered in order from first to last with the numbers 1..B,
   the original, time-ordered sequence of output frames from the EVRC
   codec may be reconstructed as follows:

   First L+1 frames:
      Frame 0 from packet 0 of interleave group
      Frame 0 from packet 1 of interleave group
      And so on up to...
      Frame 0 from packet L of interleave group

   Second L+1 frames:
      Frame 1 from packet 0 of interleave group
      Frame 1 from packet 1 of interleave group
      And so on up to...
      Frame 1 from packet L of interleave group

   And so on up to...

   Bth L+1 frames:
      Frame B from packet 0 of interleave group
      Frame B from packet 1 of interleave group
      And so on up to...
      Frame B from packet L of interleave group

6.3. Receiving Invalid Interleaving Values

   On receipt of an RTP packet with an invalid value of the LLL or NNN
   fields, the RTP packet SHOULD be treated as lost by the receiver for
   the purpose of generating erasure frames as described in Section 8.

6.4. Additional Receiver Responsibilities

   Assume that the receiver has begun playing has begun playing frames from an interleave
         group. The time has come to play frame x from packet n of the
         interleave group. Further assume that packet n of the interleave
         group has not been received. As described in section Section 8, an erasure
         frame will be sent to the receiving vocoder.

         Now, assume that packet n of the interleave group arrives before
         frame x+1 of that packet is needed. Receivers SHOULD use frame x+1 of
         the newly received packet n rather than substituting an erasure
         frame. In other words, just because packet n was not available the
         first time it was needed to reconstruct the interleaved speech, the
         receiver SHOULD NOT assume it is not available when it is
         subsequently needed for interleaved speech reconstruction.

      7. Bundling Codec Data Frames in Type 1 Packets

         As discussed in Section 6, the bundling of codec data frames is a
         special reduced case of interleaving with LLL value in the Interleave
         Octet set to 0.

         Bundling codec data frames indicates multiple data frames are
         included consecutively in a packet, because the interleaving length
         (LLL) is 0. The interleaving group is thus reduced to a single RTP
         packet, and the reconstruction of the code data frames from RTP
         packets becomes a much simpler process.

         Furthermore, the additional restrictions on senders are reduced to:

         o  MUST NOT bundle more codec data frames in a single RTP packet than
            indicated by maxptime (see Section 10) 12) if it is signaled.

         o  SHOULD NOT bundle more codec data frames in a single RTP packet
            than will fit in the MTU of the underlying network.

      8. Handling Missing Codec Data Frames

         The vocoders covered by this payload format support erasure frame as
         an indication when frames are not available. While an The erasure frame
   MUST NOT be transmitted by an RTP sender, it MAY be frames are
         normally used internally by a receiver to advance the state of the
         voice decoder by exactly one frame time for each missing frame. Using
         the information from packet sequence number, time stamp, and the M
         bit, the receiver can detect missing codec data frames from RTP
         packet loss and/or silence suppression, and generate corresponding
         erasure frames. Erasure frames SHOULD MUST also be used in storage mode to
         record missing frames.

      9. Implementation Issues

      9.1. Interleaving Length

         The vocoder interpolates the missing speech content when given an
         erasure frame. However, the best quality is perceived by the listener
         when erasure frames are not consecutive. This makes interleaving
         desirable as it increases speech quality when packet loss occurs.

         On the other hand, interleaving can greatly increase the end-to-end
         delay. Where an interactive session is desired, either Type 1
         Interleaved/Bundled with interleaving length (field LLL) 0 or Type 2
         Header-Free RTP payload types are RECOMMENDED.

         When end-to-end delay is not a primary concern, an interleaving
         length (field LLL) of 4 or 5 is RECOMMENDED.

   The parameters maxptime RECOMMENDED as it offers a reasonable
         compromise between robustness and maxinterleave are exchanged at latency.

      9.2. Validation of Received Packets

         When receiving an RTP packet, the
   initial setup receiver SHOULD check the validity
         of the session so that ToC fields and match the receiver can allocate a
   known amount length of buffer space that the packet with what is
         indicated by the ToC fields. If any invalidity or mismatch is
         detected, it is RECOMMENDED to discard the received packet to avoid
         potential severe degradation of the speech quality. The discarded
         packet is treated following the same procedure as a lost packet, and
         the discarded data will be sufficient replaced with erasure frames.

         On receipt of an RTP packet with an invalid value of the LLL or NNN
         fields, the RTP packet SHOULD be treated as lost by the receiver for all future
   reception
         the purpose of generating erasure frames as described in that session. During Section 8.

         On receipt of an RTP packet in an interleave group with other than
         the session, expected frame count value, the sender may
   decrease receiver MAY discard codec data
         frames off the bundling value end of the RTP packet or interleaving length (so that less
   buffer space is required at add erasure codec data frames
         to the receiver), but never require more
   buffer space. This prevents end of the situation where packet in order to manufacture a substitute packet
         with the expected bundling value.  The receiver needs MAY instead choose to
   allocate more buffer space in
         discard the middle of a session but is unable
   to do so.

9.2. whole interleave group.

      10. Mode Request

         The Mode Request signal requests a particular encoding mode for the
         speech encoding in the reverse direction. All implementations are
         RECOMMENDED to honor the Mode Request signal. The Mode Request signal
         SHOULD only be used in one-to-one sessions. In multiparty sessions,
         any received Mode Request signals SHOULD be ignored.

         In addition, the Mode Request signal MAY also be sent through non-RTP
         means, which is out of the scope of this specification.

         The three-bit Mode Request field is used to signal the receiver to
         set a particular encoding mode to its audio encoder. If the Mode
         Request field is set to a non-zero value in RTP packets from node A
         to node B, it is a request for node B to change to the requested
         encoding mode for its audio encoder and therefore the bit rate of the
         RTP stream from node B to node A. Once a node sets this field to a
         non-zero value it SHOULD continue to set the field to the same value
         in subsequent packets until the requested mode has changed. This
         design helps to eliminate the scenario of getting the codec stuck in
         an unintended state if one of the packets that carries the Mode
         Request is lost. An otherwise silent node MAY send an RTP packet
         containing a blank frame in order to send a Mode Request.

         Each codec type using this format SHOULD define its own
         interpretation of the Mode Request field. Codecs SHOULD follow the
         convention that higher values of the three-bit field correspond to an
         equal or lower average output bit rate.

         For the EVRC codec, the Mode Request field MUST be interpreted
         according to Tables 2.2.1.2-1 and 2.2.1.2-2 of the EVRC codec
         specifications [1].  Values above '100' (4) are currently reserved.
         If an unknown value above '100' (4) is received, it MUST be handled
         as if '100' (4) were received. received, for interoperability with potential
         future revisions.

         For SMV codec, the Mode Request field MUST be interpreted according
         to Table 2.2-2 of the SMV codec specifications [2]. Values above
         '101' (5) are currently reserved. If an unknown value above '101' (5)
         is received, it MUST be handled as if '101' (5) were received.

10. IANA Considerations

   Two new MIME sub-types as described in this section are to be
   registered.

   The MIME-names for the EVRC and SMV codec are allocated from the IETF
   tree since all the vocoders covered are expected to be widely used received, also
         for Voice-over-IP applications.

   The RTP mode has been described in the previous sections.

10.1. interoperability with potential future revisions.

      11. Storage Mode

         The storage mode is used for storing speech frames, e.g., as a file
         or e-mail attachment.

         The file begins with a magic number to identify the vocoder that is
         used. The magic number for EVRC corresponds to the ASCII character
         string "#!EVRC\n", i.e., "0x23 0x21 0x45 0x56 0x52 0x43 0x0A" in
         network byte order. The magic number for SMV corresponds to the ASCII
         character string "#!SMV\n", i.e., "0x23 0x21 0x53 0x4d 0x56 0x0a" in
         network byte order.

         The codec data frames are stored in consecutive order, with a single
         TOC entry field, expanded extended to one octet, prefixing each codec data
         frame. The ToC field is expanded extended to one octet by setting the left-
   most four
         most significant bits of the octet to zero. For example, a ToC value
         of 4 (a full-rate frame) is stored as 0x04.

         Speech frames lost in transmission and non-received frames MUST be
         stored as erasure frames (frame type 5, see definition in Section
         5.1) to maintain synchronization with the original media.

10.2. EVRC

      12. IANA Considerations

         Two new MIME sub-types as described in this section are to be
         registered.

         The MIME-names for the EVRC and SMV codec are allocated from the IETF
         tree since all the vocoders covered are expected to be widely used
         for Voice-over-IP applications.

      12.1. Registration of Media Type EVRC

         Media Type Name:           audio

         Media Subtype Name:        EVRC

   Required Parameter for RTP mode:

      ptype:    Indicates the
            Type of the RTP/Vocoder packets. The
         valid values are 1 (Type 1 Interleaved/Bundled) or 2 (Type 2
         Header-Free).

   Optional parameters Interleaved/Bundled packet format for EVRC

            Required Parameter:         none

         Optional parameters:
            The following parameter applies to RTP mode: mode only.

            ptime:    Defined as usual for RTP audio [6].

            maxptime: The maximum amount of media which can be encapsulated
               in each packet, expressed as time in milliseconds. The time
               SHALL be calculated as the sum of the time the media present
               in the packet represents. The time SHOULD be a multiple of the
               duration of a single codec data frame (20 msec). If not
               signaled, the default maxptime value SHALL be 200
               milliseconds.

            maxinterleave: Maximum number for interleaving length (field LLL
               in the Interleaving Octet). The interleaving lengths used in
               the entire session MUST NOT exceed this maximum value. If not
               signaled, the maxinterleave length SHALL be 5.

   Optional parameters for storage mode: none

         Encoding considerations for considerations:
            For RTP mode: mode, see Section 6 and Section 7 of RFC xxxx.

   Encoding considerations for
            For storage mode: mode, see Section 10.1 11 of RFC xxxx.

         Security considerations: see
            See Section 12 14 "Security Considerations" of RFC xxxx.

         Public specification:
            RFC xxxx.

         Additional information:
            The following information applies for storage mode: mode only.

            Magic number: #!EVRC\n
            File extensions: evc, EVC
            Macintosh file type code: none
            Object identifier or OID: none

         Intended usage:
            COMMON. It is expected that many VoIP applications (as well as
            mobile applications) will use this type.

         Person & email address to contact for further information:
            Adam Li
            adamli@icsl.ucla.edu

         Author/Change controller:
            Adam Li
            adamli@icsl.ucla.edu
            IETF Audio/Video Transport Working Group

10.3. SMV MIME

      12.2. Registration of Media Type EVRC0

         Media Type Name:           audio

         Media Subtype Name:  SMV        EVRC0
            Type 2 Header-Free packet format for EVRC

         Required Parameter Parameter:       none

         Optional parameters:       none

         Encoding considerations:  none

         Security considerations:
            See Section 14 "Security Considerations" of RFC xxxx.

         Public specification:
            RFC xxxx.

         Additional information:   none

         Intended usage:
            COMMON. It is expected that many VoIP applications (as well as
            mobile applications) will use this type.

         Person & email address to contact for RTP mode:

      ptype:    Indicates the Type further information:
            Adam Li
            adamli@icsl.ucla.edu

         Author/Change controller:
            Adam Li
            adamli@icsl.ucla.edu
            IETF Audio/Video Transport Working Group

      12.3. Registration of the RTP/Vocoder packets. The
         valid values are 1 (Type Media Type SMV

         Media Type Name:           audio

         Media Subtype Name:        SMV
            Type 1 Interleaved/Bundled) or 2 (Type 2
         Header-Free).

   Optional parameters Interleaved/Bundled packet format for SMV

            Required Parameter:         none

         Optional parameters:
            The following parameter applies to RTP mode: mode only.

            ptime:    Defined as usual for RTP audio [6].

            maxptime: The maximum amount of media which can be encapsulated
               in each packet, expressed as time in milliseconds. The time
               SHALL be calculated as the sum of the time the media present
               in the packet represents. The time SHOULD be a multiple of the
               duration of a single codec data frame (20 msec). If not
               signaled, the default maxptime value SHALL be 200
               milliseconds.

            maxinterleave: Maximum number for interleaving length (field LLL
               in the Interleaving Octet). The interleaving lengths used in
               the entire session MUST NOT exceed this maximum value. If not
               signaled, the maxinterleave length SHALL be 5.

   Optional parameters for

         Encoding considerations:
            For RTP mode, see Section 6 and Section 7 of RFC xxxx.
            For storage mode: mode, see Section 11 of RFC xxxx.

         Security considerations:
            See Section 14 "Security Considerations" of RFC xxxx.

         Public specification:
            RFC xxxx.

         Additional information:
            The following information applies to storage mode only.

            Magic number: #!SMV\n
            File extensions: smv, SMV
            Macintosh file type code: none
            Object identifier or OID: none
         Intended usage:
            COMMON. It is expected that many VoIP applications (as well as
            mobile applications) will use this type.

         Person & email address to contact for further information:
            Adam Li
            adamli@icsl.ucla.edu

         Author/Change controller:
            Adam Li
            adamli@icsl.ucla.edu
            IETF Audio/Video Transport Working Group

      12.4. Registration of Media Type SMV0

         Media Type Name:           audio

         Media Subtype Name:        SMV0
            Type 2 Header-Free packet format for SMV

         Required Parameter:        none

         Optional parameters:       none

         Encoding considerations for RTP mode: see Section 6 and Section 7 of
      RFC xxxx.

   Encoding considerations for storage mode: see Section 10.1 of RFC
      xxxx. considerations:  none

         Security considerations: see
            See Section 12 14 "Security Considerations" of RFC xxxx.

         Public specification:
            RFC xxxx.

         Additional information for storage mode:
      Magic number: #!SMV\n
      File extensions: smv, SMV
      Macintosh file type code: none
      Object identifier or OID: information:   none

         Intended usage:
            COMMON. It is expected that many VoIP applications (as well as
            mobile applications) will use this type.

         Person & email address to contact for further information:
            Adam Li
            adamli@icsl.ucla.edu

         Author/Change controller:
            Adam Li
            adamli@icsl.ucla.edu
            IETF Audio/Video Transport Working Group

11.

      13. Mapping to SDP Parameters

         Please note that this section applies to the RTP mode only.

   Parameters are mapped

         The information carried in the MIME media type specification has a
         specific mapping to fields in the Session Description Protocol (SDP)
         [6], which is commonly used to describe RTP sessions. When SDP [6] is
         used to specify sessions employing the EVRC or EMV codec, the mapping
         is as follows:

            o The MIME type ("audio") goes in SDP "m=" as the media name.

            o The MIME subtype (payload format name) goes in SDP "a=rtpmap"
              as the encoding name.

            o The parameters "ptime" and "maxptime" go in the SDP "a=ptime"
              and "a=maxptime" attributes, respectively.

            o Any remaining parameters go in the SDP "a=fmtp" attribute by
              copying them directly from the MIME media type string as usual. a
              semicolon separated list of parameter=value pairs.

         Some examples of SDP session descriptions for EVRC and SMV encodings
         follow below.

         Example of usage in SDP: of EVRC:

           m = audio 49120 RTP/AVP 97
           a = rtpmap:97 EVRC
           a = fmtp:97 ptype=1; maxinterleave=2
           a = maxptime:80

12.

         Example of usage of SMV

           m = audio 49122 RTP/AVP 99
           a = rtpmap:99 SMV0
           a = fmtp:99

         Note that the payload format (encoding) names are commonly shown in
         upper case. MIME subtypes are commonly shown in lower case. These
         names are case-insensitive in both places. Similarly, parameter names
         are case-insensitive both in MIME types and in the default mapping to
         the SDP a=fmtp attribute.

      14. Security Considerations

         RTP packets using the payload format defined in this specification
         are subject to the security considerations discussed in the RTP
         specification [4], and any appropriate profile (for example [5]).
         This implies that confidentiality of the media streams is achieved by
         encryption. Because the data compression used with this payload
         format is applied end-to-end, encryption may be performed after
         compression so there is no conflict between the two operations.

         A potential denial-of-service threat exists for data encoding using
         compression techniques that have non-uniform receiver-end
         computational load. The attacker can inject pathological datagrams
         into the stream which are complex to decode and cause the receiver to
         become overloaded. However, the encodings covered in this document do
         not exhibit any significant non-uniformity.

         As with any IP-based protocol, in some circumstances, a receiver may
         be overloaded simply by the receipt of too many packets, either
         desired or undesired. Network-layer authentication may be used to
         discard packets from undesired sources, but the processing cost of
         the authentication itself may be too high. In a multicast
         environment, pruning of specific sources may be implemented in
         future versions of IGMP [7] and in multicast routing protocols to
         allow a receiver to select which sources are allowed to reach it.

         Interleaving MAY affect encryption. Depending on the used encryption
         scheme there MAY be restrictions on for example the time when keys
         can be changed.

13. Specifically, the key change may need to occur at the
         boundary between interleave groups.

      15. Adding Support of Other Frame-Based Vocoders

         As described above, the RTP packet format defined in this document is
         very flexible and designed to be usable by other frame-based
         vocoders.

         Additional vocoders using this format MUST have properties as
         described in Section 3.3.

   The following need to be done in order for any

         For an eligible vocoders vocoder to use the RTP payload format mechanisms defined
         in this document: document, a new RTP payload format document needs to be
         published as an RFC. That document can simply refer to this document
         and then specify the following parameters:

          o Define the unit used for RTP time stamp;
          o Define the meaning of the Mode Request bits;
          o Define corresponding codec data frame type values for ToC;
          o Define the conversion procedure for vocoders output data frame;
          o Define a magic number for storage mode, and complete the
            corresponding MIME registration.

14.

      16. Acknowledgements

         The following authors have made significant contributions to this
         document: Adam H. Li, John D. Villasenor, Dong-Seek Park, Jeong-Hoon
         Park, Keith Miller, S. Craig Greer, David Leon, Nikolai Leung,
         Marcello Lioy, Kyle J. McKay, Magdalena L. Espelien, Randall Gellens,
         Tom Hiller, Peter J. McCann, Stinson S. Mathai, Michael D. Turner,
         Ajay Rajkumar, Dan Gal, Magnus Westerlund, Lars-Erik Jonsson, Greg
         Sherwood, and Thomas Zeng.

15.

      17. References

         [1]  3GPP2 C.S0014, "Enhanced Variable Rate Codec, Speech Service
              Option 3 for Wideband Spread Spectrum Digital Systems", January
              1997.

         [2]  3GPP2 C.S0030,  C.S0030-0 v2.0, "Selectable Mode Vocoder", August 2001. Vocoder, Service Option for
              Wideband Spread Spectrum Communication Systems", May 2002.

         [3]  Bradner, S., "Key words for use in RFCs to Indicate Requirement
              Levels", BCP 14, RFC 2119, March 1997.

         [4]  Schulzrinne, H., Casner, S., Frederick, R. and V. Jacobson,
              "RTP:  A Transport Protocol for Real-Time Applications", RFC
              1889, January 1996.

         [5]  Schulzrinne, H., "RTP Profile for Audio and Video Conferences
              with Minimal Control", RFC 1890, January 1996.

         [6]  M. Handley and V. Jacobson, "SDP: Session Description Protocol",
              RFC 2327, April 1998.

         [7]  Deering, S., "Host Extensions for IP Multicasting", STD 5, RFC
              1112, August 1989.

16.

      18. Authors' Address

         The editor will serve as the point of contact for technical issues.

         Adam H. Li
         Image Communication Lab
         Electrical Engineering Department
         University of California
         Los Angeles, CA 90095
         USA
         Phone: +1 310 825 5178
         Email: adamli@icsl.ucla.edu