draft-ietf-ecm-cm-01.txt   draft-ietf-ecm-cm-02.txt 
Internet Engineering Task Force Hari Balakrishnan Internet Engineering Task Force Hari Balakrishnan
INTERNET DRAFT MIT LCS INTERNET DRAFT MIT LCS
Document: draft-ietf-ecm-cm-01.txt Srinivasan Seshan Document: draft-ietf-ecm-cm-02.txt Srinivasan Seshan
CMU CMU
July, 2000 October, 2000
Expires: January 2001
The Congestion Manager The Congestion Manager
Status of this Memo Status of this Memo
This document is an Internet-Draft and is in full conformance with This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC-2026 [Bradner96]. all provisions of Section 10 of RFC-2026 [Bradner96].
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet- other groups may also distribute working documents as Internet-
skipping to change at line 32 skipping to change at line 29
as reference material or to cite them other than as "work in as reference material or to cite them other than as "work in
progress." progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
1. Abstract 1. Abstract
This document describes the Congestion Manager (CM), an end-system This document describes the Congestion Manager (CM), an end-system
module that (i) enables an ensemble of multiple concurrent streams module that:
from a sender destined to the same receiver and sharing the same
(i) Enables an ensemble of multiple concurrent streams from a
sender destined to the same receiver and sharing the same
congestion properties to perform proper congestion avoidance and congestion properties to perform proper congestion avoidance and
control, and (ii) allows applications to easily adapt to network control, and
congestion. This CM framework integrates congestion management
across all applications and transport protocols. The CM maintains (ii) Allows applications to easily adapt to network congestion.
congestion parameters (available aggregate and per-stream bandwidth,
per-receiver round-trip times, etc.) and exports an API that The framework described in this document integrates congestion
enables applications to learn about network characteristics, pass management across all applications and transport protocols. The CM
information to the CM, share congestion information with each maintains congestion parameters (available aggregate and per-stream
bandwidth, per-receiver round-trip times, etc.) and exports an API
that enables applications to learn about network characteristics,
pass information to the CM, share congestion information with each
other, and schedule data transmissions. This document focuses on other, and schedule data transmissions. This document focuses on
applications and transport protocols with their own independent applications and transport protocols with their own independent
per-byte or per-packet sequence number information, and does not per-byte or per-packet sequence number information, and does not
require modifications to the receiver protocol stack. The require modifications to the receiver protocol stack. However, the
receiving application must provide feedback to the sending receiving application must provide feedback to the sending
application about received packets and losses, and the latter uses application about received packets and losses, and the latter is
the CM API to update CM state. This document does not address expected to use the CM API to update CM state. This document does
networks with reservations or service discrimination. not address networks with reservations or service differentiation.
2. Conventions used in this document: 2. Conventions used in this document:
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
this document are to be interpreted as described in RFC-2119 this document are to be interpreted as described in RFC-2119
[Bradner97]. [Bradner97].
STREAM STREAM
A group of packets that all share the same source and A group of packets that all share the same source and
destination IP address, IP type-of-service, transport destination IP address, IP type-of-service, transport
protocol, and source and destination transport port numbers. protocol, and source and destination transport-layer port
numbers.
FLOW
Identical to STREAM.
MACROFLOW MACROFLOW
A group of streams that all use the same congestion management A group of streams that all use the same congestion management
and scheduling algorithms, and share congestion state and scheduling algorithms, and share congestion state
information. Currently, streams destined to different information. Currently, streams destined to different
receivers belong to different macroflows. Streams destined to receivers belong to different macroflows. Streams destined to
the same receiver MAY belong to different macroflows. Streams the same receiver MAY belong to different macroflows. Streams
that experience identical congestion behavior in the Internet that experience identical congestion behavior in the Internet
and use the same congestion control algorithm SHOULD belong to and use the same congestion control algorithm SHOULD belong to
the same macroflow. the same macroflow.
skipping to change at line 120 skipping to change at line 120
return value is expected from a call. Pointers are referred to return value is expected from a call. Pointers are referred to
using "*" syntax, following C language convention. using "*" syntax, following C language convention.
We emphasize that all the API functions described in this We emphasize that all the API functions described in this
document are "abstract" calls and that conformant CM document are "abstract" calls and that conformant CM
implementations may differ in specific implementation details. implementations may differ in specific implementation details.
3. Introduction 3. Introduction
The CM is an end-system module that enables an ensemble of multiple The CM is an end-system module that enables an ensemble of multiple
concurrent streams to perform proper congestion avoidance and concurrent streams to perform stable congestion avoidance and
control, and allows applications to easily adapt their control, and allows applications to easily adapt their
transmissions to prevailing network conditions. It integrates transmissions to prevailing network conditions. It integrates
congestion management across all applications and transport congestion management across all applications and transport
protocols. It maintains congestion parameters (available aggregate protocols. It maintains congestion parameters (available aggregate
and per-stream bandwidth, per-receiver round-trip times, etc.) and and per-stream bandwidth, per-receiver round-trip times, etc.) and
exports an API that enables applications to learn about network exports an API that enables applications to learn about network
characteristics, pass information to the CM, share congestion characteristics, pass information to the CM, share congestion
information with each other, and schedule data transmissions. All information with each other, and schedule data transmissions. All
data transmissions MUST be done with the explicit consent of the CM data transmissions MUST be done with the explicit consent of the CM
via this API to ensure proper congestion behavior. via this API to ensure proper congestion behavior.
skipping to change at line 146 skipping to change at line 146
per-byte or per-packet sequence number information, and use the per-byte or per-packet sequence number information, and use the
CM API to update internal state in the CM. CM API to update internal state in the CM.
2. Networks are best-effort without service discrimination or 2. Networks are best-effort without service discrimination or
reservations. In particular, it does not address situations reservations. In particular, it does not address situations
where different streams between the same pair of hosts traverse where different streams between the same pair of hosts traverse
paths with differing characteristics. paths with differing characteristics.
The Congestion Manager framework can be extended to support The Congestion Manager framework can be extended to support
applications that do not provide their own feedback and to applications that do not provide their own feedback and to
differentially served networks. These extensions will be addressed differentially-served networks. These extensions will be addressed
in later documents. in later documents.
The CM is motivated by two main goals: The CM is motivated by two main goals:
(i) Enable efficient multiplexing. Increasingly, the trend on the (i) Enable efficient multiplexing. Increasingly, the trend on the
Internet is for unicast data senders (e.g., Web servers) to Internet is for unicast data senders (e.g., Web servers) to
transmit heterogeneous types of data to receivers, ranging from transmit heterogeneous types of data to receivers, ranging from
unreliable real-time streaming content to reliable Web pages and unreliable real-time streaming content to reliable Web pages and
applets. As a result, many logically different streams share the applets. As a result, many logically different streams share the
same path between sender and receiver. For the Internet to remain same path between sender and receiver. For the Internet to remain
skipping to change at line 238 skipping to change at line 238
| | | | I | | | | | | | I | | |
v v v | | | Controller | v v v | | | Controller |
|-----------------------------------| | | | | |-----------------------------------| | | | |
| IP |-->| | | | | IP |-->| | | |
|-----------------------------------| | | |--------------| |-----------------------------------| | | |--------------|
|---| |---|
Figure 1 Figure 1
The key components of the CM framework are (i) the API, (ii) the The key components of the CM framework are (i) the API, (ii) the
congestion controller, (iii) the scheduler. The API is (in part) congestion controller, and (iii) the scheduler. The API is (in
motivated by the ideas of application-level framing (ALF) [Clark90] part) motivated by the requirements of application-level framing
and is described in Section 4. The CM internals (Section 5) (ALF) [Clark90], and is described in Section 4. The CM internals
include a congestion controller (Section 5.1) and a scheduler to (Section 5) include a congestion controller (Section 5.1) and a
orchestrate data transmissions between concurrent streams in a scheduler to orchestrate data transmissions between concurrent
macroflow (Section 5.2). The congestion controller adjusts the streams in a macroflow (Section 5.2). The congestion controller
aggregate transmission rate between sender and receiver based on adjusts the aggregate transmission rate between sender and receiver
its estimate of congestion in the network. It obtains feedback based on its estimate of congestion in the network. It obtains
about its past transmissions from applications themselves via the feedback about its past transmissions from applications themselves
API. The scheduler apportions available bandwidth amongst the via the API. The scheduler apportions available bandwidth amongst
different streams within each macroflow and notifies applications the different streams within each macroflow and notifies
when they are permitted to send data. This document focuses on applications when they are permitted to send data. This document
well-behaved applications; a future one will describe the focuses on well-behaved applications; a future one will describe
sender-receiver protocol and header formats that will handle the sender-receiver protocol and header formats that will handle
applications that do not incorporate their own feedback to the CM. applications that do not incorporate their own feedback to the CM.
4. CM API 4. CM API
Using the CM API, streams can determine their share of the available Using the CM API, streams can determine their share of the available
bandwidth, request and have their data transmissions scheduled, bandwidth, request and have their data transmissions scheduled,
inform the CM about successful transmissions, and be informed when inform the CM about successful transmissions, and be informed when
the CM's estimate of path bandwidth changes. Thus, the CM frees the CM's estimate of path bandwidth changes. Thus, the CM frees
applications from having to maintain information about the state of applications from having to maintain information about the state of
congestion and available bandwidth along any path. congestion and available bandwidth along any path.
skipping to change at line 280 skipping to change at line 280
Currently, stream_info consists of the following information: (i) Currently, stream_info consists of the following information: (i)
the source IP address, (ii) the source port, (iii) the destination the source IP address, (ii) the source port, (iii) the destination
IP address, (iv) the destination port, and (v) the IP protocol IP address, (iv) the destination port, and (v) the IP protocol
number. number.
4.1 State maintenance 4.1 State maintenance
1. Open: All applications MUST call cm_open(stream_info) before 1. Open: All applications MUST call cm_open(stream_info) before
using the CM API. This returns a handle, cm_streamid, for the using the CM API. This returns a handle, cm_streamid, for the
application to use for all further CM API invocations for that application to use for all further CM API invocations for that
stream. If cm_streamid is -1, then the cm_open() failed and that stream. If the returned cm_streamid is -1, then the cm_open()
stream cannot use the CM. failed and that stream cannot use the CM.
All other calls to the CM for a stream use the cm_streamid All other calls to the CM for a stream use the cm_streamid
returned from the cm_open() call. returned from the cm_open() call.
2. Close: When a stream terminates, the application SHOULD invoke 2. Close: When a stream terminates, the application SHOULD invoke
cm_close(cm_streamid) to inform the CM about the termination cm_close(cm_streamid) to inform the CM about the termination
of the stream. of the stream.
3. Packet size: cm_mtu(cm_streamid) returns the estimated PMTU of 3. Packet size: cm_mtu(cm_streamid) returns the estimated PMTU of
the path between sender and receiver. Internally, this the path between sender and receiver. Internally, this
skipping to change at line 310 skipping to change at line 310
prevailing network conditions, and supporting ALF-based prevailing network conditions, and supporting ALF-based
applications. applications.
1. Callback-based transmission. The callback-based transmission API 1. Callback-based transmission. The callback-based transmission API
puts the stream in firm control of deciding what to transmit at puts the stream in firm control of deciding what to transmit at
each point in time. To achieve this, the CM does not buffer any each point in time. To achieve this, the CM does not buffer any
data; instead, it allows streams the opportunity to adapt to data; instead, it allows streams the opportunity to adapt to
unexpected network changes at the last possible instant. Thus, unexpected network changes at the last possible instant. Thus,
this enables streams to "pull out" and repacketize data upon this enables streams to "pull out" and repacketize data upon
learning about any rate change, which is hard to do once the data learning about any rate change, which is hard to do once the data
has been buffered. A stream wishing to send data in this style has been buffered. The CM must implement a cm_request(i32
MUST call cm_request(i32 cm_streamid). After some time, depending cm_streamid) call for streams wishing to send data in this style.
on the rate, the CM invokes a callback using cmapp_send(), which is After some time, depending on the rate, the CM MUST
invoke a callback using cmapp_send(), which is
a grant for the stream to send up to PMTU bytes. The a grant for the stream to send up to PMTU bytes. The
callback-style API is the recommended choice for ALF-based streams. callback-style API is the recommended choice for ALF-based streams.
Note that cm_request() does not take the number of bytes or Note that cm_request() does not take the number of bytes or
MTU-sized units as an argument; each call to cm_request() is an MTU-sized units as an argument; each call to cm_request() is an
implicit request for sending up to PMTU bytes. Section 4.3 implicit request for sending up to PMTU bytes. The CM MAY provide
discusses the time duration for which the transmission grant is an alternate interface, cm_request(int k). The cmapp_send callback
valid, while Section 5.2 describes how these requests are scheduled for this request is granted the right to send up to k PMTU sized
and callbacks made. segments. Section 4.3 discusses the time duration for which the
transmission grant is valid, while Section 5.2 describes how these
requests are scheduled and callbacks made.
2. Synchronous-style. The above callback-based API accommodates a 2. Synchronous-style. The above callback-based API accommodates a
class of ALF streams that are "asynchronous." Asynchronous class of ALF streams that are "asynchronous." Asynchronous
transmitters do not transmit based on a periodic clock, but do so transmitters do not transmit based on a periodic clock, but do so
triggered by asynchronous events like file reads or captured triggered by asynchronous events like file reads or captured
frames. On the other hand, there are many streams that are frames. On the other hand, there are many streams that are
"synchronous" transmitters, which transmit periodically based on "synchronous" transmitters, which transmit periodically based on
their own internal timers (e.g., an audio senders that sends at a their own internal timers (e.g., an audio senders that sends at a
constant sampling rate). While CM callbacks could be configured to constant sampling rate). While CM callbacks could be configured to
periodically interrupt such transmitters, the transmit loop of such periodically interrupt such transmitters, the transmit loop of such
applications is less affected if they retain their original applications is less affected if they retain their original
timer-based loop. In addition, it complicates the CM API to have a timer-based loop. In addition, it complicates the CM API to have a
stream express the periodicity and granularity of its callbacks. stream express the periodicity and granularity of its callbacks.
Thus, the CM exports an API that allows such streams to be informed Thus, the CM MUST export an API that allows such streams to be informed
of changes in rates using the cmapp_update(u64 newrate, u32 srtt, of changes in rates using the cmapp_update(u64 newrate, u32 srtt,
u32 rttdev) callback function, where newrate is the new rate in u32 rttdev) callback function, where newrate is the new rate in
bits per second for this stream, srtt is the current smoothed round bits per second for this stream, srtt is the current smoothed round
trip time estimate in microseconds, and rttdev is the smoothed trip time estimate in microseconds, and rttdev is the smoothed
linear deviation in the round-trip time estimate. The newrate linear deviation in the round-trip time estimate calculated using
value reports an instantaneous rate calculated, for example, by the same algorithm as in TCP [Paxson00]. The newrate value reports
taking the ratio of cwnd and srtt, and dividing by the fraction of an instantaneous rate calculated, for example, by taking the ratio
that ratio allocated to the stream. In response, the stream MUST of cwnd and srtt, and dividing by the fraction of that ratio
adapt its packet size or change its timer interval to conform to allocated to the stream. In response, the stream MUST adapt its
(i.e., not exceed) the allowed rate. Of course, it may choose not packet size or change its timer interval to conform to (i.e., not
to use all of this rate. Note that the CM is not on the data path exceed) the allowed rate. Of course, it may choose not to use all
of the actual transmission. of this rate. Note that the CM is not on the data path of the
actual transmission.
To avoid unnecessary cmapp_update() callbacks that the application To avoid unnecessary cmapp_update() callbacks that the application
will only ignore, the stream can use the cm_thresh(float will only ignore, the CM MUST provide a cm_thresh(float
rate_downthresh, float rate_upthresh, float rtt_downthresh, float rate_downthresh, float rate_upthresh, float rtt_downthresh, float
rtt_upthresh) function at any stage in its execution. In response, rtt_upthresh) function that a stream can use at any stage in its execution.
the CM will invoke the callback only when the rate decreases to In response, the CM SHOULD invoke the callback only when the rate decreases
less than (rate_downthresh * lastrate) or increases to more than to less than (rate_downthresh * lastrate) or increases to more than
(rate_upthresh * lastrate), where lastrate is the rate last (rate_upthresh * lastrate), where lastrate is the rate last
notified to the stream, or when the round-trip time changes notified to the stream, or when the round-trip time changes
correspondingly by the requisite thresholds. This information is correspondingly by the requisite thresholds. This information is
used as a hint by the CM, in the sense the cmapp_update() can be used as a hint by the CM, in the sense the cmapp_update() can be
called even if these conditions are not met. called even if these conditions are not met.
An application can query the current CM state by using cm_query(i32 The CM MUST implement a cm_query(i32 cm_streamid, u64* rate,
cm_streamid, u64* rate, u32* srtt, u32* rttdev). This sets the u32* srtt, u32* rttdev) to allow an application to query
rate variable to the current rate estimate in bits per second, the the current CM state. This sets the rate variable to
the current rate estimate in bits per second, the
srtt variable to the current smoothed round-trip time estimate in srtt variable to the current smoothed round-trip time estimate in
microseconds, and rttdev to the mean linear deviation. If the CM microseconds, and rttdev to the mean linear deviation. If the CM
does not have valid estimates for the macroflow, it fills in does not have valid estimates for the macroflow, it fills in
negative values for the rate, srtt, and rttdev. negative values for the rate, srtt, and rttdev.
Note that a stream can use more than one of the above transmission Note that a stream can use more than one of the above transmission
APIs at the same time. In particular, the knowledge of sustainable APIs at the same time. In particular, the knowledge of sustainable
rate is useful for asynchronous streams as well as synchronous rate is useful for asynchronous streams as well as synchronous
ones; e.g., an asynchronous Web server disseminating images using ones; e.g., an asynchronous Web server disseminating images using
TCP may use cmapp_send() to schedule its transmissions and TCP may use cmapp_send() to schedule its transmissions and
skipping to change at line 400 skipping to change at line 405
successful receptions, type of loss (timeout event, Explicit successful receptions, type of loss (timeout event, Explicit
Congestion Notification [Ramakrishnan98], etc.) and round-trip time Congestion Notification [Ramakrishnan98], etc.) and round-trip time
samples. The nrecd parameter indicates how many bytes were samples. The nrecd parameter indicates how many bytes were
successfully received by the receiver since the last cm_update successfully received by the receiver since the last cm_update
call, while the nrecd parameter identifies how many bytes were call, while the nrecd parameter identifies how many bytes were
received were lost during the same time period. The rtt value received were lost during the same time period. The rtt value
indicates the round-trip time measured during the transmission of indicates the round-trip time measured during the transmission of
these bytes. The rtt value must be set to -1 if no valid these bytes. The rtt value must be set to -1 if no valid
round-trip sample was obtained by the application. The lossmode round-trip sample was obtained by the application. The lossmode
parameter provides an indicator of how a loss was detected. A parameter provides an indicator of how a loss was detected. A
value of CM_PERSISTENT indicates that the application believes value of CM_NO_FEEDBACK indicates that the application has received
congestion to be severe, e.g., a TCP that has experienced a no feedback for all its outstanding data, and is reporting this to
timeout. A value of CM_TRANSIENT indicates that the application the CM. For example, a TCP that has experienced a timeout would
believes that the congestion is not severe, e.g., a TCP loss use this parameter to inform the CM of this. A value of
CM_LOSS_FEEDBACK indicates that the application has experienced
some loss, which it believes to be due to congestion, but not all
outstanding data has been lost. For example, a TCP segment loss
detected using duplicate (selective) acknowledgements or other detected using duplicate (selective) acknowledgements or other
data-driven techniques. A value of CM_ECN indicates that the data-driven techniques fits this category. A value of
receiver echoed an explicit congestion notification message. CM_EXPLICIT_CONGESTION indicates that the receiver echoed an
Finally, a value of CM_NOLOSS indicates that no congestion-related explicit congestion notification message. Finally, a value of
loss has occurred. The lossmode parameter MUST be reported as a CM_NO_CONGESTION indicates that no congestion-related loss has
bit-vector where the bits correspond to CM_PERSISTENT, occurred. The lossmode parameter MUST be reported as a bit-vector
CM_TRANSIENT, and CM_ECN. where the bits correspond to CM_NO_FEEDBACK, CM_LOSS_FEEDBACK,
CM_EXPLICIT_CONGESTION, and CM_NO_CONGESTION. Note that over links
(paths) that experience losses for reasons other than congestion,
an application SHOULD inform the CM of losses, with the
CM_NO_CONGESTION field set.
cm_notify(i32 cm_streamid, u32 nsent) MUST be called when data is cm_notify(i32 cm_streamid, u32 nsent) MUST be called when data is
transmitted from the host (e.g., in the IP output routine) to transmitted from the host (e.g., in the IP output routine) to
inform the CM that nsent bytes were just transmitted on a given inform the CM that nsent bytes were just transmitted on a given
stream. This allows the CM to update its estimate of the number of stream. This allows the CM to update its estimate of the number of
outstanding bytes for the macroflow and for the stream. outstanding bytes for the macroflow and for the stream.
A cmapp_send() grant from the CM to an application is valid only A cmapp_send() grant from the CM to an application is valid only
for an expiration time, equal to the larger of the round-trip time for an expiration time, equal to the larger of the round-trip time
and an implementation-dependent threshold communicated as an and an implementation-dependent threshold communicated as an
skipping to change at line 460 skipping to change at line 472
sets the macroflow of the stream cm_streamid to cm_macroflowid. If the sets the macroflow of the stream cm_streamid to cm_macroflowid. If the
cm_macroflowid that is passed to cm_setmacroflow() is -1, then a cm_macroflowid that is passed to cm_setmacroflow() is -1, then a
new macroflow is constructed and this is returned to the caller. new macroflow is constructed and this is returned to the caller.
Each call to cm_setmacroflow() overrides the previous macroflow Each call to cm_setmacroflow() overrides the previous macroflow
association for the stream, should one exist. association for the stream, should one exist.
The default suggested aggregation method is to aggregate by The default suggested aggregation method is to aggregate by
destination IP address; i.e., all streams to the same destination destination IP address; i.e., all streams to the same destination
address are aggregated to a single macroflow by default. The address are aggregated to a single macroflow by default. The
cm_getmacroflow() and cm_setmacroflow() calls can then be used to cm_getmacroflow() and cm_setmacroflow() calls can then be used to
change this as needed. change this as needed. We do note that there are some cases where
this may not be optimal, even over best-effort networks. For
example, when a group of receivers are behind a NAT device, the
sender will see them all as one address. If the hosts behind the
NAT are in fact connected over different bottleneck links, some of
those hosts could see worse performance than before. It is
possible to detect such hosts when using delay and loss estimates,
although the specific mechanisms for doing so are beyond the scope
of this document.
The objective of this interface is to set up sharing of groups not The objective of this interface is to set up sharing of groups not
sharing policy of relative weights of streams in a macroflow. The sharing policy of relative weights of streams in a macroflow. The
latter requires the scheduler to provide an interface to set latter requires the scheduler to provide an interface to set
sharing policy. However, because we want to support many different sharing policy. However, because we want to support many different
schedulers (each of which may need different information to set schedulers (each of which may need different information to set
policy), we do not specify a complete API to the scheduler (but see policy), we do not specify a complete API to the scheduler (but see
Section 5.2). A later guideline document intends to describe a few Section 5.2). A later guideline document is expected to describe a
simple schedulers (e.g., weighted round-robin, hierarchical few simple schedulers (e.g., weighted round-robin, hierarchical
scheduling) and the API they export to provide relative scheduling) and the API they export to provide relative
prioritization. prioritization.
5. CM internals 5. CM internals
This section describes the internal components of the CM. It This section describes the internal components of the CM. It
includes a Congestion Controller and a Scheduler, with includes a Congestion Controller and a Scheduler, with
well-defined, abstract interfaces exported by them. well-defined, abstract interfaces exported by them.
5.1 Congestion controller 5.1 Congestion controller
skipping to change at line 555 skipping to change at line 575
described stream's share of the total bandwidth available to the described stream's share of the total bandwidth available to the
macroflow. This call combined with the query call of the macroflow. This call combined with the query call of the
congestion controller provides the information to satisfy an congestion controller provides the information to satisfy an
application's cm_query() request. application's cm_query() request.
- void notify(i32 cm_streamid, u32 nsent): This interface is used - void notify(i32 cm_streamid, u32 nsent): This interface is used
to notify the scheduler module whenever data is sent by a CM to notify the scheduler module whenever data is sent by a CM
application. The nsent parameter indicates the number of bytes application. The nsent parameter indicates the number of bytes
just sent by the application. just sent by the application.
The Scheduler MAY implement many additional interfaces. As
experience with CM schedulers increases, future documents may
make additions and/or changes to some parts of the scheduler
API.
6. Examples 6. Examples
6.1 Example applications 6.1 Example applications
The following describes the possible use of the CM API by an The following describes the possible use of the CM API by an
asynchronous application (an implementation of a TCP sender) and a asynchronous application (an implementation of a TCP sender) and a
synchronous application (an audio server). More details of these synchronous application (an audio server). More details of these
applications and CM implementation optimizations for efficient applications and CM implementation optimizations for efficient
operation are described in [Andersen00]. We emphasize that the operation are described in [Andersen00]. We emphasize that the
protocols in this section are examples and suggestions for protocols in this section are examples and suggestions for
skipping to change at line 639 skipping to change at line 664
Karn's algorithm. Any rtt estimate that is generated is passed Karn's algorithm. Any rtt estimate that is generated is passed
to CM via the cm_update call. to CM via the cm_update call.
2. All cwnd and slow start threshold (ssthresh) updates are 2. All cwnd and slow start threshold (ssthresh) updates are
removed. removed.
3. Upon the arrival of an ack for new data, TCP computes the 3. Upon the arrival of an ack for new data, TCP computes the
value of in_flight (the amount of data in flight) as value of in_flight (the amount of data in flight) as
snd_max-ack-1 (i.e. MAX Sequence Sent - Current Ack - 1). TCP snd_max-ack-1 (i.e. MAX Sequence Sent - Current Ack - 1). TCP
then calls cm_update(streamid, tcp_ownd - in_flight, 0, then calls cm_update(streamid, tcp_ownd - in_flight, 0,
CM_NOLOSS, rtt). CM_NO_CONGESTION, rtt).
4. Upon the arrival of a duplicate acknowledgement, TCP must 4. Upon the arrival of a duplicate acknowledgement, TCP must
check its dupack count (dup_acks) to determine its action. If check its dupack count (dup_acks) to determine its action. If
dup_acks < 3, the TCP does nothing. If dup_acks == 3, TCP dup_acks < 3, the TCP does nothing. If dup_acks == 3, TCP
assumes that a packet was lost and that at least 3 packets assumes that a packet was lost and that at least 3 packets
arrived to generate these duplicate acks. Therefore, it calls arrived to generate these duplicate acks. Therefore, it calls
cm_update(streamid, 4 * avg_pkt_size, 3 * avg_pkt_size, cm_update(streamid, 4 * avg_pkt_size, 3 * avg_pkt_size,
CM_TRANSIENT, rtt). The average packet size is used since the CM_LOSS_FEEDBACK, rtt). The average packet size is used since the
acknowledgements do not indicate exactly how much data has acknowledgements do not indicate exactly how much data has
reached the other end. Most TCP implementations interpret a reached the other end. Most TCP implementations interpret a
duplicate ACK as an indication that a full MSS has reached its duplicate ACK as an indication that a full MSS has reached its
destination. Once a new ACK is received, these TCP sender destination. Once a new ACK is received, these TCP sender
implementations may resynchronize with TCP receiver. The CM API implementations may resynchronize with TCP receiver. The CM API
does not provide a mechanism for TCP to pass information from does not provide a mechanism for TCP to pass information from
this resynchronization. Therefore, TCP can only infer the this resynchronization. Therefore, TCP can only infer the
arrival of an avg_pkt_size amount of data from each duplicate arrival of an avg_pkt_size amount of data from each duplicate
ack. TCP also enqueues a retransmission of the lost segment and ack. TCP also enqueues a retransmission of the lost segment and
calls cm_request(). If dup_acks > 3, TCP assumes that a packet calls cm_request(). If dup_acks > 3, TCP assumes that a packet
has reached the other end and caused this ack to be sent. As a has reached the other end and caused this ack to be sent. As a
result, it calls cm_update(streamid, avg_pkt_size, avg_pkt_size, result, it calls cm_update(streamid, avg_pkt_size, avg_pkt_size,
CM_NOLOSS, rtt). CM_NO_CONGESTION, rtt).
5. Upon the arrival of a partial acknowledgment (one that does 5. Upon the arrival of a partial acknowledgment (one that does
not exceed the highest segment transmitted at the time the loss not exceed the highest segment transmitted at the time the loss
occurred, as defined in [Floyd99b]), TCP assumes that a packet occurred, as defined in [Floyd99b]), TCP assumes that a packet
was lost and that the retransmitted packet has reached the was lost and that the retransmitted packet has reached the
recipient. Therefore, it calls cm_update(streamid, 2 * recipient. Therefore, it calls cm_update(streamid, 2 *
avg_pkt_size, avg_pkt_size, CM_NOLOSS, rtt). CM_NOLOSS is used avg_pkt_size, avg_pkt_size, CM_NO_CONGESTION,
since the loss period has already been reported. TCP also rtt). CM_NO_CONGESTION is used since the loss period has already
enqueues a retransmission of the lost segment and calls been reported. TCP also enqueues a retransmission of the lost
cm_request(). segment and calls cm_request().
When the TCP retransmission timer expires, the sender identifies When the TCP retransmission timer expires, the sender identifies
that a segment has been lost and calls cm_update(streamid, that a segment has been lost and calls cm_update(streamid,
avg_pkt_size, 0, CM_PERSISTENT, 0) to signify the occurrence of avg_pkt_size, 0, CM_NO_FEEDBACK, 0) to signify the occurrence of
persistent congestion to the CM. TCP also enqueues a persistent congestion to the CM. TCP also enqueues a
retransmission of the lost segment and calls cm_request(). retransmission of the lost segment and calls cm_request().
6.1.2 Congestion-controlled UDP 6.1.2 Congestion-controlled UDP
Congestion-controlled UDP is a useful CM application, which we Congestion-controlled UDP is a useful CM application, which we
describe in the context of Berkeley sockets [Stevens94]. They describe in the context of Berkeley sockets [Stevens94]. They
provide the same functionality as standard Berkeley UDP sockets, provide the same functionality as standard Berkeley UDP sockets,
but instead of immediately sending the data from the kernel packet but instead of immediately sending the data from the kernel packet
queue to lower layers for transmission, the buffered socket queue to lower layers for transmission, the buffered socket
skipping to change at line 760 skipping to change at line 785
- query(): AIMD_CC returns the current congestion window (cwnd) - query(): AIMD_CC returns the current congestion window (cwnd)
divided by the smoothed rtt (srtt) as its bandwidth estimate. It divided by the smoothed rtt (srtt) as its bandwidth estimate. It
returns the smoothed rtt estimate as srtt. returns the smoothed rtt estimate as srtt.
- notify(): AIMD_CC adds the number of bytes sent to its - notify(): AIMD_CC adds the number of bytes sent to its
outstanding data window (ownd). outstanding data window (ownd).
- update(): AIMD_CC subtracts nsent from ownd. If the value of rtt - update(): AIMD_CC subtracts nsent from ownd. If the value of rtt
is non-zero, AIMD_CC updates srtt using the TCP srtt calculation. is non-zero, AIMD_CC updates srtt using the TCP srtt calculation.
If the update indicates that data has been lost, AIMD_CC sets If the update indicates that data has been lost, AIMD_CC sets
cwnd to 1 MTU if the loss_mode is CM_PERSISTENT and to cwnd/2 cwnd to 1 MTU if the loss_mode is CM_NO_FEEDBACK and to cwnd/2
(with a minimum of 1 MTU) if the loss_mode is CM_TRANSIENT or (with a minimum of 1 MTU) if the loss_mode is CM_LOSS_FEEDBACK or
CM_ECN. AIMD_CC also sets its internal ssthresh variable to CM_EXPLICIT_CONGESTION. AIMD_CC also sets its internal ssthresh variable
to
cwnd/2. If no loss had occurred, AIMD_CC mimics TCP slow start cwnd/2. If no loss had occurred, AIMD_CC mimics TCP slow start
and linear growth modes. It increments cwnd by nsent when cwnd < and linear growth modes. It increments cwnd by nsent when cwnd <
ssthresh (bounded by a maximum of ssthresh-cwnd) and by nsent * ssthresh (bounded by a maximum of ssthresh-cwnd) and by nsent *
MTU/cwnd when cwnd > ssthresh. MTU/cwnd when cwnd > ssthresh.
- When cwnd or ownd are updated and indicate that at least one MTU - When cwnd or ownd are updated and indicate that at least one MTU
may be transmitted, AIMD_CC calls the CM to schedule a may be transmitted, AIMD_CC calls the CM to schedule a
transmission. transmission.
6.4 Example Scheduler Module 6.4 Example Scheduler Module
skipping to change at line 810 skipping to change at line 836
8. References 8. References
[Allman99] Allman, M. and Paxson, V., TCP Congestion Control, [Allman99] Allman, M. and Paxson, V., TCP Congestion Control,
RFC-2581, April 1999. RFC-2581, April 1999.
[Andersen00] Andersen, D., Bansal, D., Curtis, D., Seshan, S., and [Andersen00] Andersen, D., Bansal, D., Curtis, D., Seshan, S., and
Balakrishnan, H., System Support for Bandwidth Management and Balakrishnan, H., System Support for Bandwidth Management and
Content Adaptation in Internet Applications, Proc. 4th Symp. on Content Adaptation in Internet Applications, Proc. 4th Symp. on
Operating Systems Design and Implementation, San Diego, CA, Operating Systems Design and Implementation, San Diego, CA,
October 2000. October 2000. Available from
http://nms.lcs.mit.edu/papers/cm-osdi2000.html
[Balakrishnan98] Balakrishnan, H., Padmanabhan, V., Seshan, S., [Balakrishnan98] Balakrishnan, H., Padmanabhan, V., Seshan, S.,
Stemm, M., and Katz, R., "TCP Behavior of a Busy Web Server: Stemm, M., and Katz, R., "TCP Behavior of a Busy Web Server:
Analysis and Improvements," Proc. IEEE INFOCOM, San Francisco, Analysis and Improvements," Proc. IEEE INFOCOM, San Francisco,
CA, March 1998. CA, March 1998.
[Balakrishnan99] Balakrishnan, H., Rahul, H., and Seshan, S., "An [Balakrishnan99] Balakrishnan, H., Rahul, H., and Seshan, S., "An
Integrated Congestion Management Architecture for Internet Integrated Congestion Management Architecture for Internet
Hosts," Proc. ACM SIGCOMM, Cambridge, MA, September 1999. Hosts," Proc. ACM SIGCOMM, Cambridge, MA, September 1999.
skipping to change at line 855 skipping to change at line 882
[Mahdavi98] Mahdavi, J. and Floyd, S., "The TCP Friendly Website," [Mahdavi98] Mahdavi, J. and Floyd, S., "The TCP Friendly Website,"
http://www.psc.edu/networking/tcp_friendly.html http://www.psc.edu/networking/tcp_friendly.html
[Mogul90] Mogul, J. and Deering, S., "Path MTU Discovery," [Mogul90] Mogul, J. and Deering, S., "Path MTU Discovery,"
RFC-1191, November 1990. RFC-1191, November 1990.
[Padmanabhan98] Padmanabhan, V., "Addressing the Challenges of Web [Padmanabhan98] Padmanabhan, V., "Addressing the Challenges of Web
Data Transport," PhD thesis, Univ. of California, Berkeley, Data Transport," PhD thesis, Univ. of California, Berkeley,
December 1998. December 1998.
[Postel81] Postel, J. (ed.), "Transmission Control Protocol", [Paxson00] Paxson. V. and Allman, M., "Computing TCP's
Retransmission Timer," Internet Draft
draft-paxson-tcp-rto-01.txt, April 2000. (Expires October
2000.)
[Postel81] Postel, J. (ed.), "Transmission Control Protocol,"
RFC-793, September 1981. RFC-793, September 1981.
[Ramakrishnan98] Ramakrishnan, K. and Floyd, S., "A Proposal to Add [Ramakrishnan98] Ramakrishnan, K. and Floyd, S., "A Proposal to Add
Explicit Congestion Notification (ECN) to IP," RFC-2481. Explicit Congestion Notification (ECN) to IP," RFC-2481.
(Experimental.) (Experimental.)
[Stevens94] Stevens, W., TCP/IP Illustrated, Volume 1. [Stevens94] Stevens, W., TCP/IP Illustrated, Volume 1.
Addison-Wesley, Reading, MA, 1994. Addison-Wesley, Reading, MA, 1994.
[Touch97] Touch, J., "TCP Control Block Interdependence," RFC-2140, [Touch97] Touch, J., "TCP Control Block Interdependence," RFC-2140,
skipping to change at line 891 skipping to change at line 923
Massachusetts Institute of Technology Massachusetts Institute of Technology
Cambridge, MA 02139 Cambridge, MA 02139
Email: hari@lcs.mit.edu Email: hari@lcs.mit.edu
Web: http://nms.lcs.mit.edu/~hari/ Web: http://nms.lcs.mit.edu/~hari/
Srinivasan Seshan Srinivasan Seshan
School of Computer Science School of Computer Science
Carnegie Mellon University Carnegie Mellon University
5000 Forbes Ave. 5000 Forbes Ave.
Pittsburgh, PA 15213 Pittsburgh, PA 15213
Email: srini@seshan.org Email: srini@cmu.edu
Web: http://www.seshan.org/ Web: http://www.cs.cmu.edu/~srini/
Full Copyright Statement Full Copyright Statement
"Copyright (C) The Internet Society (date). All Rights Reserved. "Copyright (C) The Internet Society (date). All Rights Reserved.
This document and translations of it may be copied and furnished to This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain others, and derivative works that comment on or otherwise explain
it or assist in its implementation may be prepared, copied, it or assist in its implementation may be prepared, copied,
published and distributed, in whole or in part, without restriction published and distributed, in whole or in part, without restriction
of any kind, provided that the above copyright notice and this of any kind, provided that the above copyright notice and this
paragraph are included on all such copies and derivative works. paragraph are included on all such copies and derivative works.
 End of changes. 33 change blocks. 
91 lines changed or deleted 123 lines changed or added

This html diff was produced by rfcdiff 1.34. The latest version is available from http://tools.ietf.org/tools/rfcdiff/