draft-ietf-ippm-btc-framework-00.txt   draft-ietf-ippm-btc-framework-01.txt 
INTERNET-DRAFT Expires Jan. 2000 INTERNET-DRAFT
INTERNET-DRAFT Expires June 1999 INTERNET-DRAFT
Network Working Group Matt Mathis Network Working Group Matt Mathis
INTERNET-DRAFT Pittsburgh Supercomputing Center INTERNET-DRAFT Pittsburgh Supercomputing Center
Expiration Date: June 1999 Mark Allman Expiration Date: Jan. 2000 Mark Allman
NASA Lewis NASA Glenn
June, 1999
Empirical Bulk Transfer Capacity Empirical Bulk Transfer Capacity
< draft-ietf-ippm-btc-framework-00.txt > < draft-ietf-ippm-btc-framework-01.txt >
Status of this Document Status of this Document
This document is an Internet-Draft. Internet-Drafts are working This document is an Internet-Draft and is in full conformance with
documents of the Internet Engineering Task Force (IETF), its areas, all provisions of Section 10 of RFC2026.
and its working groups. Note that other groups may also distribute
working documents as Internet-Drafts. Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as
Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six Internet-Drafts are draft documents valid for a maximum of six
months, and may be updated, replaced, or obsoleted by other documents months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as reference at any time. It is inappropriate to use Internet-Drafts as
material or to cite them other than as "work in progress." reference material or to cite them other than as ``work in
progress.''
To view the entire list of current Internet-Drafts, please check the The list of current Internet-Drafts can be accessed at
"1id-abstracts.txt" listing contained in the Internet-Drafts shadow http://www.ietf.org/ietf/1id-abstracts.txt
directories on ftp.is.co.za (Africa), nic.nordu.net (Northern
Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au (Pacific
Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast).
This memo provides information for the Internet community. This memo The list of Internet-Draft Shadow Directories can be accessed at
does not specify an Internet standard of any kind. Distribution of http://www.ietf.org/shadow.html.
this memo is unlimited.
Abstract: Abstract
Bulk Transport Capacity (BTC) is a measure of a network's ability Bulk Transport Capacity (BTC) is a measure of a network's ability to
to transfer significant quantities of data with a single transfer significant quantities of data with a single
congestion-aware transport connection (e.g., TCP). The intuitive congestion-aware transport connection (e.g., TCP). The intuitive
definition of BTC is the expected long term average data rate definition of BTC is the expected long term average data rate (bits
(bits per second) of a single ideal TCP implementation over the per second) of a single ideal TCP implementation over the path in
path in question. However, there are many congestion control question. However, there are many congestion control algorithms
algorithms (and hence transport implementations) permitted by (and hence transport implementations) permitted by IETF standards.
IETF standards. This diversity in transport algorithms creates a This diversity in transport algorithms creates a difficulty for
difficulty for standardizing BTC metrics because the allowed standardizing BTC metrics because the allowed diversity is
diversity is sufficient to lead to situations where different sufficient to lead to situations where different implementations
implementations will yield non-comparable measures -- and will yield non-comparable measures -- and potentially fail the
potentially fail the formal tests for being a metric. formal tests for being a metric.
This document defines a framework for standardizing multiple BTC This document defines a framework for standardizing multiple BTC
metrics that parallel the permitted transport diversity. Two metrics that parallel the permitted transport diversity. Two
approaches are used. First, each BTC metric must be much more approaches are used. First, each BTC metric must be much more
tightly specified than the typical IETF protocol. Pseudo-code or tightly specified than the typical IETF protocol. Pseudo-code or
reference implementations are expected to be the norm. Second, reference implementations are expected to be the norm. Second, each
each BTC methodology is expected to collect some ancillary metrics BTC methodology is expected to collect some ancillary metrics which
which are potentially useful to support analytical models of BTC. are potentially useful to support analytical models of BTC.
1. Introduction 1. Introduction
Bulk Transport Capacity (BTC) is a measure of a network's ability Bulk Transport Capacity (BTC) is a measure of a network's ability to
to transfer significant quantities of data with a single transfer significant quantities of data with a single
congestion-aware transport connection (e.g., TCP). For many congestion-aware transport connection (e.g., TCP). For many
applications the BTC of the underlying network dominates the applications the BTC of the underlying network dominates the overall
overall elapsed time for the application and thus dominates the elapsed time for the application to run and thus dominates the
performance as perceived by a user. Examples of such performance as perceived by a user. Examples of such applications
applications include FTP, and the world wide web when delivering include FTP, and the world wide web when delivering large images or
large images or documents. documents.
The intuitive definition of BTC is the expected long term average The intuitive definition of BTC is the expected long term average
data rate (bits per second) of a single ideal TCP implementation data rate (bits per second) of a single ideal TCP implementation
over the path in question. over the path in question.
Central to the notion of bulk transport capacity is the idea that Central to the notion of bulk transport capacity is the idea that
all transport protocols should have similar responses to all transport protocols should have similar responses to congestion
congestion in the Internet. Indeed the only form of equity in the Internet. Indeed the only form of equity significantly
significantly deployed in the Internet today is that the vast deployed in the Internet today is that the vast majority of all
majority of all traffic is carried by TCP implementations sharing traffic is carried by TCP implementations sharing common congestion
common congestion control algorithms largely due to a shared control algorithms largely due to a shared developmental heritage.
developmental heritage.
[RFC2001.bis] specifies the standard congestion control algorithms [RFC2581] specifies the standard congestion control algorithms used
used by these TCP implementations. Even though this document is a by these TCP implementations. Even though this document is a
(proposed) standard, it permits considerable latitude in (proposed) standard, it permits considerable latitude in
implementation. This latitude is by design, to encourage ongoing implementation. This latitude is by design, to encourage ongoing
evolution in congestion control algorithms. evolution in congestion control algorithms.
This legal diversity in transport algorithms creates a This legal diversity in transport algorithms creates a difficulty
difficulty for standardizing BTC metrics because the allowed for standardizing BTC metrics because the allowed diversity is
diversity is sufficient to lead to situations where different sufficient to lead to situations where different implementations
implementations will yield non-comparable measures -- and will yield non-comparable measures -- and potentially fail the
potentially fail the formal tests for being a metric. formal tests for being a metric.
@@@ A more serious problem is that most of the existing CC algorithms There is also evidence that most TCP implementations exhibit
@ do not assure that improving the properties of a path improves the non-linear performance over some portion of their operating region.
@ measure of that path. That is existing TCP implementations do not It is possible to construct simple simulation examples where
@ always have performance that monotonically increase with true path incremental improvements to a path (such as raising the link
@ capacity. data rate) results in lower overall TCP throughput [MathisIPPM1998?].
#
# OK. I'll leave that to you... I think it needs said and supported
# with some explanation. --allman
@ Next pass --MM--
Furthermore congestion control and related areas, including We beleive that such non-linearity reflects weakness in our current
Integrated services[@@], differentiated services[@@] and Internet understanding of congestion control and is present to some extent in
traffic analysis[@@] are all currently receiving a lot of all TCP implementations and BTC metrics. Note that such
attention from the research community. It is very likely that we non-linearity (in either TCP or a BTC metric) is potentially
will see new experimental congestion control algorithms in the near problematic in the market because investment in capacity might
future. In addition, explicit congestion notification (ECN) actually reduce the preceived quality of the network. Ongoing
[RFC98] is being tested for Internet deployment. We do not yet research in congestion dynamics has some hope of mitigating or
know how any of these developments might affect BTC metrics. modeling the these non-linearities.
Furthermore related areas, including Integrated services[@@],
differentiated services[@@] and Internet traffic analysis[@@] are
all currently receiving significant attention from the research
community. It is likely that we will see new experimental
congestion control algorithms in the near future. In addition,
Explicit Congestion Notification (ECN) [RFC2481] is being tested for
Internet deployment. We do not yet know how any of these
developments might affect BTC metrics.
This document defines a framework for standardizing multiple BTC This document defines a framework for standardizing multiple BTC
metrics that parallel the permitted transport diversity. Two metrics that parallel the permitted transport diversity. Two
approaches are used. First, each BTC metric must be much more approaches are used. First, each BTC metric must be much more
tightly specified than the typical IETF protocol. Pseudo-code or tightly specified than the typical IETF transport protocol.
reference implementations are expected to be the norm. Second, Pseudo-code or reference implementations are expected to be the
each BTC methodology is expected to collect some ancillary metrics norm. Second, each BTC methodology is expected to collect some
which are potentially useful to support analytical models of BTC. ancillary metrics which are potentially useful to support analytical
models of BTC. If a BTC methodology does not collect these
ancillary metrics, it should collect enough information such that
these metrics can be derived (for instance a segment trace file).
For example, the models in [PFTK98, MSMO97, OKM96a, Lak94] all For example, the models in [PFTK98, MSMO97, OKM96a, Lak94] all
predict bulk performance based on path properties such as loss predict bulk transfer performance based on path properties such as
rate, round trip time, etc. A BTC methodology which also provides loss rate and round trip time. A BTC methodology that also provides
ancillary measures of these properties is stronger because ancillary measures of these properties is stronger because agreement
agreement with the analytical models can be used to corroborate with the analytical models can be used to corroborate the direct BTC
the direct BTC measurement results. measurement results.
More importantly these ancillary metrics are expected to be useful More importantly the ancillary metrics are expected to be useful for
for resolving disparity between different BTC metrics. For resolving disparity between different BTC methodologies. For
example, a path that predominantly experiences clustered packet example, a path that predominantly experiences clustered packet
losses is likely to exhibit vastly different measures from BTC losses is likely to exhibit vastly different measures from BTC
metrics that mimic Tahoe, Reno, NewReno, and SACK TCP metrics that mimic Tahoe, Reno, NewReno, and SACK TCP algorithms
algorithms [FF96]. The differences in the BTC metrics over [FF96]. The differences in the BTC metrics over such a path might
such a path might be diagnosed by an ancillary measure of loss be diagnosed by an ancillary measure of loss clustering.
clustering.
Furthermore there are some path properties which are best measured There are some path properties which are best measured as ancillary
as ancillary metrics to a transport protocol. Examples of such metrics to a transport protocol. Examples of such properties
properties include bottleneck queue limits or the tendency to include bottleneck queue limits or the tendency to reorder packets.
reorder packets. These are difficult or impossible to measure at These are difficult or impossible to measure at low rates and unsafe
low rates and unsafe to measure at rates higher than the bulk to measure at rates higher than the bulk transport capacity of the
transport capacity of the path. path.
It is expected that at some point in the future there will exist It is expected that at some point in the future there will exist an
an A-frame [RFC2330] which will unify all simple path metrics A-frame [RFC2330] which will unify all simple path metrics (e.g.,
(e.g., segment loss rates, round trip time) and BTC ancillary segment loss rates, round trip time) and BTC ancillary metrics
metrics (e.g. queue size and packet reordering) with different (e.g., queue size and packet reordering) with different versions of
versions of BTC metrics (e.g., that parallel Reno or SACK TCP). BTC metrics (e.g., that parallel Reno or SACK TCP).
2. Congestion Control Algorithms 2. Congestion Control Algorithms
Nearly all TCP implementations in use today are based on Nearly all TCP implementations in use today utilize the congestion
congestion control algorithms published in [Jac88] and further control algorithms published in [Jac88] and further refined in
refined in [RFC2001,RFC2001.bis]. In addition to the basic notion [RFC2581]. In addition to the basic notion of using an ACK clock,
of using an ACK clock, TCP (and therefore BTC) implements five TCP (and therefore BTC) implements five standard congestion control
standard congestion control algorithms: Congestion Avoidance, algorithms: Congestion Avoidance, Retransmission timeouts,
Retransmission timeouts, Slow-start, Fast Retransmit and Fast Slow-start, Fast Retransmit and Fast Recovery. All BTC
Recovery. All BTC implementations must use these algorithms as implementations must use these algorithms as they are defined in
they are defined in [RFC2001.bis]. However, in all cases a BTC [RFC2581] (which the reader is assumed to be familiar with).
metric must more tightly specify these algorithms, as discussed However, in all cases a BTC metric must more tightly specify these
below. algorithms, as discussed below.
2.1 Congestion Avoidance 2.1 Congestion Avoidance
The Congestion Avoidance algorithm drives the steady-state bulk The Congestion Avoidance algorithm drives the steady-state bulk
transfer behavior of TCP. It calls for opening the congestion transfer behavior of TCP. It calls for opening the congestion
window (cwnd) by a constant additive amount during each round trip window (cwnd) by a constant additive amount during each round trip
time (RTT), and closing it by a constant multiplicative fraction time (RTT), and closing cwnd by a constant multiplicative fraction
on congestion, as indicated by lost segments. The window closing on congestion, as indicated by lost segments or Explicit Congestion
is specified to be half the number of outstanding data segments in Notification messages [RFC2481]. The window closes by half the
flight when loss is detected. A BTC metric must specify the number of outstanding data segments in flight when loss is detected.
following Congestion Avoidance details: A BTC metric must specify the following Congestion Avoidance
details:
The exact algorithm for incrementing cwnd is left to the The exact algorithm for incrementing cwnd in TCP is left to the
implementer. Several candidate algorithms are outlined in implementer. Several candidate algorithms are outlined in
[RFC2001.bis]. In addition, some of these algorithms include some [RFC2581]. In addition, some of these algorithms include some
rounding. For these reasons, the exact algorithm for increasing rounding. For these reasons, the exact algorithm for increasing
cwnd during congestion avoidance must be fully specified for cwnd during congestion avoidance must be fully specified for
each BTC metric defined. each BTC metric defined.
[RFC2001.bis] permits an extra plus one segment window [RFC2581] permits, but does not require, an extra plus one
adjustment following the multiplicative closing of cwnd. This segment cwnd adjustment following the multiplicative decrease of
is because [RFC2001.bis] allows a single invocation of the Slow-Start cwnd. This is because [RFC2581] allows a single invocation of
algorithm when when cwnd equals ssthresh at the end of the Slow-Start algorithm when when cwnd equals ssthresh at the
recovery. end of recovery.
2.2 Retransmission Timeouts 2.2 Retransmission Timeouts
In order to provide reliable data delivery, TCP resends a segment if In order to provide reliable data delivery, TCP resends a segment if
the ACK for the given segment does not arrive before the the ACK for the given segment does not arrive before the
retransmission timer (RTO) fires. A BTC metric must implement an retransmission timer (RTO) expires. A BTC metric must implement an
RTO timer to trigger retransmissions not handled by the fast RTO timer to trigger retransmissions not handled by the fast
retransmit algorithm. Such retransmissions can have a large impact retransmit algorithm. Such retransmissions can have a large impact
on the measured capacity. Calculating the RTO is subject to a on the measured BTC of the path. Calculating the RTO is subject to
number of details that are not standardized. When implementing a a number of details that are not standardized (however, [WS95]
BTC metric the details of the RTO calculation, how and when the outlines a popular implementation). When implementing a BTC metric
clock is set, as well as the clock granularity must be fully the details of the RTO calculation, how and when the clock is set,
documented. as well as the clock granularity must be fully documented.
2.3 Slow Start 2.3 Slow Start
Slow start is part of TCP's transient behavior. It is used to Slow start is part of TCP's transient behavior. It is used to
quickly bring new or recently restarted connections up to an quickly increase the congestion window for new or recently restarted
appropriate congestion window. In addition, slow start is used to connections up to an appropriate level for the network path. In
restart the ACK clock after a retransmission timeout. A BTC addition, slow start is used to restart the ACK clock after a
implementation must use the slow start algorithm, as specified by retransmission timeout. A BTC implementation must use the slow
[RFC2001.bis]. The slow start algorithm is used while the congestion start algorithm, as specified by [RFC2581]. The slow start
window (cwnd) is less than the slow start threshold (ssthresh). algorithm is used while the congestion window (cwnd) is less than
However, whether to use slow start or congestion avoidance when cwnd the slow start threshold (ssthresh). However, whether to use slow
equals ssthresh is left to the implementer by [RFC2001.bis]. This start or congestion avoidance when cwnd equals ssthresh is left to
detail must be specified in every specific BTC metric definition. the implementer by [RFC2581]. This detail must be specified in
every specific BTC metric definition.
2.4 Fast Retransmit/Fast Recovery 2.4 Fast Retransmit/Fast Recovery
The Fast Retransmit/Fast Recovery algorithms are used to infer The Fast Retransmit/Fast Recovery algorithms are used to infer
segment loss before the RTO expires. A BTC implementation must segment loss before the RTO expires. A BTC implementation must
implement the algorithms as defined in [RFC2001.bis]. implement the algorithms as defined in [RFC2581].
In Reno TCP, Fast Retransmit and Fast Recovery are used to support In Reno TCP, Fast Retransmit and Fast Recovery are used to support
the Congestion Avoidance algorithm during recovery from lost the Congestion Avoidance algorithm during loss recovery. During
segments. During Fast Recovery, the data receiver sends duplicated Fast Recovery, the data receiver sends duplicated acknowledgments,
acknowledgments. The data sender uses these duplicate ACKs to per the TCP specification [RFC793]. The data sender uses these
detect loss, to estimate the quantity of data in the network still duplicate ACKs to detect loss, to estimate the quantity of
pending delivery and to clock out new data in an effort to keep the outstanding data in the network and to clock out new data in an
ACK clock running. effort to keep the ACK clock running.
The Fast Retransmit/Fast Recovery algorithms should be implemented
in all BTC methodologies as specified in [RFC2581].
2.5 Advanced Recovery Algorithms 2.5 Advanced Recovery Algorithms
It has been observed that under some conditions the Fast It has been observed that under some conditions the Fast Retransmit
Retransmit and Fast Recovery algorithms do not reliably preserve and Fast Recovery algorithms do not reliably preserve TCP's
TCP's Self-Clock, causing unpredictable or unstable TCP Self-Clock, causing unpredictable or unstable TCP performance
performance [Lak94@@@check, Flo95]. Simulations of reference TCP [Lak94@@@check, Flo95]. Simulations of reference TCP
implementations have uncovered situations where incidental changes implementations have uncovered situations where incidental changes
in other parts of the network have a large effect on performance in the network path have a large effect on performance [MM96a].
[MM96a]. Other simulations have shown that under some Additional simulations have shown that under some conditions,
conditions, slightly better networks (higher bandwidth, lower slightly better networks (higher bandwidth, lower delay or less
delay or less load from other connections) yield lower throughput. competing traffic) yield lower throughput [MathisIPPMDec1998?].
@@@ This is pretty easy to construct, but has it been published?
# Not that I can think of off the top of my head... Maybe a concrete
# example to back up the claim? --allman
[RFC2001.bis] allows a TCP implementation to use more robust loss
recovery algorithms, such as NewReno type algorithms
[FH98,FF96,Hoe96] and SACK-based algorithms [FF96,MM96a,MM96b].
While allowing these algorithms, [RFC2001.bis] does not define any
such algorithm and therefore, a BTC metric that implements
advanced recovery algorithms must fully specify the details.
Note that since TCP based on standard Fast Retransmit and Fast [RFC2581] allows a TCP implementation to use more robust loss
Recovery sometimes exhibits erratic performance [MM96a], these recovery algorithms, such as NewReno [RFC2582,FF96,Hoe96] and
algorithms may prove to be unsuitable for use in a metric. SACK-based algorithms [FF96,MM96a,MM96b]. While allowing these
# Ouch... I know what you're saying, but... If the goal is to see what algorithms, [RFC2581] does not define any such algorithm and
# congestion-aware transport connection yields, I think the above is a therefore, a BTC metric that implements advanced loss recovery
# little harsh given the current standardized CC algorithms. algorithms must fully specify the details.
2.6 Segment Size 2.6 Segment Size
The actual segment size, or method of choosing a segment size The actual segment size, or method of choosing a segment size (e.g.,
(e.g., path MTU discovery [RFC1191]) and the number of header path MTU discovery [RFC1191]) and the number of header bytes assumed
bytes assumed to be prepended to each segment must be specified. to be prepended to each segment must be specified. In addition if
In addition if the segment size is artificially limited to less the segment size is artificially limited to less than the path MTU
than the path MTU this must be indicated. this must be indicated (if known).
3 Ancillary Metrics 3 Ancillary Metrics
The following ancillary metrics should be implemented in every BTC The following ancillary metrics can provide additional information
that can exhibit the relevant behaviors. Alternatively, the BTC about the network and the behavior of the implemented congestion
implementation should provide enough information that the following control algorithm in response to the behavior of the network path.
information can be gathered in post-processing (e.g., by providing a It is recommended that these metrics be built into each BTC
segment trace of the connection). methodology. Alternatively, the BTC implementation should provide
enough information such that the ancillary metrics can be derived
via post-processing (e.g., by providing a segment trace of the
connection).
3.1 Congestion Avoidance Capacity 3.1 Congestion Avoidance Capacity
Define a pure "Congestion Avoidance Capacity" (CAC) metric to be The "Congestion Avoidance Capacity" (CAC) metric is the data rate
the data rate (bits per second) of a fully specified (bits per second) of a fully specified implementation of the
implementation of the Congestion Avoidance algorithm, subject to Congestion Avoidance algorithm, subject to the restriction that the
the restriction that the Retransmission Timeout and Slow-Start Retransmission Timeout and Slow-Start algorithms are not invoked.
algorithms are not invoked. The CAC metric is defined to have no The CAC metric is defined to have no meaning across Retransmission
meaning across Retransmission Timeouts or Slow-Start (except the Timeouts or Slow-Start periods (except the single segment Slow-Start
single segment Slow-Start that is permitted to follow recovery). that is permitted to follow recovery, as discussed in section 2.3).
In principle a CAC metric would be an ideal BTC metric. But there In principle a CAC metric would be an ideal BTC metric, as it
is a rather substantial difficulty with using it as such. The captures what should be TCP's steady state behavior. But, there is
a rather substantial difficulty with using it as such. The
Self-Clocking of the Congestion Avoidance algorithm can be very Self-Clocking of the Congestion Avoidance algorithm can be very
fragile, depending on the specific details of the Fast Retransmit, fragile, depending on the specific details of the Fast Retransmit,
Fast Recovery or advanced recovery algorithms above. Fast Recovery or advanced recovery algorithms above. It has been
found that timeouts and periods of slow start loss recovery are
prevalent in traffic on the Internet [LK98] and therefore these
should be included in the BTC metric.
When TCP looses Self-Clock it is reestablished through a When TCP looses Self-Clock it is reestablished through a
retransmission timeout and Slow-Start. These algorithms nearly retransmission timeout and Slow-Start. These algorithms nearly
always take more time than Congestion Avoidance would have taken. always require more time than Congestion Avoidance would have taken.
It is easily observed that unless the network loses an entire window
It is easily observed that unless the network loses an entire of data (which would clearly require a retransmit timeout) TCP
window of data (which would clearly require a retransmit timeout) missed some opportunity to safely transmit data. That is, if TCP
TCP missed some opportunity to send data. That is, if TCP experiences a timeout after losing a partial window of data, it must
experiences a timeout after losing any partial window of data, it have received at least one ACK that was generated after some of the
must have received at least one ACK that was generated after some partial data was delivered, but did not trigger the transmission of
of the partial data was delivered, but did not trigger new data. Recent research in congestion control (e.g., FACK
transmitting any new data. Much recent research in congestion [MM96a], NewReno [FF96,RFC2582]) can be characterized as making
control (e.g., FACK[MM96a], NewReno[FH98], [LowWindow]) can be TCP's Self-Clock more tenacious, while preserving fairness under
characterized as making TCP's Self-Clock more tenacious, while adverse conditions. This work is often motivated by how poorly
preserving fairness under adverse conditions. This work is often current TCP implementations perform under some conditions, often due
motivated by how poorly current TCP implementations perform under to repeated clock loss. Since this is an active research area,
some conditions, often due to repeated clock loss. Since this is different TCP implementations have rather considerable differences
an active research area, different TCP implementations have rather in their ability to preserve Self-Clock.
considerable differences in their ability to preserve Self-Clock.
3.2 Ancillary metrics relating to the preservation of Self-Clock 3.2 Preservation of Self-Clock
Since loosing the clock can have a large effect on the overall BTC, Losing the ACK clock can have a large effect on the overall BTC, and
and the clock is itself fragile in ways that are very dependent on the clock is itself fragile in ways that are dependent on the loss
the recovery algorithm, it is important that the transitions between recovery algorithm. Therefore, it is important that the transition
timer driven and Self-Clocked operation be instrumented. between timer driven and Self-Clocked operation be instrumented.
3.2.1 Lost transmission opportunities 3.2.1 Lost Transmission Opportunities
If the last event before a timeout was the receipt of an ACK that If the last event before a timeout was the receipt of an ACK that
did not trigger a retransmission, the possibility exists that did not trigger a retransmission, the possibility exists that an
some other congestion control algorithm would have successfully alternate congestion control algorithm would have successfully
preserved the Self-Clock. In this event, instrumenting key parts preserved the Self-Clock. In this event, instrumenting key parts of
of the BTC state (e.g., cwnd) may lead to further improvements in the BTC state (such as the congestion window) may lead to further
congestion control algorithms. improvements in congestion control algorithms.
Note that in the absence of knowledge about the future, it is not Note that in the absence of knowledge about the future, it is not
possible to design an algorithm that never misses transmission possible to design an algorithm that never misses transmission
opportunities. However, there are ever more subtle ways to gauge opportunities. However, there are ever more subtle ways to gauge
network state, and to estimate if a given ACK is likely to be the network state, and to estimate if a given ACK is likely to be the
last. last.
3.2.2 Loosing an entire window 3.2.2 Loosing an Entire Window
If an entire window of data (or ACKs) is lost, there will be no If an entire window of data (or ACKs) is lost, there will be no
returning ACKs to clock out additional data. This condition can returning ACKs to clock out additional data. This condition can
be detected if the last event before a timeout was a data be detected if the last event before a timeout was a data
transmission triggered by an ACK. The loss of an entire window transmission triggered by an ACK. The loss of an entire window
of data/ACKs forces recovery to be via a Retransmission Timeout and of data/ACKs forces recovery to be via a Retransmission Timeout and
Slow-Start. Slow-Start.
Losing an entire window of data implies an outage with a duration Losing an entire window of data implies an outage with a duration
at least as long as a round trip time. Such an outage can not be at least as long as a round trip time. Such an outage can not be
diagnosed with low rate metrics and is unsafe to diagnose at diagnosed with low rate metrics and is unsafe to diagnose at
higher rates than the BTC. Therefore all BTC metrics at should higher rates than the BTC. Therefore all BTC metrics at should
instrument and report losses of an entire window of data. instrument and report losses of an entire window of data.
There are some conditions, such as at very small window, in which Note that there are some conditions, such as when operating with a
there is a significant probability that an entire window can be very small window, in which there is a significant probability that
legitimately lost through individual random losses. an entire window can be lost through individual random losses.
3.2.3 Heroic clock preservation 3.2.3 Heroic Clock Preservation
All algorithms that permit a given BTC to sustain Self-Clock when All algorithms that permit a given BTC to sustain Self-Clock when
other algorithms might not, should be instrumented. Furthermore, other algorithms might not, should be instrumented. Furthermore,
the details of the algorithms used must be fully documented. the details of the algorithms used must be fully documented.
BTC metrics that can sustain Self-Clock in the presence of BTC metrics that can sustain Self-Clock in the presence of multiple
multiple losses within one round trip should instrument the losses within one round trip should instrument the loss
loss distribution, such that the performance of Reno style distribution, such that the performance of Reno style bulk transport
bulk transport can be estimated. can be estimated.
BTC algorithms that can trigger fast retransmits earlier than
following three duplicate acknowledgments (e.g. at small
window [LowWindow]), should instrument and fully document
these events as well.
3.2.4 False timeouts 3.2.4 False Timeouts
All false timeouts, (where the transmission timer expires before All false timeouts, (where the retransmission timer expires before
the ACK for some previously transmitted data arrives) should be the ACK for some previously transmitted data arrives) should be
instrumented when possible. Note that depending upon how the BTC instrumented when possible. Note that depending upon how the BTC
metric implements sequence numbers, this may be difficult to metric implements sequence numbers, this may be difficult to detect.
detect.
3.3 Ancillary metrics relating to flow based path properties 3.3 Ancillary Metrics Relating to Flow Based Path Properties
All BTC metrics provide unique vantage points for instrumenting All BTC metrics provide unique vantage points for instrumenting
certain path properties relating to closely spaced packets. As in certain path properties relating to closely spaced packets. As in
the case of RTT duration outages, these can be impossible to the case of RTT duration outages, these can be impossible to
diagnose at low rates (less than 1 packet per RTT) and diagnose at low rates (less than 1 packet per RTT) and inappropriate
inappropriate to test at rates above the BTC. to test at rates above the BTC.
All BTC metrics should instrument packet reordering. The severity All BTC metrics should instrument packet reordering. The frequency
of the reordering can be classified as one of three different and distance out of sequence must be instrumented for all
cases, each of which should be instrumented. out-of-order packets. The severity of the reordering can be
classified as one of three different cases, each of which should be
reported.
Packets that are only slightly out of order should not trigger Packets that are only slightly out of order should not trigger
retransmission, but they may affect the window calculation. retransmission (via fast retransmit), but they may affect the
BTC metrics must document how slightly out-of-order packets window calculation. BTC metrics must document how slightly
affect the congestion window calculation. The frequency and out-of-order packets affect the congestion window calculation.
distance out of sequence must be instrumented for all
out-of-order packets.
If packets are sufficiently out-of-order, the Fast Retransmit If packets are sufficiently out-of-order, the Fast Retransmit
algorithm will be invoked in advance of the delayed packet's algorithm will be invoked in advance of the delayed packet's
late arrival. These events must be instrumented. late arrival. These events must be instrumented. Even though
Even though the the late arriving packet will complete the the late arriving packet will complete recovery, the the
recovery, the the window must still be reduced by half. window will still be reduced by half.
Under some rare conditions packets have been observed that are Under some rare conditions packets have been observed that are
far out of order - sometimes many seconds late [Pax97b]. far out of order - sometimes many seconds late [Pax97b]. These
These should always be instrumented. should always be instrumented.
The BTC should instrument the maximum cwnd observed during The BTC should instrument the maximum cwnd observed during
congestion avoidance and slow start. A TCP running over the same congestion avoidance and slow start. A TCP running over the same
path must have sufficient sender buffer space and receiver window path as the BTC must have sufficient sender buffer space and
(and window shift [RFC1323]) to cover this cwnd. receiver window (and window shift [RFC1323]) to cover this cwnd.
There are several other path properties that one might measure There are several other path properties that one might measure
within a BTC metric. For example, with an embedded one-way delay within a BTC metric. For example, with an embedded one-way delay
metric it may be possible to measure how queueing delay and metric it may be possible to measure how queueing delay and and
and (RED) drop probabilities are correlated to window size. (RED) drop probabilities are correlated to window size. These are
These are all open research questions. open research questions.
3.4 Ancillary metrics pertaining to MTU discovery 3.4 Ancillary Metrics Pertaining to MTU Discovery
Under some conditions, BTC can be very sensitive to segment size. Under some conditions, BTC can be very sensitive to segment size.
In addition to instrumenting the segment size, a BTC metric should In addition to instrumenting the segment size, a BTC metric should
indicate how it was selected: by path MTU discovery [RFC1191], a indicate how it was selected: by path MTU discovery [RFC1191], a
manual control, system default, or the maximum MTU for the manual configuration, system default, or the maximum MTU for the
interface. interface.
Note that the most popular LAN technologies have smaller MTUs Note that the most popular LAN technologies have smaller MTUs than
than nearly all WAN technologies. As a consequence, it is nearly all WAN technologies. As a consequence, it is difficult to
difficult to measure the true performance of a wide area path measure the true performance of a wide area path without subjecting
without subjecting it to the smaller MTU of the LAN. it to the smaller MTU of the LAN.
3.4 Ancillary metrics as calibration checks 3.4 Ancillary Metrics as Calibration Checks
Unlike low rate metrics, BTC must have explicit checks that the Unlike low rate metrics, BTC must have explicit checks that the
test platform is not the bottleneck, either due to insufficient test platform is not the bottleneck.
tester data rate or buffer space.
Ideally all queues within the tester should be instrumented. All Ideally all queues within the tester should be instrumented. All
packets dropped within the tester should be instrumented as tester packets dropped within the tester should be instrumented as tester
failures, invalidating a measurement. failures, invalidating a measurement.
The maximum queue lengths should be instrumented. Any significant The maximum queue lengths should be instrumented. Any significant
queue may indicate that the tester itself has insufficient burst queue may indicate that the tester itself has insufficient burst
data rate, and is slightly smoothing the data into the network. data rate, and is slightly smoothing the data into the network.
3.4.3 Validate Reverse path load 3.4.3 Validate Reverse path load
skipping to change at line 452 skipping to change at line 452
instrumented? instrumented?
# #
# Some implementations (mine!) have an annoying feature whereby ACK loss # Some implementations (mine!) have an annoying feature whereby ACK loss
# looks just like data loss. This should be documented. If ACK loss # looks just like data loss. This should be documented. If ACK loss
# and data loss can be detected separately, I think ACK loss rate should # and data loss can be detected separately, I think ACK loss rate should
# be reported, as it slightly changes the ACK clock (can impact # be reported, as it slightly changes the ACK clock (can impact
# algorithms like slow start that work on a per ACK basis and can make # algorithms like slow start that work on a per ACK basis and can make
# the sender more bursty, which could cause more loss). # the sender more bursty, which could cause more loss).
@ and mine --MM-- @ and mine --MM--
3.5 Ancillary metrics relating to the need for advanced TCP features 3.5 Ancillary Metrics Relating to the Need for Advanced TCP Features
If TCP would require RFC1323 features (window scaling, timestamp If TCP would require advanced TCP extensions to match BTC
based round trip time measurement, protection from wrapped performance (such as RFC 1323 or RFC 2018 features), it should be
sequences, etc) to match the BTC performance, it should be
reported. reported.
4 Acknowledgments 4 Acknowledgments
Jeff Semke, for numerous clarifications. Thanks to Jeff Semke for numerous clarifications.
5 References 5 References
[LowWindow] @@@@@ Current work
[FF96] Fall, K., Floyd, S.. "Simulation-based Comparisons of Tahoe, [FF96] Fall, K., Floyd, S.. "Simulation-based Comparisons of Tahoe,
Reno and SACK TCP". Computer Communication Review, July 1996. Reno and SACK TCP". Computer Communication Review, July 1996.
ftp://ftp.ee.lbl.gov/papers/sacks.ps.Z. ftp://ftp.ee.lbl.gov/papers/sacks.ps.Z.
[FH98] Floyd, S., Henderson, T., "The NewReno Modification to [Flo95] Floyd, S., "TCP and successive fast retransmits", March
TCP's Fast Recovery Algorithm", Work in progress 1995, Obtain via ftp://ftp.ee.lbl.gov/papers/fastretrans.ps.
draft-ietf-tcpimpl-newreno-00.txt
[Flo95] Floyd, S., "TCP and successive fast retransmits",
March 1995, Obtain via ftp://ftp.ee.lbl.gov/papers/fastretrans.ps.
[RF98] K. Ramakrishnan, S. Floyd, "A Proposal to add Explicit
Congestion Notification (ECN) to IP", Work in progress
draft-kksjf-ecn-03.txt
[Hoe96] Hoe, J., "Improving the start-up behavior of a congestion [Hoe96] Hoe, J., "Improving the start-up behavior of a congestion
control scheme for TCP, Proceedings of ACM SIGCOMM '96, control scheme for TCP, Proceedings of ACM SIGCOMM '96, August
August 1996. 1996.
[Hoe95] Hoe, J., "Startup dynamics of TCP's congestion control [Hoe95] Hoe, J., "Startup dynamics of TCP's congestion control and
and avoidance schemes". Master's thesis, Massachusetts Institute avoidance schemes". Master's thesis, Massachusetts Institute of
of Technology, June 1995. Technology, June 1995.
[Jac88] Jacobson, V., "Congestion Avoidance and Control", [Jac88] Jacobson, V., "Congestion Avoidance and Control",
Proceedings of SIGCOMM '88, Stanford, CA., August 1988. Proceedings of SIGCOMM '88, Stanford, CA., August 1988.
[Lak94] Lakshman, Effects of random loss [Lak94] Lakshman, Effects of random loss
[MM96a] Mathis, M. and Mahdavi, J. "Forward acknowledgment: [LK98] Lin, D. and Kung, H.T., "TCP Fast Recovery Strategies:
Refining TCP congestion control", Proceedings of ACM SIGCOMM '96, Analysis and Improvements", Proceedings of InfoCom, March 1998.
[MM96a] Mathis, M. and Mahdavi, J. "Forward acknowledgment: Refining
TCP congestion control", Proceedings of ACM SIGCOMM '96,
Stanford, CA., August 1996. Stanford, CA., August 1996.
[MM96b] M. Mathis, J. Mahdavi, "TCP Rate-Halving with Bounding [MM96b] M. Mathis, J. Mahdavi, "TCP Rate-Halving with Bounding
Parameters" Available from Parameters" Available from
http://www.psc.edu/networking/papers/FACKnotes/current. http://www.psc.edu/networking/papers/FACKnotes/current.
[MSMO97] Mathis, M., Semke, J., Mahdavi, J., Ott, T., [MSMO97] Mathis, M., Semke, J., Mahdavi, J., Ott, T., "The
"The Macroscopic Behavior of the TCP Congestion Avoidance Macroscopic Behavior of the TCP Congestion Avoidance Algorithm",
Algorithm", Computer Communications Review, 27(3), July 1997. Computer Communications Review, 27(3), July 1997.
[OKM96a], Ott, T., Kemperman, J., Mathis, M., "The Stationary [OKM96a], Ott, T., Kemperman, J., Mathis, M., "The Stationary
Behavior of Ideal TCP Congestion Avoidance", In progress, August Behavior of Ideal TCP Congestion Avoidance", In progress, August
1996. Obtain via pub/tjo/TCPwindow.ps using anonymous ftp to 1996. Obtain via pub/tjo/TCPwindow.ps using anonymous ftp to
ftp.bellcore.com ftp.bellcore.com
[OKM96b], Ott, T., Kemperman, J., Mathis, M., "Window Size [OKM96b], Ott, T., Kemperman, J., Mathis, M., "Window Size Behavior
Behavior in TCP/IP with Constant Loss Probability", DIMACS in TCP/IP with Constant Loss Probability", DIMACS Special Year
Special Year on Networks, Workshop on Performance of Real-Time on Networks, Workshop on Performance of Real-Time Applications
Applications on the Internet, Nov 1996. on the Internet, Nov 1996.
[Pax97a] Paxson, V., "Automated Packet Trace Analysis of TCP [Pax97a] Paxson, V., "Automated Packet Trace Analysis of TCP
Implementations", Proceedings of ACM SIGCOMM '97, August 1997. Implementations", Proceedings of ACM SIGCOMM '97, August 1997.
[Pax97b] Paxson, V., "End-to-End Internet Packet Dynamics," [Pax97b] Paxson, V., "End-to-End Internet Packet Dynamics,"
Proceedings of SIGCOMM '97, Cannes, France, Sep. 1997. Proceedings of SIGCOMM '97, Cannes, France, Sep. 1997.
[Pax97c] Paxson, V, editor "Known TCP Implementation Problems",
Work in progress: http://reality.sgi.com/sca/tcp-impl/prob-01.txt
[PFTK98] Padhye, J., Firoiu. V., Towsley, D., and Kurose, J., "TCP [PFTK98] Padhye, J., Firoiu. V., Towsley, D., and Kurose, J., "TCP
Throughput: A Simple Model and its Empirical Validation", Throughput: A Simple Model and its Empirical Validation",
Proceedings of ACM SIGCOMM '98, August 1998. Proceedings of ACM SIGCOMM '98, August 1998.
[RFC1191] Mogul, J., Deering, S., "Path MTU Discovery", [RFC793] Postel, J., "Transmission Control Protocol", 1981, Obtain
November 1990, Obtain via: via: ftp://ds.internic.net/rfc/rfc793.txt
ftp://ds.internic.net/rfc/rfc1191.txt
[RFC1323] Jacobson, V., Braden, R., Borman, D., "TCP Extensions [RFC1191] Mogul, J., Deering, S., "Path MTU Discovery", November
for High Performance", May 1992, Obtain via: 1990, Obtain via: ftp://ds.internic.net/rfc/rfc1191.txt
[RFC1323] Jacobson, V., Braden, R., Borman, D., "TCP Extensions for
High Performance", May 1992, Obtain via:
ftp://ds.internic.net/rfc/rfc1323.txt ftp://ds.internic.net/rfc/rfc1323.txt
[RFC2001] Stevens, W., "TCP Slow Start, Congestion Avoidance, [RFC2001] Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast
Fast Retransmit, and Fast Recovery Algorithms", Retransmit, and Fast Recovery Algorithms", 1997, Obtain via:
ftp://ds.internic.net/rfc/rfc2001.txt ftp://ds.internic.net/rfc/rfc2001.txt
[RFC2001.bis] Allman, M., Paxson, V., Stevens, W., "TCP Congestion
Control". Work in progress draft-ietf-cong-control-01.txt, to
update RFC2001.
[RFC2018] Mathis, M., Mahdavi, J. Floyd, S., Romanow, A., "TCP [RFC2018] Mathis, M., Mahdavi, J. Floyd, S., Romanow, A., "TCP
Selective Acknowledgment Options", 1996, Obtain via: Selective Acknowledgment Options", 1996, Obtain via:
ftp://ds.internic.net/rfc/rfc2018.txt ftp://ds.internic.net/rfc/rfc2018.txt
[RFC2330] Paxson, V., Almes, G., Mahdavi, J., Mathis, M., [RFC2330] Paxson, V., Almes, G., Mahdavi, J., Mathis, M., "Framework
"Framework for IP Performance Metrics" , 1998, Obtain via: for IP Performance Metrics" , 1998, Obtain via:
ftp://ds.internic.net/rfc/rfc2330.txt ftp://ds.internic.net/rfc/rfc2330.txt
[Ste94] Stevens, W., "TCP/IP Illustrated, Volume 1: The [RFC2481] K. Ramakrishnan, S. Floyd, "A Proposal to add Explicit
Protocols", Addison-Wesley, 1994. Congestion Notification (ECN) to IP", 1999, Obtain via:
ftp://ds.internic.net/rfc/rfc2481.txt
[RFC2525] V. Paxson, M. Allman, S. Dawson, W. Fenner, J. Griner,
I. Heavens, K. Lahey, J. Semke, B. Volz, "Known TCP
Implementation Problems", 1999, Obtain via:
ftp://ds.internic.net/rfc/rfc2525.txt
[RFC2581] Allman, M., Paxson, V., Stevens, W., "TCP Congestion
Control"., 1999, Obtain via:
ftp://ds.internic.net/rfc/rfc2581.txt
[RFC2582] Floyd, S., Henderson, T., "The NewReno Modification to
TCP's Fast Recovery Algorithm", 1999, Obtain via:
ftp://ds.internic.net/rfc/rfc2582.txt
[Ste94] Stevens, W., "TCP/IP Illustrated, Volume 1: The Protocols",
Addison-Wesley, 1994.
[WS95] Wright, G., Stevens, W., "TCP/IP Illustrated Volume II: The
Implementation", Addison-Wesley, 1995.
Author's Addresses Author's Addresses
Matt Mathis Matt Mathis
Pittsburgh Supercomputing Center Pittsburgh Supercomputing Center
4400 Fifth Ave. 4400 Fifth Ave.
Pittsburgh PA 15213 Pittsburgh PA 15213
mathis@psc.edu mathis@psc.edu
http://www.psc.edu/~mathis http://www.psc.edu/~mathis
Mark Allman Mark Allman
NASA Lewis Research Center/Sterling Software NASA Glenn Research Center/GTE Internetworking
Lewis Field
21000 Brookpark Rd. MS 54-2 21000 Brookpark Rd. MS 54-2
Cleveland, OH 44135 Cleveland, OH 44135
216-433-6586 216-433-6586
mallman@lerc.nasa.gov mallman@grc.nasa.gov
http://gigahertz.lerc.nasa.gov/~mallman http://roland.grc.nasa.gov/~mallman
 End of changes. 

This html diff was produced by rfcdiff 1.23, available from http://www.levkowetz.com/ietf/tools/rfcdiff/