draft-ietf-ippm-btc-framework-01.txt   draft-ietf-ippm-btc-framework-02.txt 
INTERNET-DRAFT Expires Jan. 2000 INTERNET-DRAFT INTERNET-DRAFT Expires Apr. 2000 INTERNET-DRAFT
Network Working Group Matt Mathis Network Working Group Matt Mathis
INTERNET-DRAFT Pittsburgh Supercomputing Center INTERNET-DRAFT Pittsburgh Supercomputing Center
Expiration Date: Jan. 2000 Mark Allman Expiration Date: Apr. 2000 Mark Allman
NASA Glenn NASA Glenn/BBN
June, 1999 October, 1999
Empirical Bulk Transfer Capacity Empirical Bulk Transfer Capacity
< draft-ietf-ippm-btc-framework-01.txt > < draft-ietf-ippm-btc-framework-02.txt >
Status of this Document Status of this Document
This document is an Internet-Draft and is in full conformance with This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026. all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as other groups may also distribute working documents as
Internet-Drafts. Internet-Drafts.
skipping to change at line 60 skipping to change at line 60
This document defines a framework for standardizing multiple BTC This document defines a framework for standardizing multiple BTC
metrics that parallel the permitted transport diversity. Two metrics that parallel the permitted transport diversity. Two
approaches are used. First, each BTC metric must be much more approaches are used. First, each BTC metric must be much more
tightly specified than the typical IETF protocol. Pseudo-code or tightly specified than the typical IETF protocol. Pseudo-code or
reference implementations are expected to be the norm. Second, each reference implementations are expected to be the norm. Second, each
BTC methodology is expected to collect some ancillary metrics which BTC methodology is expected to collect some ancillary metrics which
are potentially useful to support analytical models of BTC. are potentially useful to support analytical models of BTC.
1. Introduction 1. Introduction
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119]. Although
[RFC2119] was written with protocols in mind, the key words are used
in this document for similar reasons. They are used to ensure that
each BTC methodology defined contains specific pieces of
information.
Bulk Transport Capacity (BTC) is a measure of a network's ability to Bulk Transport Capacity (BTC) is a measure of a network's ability to
transfer significant quantities of data with a single transfer significant quantities of data with a single
congestion-aware transport connection (e.g., TCP). For many congestion-aware transport connection (e.g., TCP). For many
applications the BTC of the underlying network dominates the overall applications the BTC of the underlying network dominates the overall
elapsed time for the application to run and thus dominates the elapsed time for the application to run and thus dominates the
performance as perceived by a user. Examples of such applications performance as perceived by a user. Examples of such applications
include FTP, and the world wide web when delivering large images or include FTP, and the world wide web when delivering large images or
documents. documents. The intuitive definition of BTC is the expected long
term average data rate (bits per second) of a single ideal TCP
The intuitive definition of BTC is the expected long term average implementation over the path in question.
data rate (bits per second) of a single ideal TCP implementation
over the path in question.
Central to the notion of bulk transport capacity is the idea that Central to the notion of bulk transport capacity is the idea that
all transport protocols should have similar responses to congestion all transport protocols should have similar responses to congestion
in the Internet. Indeed the only form of equity significantly in the Internet. Indeed the only form of equity significantly
deployed in the Internet today is that the vast majority of all deployed in the Internet today is that the vast majority of all
traffic is carried by TCP implementations sharing common congestion traffic is carried by TCP implementations sharing common congestion
control algorithms largely due to a shared developmental heritage. control algorithms largely due to a shared developmental heritage.
[RFC2581] specifies the standard congestion control algorithms used [RFC2581] specifies the standard congestion control algorithms used
by these TCP implementations. Even though this document is a by TCP implementations. Even though this document is a (proposed)
(proposed) standard, it permits considerable latitude in standard, it permits considerable latitude in implementation. This
implementation. This latitude is by design, to encourage ongoing latitude is by design, to encourage ongoing evolution in congestion
evolution in congestion control algorithms. control algorithms.
This legal diversity in transport algorithms creates a difficulty This legal diversity in congestion control algorithms creates a
for standardizing BTC metrics because the allowed diversity is difficulty for standardizing BTC metrics because the allowed
sufficient to lead to situations where different implementations diversity is sufficient to lead to situations where different
will yield non-comparable measures -- and potentially fail the implementations will yield non-comparable measures -- and
formal tests for being a metric. potentially fail the formal tests for being a metric.
There is also evidence that most TCP implementations exhibit There is also evidence that most TCP implementations exhibit
non-linear performance over some portion of their operating region. non-linear performance over some portion of their operating region.
It is possible to construct simple simulation examples where It is possible to construct simple simulation examples where
incremental improvements to a path (such as raising the link incremental improvements to a path (such as raising the link data
data rate) results in lower overall TCP throughput [MathisIPPM1998?]. rate) results in lower overall TCP throughput (or BTC) [Mat98].
We beleive that such non-linearity reflects weakness in our current We believe that such non-linearity reflects weakness in our current
understanding of congestion control and is present to some extent in understanding of congestion control and is present to some extent in
all TCP implementations and BTC metrics. Note that such all TCP implementations and BTC metrics. Note that such
non-linearity (in either TCP or a BTC metric) is potentially non-linearity (in either TCP or a BTC metric) is potentially
problematic in the market because investment in capacity might problematic in the market because investment in capacity might
actually reduce the preceived quality of the network. Ongoing actually reduce the perceived quality of the network. Ongoing
research in congestion dynamics has some hope of mitigating or research in congestion dynamics has some hope of mitigating or
modeling the these non-linearities. modeling the these non-linearities.
Furthermore related areas, including Integrated services[@@], Furthermore related areas, including Integrated services
differentiated services[@@] and Internet traffic analysis[@@] are [RFC1633,RFC2216], differentiated services [RFC2475] and Internet
all currently receiving significant attention from the research traffic analysis [MSMO97,PFTK98,Pax97b,LM97] are all currently
community. It is likely that we will see new experimental receiving significant attention from the research community. It is
congestion control algorithms in the near future. In addition, likely that we will see new experimental congestion control
Explicit Congestion Notification (ECN) [RFC2481] is being tested for algorithms in the near future. In addition, Explicit Congestion
Internet deployment. We do not yet know how any of these Notification (ECN) [RFC2481] is being tested for Internet
developments might affect BTC metrics. deployment. We do not yet know how any of these developments might
affect BTC metrics.
This document defines a framework for standardizing multiple BTC This document defines a framework for standardizing multiple BTC
metrics that parallel the permitted transport diversity. Two metrics that parallel the permitted transport diversity. Two
approaches are used. First, each BTC metric must be much more approaches are used. First, each BTC metric must be much more
tightly specified than the typical IETF transport protocol. tightly specified than the typical IETF transport protocol.
Pseudo-code or reference implementations are expected to be the Pseudo-code or reference implementations are expected to be the
norm. Second, each BTC methodology is expected to collect some norm. Second, each BTC methodology is expected to collect some
ancillary metrics which are potentially useful to support analytical ancillary metrics which are potentially useful to support analytical
models of BTC. If a BTC methodology does not collect these models of BTC. If a BTC methodology does not collect these
ancillary metrics, it should collect enough information such that ancillary metrics, it should collect enough information such that
these metrics can be derived (for instance a segment trace file). these metrics can be derived (for instance a segment trace file).
For example, the models in [PFTK98, MSMO97, OKM96a, Lak94] all As an example, the models in [PFTK98, MSMO97, OKM96a, Lak94] all
predict bulk transfer performance based on path properties such as predict bulk transfer performance based on path properties such as
loss rate and round trip time. A BTC methodology that also provides loss rate and round trip time. A BTC methodology that also provides
ancillary measures of these properties is stronger because agreement ancillary measures of these properties is stronger because agreement
with the analytical models can be used to corroborate the direct BTC with the analytical models can be used to corroborate the direct BTC
measurement results. measurement results.
More importantly the ancillary metrics are expected to be useful for More importantly the ancillary metrics are expected to be useful for
resolving disparity between different BTC methodologies. For resolving disparity between different BTC methodologies. For
example, a path that predominantly experiences clustered packet example, a path that predominantly experiences clustered packet
losses is likely to exhibit vastly different measures from BTC losses is likely to exhibit vastly different measures from BTC
skipping to change at line 159 skipping to change at line 166
It is expected that at some point in the future there will exist an It is expected that at some point in the future there will exist an
A-frame [RFC2330] which will unify all simple path metrics (e.g., A-frame [RFC2330] which will unify all simple path metrics (e.g.,
segment loss rates, round trip time) and BTC ancillary metrics segment loss rates, round trip time) and BTC ancillary metrics
(e.g., queue size and packet reordering) with different versions of (e.g., queue size and packet reordering) with different versions of
BTC metrics (e.g., that parallel Reno or SACK TCP). BTC metrics (e.g., that parallel Reno or SACK TCP).
2. Congestion Control Algorithms 2. Congestion Control Algorithms
Nearly all TCP implementations in use today utilize the congestion Nearly all TCP implementations in use today utilize the congestion
control algorithms published in [Jac88] and further refined in control algorithms published in [Jac88] and further refined in
[RFC2581]. In addition to the basic notion of using an ACK clock, [RFC2581]. In addition to using the basic notion of using an ACK
TCP (and therefore BTC) implements five standard congestion control clock, TCP (and therefore BTC) implements five standard congestion
algorithms: Congestion Avoidance, Retransmission timeouts, control algorithms: Congestion Avoidance, Retransmission timeouts,
Slow-start, Fast Retransmit and Fast Recovery. All BTC Slow-start, Fast Retransmit and Fast Recovery. All BTC
implementations must use these algorithms as they are defined in implementations MUST implement slow start and congestion avoidance,
[RFC2581] (which the reader is assumed to be familiar with). as specified in [RFC2581] (with extra details also specified, as
However, in all cases a BTC metric must more tightly specify these outlined below). All BTC methodologies SHOULD implement fast
algorithms, as discussed below. retransmit and fast recovery as outlined in [RFC2581]. Finally, all
BTC methodologies MUST implement a retransmission timeout.
2.1 Congestion Avoidance
The Congestion Avoidance algorithm drives the steady-state bulk
transfer behavior of TCP. It calls for opening the congestion
window (cwnd) by a constant additive amount during each round trip
time (RTT), and closing cwnd by a constant multiplicative fraction
on congestion, as indicated by lost segments or Explicit Congestion
Notification messages [RFC2481]. The window closes by half the
number of outstanding data segments in flight when loss is detected.
A BTC metric must specify the following Congestion Avoidance
details:
The exact algorithm for incrementing cwnd in TCP is left to the
implementer. Several candidate algorithms are outlined in
[RFC2581]. In addition, some of these algorithms include some
rounding. For these reasons, the exact algorithm for increasing
cwnd during congestion avoidance must be fully specified for
each BTC metric defined.
[RFC2581] permits, but does not require, an extra plus one
segment cwnd adjustment following the multiplicative decrease of
cwnd. This is because [RFC2581] allows a single invocation of
the Slow-Start algorithm when when cwnd equals ssthresh at the
end of recovery.
2.2 Retransmission Timeouts
In order to provide reliable data delivery, TCP resends a segment if
the ACK for the given segment does not arrive before the
retransmission timer (RTO) expires. A BTC metric must implement an
RTO timer to trigger retransmissions not handled by the fast
retransmit algorithm. Such retransmissions can have a large impact
on the measured BTC of the path. Calculating the RTO is subject to
a number of details that are not standardized (however, [WS95]
outlines a popular implementation). When implementing a BTC metric
the details of the RTO calculation, how and when the clock is set,
as well as the clock granularity must be fully documented.
2.3 Slow Start
Slow start is part of TCP's transient behavior. It is used to
quickly increase the congestion window for new or recently restarted
connections up to an appropriate level for the network path. In
addition, slow start is used to restart the ACK clock after a
retransmission timeout. A BTC implementation must use the slow
start algorithm, as specified by [RFC2581]. The slow start
algorithm is used while the congestion window (cwnd) is less than
the slow start threshold (ssthresh). However, whether to use slow
start or congestion avoidance when cwnd equals ssthresh is left to
the implementer by [RFC2581]. This detail must be specified in
every specific BTC metric definition.
2.4 Fast Retransmit/Fast Recovery
The Fast Retransmit/Fast Recovery algorithms are used to infer
segment loss before the RTO expires. A BTC implementation must
implement the algorithms as defined in [RFC2581].
In Reno TCP, Fast Retransmit and Fast Recovery are used to support The algorithms specified in [RFC2581] give implementers some choices
the Congestion Avoidance algorithm during loss recovery. During in the details of the implementation. The following is a list of
Fast Recovery, the data receiver sends duplicated acknowledgments, details about the congestion control algorithms that are either
per the TCP specification [RFC793]. The data sender uses these underspecified in [RFC2581] or very important to define when
duplicate ACKs to detect loss, to estimate the quantity of constructing a BTC methodology. These details MUST be specifically
outstanding data in the network and to clock out new data in an defined in each BTC methodology.
effort to keep the ACK clock running.
The Fast Retransmit/Fast Recovery algorithms should be implemented * [RFC2581] does not standardize a specific algorithm for
in all BTC methodologies as specified in [RFC2581]. increasing cwnd during congestion avoidance. Several candidate
algorithms are given in [RFC2581].
2.5 Advanced Recovery Algorithms * [RFC2581] does not specify which cwnd increase algorithm (slow
start or congestion avoidance) should be used when cwnd equals
ssthresh.
It has been observed that under some conditions the Fast Retransmit * [RFC2581] allows TCPs to use advanced loss recovery mechanism
and Fast Recovery algorithms do not reliably preserve TCP's such as NewReno [RFC2582,FF96,Hoe96] and SACK-based algorithms
Self-Clock, causing unpredictable or unstable TCP performance [FF96,MM96a,MM96b]. If used in a BTC implementation, such an
[Lak94@@@check, Flo95]. Simulations of reference TCP algorithm MUST be fully defined.
implementations have uncovered situations where incidental changes
in the network path have a large effect on performance [MM96a].
Additional simulations have shown that under some conditions,
slightly better networks (higher bandwidth, lower delay or less
competing traffic) yield lower throughput [MathisIPPMDec1998?].
[RFC2581] allows a TCP implementation to use more robust loss * The actual segment size, or method of choosing a segment size
recovery algorithms, such as NewReno [RFC2582,FF96,Hoe96] and (e.g., path MTU discovery [RFC1191]) and the number of header
SACK-based algorithms [FF96,MM96a,MM96b]. While allowing these bytes assumed to be prepended to each segment MUST be specified.
algorithms, [RFC2581] does not define any such algorithm and In addition, if the segment size is artificially limited to less
therefore, a BTC metric that implements advanced loss recovery than the path MTU this MUST be indicated (if known).
algorithms must fully specify the details.
2.6 Segment Size * TCP includes a retransmission timeout (RTO) to trigger
retransmissions of segments that have not been acknowledged
within an appropriate amount of time and have not been
retransmitted via some more advanced loss recovery algorithm. A
BTC implementation MUST include a retransmission timer.
Calculating the RTO is subject to a number of details that MUST
be defined for each BTC metric. In addition, a BTC metric MUST
define when the clock is set and the granularity of the clock.
The actual segment size, or method of choosing a segment size (e.g., Note [WS95] outlines a popular implementation of the
path MTU discovery [RFC1191]) and the number of header bytes assumed retransmission timer. Also, a specification for the behavior of
to be prepended to each segment must be specified. In addition if the retransmission timer is currently being written for TCP
the segment size is artificially limited to less than the path MTU [PA99]. If adopted this specification would apply to BTC
this must be indicated (if known). implementation, as well.
3 Ancillary Metrics 3 Ancillary Metrics
The following ancillary metrics can provide additional information The following ancillary metrics can provide additional information
about the network and the behavior of the implemented congestion about the network and the behavior of the implemented congestion
control algorithm in response to the behavior of the network path. control algorithms in response to the behavior of the network path.
It is recommended that these metrics be built into each BTC It is RECOMMENDED that these metrics be built into each BTC
methodology. Alternatively, the BTC implementation should provide methodology. Alternatively, it is RECOMMENDED that the BTC
enough information such that the ancillary metrics can be derived implementation provide enough information such that the ancillary
via post-processing (e.g., by providing a segment trace of the metrics can be derived via post-processing (e.g., by providing a
connection). segment trace of the connection).
3.1 Congestion Avoidance Capacity 3.1 Congestion Avoidance Capacity
The "Congestion Avoidance Capacity" (CAC) metric is the data rate The "Congestion Avoidance Capacity" (CAC) metric is the data rate
(bits per second) of a fully specified implementation of the (bits per second) of a fully specified implementation of the
Congestion Avoidance algorithm, subject to the restriction that the Congestion Avoidance algorithm, subject to the restriction that the
Retransmission Timeout and Slow-Start algorithms are not invoked. Retransmission Timeout and Slow-Start algorithms are not invoked.
The CAC metric is defined to have no meaning across Retransmission The CAC metric is defined to have no meaning across Retransmission
Timeouts or Slow-Start periods (except the single segment Slow-Start Timeouts or Slow-Start periods (except the single segment Slow-Start
that is permitted to follow recovery, as discussed in section 2.3). that is permitted to follow recovery, as discussed in section 2.3).
In principle a CAC metric would be an ideal BTC metric, as it In principle a CAC metric would be an ideal BTC metric, as it
captures what should be TCP's steady state behavior. But, there is captures what should be TCP's steady state behavior. But, there is
a rather substantial difficulty with using it as such. The a rather substantial difficulty with using it as such. The
Self-Clocking of the Congestion Avoidance algorithm can be very Self-Clocking of the Congestion Avoidance algorithm can be very
fragile, depending on the specific details of the Fast Retransmit, fragile, depending on the specific details of the Fast Retransmit,
Fast Recovery or advanced recovery algorithms above. It has been Fast Recovery or advanced recovery algorithms above. It has been
found that timeouts and periods of slow start loss recovery are found that timeouts and periods of slow start loss recovery are
prevalent in traffic on the Internet [LK98] and therefore these prevalent in traffic on the Internet [LK98] and therefore these
should be included in the BTC metric. should be captured by the BTC metric.
When TCP looses Self-Clock it is reestablished through a When TCP loses Self-Clock it is re-established through a
retransmission timeout and Slow-Start. These algorithms nearly retransmission timeout and Slow-Start. These algorithms nearly
always require more time than Congestion Avoidance would have taken. always require more time than Congestion Avoidance would have taken.
It is easily observed that unless the network loses an entire window It is easily observed that unless the network loses an entire window
of data (which would clearly require a retransmit timeout) TCP of data (which would clearly require a retransmit timeout) TCP
missed some opportunity to safely transmit data. That is, if TCP missed some opportunity to safely transmit data. That is, if TCP
experiences a timeout after losing a partial window of data, it must experiences a timeout after losing a partial window of data, it must
have received at least one ACK that was generated after some of the have received at least one ACK that was generated after some of the
partial data was delivered, but did not trigger the transmission of partial data was delivered, but did not trigger the transmission of
new data. Recent research in congestion control (e.g., FACK new data. Recent research in congestion control (e.g., FACK
[MM96a], NewReno [FF96,RFC2582]) can be characterized as making [MM96a], NewReno [FF96,RFC2582], rate-halving [MSML99]) can be
TCP's Self-Clock more tenacious, while preserving fairness under characterized as making TCP's Self-Clock more tenacious, while
adverse conditions. This work is often motivated by how poorly preserving fairness under adverse conditions. This work is
current TCP implementations perform under some conditions, often due motivated by how poorly current TCP implementations perform under
to repeated clock loss. Since this is an active research area, some conditions, often due to repeated clock loss. Since this is an
different TCP implementations have rather considerable differences active research area, different TCP implementations have rather
in their ability to preserve Self-Clock. considerable differences in their ability to preserve Self-Clock.
3.2 Preservation of Self-Clock 3.2 Preservation of Self-Clock
Losing the ACK clock can have a large effect on the overall BTC, and Losing the ACK clock can have a large effect on the overall BTC, and
the clock is itself fragile in ways that are dependent on the loss the clock is itself fragile in ways that are dependent on the loss
recovery algorithm. Therefore, it is important that the transition recovery algorithm. Therefore, the transition between timer driven
between timer driven and Self-Clocked operation be instrumented. and Self-Clocked operation SHOULD be instrumented.
3.2.1 Lost Transmission Opportunities 3.2.1 Lost Transmission Opportunities
If the last event before a timeout was the receipt of an ACK that If the last event before a timeout was the receipt of an ACK that
did not trigger a retransmission, the possibility exists that an did not trigger a transmission, the possibility exists that an
alternate congestion control algorithm would have successfully alternate congestion control algorithm would have successfully
preserved the Self-Clock. In this event, instrumenting key parts of preserved the Self-Clock. A BTC SHOULD instrument key items in the
the BTC state (such as the congestion window) may lead to further BTC state (such as the congestion window) in the hopes that this may
improvements in congestion control algorithms. lead to further improvements in congestion control algorithms.
Note that in the absence of knowledge about the future, it is not Note that in the absence of knowledge about the future, it is not
possible to design an algorithm that never misses transmission possible to design an algorithm that never misses transmission
opportunities. However, there are ever more subtle ways to gauge opportunities. However, there are ever more subtle ways to gauge
network state, and to estimate if a given ACK is likely to be the network state, and to estimate if a given ACK is likely to be the
last. last.
3.2.2 Loosing an Entire Window 3.2.2 Loosing an Entire Window
If an entire window of data (or ACKs) is lost, there will be no If an entire window of data (or ACKs) is lost, there will be no
returning ACKs to clock out additional data. This condition can returning ACKs to clock out additional data. This condition can
be detected if the last event before a timeout was a data be detected if the last event before a timeout was a data
transmission triggered by an ACK. The loss of an entire window transmission triggered by an ACK. The loss of an entire window
of data/ACKs forces recovery to be via a Retransmission Timeout and of data/ACKs forces recovery to be via a Retransmission Timeout and
Slow-Start. Slow-Start.
Losing an entire window of data implies an outage with a duration Losing an entire window of data implies an outage with a duration at
at least as long as a round trip time. Such an outage can not be least as long as a round trip time. Such an outage can not be
diagnosed with low rate metrics and is unsafe to diagnose at diagnosed with low rate metrics and is unsafe to diagnose at higher
higher rates than the BTC. Therefore all BTC metrics at should rates than the BTC. Therefore all BTC metrics SHOULD instrument and
instrument and report losses of an entire window of data. report losses of an entire window of data.
Note that there are some conditions, such as when operating with a Note that there are some conditions, such as when operating with a
very small window, in which there is a significant probability that very small window, in which there is a significant probability that
an entire window can be lost through individual random losses. an entire window can be lost through individual random losses (again
highlighting the importance of instrumenting cwnd).
3.2.3 Heroic Clock Preservation 3.2.3 Heroic Clock Preservation
All algorithms that permit a given BTC to sustain Self-Clock when All algorithms that permit a given BTC to sustain Self-Clock when
other algorithms might not, should be instrumented. Furthermore, other algorithms might not, SHOULD be instrumented. Furthermore,
the details of the algorithms used must be fully documented. the details of the algorithms used MUST be fully documented (as
discussed in section 2).
BTC metrics that can sustain Self-Clock in the presence of multiple BTC metrics that can sustain Self-Clock in the presence of multiple
losses within one round trip should instrument the loss losses within one round trip SHOULD instrument the loss
distribution, such that the performance of Reno style bulk transport distribution, such that the performance of alternate congestion
can be estimated. control algorithms may be estimated (e.g., Reno style).
3.2.4 False Timeouts 3.2.4 False Timeouts
All false timeouts, (where the retransmission timer expires before All false timeouts, (where the retransmission timer expires before
the ACK for some previously transmitted data arrives) should be the ACK for some previously transmitted data arrives) SHOULD be
instrumented when possible. Note that depending upon how the BTC instrumented when possible. Note that depending upon how the BTC
metric implements sequence numbers, this may be difficult to detect. metric implements sequence numbers, this may be difficult to detect.
3.3 Ancillary Metrics Relating to Flow Based Path Properties 3.3 Ancillary Metrics Relating to Flow Based Path Properties
All BTC metrics provide unique vantage points for instrumenting All BTC metrics provide unique vantage points for observing certain
certain path properties relating to closely spaced packets. As in path properties relating to closely spaced packets. As in the case
the case of RTT duration outages, these can be impossible to of RTT duration outages, these can be impossible to diagnose at low
diagnose at low rates (less than 1 packet per RTT) and inappropriate rates (less than 1 packet per RTT) and inappropriate to test at
to test at rates above the BTC. rates above the BTC of the network path.
All BTC metrics should instrument packet reordering. The frequency All BTC metrics SHOULD instrument packet reordering. The frequency
and distance out of sequence must be instrumented for all and distance out-of-sequence SHOULD be instrumented for all
out-of-order packets. The severity of the reordering can be out-of-order packets. The severity of the reordering can be
classified as one of three different cases, each of which should be classified as one of three different cases, each of which SHOULD be
reported. reported.
Packets that are only slightly out of order should not trigger Packets that are only slightly out-of-order should not trigger
retransmission (via fast retransmit), but they may affect the the fast retransmit algorithm, but they may affect the window
window calculation. BTC metrics must document how slightly calculation. BTC metrics SHOULD document how slightly
out-of-order packets affect the congestion window calculation. out-of-order packets affect the congestion window calculation.
If packets are sufficiently out-of-order, the Fast Retransmit If packets are sufficiently out-of-order, the Fast Retransmit
algorithm will be invoked in advance of the delayed packet's algorithm will be invoked in advance of the delayed packet's
late arrival. These events must be instrumented. Even though late arrival. These events SHOULD be instrumented. Even though
the the late arriving packet will complete recovery, the the the the late arriving packet will complete recovery, the the
window will still be reduced by half. window will still be reduced by half.
Under some rare conditions packets have been observed that are Under some rare conditions packets have been observed that are
far out of order - sometimes many seconds late [Pax97b]. These far out of order - sometimes many seconds late [Pax97b]. These
should always be instrumented. SHOULD always be instrumented.
The BTC should instrument the maximum cwnd observed during BTC implementations SHOULD instrument the maximum cwnd observed
congestion avoidance and slow start. A TCP running over the same during congestion avoidance and slow start. A TCP running over the
path as the BTC must have sufficient sender buffer space and same path as the BTC metric must have sufficient sender buffer space
receiver window (and window shift [RFC1323]) to cover this cwnd. and receiver window (and window shift [RFC1323]) to cover this cwnd
in order to expect the same performance.
There are several other path properties that one might measure There are several other path properties that one might measure
within a BTC metric. For example, with an embedded one-way delay within a BTC metric. For example, with an embedded one-way delay
metric it may be possible to measure how queueing delay and and metric it may be possible to measure how queueing delay and and
(RED) drop probabilities are correlated to window size. These are (RED) drop probabilities are correlated to window size. These are
open research questions. open research questions.
3.4 Ancillary Metrics Pertaining to MTU Discovery
Under some conditions, BTC can be very sensitive to segment size.
In addition to instrumenting the segment size, a BTC metric should
indicate how it was selected: by path MTU discovery [RFC1191], a
manual configuration, system default, or the maximum MTU for the
interface.
Note that the most popular LAN technologies have smaller MTUs than
nearly all WAN technologies. As a consequence, it is difficult to
measure the true performance of a wide area path without subjecting
it to the smaller MTU of the LAN.
3.4 Ancillary Metrics as Calibration Checks 3.4 Ancillary Metrics as Calibration Checks
Unlike low rate metrics, BTC must have explicit checks that the Unlike low rate metrics, BTC SHOULD include explicit checks that the
test platform is not the bottleneck. test platform is not the bottleneck.
Ideally all queues within the tester should be instrumented. All Any detected dropped packets within the sending host MUST be reported.
packets dropped within the tester should be instrumented as tester Unless the sending interface is the path bottleneck, any dropped
failures, invalidating a measurement. packets probably indicates a measurement failure.
The maximum queue lengths should be instrumented. Any significant
queue may indicate that the tester itself has insufficient burst
data rate, and is slightly smoothing the data into the network.
3.4.3 Validate Reverse path load
@@@@ What happens to a BTC when the reverse path is congested? Is The maximum queue lengths within the sending host SHOULD be
this identical to TCP? What should happen? How should it be instrumented. Any significant queue may indicate that the sending
instrumented? host has insufficient burst data rate, and is smoothing the data
# being transmitted into the network.
# Some implementations (mine!) have an annoying feature whereby ACK loss
# looks just like data loss. This should be documented. If ACK loss
# and data loss can be detected separately, I think ACK loss rate should
# be reported, as it slightly changes the ACK clock (can impact
# algorithms like slow start that work on a per ACK basis and can make
# the sender more bursty, which could cause more loss).
@ and mine --MM--
3.5 Ancillary Metrics Relating to the Need for Advanced TCP Features 3.5 Ancillary Metrics Relating to the Need for Advanced TCP Features
If TCP would require advanced TCP extensions to match BTC If TCP would require advanced TCP extensions to match BTC
performance (such as RFC 1323 or RFC 2018 features), it should be performance (such as RFC 1323 or RFC 2018 features), it SHOULD be
reported. reported.
3.6 Validate Reverse Path Load
To the extent possible, the BTC metric SHOULD distinguish between
the properties of the forward and reverse paths.
BTC methodologies which rely on non-cooperating receivers may only
be able to measure round trip path properties and may not be able to
independently differentiate between the properties of the forward
and reverse paths. In this case the load on the reverse path
contributed by the BTC metric SHOULD be instrumented (or computed)
to permit other means of gage the proportion of the round trip path
properties attributed to the the forward and reverse paths.
To the extent possible, BTC methodologies that rely on cooperating
receivers SHOULD support separate ancillary metrics for the forward
and reverse paths.
4 Acknowledgments 4 Acknowledgments
Thanks to Jeff Semke for numerous clarifications. Thanks to Jeff Semke for numerous clarifications.
5 References 5 References
[FF96] Fall, K., Floyd, S.. "Simulation-based Comparisons of Tahoe, [FF96] Fall, K., Floyd, S.. "Simulation-based Comparisons of Tahoe,
Reno and SACK TCP". Computer Communication Review, July 1996. Reno and SACK TCP". Computer Communication Review, July 1996.
ftp://ftp.ee.lbl.gov/papers/sacks.ps.Z. ftp://ftp.ee.lbl.gov/papers/sacks.ps.Z.
skipping to change at line 487 skipping to change at line 434
Technology, June 1995. Technology, June 1995.
[Jac88] Jacobson, V., "Congestion Avoidance and Control", [Jac88] Jacobson, V., "Congestion Avoidance and Control",
Proceedings of SIGCOMM '88, Stanford, CA., August 1988. Proceedings of SIGCOMM '88, Stanford, CA., August 1988.
[Lak94] Lakshman, Effects of random loss [Lak94] Lakshman, Effects of random loss
[LK98] Lin, D. and Kung, H.T., "TCP Fast Recovery Strategies: [LK98] Lin, D. and Kung, H.T., "TCP Fast Recovery Strategies:
Analysis and Improvements", Proceedings of InfoCom, March 1998. Analysis and Improvements", Proceedings of InfoCom, March 1998.
[LM97] T.V.Lakshman and U.Madhow. "The Performance of TCP/IP for
Networks with High Bandwidth-Delay Products and Random Loss".
IEEE/ACM Transactions on Networking, Vol. 5, No. 3, June 1997,
pp.336-350.
[Mat98] Mathis, M., "Empirical Bulk Transfer Capacity", IP
Performance Metrics Working Group report in Proceedings of the
Forty Third Internet Engineering Task Force, Orlando, FL,
December 1988. Available from
http://www.ietf.org/proceedings/98dec/43rd-ietf-98dec-122.html
and
http://www.ietf.org/proceedings/98dec/slides/ippm-mathis-98dec.pdf.
[MM96a] Mathis, M. and Mahdavi, J. "Forward acknowledgment: Refining [MM96a] Mathis, M. and Mahdavi, J. "Forward acknowledgment: Refining
TCP congestion control", Proceedings of ACM SIGCOMM '96, TCP congestion control", Proceedings of ACM SIGCOMM '96,
Stanford, CA., August 1996. Stanford, CA., August 1996.
[MM96b] M. Mathis, J. Mahdavi, "TCP Rate-Halving with Bounding [MM96b] M. Mathis, J. Mahdavi, "TCP Rate-Halving with Bounding
Parameters" Available from Parameters" Available from
http://www.psc.edu/networking/papers/FACKnotes/current. http://www.psc.edu/networking/papers/FACKnotes/current.
[MSML99] Mathis, M., Semke, J., Mahdavi, J., Lahey, K., "The
Rate-Halving Algorithm for TCP Congestion Control", June 1999.
Internet-Draft draft-mathis-tcp-ratehalving-00.txt (work in
progress).
[MSMO97] Mathis, M., Semke, J., Mahdavi, J., Ott, T., "The [MSMO97] Mathis, M., Semke, J., Mahdavi, J., Ott, T., "The
Macroscopic Behavior of the TCP Congestion Avoidance Algorithm", Macroscopic Behavior of the TCP Congestion Avoidance Algorithm",
Computer Communications Review, 27(3), July 1997. Computer Communications Review, 27(3), July 1997.
[OKM96a], Ott, T., Kemperman, J., Mathis, M., "The Stationary [OKM96a], Ott, T., Kemperman, J., Mathis, M., "The Stationary
Behavior of Ideal TCP Congestion Avoidance", In progress, August Behavior of Ideal TCP Congestion Avoidance", In progress, August
1996. Obtain via pub/tjo/TCPwindow.ps using anonymous ftp to 1996. Obtain via pub/tjo/TCPwindow.ps using anonymous ftp to
ftp.bellcore.com ftp.bellcore.com
[OKM96b], Ott, T., Kemperman, J., Mathis, M., "Window Size Behavior [OKM96b], Ott, T., Kemperman, J., Mathis, M., "Window Size Behavior
in TCP/IP with Constant Loss Probability", DIMACS Special Year in TCP/IP with Constant Loss Probability", DIMACS Special Year
on Networks, Workshop on Performance of Real-Time Applications on Networks, Workshop on Performance of Real-Time Applications
on the Internet, Nov 1996. on the Internet, Nov 1996.
[PA99] Paxson, V., Allman, M., "Computing TCP's Retransmission
Timer", October 1999. Internet-Draft draft-paxson-tcp-rto-00.txt
(work in progress).
[Pax97a] Paxson, V., "Automated Packet Trace Analysis of TCP [Pax97a] Paxson, V., "Automated Packet Trace Analysis of TCP
Implementations", Proceedings of ACM SIGCOMM '97, August 1997. Implementations", Proceedings of ACM SIGCOMM '97, August 1997.
[Pax97b] Paxson, V., "End-to-End Internet Packet Dynamics," [Pax97b] Paxson, V., "End-to-End Internet Packet Dynamics,"
Proceedings of SIGCOMM '97, Cannes, France, Sep. 1997. Proceedings of SIGCOMM '97, Cannes, France, Sep. 1997.
[PFTK98] Padhye, J., Firoiu. V., Towsley, D., and Kurose, J., "TCP [PFTK98] Padhye, J., Firoiu. V., Towsley, D., and Kurose, J., "TCP
Throughput: A Simple Model and its Empirical Validation", Throughput: A Simple Model and its Empirical Validation",
Proceedings of ACM SIGCOMM '98, August 1998. Proceedings of ACM SIGCOMM '98, August 1998.
[RFC793] Postel, J., "Transmission Control Protocol", 1981, Obtain [RFC793] Postel, J., "Transmission Control Protocol", 1981, Obtain
via: ftp://ds.internic.net/rfc/rfc793.txt via: ftp://ds.internic.net/rfc/rfc793.txt
[RFC1191] Mogul, J., Deering, S., "Path MTU Discovery", November [RFC1191] Mogul, J., Deering, S., "Path MTU Discovery", November
1990, Obtain via: ftp://ds.internic.net/rfc/rfc1191.txt 1990, Obtain via: ftp://ds.internic.net/rfc/rfc1191.txt
[RFC1323] Jacobson, V., Braden, R., Borman, D., "TCP Extensions for [RFC1323] Jacobson, V., Braden, R., Borman, D., "TCP Extensions for
High Performance", May 1992, Obtain via: High Performance", May 1992, Obtain via:
ftp://ds.internic.net/rfc/rfc1323.txt ftp://ds.internic.net/rfc/rfc1323.txt
[RFC1633] Braden R., Clark D., Shenker S., "Integrated Services in
the Internet Architecture: an Overview"., 1994.
[RFC2001] Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast [RFC2001] Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast
Retransmit, and Fast Recovery Algorithms", 1997, Obtain via: Retransmit, and Fast Recovery Algorithms", 1997, Obtain via:
ftp://ds.internic.net/rfc/rfc2001.txt ftp://ds.internic.net/rfc/rfc2001.txt
[RFC2018] Mathis, M., Mahdavi, J. Floyd, S., Romanow, A., "TCP [RFC2018] Mathis, M., Mahdavi, J. Floyd, S., Romanow, A., "TCP
Selective Acknowledgment Options", 1996, Obtain via: Selective Acknowledgment Options", 1996, Obtain via:
ftp://ds.internic.net/rfc/rfc2018.txt ftp://ds.internic.net/rfc/rfc2018.txt
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", 1997, Obtain via:
ftp://ds.internic.net/rfc/rfc2119.txt
[RFC2216] Shenker, S., Wroclawski, J., "Network Element Service
Specification Template", 1997, Obtain via:
ftp://ds.internic.net/rfc/rfc2216.txt
[RFC2330] Paxson, V., Almes, G., Mahdavi, J., Mathis, M., "Framework [RFC2330] Paxson, V., Almes, G., Mahdavi, J., Mathis, M., "Framework
for IP Performance Metrics" , 1998, Obtain via: for IP Performance Metrics" , 1998, Obtain via:
ftp://ds.internic.net/rfc/rfc2330.txt ftp://ds.internic.net/rfc/rfc2330.txt
[RFC2475] Black D., Blake S., Carlson M., Davies E., Wang Z., Weiss
W., "An Architecture for Differentiated Services"., 1998.
[RFC2481] K. Ramakrishnan, S. Floyd, "A Proposal to add Explicit [RFC2481] K. Ramakrishnan, S. Floyd, "A Proposal to add Explicit
Congestion Notification (ECN) to IP", 1999, Obtain via: Congestion Notification (ECN) to IP", 1999, Obtain via:
ftp://ds.internic.net/rfc/rfc2481.txt ftp://ds.internic.net/rfc/rfc2481.txt
[RFC2525] V. Paxson, M. Allman, S. Dawson, W. Fenner, J. Griner, [RFC2525] V. Paxson, M. Allman, S. Dawson, W. Fenner, J. Griner,
I. Heavens, K. Lahey, J. Semke, B. Volz, "Known TCP I. Heavens, K. Lahey, J. Semke, B. Volz, "Known TCP
Implementation Problems", 1999, Obtain via: Implementation Problems", 1999, Obtain via:
ftp://ds.internic.net/rfc/rfc2525.txt ftp://ds.internic.net/rfc/rfc2525.txt
[RFC2581] Allman, M., Paxson, V., Stevens, W., "TCP Congestion [RFC2581] Allman, M., Paxson, V., Stevens, W., "TCP Congestion
skipping to change at line 574 skipping to change at line 557
Author's Addresses Author's Addresses
Matt Mathis Matt Mathis
Pittsburgh Supercomputing Center Pittsburgh Supercomputing Center
4400 Fifth Ave. 4400 Fifth Ave.
Pittsburgh PA 15213 Pittsburgh PA 15213
mathis@psc.edu mathis@psc.edu
http://www.psc.edu/~mathis http://www.psc.edu/~mathis
Mark Allman Mark Allman
NASA Glenn Research Center/GTE Internetworking NASA Glenn Research Center/BBN Technologies
Lewis Field Lewis Field
21000 Brookpark Rd. MS 54-2 21000 Brookpark Rd. MS 54-2
Cleveland, OH 44135 Cleveland, OH 44135
216-433-6586 216-433-6586
mallman@grc.nasa.gov mallman@grc.nasa.gov
http://roland.grc.nasa.gov/~mallman http://roland.grc.nasa.gov/~mallman
 End of changes. 

This html diff was produced by rfcdiff 1.23, available from http://www.levkowetz.com/ietf/tools/rfcdiff/