draft-ietf-tcpm-1323bis-04.txt   draft-ietf-tcpm-1323bis-05.txt 
TCP Maintenance (TCPM) D. Borman TCP Maintenance (TCPM) D. Borman
Internet-Draft Quantum Corporation Internet-Draft Quantum Corporation
Intended status: Standards Track B. Braden Intended status: Standards Track B. Braden
Expires: February 3, 2013 University of Southern Expires: August 17, 2013 University of Southern
California California
V. Jacobson V. Jacobson
Packet Design Packet Design
R. Scheffenegger, Ed. R. Scheffenegger, Ed.
NetApp, Inc. NetApp, Inc.
August 2, 2012 February 13, 2013
TCP Extensions for High Performance TCP Extensions for High Performance
draft-ietf-tcpm-1323bis-04 draft-ietf-tcpm-1323bis-05
Abstract Abstract
This memo presents a set of TCP extensions to improve performance This document specifies a set of TCP extensions to improve
over large bandwidth*delay product paths and to provide reliable performance over paths with a large bandwidth*delay product and to
operation over very high-speed paths. It defines TCP options for provide reliable operation over very high-speed paths. It defines
scaled windows and timestamps, which are designed to provide TCP options for scaled windows and timestamps. The timestamps are
compatible interworking with TCP's that do not implement the used for two distinct mechanisms, RTTM (Round Trip Time Measurement)
extensions. The timestamps are used for two distinct mechanisms: and PAWS (Protection Against Wrapped Sequences).
RTTM (Round Trip Time Measurement) and PAWS (Protection Against
Wrapped Sequences). Selective acknowledgments are not included in
this memo.
This memo updates and obsoletes RFC 1323. This document updates and obsoletes RFC 1323.
Status of this Memo Status of this Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on February 3, 2013. This Internet-Draft will expire on August 17, 2013.
Copyright Notice Copyright Notice
Copyright (c) 2012 IETF Trust and the persons identified as the
Copyright (c) 2013 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1. TCP Performance . . . . . . . . . . . . . . . . . . . . . 4 1.1. TCP Performance . . . . . . . . . . . . . . . . . . . . . 4
1.2. TCP Reliability . . . . . . . . . . . . . . . . . . . . . 6 1.2. TCP Reliability . . . . . . . . . . . . . . . . . . . . . 5
1.3. Using TCP options . . . . . . . . . . . . . . . . . . . . 9 1.3. Using TCP options . . . . . . . . . . . . . . . . . . . . 6
2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 10 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 7
3. TCP Window Scale Option . . . . . . . . . . . . . . . . . . . 10 3. TCP Window Scale Option . . . . . . . . . . . . . . . . . . . 7
3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 10 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 7
3.2. Window Scale Option . . . . . . . . . . . . . . . . . . . 10 3.2. Window Scale Option . . . . . . . . . . . . . . . . . . . 7
3.3. Using the Window Scale Option . . . . . . . . . . . . . . 11 3.3. Using the Window Scale Option . . . . . . . . . . . . . . 8
3.4. Addressing Window Retraction . . . . . . . . . . . . . . . 13 3.4. Addressing Window Retraction . . . . . . . . . . . . . . . 10
4. RTTM -- Round-Trip Time Measurement . . . . . . . . . . . . . 14 4. RTTM -- Round-Trip Time Measurement . . . . . . . . . . . . . 11
4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 14 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 11
4.2. TCP Timestamps Option . . . . . . . . . . . . . . . . . . 15 4.2. TCP Timestamps Option . . . . . . . . . . . . . . . . . . 12
4.3. The RTTM Mechanism . . . . . . . . . . . . . . . . . . . . 16 4.3. The RTTM Mechanism . . . . . . . . . . . . . . . . . . . . 13
4.4. Which Timestamp to Echo . . . . . . . . . . . . . . . . . 17 4.4. Which Timestamp to Echo . . . . . . . . . . . . . . . . . 14
5. PAWS -- Protection Against Wrapped Sequence Numbers . . . . . 20 5. PAWS -- Protection Against Wrapped Sequence Numbers . . . . . 17
5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 20 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 17
5.2. The PAWS Mechanism . . . . . . . . . . . . . . . . . . . . 20 5.2. The PAWS Mechanism . . . . . . . . . . . . . . . . . . . . 17
5.2.1. Basic PAWS Algorithm . . . . . . . . . . . . . . . . . 21 5.2.1. Basic PAWS Algorithm . . . . . . . . . . . . . . . . . 18
5.2.2. Timestamp Clock . . . . . . . . . . . . . . . . . . . 23 5.2.2. Timestamp Clock . . . . . . . . . . . . . . . . . . . 20
5.2.3. Outdated Timestamps . . . . . . . . . . . . . . . . . 25 5.2.3. Outdated Timestamps . . . . . . . . . . . . . . . . . 22
5.2.4. Header Prediction . . . . . . . . . . . . . . . . . . 25 5.2.4. Header Prediction . . . . . . . . . . . . . . . . . . 22
5.2.5. IP Fragmentation . . . . . . . . . . . . . . . . . . . 27 5.2.5. IP Fragmentation . . . . . . . . . . . . . . . . . . . 24
5.3. Duplicates from Earlier Incarnations of Connection . . . . 27 5.3. Duplicates from Earlier Incarnations of Connection . . . . 24
6. Conclusions and Acknowledgements . . . . . . . . . . . . . . . 27 6. Conclusions and Acknowledgements . . . . . . . . . . . . . . . 24
7. Security Considerations . . . . . . . . . . . . . . . . . . . 28 7. Security Considerations . . . . . . . . . . . . . . . . . . . 25
8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 29 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 26
9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 29 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 26
9.1. Normative References . . . . . . . . . . . . . . . . . . . 29 9.1. Normative References . . . . . . . . . . . . . . . . . . . 26
9.2. Informative References . . . . . . . . . . . . . . . . . . 29 9.2. Informative References . . . . . . . . . . . . . . . . . . 26
Appendix A. Implementation Suggestions . . . . . . . . . . . . . 31 Appendix A. Implementation Suggestions . . . . . . . . . . . . . 28
Appendix B. Duplicates from Earlier Connection Incarnations . . . 32 Appendix B. Duplicates from Earlier Connection Incarnations . . . 29
B.1. System Crash with Loss of State . . . . . . . . . . . . . 32 B.1. System Crash with Loss of State . . . . . . . . . . . . . 29
B.2. Closing and Reopening a Connection . . . . . . . . . . . . 33 B.2. Closing and Reopening a Connection . . . . . . . . . . . . 30
Appendix C. Changes from RFC 1072, RFC 1185, and RFC 1323 . . . . 34 Appendix C. Summary of Notation . . . . . . . . . . . . . . . . . 31
Appendix D. Summary of Notation . . . . . . . . . . . . . . . . . 36 Appendix D. Pseudo-code Summary . . . . . . . . . . . . . . . . . 32
Appendix E. Pseudo-code Summary . . . . . . . . . . . . . . . . . 37 Appendix E. Event Processing Summary . . . . . . . . . . . . . . 34
Appendix F. Event Processing Summary . . . . . . . . . . . . . . 39 Appendix F. Timestamps Edge Cases . . . . . . . . . . . . . . . . 39
Appendix G. Timestamps Edge Cases . . . . . . . . . . . . . . . . 45 Appendix G. Changes from RFC 1072, RFC 1185, and RFC 1323 . . . . 40
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 45 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 42
1. Introduction 1. Introduction
The TCP protocol [RFC0793] was designed to operate reliably over The TCP protocol [RFC0793] was designed to operate reliably over
almost any transmission medium regardless of transmission rate, almost any transmission medium regardless of transmission rate,
delay, corruption, duplication, or reordering of segments. delay, corruption, duplication, or reordering of segments. Over the
Production TCP implementations currently adapt to transfer rates in years, advances in networking technology has resulted in ever-higher
the range of 100 bps to 10^10 bps and round-trip delays in the range transmission speeds, and the fastest paths are well beyond the domain
1 ms to 100 seconds. Work on TCP performance has shown that TCP for which TCP was originally engineered.
without the extensions described in this memo can work well over a
variety of Internet paths, ranging from 800 Mbit/sec I/O channels to
300 bit/sec dial-up modems .
Over the years, advances in networking technology has resulted in This document defines a set of modest extensions to TCP to extend the
ever-higher transmission speeds, and the fastest paths are well domain of its application to match the increasing network capability.
beyond the domain for which TCP was originally engineered. This memo It is an update to and obsoletes [RFC1323], which in turn is based
defines a set of modest extensions to TCP to extend the domain of its upon and obsoletes [RFC1072] and [RFC1185].
application to match this increasing network capability. It is an
update to and obsoletes [RFC1323], which in turn is based upon and
obsoletes [RFC1072] and [RFC1185].
There is no one-line answer to the question: "How fast can TCP go?". For brevity, the full discussions of the merits and history behind
There are two separate kinds of issues, performance and reliability, the TCP options defined within this document have been omitted.
and each depends upon different parameters. We discuss each in turn. [RFC1323] should be consulted for reference. A modern TCP
implementation SHOULD implement and make use of the extensions
described in this document.
1.1. TCP Performance 1.1. TCP Performance
TCP performance depends not upon the transfer rate itself, but rather TCP performance problems arise when the bandwidth*delay product is
upon the product of the transfer rate and the round-trip delay. This large.
"bandwidth*delay product" measures the amount of data that would
"fill the pipe"; it is the buffer space required at sender and
receiver to obtain maximum throughput on the TCP connection over the
path, i.e., the amount of unacknowledged data that TCP must handle in
order to keep the pipeline full. TCP performance problems arise when
the bandwidth*delay product is large. We refer to an Internet path
operating in this region as a "long, fat pipe", and a network
containing this path as an "LFN" (pronounced "elephan(t)").
High-capacity packet satellite channels are LFN's. For example, a
DS1-speed satellite channel has a bandwidth*delay product of 10^6
bits or more; this corresponds to 100 outstanding TCP segments of
1200 bytes each. Terrestrial fiber-optical paths will also fall into
the LFN class; for example, a cross-country delay of 30 ms at a DS3
bandwidth (45Mbps) also exceeds 10^6 bits.
There are three fundamental performance problems with the current TCP There are three fundamental performance problems with the current TCP
over LFN paths: over LFN paths:
(1) Window Size Limit (1) Window Size Limit
The TCP header uses a 16 bit field to report the receive window The TCP header uses a 16 bit field to report the receive window
size to the sender. Therefore, the largest window that can be size to the sender. Therefore, the largest window that can be
used is 2^16 = 65K bytes. used is 2^16 = 65K bytes.
To circumvent this problem, Section 2 of this memo defines a new To circumvent this problem, Section 2 of this memo defines a new
TCP option, "Window Scale", to allow windows larger than 2^16. TCP option, "Window Scale", to allow windows larger than 2^16.
This option defines an implicit scale factor, which is used to This option defines an implicit scale factor, which is used to
multiply the window size value found in a TCP header to obtain multiply the window size value found in a TCP header to obtain
the true window size. the true window size.
(2) Recovery from Losses (2) Recovery from Losses
Packet losses in an LFN can have a catastrophic effect on Packet losses in an LFN can have a catastrophic effect on
throughput. In the past, properly-operating TCP implementations throughput.
would cause the data pipeline to drain with every packet loss,
and require a slow-start action to recover. The Fast Retransmit
and Fast Recovery algorithms [Jacobson90c], [RFC2581] and
[RFC5681] were introduced, and their combined effect was to
recover from one packet loss per window, without draining the
pipeline. However, more than one packet loss per window
typically resulted in a retransmission timeout and the resulting
pipeline drain and slow start.
Expanding the window size to match the capacity of an LFN
results in a corresponding increase of the probability of more
than one packet per window being dropped. This could have a
devastating effect upon the throughput of TCP over an LFN. In
addition, since the publication of RFC 1323, congestion control
mechanism based upon some form of random dropping have been
introduced into gateways, and randomly spaced packet drops have
become common; this increases the probability of dropping more
than one packet per window.
To generalize the Fast Retransmit/Fast Recovery mechanism to To generalize the Fast Retransmit/Fast Recovery mechanism to
handle multiple packets dropped per window, selective handle multiple packets dropped per window, selective
acknowledgments are required. Unlike the normal cumulative acknowledgments are required. Unlike the normal cumulative
acknowledgments of TCP, selective acknowledgments give the acknowledgments of TCP, selective acknowledgments give the
sender a complete picture of which segments are queued at the sender a complete picture of which segments are queued at the
receiver and which have not yet arrived. receiver and which have not yet arrived.
Since the publication of RFC1323 [RFC1323], selective Selective acknowledgements are specified in a separate document,
acknowledgments (SACK) have become important in the LFN regime. "A Conservative Selective Acknowledgment (SACK)-based Loss
SACK has been published as "TCP Selective Acknowledgment Recovery Algorithm for TCP" [RFC6675], and not further discussed
Options" [RFC2018]. Additional information about SACK can be in this document.
found in "An Extension to the Selective Acknowledgement (SACK)
option for TCP" [RFC2883], and , "A Conservative Selective
Acknowledgment (SACK)-based Loss Recovery Algorithm for TCP"
[RFC3517].
(3) Round-Trip Measurement (3) Round-Trip Measurement
TCP implements reliable data delivery by retransmitting segments TCP implements reliable data delivery by retransmitting segments
that are not acknowledged within some retransmission timeout that are not acknowledged within some retransmission timeout
(RTO) interval. Accurate dynamic determination of an (RTO) interval. Accurate dynamic determination of an
appropriate RTO is essential to TCP performance. RTO is appropriate RTO is essential to TCP performance. RTO is
determined by estimating the mean and variance of the measured determined by estimating the mean and variance of the measured
round-trip time (RTT), i.e., the time interval between sending a round-trip time (RTT), i.e., the time interval between sending a
segment and receiving an acknowledgment for it [Jacobson88a]. segment and receiving an acknowledgment for it [Jacobson88a].
Section 4.2 introduces a new TCP option, "Timestamps", and then Section 4.2 introduces a new TCP option, "Timestamps", and then
defines a mechanism using this option that allows nearly every defines a mechanism using this option that allows nearly every
segment, including retransmissions, to be timed at negligible segment, including retransmissions, to be timed at negligible
computational cost. We use the mnemonic RTTM (Round Trip Time computational cost. We use the mnemonic RTTM (Round Trip Time
Measurement) for this mechanism, to distinguish it from other Measurement) for this mechanism, to distinguish it from other
uses of the Timestamps option. uses of the Timestamps option.
1.2. TCP Reliability 1.2. TCP Reliability
Now we turn from performance to reliability. High transfer rate
enters TCP performance through the bandwidth*delay product. However,
high transfer rate alone can threaten TCP reliability by violating
the assumptions behind the TCP mechanism for duplicate detection and
sequencing.
An especially serious kind of error may result from an accidental An especially serious kind of error may result from an accidental
reuse of TCP sequence numbers in data segments. Suppose that an "old reuse of TCP sequence numbers in data segments. TCP reliability
duplicate segment", e.g., a duplicate data segment that was delayed depends upon the existence of a bound on the lifetime of a segment:
in Internet queues, is delivered to the receiver at the wrong moment, the "Maximum Segment Lifetime" or MSL.
so that its sequence numbers fall somewhere within the current
window. There would be no checksum failure to warn of the error, and
the result could be an undetected corruption of the data. Reception
of an old duplicate ACK segment at the transmitter could be only
slightly less serious: it is likely to lock up the connection so that
no further progress can be made, forcing an RST on the connection.
TCP reliability depends upon the existence of a bound on the lifetime
of a segment: the "Maximum Segment Lifetime" or MSL. An MSL is
generally required by any reliable transport protocol, since every
sequence number field must be finite, and therefore any sequence
number may eventually be reused. In the Internet protocol suite, the
MSL bound is loosely enforced by an IP-layer mechanism, the "Time-to-
Live" (TTL) field, or "Hop Limit" field.
Duplication of sequence numbers might happen in either of two ways: Duplication of sequence numbers might happen in either of two ways:
(1) Sequence number wrap-around on the current connection (1) Sequence number wrap-around on the current connection
A TCP sequence number contains 32 bits. At a high enough A TCP sequence number contains 32 bits. At a high enough
transfer rate, the 32-bit sequence space may be "wrapped" transfer rate, the 32-bit sequence space may be "wrapped"
(cycled) within the time that a segment is delayed in queues. (cycled) within the time that a segment is delayed in queues.
(2) Earlier incarnation of the connection (2) Earlier incarnation of the connection
skipping to change at page 7, line 27 skipping to change at page 6, line 13
the current window for the new incarnation and be accepted as the current window for the new incarnation and be accepted as
valid. valid.
Duplicates from earlier incarnations, Case (2), are avoided by Duplicates from earlier incarnations, Case (2), are avoided by
enforcing the current fixed MSL of the TCP spec, as explained in enforcing the current fixed MSL of the TCP spec, as explained in
Section 5.3 and Appendix B. However, case (1), avoiding the reuse of Section 5.3 and Appendix B. However, case (1), avoiding the reuse of
sequence numbers within the same connection, requires an MSL bound sequence numbers within the same connection, requires an MSL bound
that depends upon the transfer rate, and at high enough rates, a new that depends upon the transfer rate, and at high enough rates, a new
mechanism is required. mechanism is required.
More specifically, if the maximum effective bandwidth at which TCP is
able to transmit over a particular path is B bytes per second, then
the following constraint must be satisfied for error-free operation:
2^31 / B > MSL (secs) [1]
The following table shows the value for Twrap = 2^31/B in seconds,
for some important values of the bandwidth B:
+------------------+----------+-------------+--------------------+
| Network | bits/sec | B bytes/sec | Twrap secs |
+------------------+----------+-------------+--------------------+
| Dialup | 56kbps | 7kBps | 3*10^5 (~3.6 days) |
| DS1 | 1.5Mbps | 190kBps | 10^4 (~3 hours) |
| 10MBit Ethernet | 10Mbps | 1.25MBps | 1700 (~0.5 hours) |
| DS3 | 45Mbps | 5.6MBps | 380 |
| 100MBit Ethernet | 100Mbps | 12.5MBps | 170 |
| Gigabit Ethernet | 1Gbps | 125MBps | 17 |
| 10Gig Ethernet | 10Gbps | 1.25GBps | 1.7 |
+------------------+----------+-------------+--------------------+
It is clear that wrap-around of the sequence space is not a problem
for 56kbps packet switching or even 10Mbps Ethernets. On the other
hand, at DS3 and 100mbit speeds, Twrap is comparable to the 2 minute
MSL assumed by the TCP specification [RFC0793]. Moving towards and
beyond gigabit speeds, Twrap becomes too small for reliable
enforcement by the Internet TTL mechanism.
The 16-bit window field of TCP limits the effective bandwidth B to
2^16/RTT, where RTT is the round-trip time in seconds [RFC1110]. If
the RTT is large enough, this limits B to a value that meets the
constraint [1] for a large MSL value. For example, consider a
transcontinental backbone with an RTT of 60ms (set by the laws of
physics). With the bandwidth*delay product limited to 64KB by the
TCP window size, B is then limited to 1.1MBps, no matter how high the
theoretical transfer rate of the path. This corresponds to cycling
the sequence number space in Twrap = 2000 secs, which is safe in
today's Internet.
It is important to understand that the culprit is not the larger
window but rather the high bandwidth. For example, consider a (very
large) FDDI LAN with a diameter of 10km. Using the speed of light,
we can compute the RTT across the ring as (2*10^4)/(3*10^8) = 67
microseconds, and the delay*bandwidth product is then 833 bytes. A
TCP connection across this LAN using a window of only 833 bytes will
run at the full 100mbps and can wrap the sequence space in about 3
minutes, very close to the MSL of TCP. Thus, high speed alone can
cause a reliability problem with sequence number wrap-around, even
without extended windows.
Watson's Delta-T protocol [Watson81] includes network-layer
mechanisms for precise enforcement of an MSL. In contrast, the IP
mechanism for MSL enforcement is loosely defined and even more
loosely implemented in the Internet. Therefore, it is unwise to
depend upon active enforcement of MSL for TCP connections, and it is
unrealistic to imagine setting MSL's smaller than the current values
(e.g., 120 seconds specified for TCP).
A possible fix for the problem of cycling the sequence space would be A possible fix for the problem of cycling the sequence space would be
to increase the size of the TCP sequence number field. For example, to increase the size of the TCP sequence number field. For example,
the sequence number field (and also the acknowledgment field) could the sequence number field (and also the acknowledgment field) could
be expanded to 64 bits. This could be done either by changing the be expanded to 64 bits. This could be done either by changing the
TCP header or by means of an additional option. TCP header or by means of an additional option.
Section 5 presents a different mechanism, which we call PAWS Section 5 presents a different mechanism, which we call PAWS
(Protection Against Wrapped Sequence numbers), to extend TCP (Protection Against Wrapped Sequence numbers), to extend TCP
reliability to transfer rates well beyond the foreseeable upper limit reliability to transfer rates well beyond the foreseeable upper limit
of network bandwidths. PAWS uses the TCP Timestamps option defined of network bandwidths. PAWS uses the TCP Timestamps option defined
in Section 4.2 to protect against old duplicates from the same in Section 4.2 to protect against old duplicates from the same
connection. connection.
1.3. Using TCP options 1.3. Using TCP options
The extensions defined in this memo all use new TCP options. We must The extensions defined in this document all use new TCP options.
address two possible issues concerning the use of TCP options: (1)
compatibility and (2) overhead.
We must pay careful attention to compatibility, i.e., to When RFC 1323 was published, there was concern that some buggy TCP
interoperation with existing implementations. The only TCP option implementation might be crashed by the first appearance of an option
defined previously, MSS, may appear only on a SYN segment. Every on a non-SYN segment. However, bugs like that can lead to DOS
implementation should (and we expect that most will) ignore unknown attacks against a TCP, so it is now expected that most TCP
options on SYN segments. When RFC 1323 was published, there was implementations will properly handle unknown options on non-SYN
concern that some buggy TCP implementation might be crashed by the segments. But it is still prudent to be conservative in what you
first appearance of an option on a non-SYN segment. However, bugs send, and avoiding buggy TCP implementation is not the only reason
like that can lead to DOS attacks against a TCP, so it is now for negotiating TCP options on SYN segments. Therefore, for each of
expected that most TCP implementations will properly handle unknown the extensions defined below, it is recommended that TCP options will
options on non-SYN segments. But it is still prudent to be
conservative in what you send, and avoiding buggy TCP implementation
is not the only reason for negotiating TCP options on SYN segments.
Therefore, for each of the extensions defined below, TCP options will
be sent on non-SYN segments only after an exchange of options on the be sent on non-SYN segments only after an exchange of options on the
SYN segments has indicated that both sides understand the extension. SYN segments has indicated that both sides understand the extension.
Furthermore, an extension option will be sent in a <SYN,ACK> segment Furthermore, an extension option will be sent in a <SYN,ACK> segment
only if the corresponding option was received in the initial <SYN> only if the corresponding option was received in the initial <SYN>
segment. segment.
A question may be raised about the bandwidth and processing overhead The timestamps option may appear in any data or ACK segment, adding
for TCP options. Those options that occur on SYN segments are not 12 bytes to the 20-byte TCP header. We believe that the bandwidth
likely to cause a performance concern. Opening a TCP connection saved by reducing unnecessary retransmission timeouts will more than
requires execution of significant special-case code, and the pay for the extra header bandwidth.
processing of options is unlikely to increase that cost
significantly.
On the other hand, a Timestamps option may appear in any data or ACK
segment, adding 12 bytes to the 20-byte TCP header. We believe that
the bandwidth saved by reducing unnecessary retransmissions will more
than pay for the extra header bandwidth.
There is also an issue about the processing overhead for parsing the Appendix A contains a recommended layout of the options in TCP
variable byte-aligned format of options, particularly with a RISC- headers to achieve reasonable data field alignment.
architecture CPU. Appendix A contains a recommended layout of the
options in TCP headers to achieve reasonable data field alignment.
In the spirit of Header Prediction, a TCP can quickly test for this
layout and if it is verified then use a fast path. Hosts that use
this canonical layout will effectively use the options as a set of
fixed-format fields appended to the TCP header. However, to retain
the philosophical and protocol framework of TCP options, a TCP must
be prepared to parse an arbitrary options field, albeit with less
efficiency.
Finally, we observe that most of the mechanisms defined in this memo Finally, we observe that most of the mechanisms defined in this memo
are important for LFN's and/or very high-speed networks. For low- are important for LFN's and/or very high-speed networks. For low-
speed networks, it might be a performance optimization to NOT use speed networks, it might be a performance optimization to NOT use
these mechanisms. A TCP vendor concerned about optimal performance these mechanisms. A TCP vendor concerned about optimal performance
over low-speed paths might consider turning these extensions off for over low-speed paths might consider turning these extensions off for
low-speed paths, or allow a user or installation manager to disable low-speed paths, or allow a user or installation manager to disable
them. them.
2. Terminology 2. Terminology
skipping to change at page 10, line 29 skipping to change at page 7, line 29
3. TCP Window Scale Option 3. TCP Window Scale Option
3.1. Introduction 3.1. Introduction
The window scale extension expands the definition of the TCP window The window scale extension expands the definition of the TCP window
to 32 bits and then uses a scale factor to carry this 32-bit value in to 32 bits and then uses a scale factor to carry this 32-bit value in
the 16-bit Window field of the TCP header (SEG.WND in RFC 793). The the 16-bit Window field of the TCP header (SEG.WND in RFC 793). The
scale factor is carried in a new TCP option, Window Scale. This scale factor is carried in a new TCP option, Window Scale. This
option is sent only in a SYN segment (a segment with the SYN bit on), option is sent only in a SYN segment (a segment with the SYN bit on),
hence the window scale is fixed in each direction when a connection hence the window scale is fixed in each direction when a connection
is opened. (Another design choice would be to specify the window is opened.
scale in every TCP segment. It would be incorrect to send a window
scale option only when the scale factor changed, since a TCP option
in an acknowledgement segment will not be delivered reliably (unless
the ACK happens to be piggy-backed on data in the other direction).
Fixing the scale when the connection is opened has the advantage of
lower overhead but the disadvantage that the scale factor cannot be
changed during the connection.)
The maximum receive window, and therefore the scale factor, is The maximum receive window, and therefore the scale factor, is
determined by the maximum receive buffer space. In a typical modern determined by the maximum receive buffer space. In a typical modern
implementation, this maximum buffer space is set by default but can implementation, this maximum buffer space is set by default but can
be overridden by a user program before a TCP connection is opened. be overridden by a user program before a TCP connection is opened.
This determines the scale factor, and therefore no new user interface This determines the scale factor, and therefore no new user interface
is needed for window scaling. is needed for window scaling.
3.2. Window Scale Option 3.2. Window Scale Option
skipping to change at page 12, line 41 skipping to change at page 9, line 34
TCP determines if a data segment is "old" or "new" by testing whether TCP determines if a data segment is "old" or "new" by testing whether
its sequence number is within 2^31 bytes of the left edge of the its sequence number is within 2^31 bytes of the left edge of the
window, and if it is not, discarding the data as "old". To insure window, and if it is not, discarding the data as "old". To insure
that new data is never mistakenly considered old and vice versa, the that new data is never mistakenly considered old and vice versa, the
left edge of the sender's window has to be at most 2^31 away from the left edge of the sender's window has to be at most 2^31 away from the
right edge of the receiver's window. Similarly with the sender's right edge of the receiver's window. Similarly with the sender's
right edge and receiver's left edge. Since the right and left edges right edge and receiver's left edge. Since the right and left edges
of either the sender's or receiver's window differ by the window of either the sender's or receiver's window differ by the window
size, and since the sender and receiver windows can be out of phase size, and since the sender and receiver windows can be out of phase
by at most the window size, the above constraints imply that 2 * the by at most the window size, the above constraints imply that two
max window size must be less than 2^31, or times the max window size must be less than 2^31, or
max window < 2^30 max window < 2^30
Since the max window is 2^S (where S is the scaling shift count) Since the max window is 2^S (where S is the scaling shift count)
times at most 2^16 - 1 (the maximum unscaled window), the maximum times at most 2^16 - 1 (the maximum unscaled window), the maximum
window is guaranteed to be < 2^30 if S <= 14. Thus, the shift count window is guaranteed to be < 2^30 if S <= 14. Thus, the shift count
MUST be limited to 14 (which allows windows of 2^30 = 1 Gbyte). If a MUST be limited to 14 (which allows windows of 2^30 = 1 Gbyte). If a
Window Scale option is received with a shift.cnt value exceeding 14, Window Scale option is received with a shift.cnt value exceeding 14,
the TCP SHOULD log the error but use 14 instead of the specified the TCP SHOULD log the error but use 14 instead of the specified
value. value.
skipping to change at page 15, line 33 skipping to change at page 12, line 33
+-------+-------+---------------------+---------------------+ +-------+-------+---------------------+---------------------+
|Kind=8 | 10 | TS Value (TSval) |TS Echo Reply (TSecr)| |Kind=8 | 10 | TS Value (TSval) |TS Echo Reply (TSecr)|
+-------+-------+---------------------+---------------------+ +-------+-------+---------------------+---------------------+
1 1 4 4 1 1 4 4
The Timestamps option carries two four-byte timestamp fields. The The Timestamps option carries two four-byte timestamp fields. The
Timestamp Value field (TSval) contains the current value of the Timestamp Value field (TSval) contains the current value of the
timestamp clock of the TCP sending the option. timestamp clock of the TCP sending the option.
The Timestamp Echo Reply field (TSecr) is valid if the ACK bit is set The Timestamp Echo Reply field (TSecr) is valid if the ACK bit is set
in the TCP header; if it is valid, it echos a timestamp value that in the TCP header; if it is valid, it echoes a timestamp value that
was sent by the remote TCP in the TSval field of a Timestamp option. was sent by the remote TCP in the TSval field of a Timestamp option.
When TSecr is not valid, its value MUST be zero. However, a value of When TSecr is not valid, its value MUST be zero. However, a value of
zero does not imply TSecr being invalid. The TSecr value will zero does not imply TSecr being invalid. The TSecr value will
generally be from the most recent Timestamps Option that was generally be from the most recent Timestamps Option that was
received; however, there are exceptions that are explained below. received; however, there are exceptions that are explained below.
A TCP MAY send the Timestamps option (TSopt) in an initial <SYN> A TCP MAY send the Timestamps option (TSopt) in an initial <SYN>
segment (i.e., a segment containing a SYN bit and no ACK bit). Once segment (i.e., a segment containing a SYN bit and no ACK bit). Once
a TSopt has been sent or received in a non <SYN> segment, it MUST be a TSopt has been sent or received in a non <SYN> segment, it MUST be
sent in all segments. Once a TSopt has been received in a non <SYN> sent in all segments. Once a TSopt has been received in a non <SYN>
skipping to change at page 16, line 17 skipping to change at page 13, line 17
RTTM places a Timestamps option in every segment, with a TSval that RTTM places a Timestamps option in every segment, with a TSval that
is obtained from a (virtual) "timestamp clock". Values of this clock is obtained from a (virtual) "timestamp clock". Values of this clock
values MUST be at least approximately proportional to real time, in values MUST be at least approximately proportional to real time, in
order to measure actual RTT. order to measure actual RTT.
These TSval values are echoed in TSecr values in the reverse These TSval values are echoed in TSecr values in the reverse
direction. The difference between a received TSecr value and the direction. The difference between a received TSecr value and the
current timestamp clock value provides a RTT measurement. current timestamp clock value provides a RTT measurement.
When timestamps are used, every segment that is received will contain When timestamps are used, every segment that is received will contain
a TSecr value; however, these values cannot all be used to update the a TSecr value. However, these values cannot all be used to update
measured RTT. The following example illustrates why. It shows a the measured RTT. The following example illustrates why. It shows a
one-way data flow with segments arriving in sequence without loss. one-way data flow with segments arriving in sequence without loss.
Here A, B, C... represent data blocks occupying successive blocks of Here A, B, C... represent data blocks occupying successive blocks of
sequence numbers, and ACK(A),... represent the corresponding sequence numbers, and ACK(A),... represent the corresponding
cumulative acknowledgments. The two timestamp fields of the cumulative acknowledgments. The two timestamp fields of the
Timestamps option are shown symbolically as <TSval=x,TSecr=y>. Each Timestamps option are shown symbolically as <TSval=x,TSecr=y>. Each
TSecr field contains the value most recently received in a TSval TSecr field contains the value most recently received in a TSval
field. field.
TCP A TCP B TCP A TCP B
skipping to change at page 16, line 43 skipping to change at page 13, line 43
<B,TSval=5,TSecr=127> -----> <B,TSval=5,TSecr=127> ----->
<---- <ACK(B),TSval=131,TSecr=5> <---- <ACK(B),TSval=131,TSecr=5>
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
<C,TSval=65,TSecr=131> ----> <C,TSval=65,TSecr=131> ---->
<---- <ACK(C),TSval=191,TSecr=65> <---- <ACK(C),TSval=191,TSecr=65>
(etc) (etc.)
The dotted line marks a pause (60 time units long) in which A had The dotted line marks a pause (60 time units long) in which A had
nothing to send. Note that this pause inflates the RTT which B could nothing to send. Note that this pause inflates the RTT which B could
infer from receiving TSecr=131 in data segment C. Thus, in one-way infer from receiving TSecr=131 in data segment C. Thus, in one-way
data flows, RTTM in the reverse direction measures a value that is data flows, RTTM in the reverse direction measures a value that is
inflated by gaps in sending data. However, the following rule inflated by gaps in sending data. However, the following rule
prevents a resulting inflation of the measured RTT: prevents a resulting inflation of the measured RTT:
RTTM Rule: A TSecr value received in a segment is used to update RTTM Rule: A TSecr value received in a segment is used to update
the averaged RTT measurement only if the averaged RTT measurement only if
skipping to change at page 17, line 18 skipping to change at page 14, line 18
a) the segment acknowledges some new data, i.e., only if it a) the segment acknowledges some new data, i.e., only if it
advances the left edge of the send window, and advances the left edge of the send window, and
b) the segment does not indicate any loss or reordering, i.e. b) the segment does not indicate any loss or reordering, i.e.
contains SACK options contains SACK options
Since TCP B is not sending data, the data segment C does not Since TCP B is not sending data, the data segment C does not
acknowledge any new data when it arrives at B. Thus, the inflated acknowledge any new data when it arrives at B. Thus, the inflated
RTTM measurement is not used to update B's RTTM measurement. RTTM measurement is not used to update B's RTTM measurement.
Implementors should note that with Timestamps multiple RTTMs can be Implementers should note that with Timestamps multiple RTTMs can be
taken per RTT. Many RTO estimators have a weighting factor based on taken per RTT. Many RTO estimators have a weighting factor based on
an implicit assumption that at most one RTTM will be gotten per RTT. an implicit assumption that at most one RTTM will be sampled per RTT.
When using multiple RTTMs per RTT to update the RTO estimator, the When using multiple RTTMs per RTT to update the RTO estimator, the
weighting factor needs to be decreased to take into account the more weighting factor needs to be decreased to take into account the more
frequent RTTMs. For example, an implementation could choose to just frequent RTTMs. For example, an implementation could choose to just
use one sample per RTT to update the RTO estimator, or vary the gain use one sample per RTT to update the RTO estimator, or vary the gain
based on the congestion window, or take an average of all the RTTM based on the congestion window, or take an average of all the RTTM
measurements received over one RTT, and then use that value to update measurements received over one RTT, and then use that value to update
the RTO estimator. This document does not prescribe any particular the RTO estimator. This document does not prescribe any particular
method for modifying the RTO estimator, the important point is that method for modifying the RTO estimator.
the implementation should do something more than just feeding
additional RTTM samples from one RTT into the RTO estimator.
4.4. Which Timestamp to Echo 4.4. Which Timestamp to Echo
If more than one Timestamps option is received before a reply segment If more than one Timestamps option is received before a reply segment
is sent, the TCP must choose only one of the TSvals to echo, ignoring is sent, the TCP must choose only one of the TSvals to echo, ignoring
the others. To minimize the state kept in the receiver (i.e., the the others. To minimize the state kept in the receiver (i.e., the
number of unprocessed TSvals), the receiver should be required to number of unprocessed TSvals), the receiver should be required to
retain at most one timestamp in the connection control block. retain at most one timestamp in the connection control block.
There are three situations to consider: There are three situations to consider:
skipping to change at page 20, line 13 skipping to change at page 17, line 13
(etc) (etc)
5. PAWS -- Protection Against Wrapped Sequence Numbers 5. PAWS -- Protection Against Wrapped Sequence Numbers
5.1. Introduction 5.1. Introduction
Section 5.2 describes a simple mechanism to reject old duplicate Section 5.2 describes a simple mechanism to reject old duplicate
segments that might corrupt an open TCP connection; we call this segments that might corrupt an open TCP connection; we call this
mechanism PAWS (Protection Against Wrapped Sequence numbers). PAWS mechanism PAWS (Protection Against Wrapped Sequence numbers). PAWS
operates within a single TCP connection, using state that is saved in operates within a single TCP connection, using state that is saved in
the connection control block. Section 5.3 and Appendix C discuss the the connection control block. Section 5.3 and Appendix G discuss the
implications of the PAWS mechanism for avoiding old duplicates from implications of the PAWS mechanism for avoiding old duplicates from
previous incarnations of the same connection. previous incarnations of the same connection.
5.2. The PAWS Mechanism 5.2. The PAWS Mechanism
PAWS uses the same TCP Timestamps option as the RTTM mechanism PAWS uses the same TCP Timestamps option as the RTTM mechanism
described earlier, and assumes that every received TCP segment described earlier, and assumes that every received TCP segment
(including data and ACK segments) contains a timestamp SEG.TSval (including data and ACK segments) contains a timestamp SEG.TSval
whose values are monotonically non-decreasing in time. The basic whose values are monotonically non-decreasing in time. The basic
idea is that a segment can be discarded as an old duplicate if it is idea is that a segment can be discarded as an old duplicate if it is
skipping to change at page 24, line 10 skipping to change at page 21, line 10
bytes must be at least two windows. bytes must be at least two windows.
To make this more quantitative, any clock faster than 1 tick/sec To make this more quantitative, any clock faster than 1 tick/sec
will reject old duplicate segments for link speeds of ~8 Gbps. will reject old duplicate segments for link speeds of ~8 Gbps.
A 1 ms timestamp clock will work at link speeds up to 8 Tbps A 1 ms timestamp clock will work at link speeds up to 8 Tbps
(8*10^12) bps! (8*10^12) bps!
(b) The timestamp clock must not be "too fast". (b) The timestamp clock must not be "too fast".
Its recycling time MUST be greater than MSL seconds. Since the The recycling time of the timestamp clock MUST be greater than
clock (timestamp) is 32 bits and the worst-case MSL is 255 MSL seconds. Since the clock (timestamp) is 32 bits and the
seconds, the maximum acceptable clock frequency is one tick worst-case MSL is 255 seconds, the maximum acceptable clock
every 59 ns. frequency is one tick every 59 ns.
However, it is desirable to establish a much longer recycle However, it is desirable to establish a much longer recycle
period, in order to handle outdated timestamps on idle period, in order to handle outdated timestamps on idle
connections (see Section 5.2.3), and to relax the MSL connections (see Section 5.2.3), and to relax the MSL
requirement for preventing sequence number wrap-around. With a requirement for preventing sequence number wrap-around. With a
1 ms timestamp clock, the 32-bit timestamp will wrap its sign 1 ms timestamp clock, the 32-bit timestamp will wrap its sign
bit in 24.8 days. Thus, it will reject old duplicates on the bit in 24.8 days. Thus, it will reject old duplicates on the
same connection if MSL is 24.8 days or less. This appears to be same connection if MSL is 24.8 days or less. This appears to be
a very safe figure; an MSL of 24.8 days or longer can probably a very safe figure; an MSL of 24.8 days or longer can probably
be assumed in the internet without requiring precise MSL be assumed in the internet without requiring precise MSL
skipping to change at page 26, line 14 skipping to change at page 23, line 14
segment: segment:
H1) Check timestamp (same as step R1 above) H1) Check timestamp (same as step R1 above)
H2) Do header prediction: if segment is next in sequence and if H2) Do header prediction: if segment is next in sequence and if
there are no special conditions requiring additional processing, there are no special conditions requiring additional processing,
accept the segment, record its timestamp, and skip H3. accept the segment, record its timestamp, and skip H3.
H3) Process the segment normally, as specified in RFC 793. This H3) Process the segment normally, as specified in RFC 793. This
includes dropping segments that are outside the window and includes dropping segments that are outside the window and
possibly sending acknowledgments, and queueing in-window, out- possibly sending acknowledgments, and queuing in-window, out-of-
of-sequence segments. sequence segments.
Another possibility would be to interchange steps H1 and H2, i.e., to Another possibility would be to interchange steps H1 and H2, i.e., to
perform the header prediction step H2 FIRST, and perform H1 and H3 perform the header prediction step H2 FIRST, and perform H1 and H3
only when header prediction fails. This could be a performance only when header prediction fails. This could be a performance
improvement, since the timestamp check in step H1 is very unlikely to improvement, since the timestamp check in step H1 is very unlikely to
fail, and it requires unsigned modulo arithmetic. To perform this fail, and it requires unsigned modulo arithmetic. To perform this
check on every single segment is contrary to the philosophy of header check on every single segment is contrary to the philosophy of header
prediction. We believe that this change might produce a measurable prediction. We believe that this change might produce a measurable
reduction in CPU time for TCP protocol processing on high-speed reduction in CPU time for TCP protocol processing on high-speed
networks. networks.
skipping to change at page 28, line 34 skipping to change at page 25, line 34
7. Security Considerations 7. Security Considerations
The TCP sequence space is a fixed size, and as the window becomes The TCP sequence space is a fixed size, and as the window becomes
larger it becomes easier for an attacker to generate forged packets larger it becomes easier for an attacker to generate forged packets
that can fall within the TCP window, and be accepted as valid that can fall within the TCP window, and be accepted as valid
packets. While use of Timestamps and PAWS can help to mitigate this, packets. While use of Timestamps and PAWS can help to mitigate this,
when using PAWS, if an attacker is able to forge a packet that is when using PAWS, if an attacker is able to forge a packet that is
acceptable to the TCP connection, a timestamp that is in the future acceptable to the TCP connection, a timestamp that is in the future
would cause valid packets to be dropped due to PAWS checks. Hence, would cause valid packets to be dropped due to PAWS checks. Hence,
implementors should take care to not open the TCP window drastically implementers should take care to not open the TCP window drastically
beyond the requirements of the connection. beyond the requirements of the connection.
Middle boxes and options: If a middle box removes TCP options from Middle boxes and options: If a middle box removes TCP options from
the SYN, such as TSopt, a high speed connection that needs PAWS would the SYN, such as TSopt, a high speed connection that needs PAWS would
not have that protection. In this situation, an implementor could not have that protection. In this situation, an implementer could
provide a mechanism for the application to determine whether or not provide a mechanism for the application to determine whether or not
PAWS is in use on the connection, and chose to terminate the PAWS is in use on the connection, and chose to terminate the
connection if that protection doesn't exist. connection if that protection doesn't exist.
Mechanisms to protect the TCP header from modification should also Mechanisms to protect the TCP header from modification should also
protect the TCP options. protect the TCP options.
Expanding the TCP window beyond 64K for IPv6 allows Jumbograms Expanding the TCP window beyond 64K for IPv6 allows Jumbograms
[RFC2675] to be used when the local network supports packets larger [RFC2675] to be used when the local network supports packets larger
than 64K. When larger TCP packets are used, the TCP checksum becomes than 64K. When larger TCP packets are used, the TCP checksum becomes
skipping to change at page 31, line 7 skipping to change at page 28, line 7
[RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion
Control", RFC 2581, April 1999. Control", RFC 2581, April 1999.
[RFC2675] Borman, D., Deering, S., and R. Hinden, "IPv6 Jumbograms", [RFC2675] Borman, D., Deering, S., and R. Hinden, "IPv6 Jumbograms",
RFC 2675, August 1999. RFC 2675, August 1999.
[RFC2883] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An [RFC2883] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An
Extension to the Selective Acknowledgement (SACK) Option Extension to the Selective Acknowledgement (SACK) Option
for TCP", RFC 2883, July 2000. for TCP", RFC 2883, July 2000.
[RFC3517] Blanton, E., Allman, M., Fall, K., and L. Wang, "A
Conservative Selective Acknowledgment (SACK)-based Loss
Recovery Algorithm for TCP", RFC 3517, April 2003.
[RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU [RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU
Discovery", RFC 4821, March 2007. Discovery", RFC 4821, March 2007.
[RFC4963] Heffner, J., Mathis, M., and B. Chandler, "IPv4 Reassembly [RFC4963] Heffner, J., Mathis, M., and B. Chandler, "IPv4 Reassembly
Errors at High Data Rates", RFC 4963, July 2007. Errors at High Data Rates", RFC 4963, July 2007.
[RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
Control", RFC 5681, September 2009. Control", RFC 5681, September 2009.
[RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M.,
and Y. Nishida, "A Conservative Loss Recovery Algorithm
Based on Selective Acknowledgment (SACK) for TCP",
RFC 6675, August 2012.
[Watson81] [Watson81]
Watson, R., "Timer-based Mechanisms in Reliable Transport Watson, R., "Timer-based Mechanisms in Reliable Transport
Protocol Connection Management", Computer Networks, Vol. Protocol Connection Management", Computer Networks, Vol.
5, 1981. 5, 1981.
[Zhang86] Zhang, L., "Why TCP Timers Don't Work Well", Proc. SIGCOMM [Zhang86] Zhang, L., "Why TCP Timers Don't Work Well", Proc. SIGCOMM
'86, Stowe, VT, August 1986. '86, Stowe, VT, August 1986.
Appendix A. Implementation Suggestions Appendix A. Implementation Suggestions
TCP Option Layout TCP Option Layout
The following layouts are recommended for sending options on non- The following layouts are recommended for sending options on non-
SYN segments, to achieve maximum feasible alignment of 32-bit and SYN segments, to achieve maximum feasible alignment of 32-bit and
64-bit machines. 64-bit machines.
+--------+--------+--------+--------+ +--------+--------+--------+--------+
| NOP | NOP | TSopt | 10 | | NOP | NOP | TSopt | 10 |
+--------+--------+--------+--------+ +--------+--------+--------+--------+
| TSval timestamp | | TSval timestamp |
+--------+--------+--------+--------+ +--------+--------+--------+--------+
| TSecr timestamp | | TSecr timestamp |
+--------+--------+--------+--------+ +--------+--------+--------+--------+
Interaction with the TCP Urgent Pointer Interaction with the TCP Urgent Pointer
The TCP Urgent pointer, like the TCP window, is a 16 bit value. The TCP Urgent pointer, like the TCP window, is a 16 bit value.
Some of the original discussion for the TCP Window Scale option Some of the original discussion for the TCP Window Scale option
included proposals to increase the Urgent pointer to 32 bits. As included proposals to increase the Urgent pointer to 32 bits. As
it turns out, this is unnecessary. There are two observations it turns out, this is unnecessary. There are two observations
that should be made: that should be made:
skipping to change at page 34, line 30 skipping to change at page 31, line 31
tick of the sender's timestamp clock. Such an extension is not tick of the sender's timestamp clock. Such an extension is not
part of the proposal of this RFC. part of the proposal of this RFC.
Note that this is a variant on the mechanism proposed by Note that this is a variant on the mechanism proposed by
Garlick, Rom, and Postel [Garlick77], which required each host Garlick, Rom, and Postel [Garlick77], which required each host
to maintain connection records containing the highest sequence to maintain connection records containing the highest sequence
numbers on every connection. Using timestamps instead, it is numbers on every connection. Using timestamps instead, it is
only necessary to keep one quantity per remote host, regardless only necessary to keep one quantity per remote host, regardless
of the number of simultaneous connections to that host. of the number of simultaneous connections to that host.
Appendix C. Changes from RFC 1072, RFC 1185, and RFC 1323 Appendix C. Summary of Notation
The protocol extensions defined in RFC 1323 document differ in
several important ways from those defined in RFC 1072 and RFC 1185.
(a) SACK has been split off into a separate document, [RFC2018].
(b) The detailed rules for sending timestamp replies (see
Section 4.4) differ in important ways. The earlier rules could
result in an under-estimate of the RTT in certain cases (packets
dropped or out of order).
(c) The same value TS.Recent is now shared by the two distinct
mechanisms RTTM and PAWS. This simplification became possible
because of change (b).
(d) An ambiguity in RFC 1185 was resolved in favor of putting
timestamps on ACK as well as data segments. This supports the
symmetry of the underlying TCP protocol.
(e) The echo and echo reply options of RFC 1072 were combined into a
single Timestamps option, to reflect the symmetry and to
simplify processing.
(f) The problem of outdated timestamps on long-idle connections,
discussed in Section 5.2.2, was realized and resolved.
(g) RFC 1185 recommended that header prediction take precedence over
the timestamp check. Based upon some skepticism about the
probabilistic arguments given in Section 5.2.4, it was decided
to recommend that the timestamp check be performed first.
(h) The spec was modified so that the extended options will be sent
on <SYN,ACK> segments only when they are received in the
corresponding <SYN> segments. This provides the most
conservative possible conditions for interoperation with
implementations without the extensions.
In addition to these substantive changes, the present RFC attempts to
specify the algorithms unambiguously by presenting modifications to
the Event Processing rules of RFC 793; see Appendix F.
There are additional changes in this document from RFC 1323. These
changes are:
(a) The description of which TSecr values can be used to update the
measured RTT has been clarified. Specifically, with Timestamps,
the Karn algorithm [Karn87] is disabled. The Karn algorithm
disables all RTT measurements during retransmission, since it is
ambiguous whether the ACK is is for the original packet, or the
retransmitted packet. With Timestamps, that ambiguity is
removed since the TSecr in the ACK will contain the TSval from
whichever data packet made it to the destination.
(b) In RFC1323, section 3.4, step (2) of the algorithm to control
which timestamp is echoed was incorrect in two regards:
(1) It failed to update TS.recent for a retransmitted segment
that resulted from a lost ACK.
(2) It failed if SEG.LEN = 0.
In the new algorithm, the case of SEG.TSval >= TS.recent is
included for consistency with the PAWS test.
(c) One correction was made to the Event Processing Summary in
Appendix F. In SEND CALL/ESTABLISHED STATE, RCV.WND is used to
fill in the SEG.WND value, not SND.WND.
(d) New pseudo-code summary has been added in Appendix E.
(e) Appendix A has been expanded with information about the TCP MSS
option and the TCP Urgent Pointer.
(f) It is now recommended that Timestamps options be included in RST
packets if the incoming packet contained a Timestamps option.
(g) RST packets are explicitly excluded from PAWS processing.
(h) Snd.TSoffset and Snd.TSclock variables have been added.
Snd.TSclock is the sum of my.TSclock and Snd.TSoffset. This
allows the starting points for timestamps to be randomized on a
per-connection basis. Setting Snd.TSoffset to zero yields the
same results as [RFC1323].
(i) RTTM update processing explicitly excludes packets containing
SACK options. This addresses inflation of the RTT during
episodes of packet loss in both directions.
(j) In Section 4.2 the if-clause allowing sending of timestamps only
when received in a <SYN> or <SYN,ACK> was removed, to allow for
late timestamp negotiation.
(k) Section 3.4 was added describing the unavoidable window
retraction issue, and explicitly describing the mitigation steps
necessary.
(l) Section 2 was added for RFC2119 wording. Normative text was
updated with the appropriate phrases.
Appendix D. Summary of Notation
The following notation has been used in this document. The following notation has been used in this document.
Options Options
WSopt: TCP Window Scale Option WSopt: TCP Window Scale Option
TSopt: TCP Timestamps Option TSopt: TCP Timestamps Option
Option Fields Option Fields
shift.cnt: Window scale byte in WSopt shift.cnt: Window scale byte in WSopt
TSval: 32-bit Timestamp Value field in TSopt TSval: 32-bit Timestamp Value field in TSopt
TSecr: 32-bit Timestamp Reply field in TSopt TSecr: 32-bit Timestamp Reply field in TSopt
Option Fields in Current Segment Option Fields in Current Segment
SEG.TSval: TSval field from TSopt in current segment SEG.TSval: TSval field from TSopt in current segment
SEG.TSecr: TSecr field from TSopt in current segment SEG.TSecr: TSecr field from TSopt in current segment
SEG.WSopt: 8-bit value in WSopt SEG.WSopt: 8-bit value in WSopt
Clock Values Clock Values
my.TSclock: System wide source of 32-bit timestamp values my.TSclock: System wide source of 32-bit timestamp values
my.TSclock.rate: Period of my.TSclock (1 ms to 1 sec) my.TSclock.rate: Period of my.TSclock (1 ms to 1 sec)
Snd.TSoffset: A offset for randomizing Snd.TSclock Snd.TSoffset: A offset for randomizing Snd.TSclock
Snd.TSclock: my.TSclock + Snd.TSoffset Snd.TSclock: my.TSclock + Snd.TSoffset
skipping to change at page 37, line 38 skipping to change at page 32, line 32
Snd.Wind.Scale: Send window scale power Snd.Wind.Scale: Send window scale power
Start.Time: Snd.TSclock value when segment being timed was Start.Time: Snd.TSclock value when segment being timed was
sent (used by pre-1323 code). sent (used by pre-1323 code).
Procedure Procedure
Update_SRTT(m) Procedure to update the smoothed RTT and RTT Update_SRTT(m) Procedure to update the smoothed RTT and RTT
variance estimates, using the rules of variance estimates, using the rules of
[Jacobson88a], given m, a new RTT measurement [Jacobson88a], given m, a new RTT measurement
Appendix E. Pseudo-code Summary Appendix D. Pseudo-code Summary
Create new TCB => { Create new TCB => {
Rcv.wind.scale = Rcv.wind.scale =
MIN( 14, MAX(0, floor(log2(receive buffer space)) - 15) ); MIN( 14, MAX(0, floor(log2(receive buffer space)) - 15) );
Snd.wind.scale = 0; Snd.wind.scale = 0;
Last.ACK.sent = 0; Last.ACK.sent = 0;
Snd.TS.OK = Snd.WS.OK = FALSE; Snd.TS.OK = Snd.WS.OK = FALSE;
Snd.TSoffset = random 32 bit value Snd.TSoffset = random 32 bit value
} }
Send initial <SYN> segment => { Send initial <SYN> segment => {
SEG.WND = MIN( RCV.WND, 65535 ); SEG.WND = MIN( RCV.WND, 65535 );
Include in segment: TSopt(TSval=Snd.TSclock, TCecr=0); Include in segment: TSopt(TSval=Snd.TSclock, TSecr=0);
Include in segment: WSopt = Rcv.wind.scale; Include in segment: WSopt = Rcv.wind.scale;
} }
Send <SYN,ACK> segment => { Send <SYN,ACK> segment => {
SEG.ACK = Last.ACK.sent = RCV.NXT; SEG.ACK = Last.ACK.sent = RCV.NXT;
SEG.WND = MIN( RCV.WND, 65535 ); SEG.WND = MIN( RCV.WND, 65535 );
if (Snd.TS.OK) then if (Snd.TS.OK) then
Include in segment: Include in segment:
TSopt(TSval=Snd.TSclock, TSecr=TS.Recent); TSopt(TSval=Snd.TSclock, TSecr=TS.Recent);
if (Snd.WS.OK) then if (Snd.WS.OK) then
skipping to change at page 39, line 22 skipping to change at page 34, line 14
Discard segment and return; Discard segment and return;
} }
} }
else { else {
if (SEG.SEQ <= Last.ACK.sent) then if (SEG.SEQ <= Last.ACK.sent) then
TS.Recent = SEG.TSval; TS.Recent = SEG.TSval;
} }
} }
if (SEG.ACK > SND.UNA) then { if (SEG.ACK > SND.UNA) then {
/* (At least part of) first segment in /* (At least part of) first segment in
* retransmission queue has been ACKd * retransmission queue has been ACKed
*/ */
if (Segment contains TSopt) then if (Segment contains TSopt) then
Update_SRTT( Update_SRTT(
(Snd.TSclock - SEG.TSecr)/my.TSclock.rate); (Snd.TSclock - SEG.TSecr)/my.TSclock.rate);
else else
Update_SRTT( /* for compatibility */ Update_SRTT( /* for compatibility */
(Snd.TSclock - Start.Time)/my.TSclock.rate); (Snd.TSclock - Start.Time)/my.TSclock.rate);
} }
} }
Appendix F. Event Processing Summary Appendix E. Event Processing Summary
OPEN Call OPEN Call
... ...
An initial send sequence number (ISS) is selected. Send a SYN An initial send sequence number (ISS) is selected. Send a SYN
segment of the form: segment of the form:
<SEQ=ISS><CTL=SYN><TSval=Snd.TSclock><WSopt=Rcv.Wind.Scale> <SEQ=ISS><CTL=SYN><TSval=Snd.TSclock><WSopt=Rcv.Wind.Scale>
skipping to change at page 45, line 7 skipping to change at page 39, line 45
<SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK> <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>
If the Snd.TS.OK bit is on, include Timestamps option If the Snd.TS.OK bit is on, include Timestamps option
<TSval=Snd.TSclock,TSecr=TS.Recent> in this ACK segment. <TSval=Snd.TSclock,TSecr=TS.Recent> in this ACK segment.
Set Last.ACK.sent to SEG.ACK of the acknowledgment, and send Set Last.ACK.sent to SEG.ACK of the acknowledgment, and send
it. This acknowledgment should be piggy-backed on a segment it. This acknowledgment should be piggy-backed on a segment
being transmitted if possible without incurring undue delay. being transmitted if possible without incurring undue delay.
... ...
Appendix G. Timestamps Edge Cases Appendix F. Timestamps Edge Cases
While the rules laid out for when to calculate RTTM produce the While the rules laid out for when to calculate RTTM produce the
correct results most of the time, there are some edge cases where an correct results most of the time, there are some edge cases where an
incorrect RTTM can be calculated. All of these situations involve incorrect RTTM can be calculated. All of these situations involve
the loss of packets. It is felt that these scenarios are rare, and the loss of packets. It is felt that these scenarios are rare, and
that if they should happen, they will cause a single RTTM measurement that if they should happen, they will cause a single RTTM measurement
to be inflated, which mitigates its effects on RTO calculations. to be inflated, which mitigates its effects on RTO calculations.
[Martin03] cites two similar cases when the returning ACK is lost, [Martin03] cites two similar cases when the returning ACK is lost,
and before the retransmission timer fires, another returning packet and before the retransmission timer fires, another returning packet
skipping to change at page 45, line 39 skipping to change at page 40, line 29
(RTTM is calculated at 4) (RTTM is calculated at 4)
One thing to note about this situation is that it is somewhat bounded One thing to note about this situation is that it is somewhat bounded
by RTO + RTT, limiting how far off the RTTM calculation will be. by RTO + RTT, limiting how far off the RTTM calculation will be.
While more complex scenarios can be constructed that produce larger While more complex scenarios can be constructed that produce larger
inflations (e.g., retransmissions are lost), those scenarios involve inflations (e.g., retransmissions are lost), those scenarios involve
multiple packet losses, and the connection will have other more multiple packet losses, and the connection will have other more
serious operational problems than using an inflated RTTM in the RTO serious operational problems than using an inflated RTTM in the RTO
calculation. calculation.
Appendix G. Changes from RFC 1072, RFC 1185, and RFC 1323
The protocol extensions defined in RFC 1323 document differ in
several important ways from those defined in RFC 1072 and RFC 1185.
(a) SACK has been split off into a separate document, [RFC2018].
(b) The detailed rules for sending timestamp replies (see
Section 4.4) differ in important ways. The earlier rules could
result in an under-estimate of the RTT in certain cases (packets
dropped or out of order).
(c) The same value TS.Recent is now shared by the two distinct
mechanisms RTTM and PAWS. This simplification became possible
because of change (b).
(d) An ambiguity in RFC 1185 was resolved in favor of putting
timestamps on ACK as well as data segments. This supports the
symmetry of the underlying TCP protocol.
(e) The echo and echo reply options of RFC 1072 were combined into a
single Timestamps option, to reflect the symmetry and to
simplify processing.
(f) The problem of outdated timestamps on long-idle connections,
discussed in Section 5.2.2, was realized and resolved.
(g) RFC 1185 recommended that header prediction take precedence over
the timestamp check. Based upon some skepticism about the
probabilistic arguments given in Section 5.2.4, it was decided
to recommend that the timestamp check be performed first.
(h) The spec was modified so that the extended options will be sent
on <SYN,ACK> segments only when they are received in the
corresponding <SYN> segments. This provides the most
conservative possible conditions for interoperation with
implementations without the extensions.
In addition to these substantive changes, the present RFC attempts to
specify the algorithms unambiguously by presenting modifications to
the Event Processing rules of RFC 793; see Appendix E.
There are additional changes in this document from RFC 1323. These
changes are:
(a) The description of which TSecr values can be used to update the
measured RTT has been clarified. Specifically, with Timestamps,
the Karn algorithm [Karn87] is disabled. The Karn algorithm
disables all RTT measurements during retransmission, since it is
ambiguous whether the ACK is for the original packet, or the
retransmitted packet. With Timestamps, that ambiguity is
removed since the TSecr in the ACK will contain the TSval from
whichever data packet made it to the destination.
(b) In RFC1323, section 3.4, step (2) of the algorithm to control
which timestamp is echoed was incorrect in two regards:
(1) It failed to update TS.recent for a retransmitted segment
that resulted from a lost ACK.
(2) It failed if SEG.LEN = 0.
In the new algorithm, the case of SEG.TSval >= TS.recent is
included for consistency with the PAWS test.
(c) One correction was made to the Event Processing Summary in
Appendix E. In SEND CALL/ESTABLISHED STATE, RCV.WND is used to
fill in the SEG.WND value, not SND.WND.
(d) New pseudo-code summary has been added in Appendix D.
(e) Appendix A has been expanded with information about the TCP MSS
option and the TCP Urgent Pointer.
(f) It is now recommended that Timestamps options be included in RST
packets if the incoming packet contained a Timestamps option.
(g) RST packets are explicitly excluded from PAWS processing.
(h) Snd.TSoffset and Snd.TSclock variables have been added.
Snd.TSclock is the sum of my.TSclock and Snd.TSoffset. This
allows the starting points for timestamps to be randomized on a
per-connection basis. Setting Snd.TSoffset to zero yields the
same results as [RFC1323].
(i) RTTM update processing explicitly excludes packets containing
SACK options. This addresses inflation of the RTT during
episodes of packet loss in both directions.
(j) In Section 4.2 the if-clause allowing sending of timestamps only
when received in a <SYN> or <SYN,ACK> was removed, to allow for
late timestamp negotiation.
(k) Section 3.4 was added describing the unavoidable window
retraction issue, and explicitly describing the mitigation steps
necessary.
(l) Section 2 was added for RFC2119 wording. Normative text was
updated with the appropriate phrases.
(m) Removed much of the discussion in Section 1 to streamline the
document. However, detailed examples and discussions in
Section 3, Section 4 and Section 5 are kept as guideline for
implementers.
(n) Moved Appendix "Changes" at the end of the appendices for easier
lookup.
Authors' Addresses Authors' Addresses
David Borman David Borman
Quantum Corporation Quantum Corporation
Mendota Heights MN 55120 Mendota Heights MN 55120
USA USA
Email: david.borman@quantum.com Email: david.borman@quantum.com
Bob Braden Bob Braden
University of Southern California University of Southern California
4676 Admiralty Way 4676 Admiralty Way
Marina del Rey CA 90292 Marina del Rey CA 90292
USA USA
Email: braden@isi.edu Email: braden@isi.edu
Van Jacobson Van Jacobson
Packet Design Packet Design
 End of changes. 49 change blocks. 
374 lines changed or deleted 233 lines changed or added

This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/