draft-ietf-tcpm-1323bis-02.txt   draft-ietf-tcpm-1323bis-03.txt 
TCP Maintenance (TCPM) D. Borman TCP Maintenance (TCPM) D. Borman
Internet-Draft Quantum Corporation Internet-Draft Quantum Corporation
Intended status: Standards Track B. Braden Intended status: Standards Track B. Braden
Expires: November 19, 2012 University of Southern Expires: January 12, 2013 University of Southern
California California
V. Jacobson V. Jacobson
Packet Design Packet Design
R. Scheffenegger, Ed. R. Scheffenegger, Ed.
NetApp, Inc. NetApp, Inc.
May 18, 2012 July 11, 2012
TCP Extensions for High Performance TCP Extensions for High Performance
draft-ietf-tcpm-1323bis-02 draft-ietf-tcpm-1323bis-03
Abstract Abstract
This memo presents a set of TCP extensions to improve performance This memo presents a set of TCP extensions to improve performance
over large bandwidth*delay product paths and to provide reliable over large bandwidth*delay product paths and to provide reliable
operation over very high-speed paths. It defines TCP options for operation over very high-speed paths. It defines TCP options for
scaled windows and timestamps, which are designed to provide scaled windows and timestamps, which are designed to provide
compatible interworking with TCP's that do not implement the compatible interworking with TCP's that do not implement the
extensions. The timestamps are used for two distinct mechanisms: extensions. The timestamps are used for two distinct mechanisms:
RTTM (Round Trip Time Measurement) and PAWS (Protection Against RTTM (Round Trip Time Measurement) and PAWS (Protection Against
skipping to change at page 1, line 46 skipping to change at page 1, line 46
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on November 19, 2012. This Internet-Draft will expire on January 12, 2013.
Copyright Notice Copyright Notice
Copyright (c) 2012 IETF Trust and the persons identified as the Copyright (c) 2012 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
skipping to change at page 3, line 33 skipping to change at page 3, line 33
4.2. The PAWS Mechanism . . . . . . . . . . . . . . . . . . . . 20 4.2. The PAWS Mechanism . . . . . . . . . . . . . . . . . . . . 20
4.2.1. Basic PAWS Algorithm . . . . . . . . . . . . . . . . . 21 4.2.1. Basic PAWS Algorithm . . . . . . . . . . . . . . . . . 21
4.2.2. Timestamp Clock . . . . . . . . . . . . . . . . . . . 23 4.2.2. Timestamp Clock . . . . . . . . . . . . . . . . . . . 23
4.2.3. Outdated Timestamps . . . . . . . . . . . . . . . . . 24 4.2.3. Outdated Timestamps . . . . . . . . . . . . . . . . . 24
4.2.4. Header Prediction . . . . . . . . . . . . . . . . . . 25 4.2.4. Header Prediction . . . . . . . . . . . . . . . . . . 25
4.2.5. IP Fragmentation . . . . . . . . . . . . . . . . . . . 26 4.2.5. IP Fragmentation . . . . . . . . . . . . . . . . . . . 26
4.3. Duplicates from Earlier Incarnations of Connection . . . . 27 4.3. Duplicates from Earlier Incarnations of Connection . . . . 27
5. Conclusions and Acknowledgements . . . . . . . . . . . . . . . 27 5. Conclusions and Acknowledgements . . . . . . . . . . . . . . . 27
6. Security Considerations . . . . . . . . . . . . . . . . . . . 28 6. Security Considerations . . . . . . . . . . . . . . . . . . . 28
7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 28 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 28
8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 28 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 29
8.1. Normative References . . . . . . . . . . . . . . . . . . . 28 8.1. Normative References . . . . . . . . . . . . . . . . . . . 29
8.2. Informative References . . . . . . . . . . . . . . . . . . 29 8.2. Informative References . . . . . . . . . . . . . . . . . . 29
Appendix A. Implementation Suggestions . . . . . . . . . . . . . 31 Appendix A. Implementation Suggestions . . . . . . . . . . . . . 31
Appendix B. Duplicates from Earlier Connection Incarnations . . . 32 Appendix B. Duplicates from Earlier Connection Incarnations . . . 32
B.1. System Crash with Loss of State . . . . . . . . . . . . . 32 B.1. System Crash with Loss of State . . . . . . . . . . . . . 32
B.2. Closing and Reopening a Connection . . . . . . . . . . . . 32 B.2. Closing and Reopening a Connection . . . . . . . . . . . . 33
Appendix C. Changes from RFC 1072, RFC 1185, and RFC 1323 . . . . 34 Appendix C. Changes from RFC 1072, RFC 1185, and RFC 1323 . . . . 34
Appendix D. Summary of Notation . . . . . . . . . . . . . . . . . 36 Appendix D. Summary of Notation . . . . . . . . . . . . . . . . . 36
Appendix E. Pseudo-code Summary . . . . . . . . . . . . . . . . . 37 Appendix E. Pseudo-code Summary . . . . . . . . . . . . . . . . . 37
Appendix F. Event Processing Summary . . . . . . . . . . . . . . 39 Appendix F. Event Processing Summary . . . . . . . . . . . . . . 39
Appendix G. Timestamps Edge Cases . . . . . . . . . . . . . . . . 44 Appendix G. Timestamps Edge Cases . . . . . . . . . . . . . . . . 44
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 45 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 45
1. Introduction 1. Introduction
The TCP protocol [RFC0793] was designed to operate reliably over The TCP protocol [RFC0793] was designed to operate reliably over
skipping to change at page 5, line 47 skipping to change at page 5, line 47
become common; this increases the probability of dropping more become common; this increases the probability of dropping more
than one packet per window. than one packet per window.
To generalize the Fast Retransmit/Fast Recovery mechanism to To generalize the Fast Retransmit/Fast Recovery mechanism to
handle multiple packets dropped per window, selective handle multiple packets dropped per window, selective
acknowledgments are required. Unlike the normal cumulative acknowledgments are required. Unlike the normal cumulative
acknowledgments of TCP, selective acknowledgments give the acknowledgments of TCP, selective acknowledgments give the
sender a complete picture of which segments are queued at the sender a complete picture of which segments are queued at the
receiver and which have not yet arrived. receiver and which have not yet arrived.
Since the publication of [RFC1323], selective acknowledgments Since the publication of RFC1323 [RFC1323], selective
(SACK) have become important in the LFN regime. SACK has been acknowledgments (SACK) have become important in the LFN regime.
published as a [RFC2018], "TCP Selective Acknowledgment SACK has been published as "TCP Selective Acknowledgment
Options".. Additional information about SACK can be found in Options" [RFC2018]. Additional information about SACK can be
[RFC2883], "An Extension to the Selective Acknowledgement (SACK) found in "An Extension to the Selective Acknowledgement (SACK)
option for TCP" and [RFC3517], "A Conservative Selective option for TCP" [RFC2883], and , "A Conservative Selective
Acknowledgment (SACK)-based Loss Recovery Algorithm for TCP". Acknowledgment (SACK)-based Loss Recovery Algorithm for TCP"
[RFC3517].
(3) Round-Trip Measurement (3) Round-Trip Measurement
TCP implements reliable data delivery by retransmitting segments TCP implements reliable data delivery by retransmitting segments
that are not acknowledged within some retransmission timeout that are not acknowledged within some retransmission timeout
(RTO) interval. Accurate dynamic determination of an (RTO) interval. Accurate dynamic determination of an
appropriate RTO is essential to TCP performance. RTO is appropriate RTO is essential to TCP performance. RTO is
determined by estimating the mean and variance of the measured determined by estimating the mean and variance of the measured
round-trip time (RTT), i.e., the time interval between sending a round-trip time (RTT), i.e., the time interval between sending a
segment and receiving an acknowledgment for it [Jacobson88a]. segment and receiving an acknowledgment for it [Jacobson88a].
skipping to change at page 6, line 35 skipping to change at page 6, line 36
Now we turn from performance to reliability. High transfer rate Now we turn from performance to reliability. High transfer rate
enters TCP performance through the bandwidth*delay product. However, enters TCP performance through the bandwidth*delay product. However,
high transfer rate alone can threaten TCP reliability by violating high transfer rate alone can threaten TCP reliability by violating
the assumptions behind the TCP mechanism for duplicate detection and the assumptions behind the TCP mechanism for duplicate detection and
sequencing. sequencing.
An especially serious kind of error may result from an accidental An especially serious kind of error may result from an accidental
reuse of TCP sequence numbers in data segments. Suppose that an "old reuse of TCP sequence numbers in data segments. Suppose that an "old
duplicate segment", e.g., a duplicate data segment that was delayed duplicate segment", e.g., a duplicate data segment that was delayed
in Internet queues, is delivered to the receiver at the wrong moment, in Internet queues, is delivered to the receiver at the wrong moment,
so that its sequence numbers falls somewhere within the current so that its sequence numbers fall somewhere within the current
window. There would be no checksum failure to warn of the error, and window. There would be no checksum failure to warn of the error, and
the result could be an undetected corruption of the data. Reception the result could be an undetected corruption of the data. Reception
of an old duplicate ACK segment at the transmitter could be only of an old duplicate ACK segment at the transmitter could be only
slightly less serious: it is likely to lock up the connection so that slightly less serious: it is likely to lock up the connection so that
no further progress can be made, forcing an RST on the connection. no further progress can be made, forcing an RST on the connection.
TCP reliability depends upon the existence of a bound on the lifetime TCP reliability depends upon the existence of a bound on the lifetime
of a segment: the "Maximum Segment Lifetime" or MSL. An MSL is of a segment: the "Maximum Segment Lifetime" or MSL. An MSL is
generally required by any reliable transport protocol, since every generally required by any reliable transport protocol, since every
sequence number field must be finite, and therefore any sequence sequence number field must be finite, and therefore any sequence
number may eventually be reused. In the Internet protocol suite, the number may eventually be reused. In the Internet protocol suite, the
MSL bound is enforced by an IP-layer mechanism, the "Time-to-Live" or MSL bound is loosely enforced by an IP-layer mechanism, the "Time-to-
TTL field. Live" (TTL) field, or "Hop Limit" field.
Duplication of sequence numbers might happen in either of two ways: Duplication of sequence numbers might happen in either of two ways:
(1) Sequence number wrap-around on the current connection (1) Sequence number wrap-around on the current connection
A TCP sequence number contains 32 bits. At a high enough A TCP sequence number contains 32 bits. At a high enough
transfer rate, the 32-bit sequence space may be "wrapped" transfer rate, the 32-bit sequence space may be "wrapped"
(cycled) within the time that a segment is delayed in queues. (cycled) within the time that a segment is delayed in queues.
(2) Earlier incarnation of the connection (2) Earlier incarnation of the connection
Suppose that a connection terminates, either by a proper close Suppose that a connection terminates, either by a proper close
sequence or due to a host crash, and the same connection (i.e., sequence or due to a host crash, and the same connection (i.e.,
using the same pair of sockets) is immediately reopened. A using the same pair of port numbers) is immediately reopened. A
delayed segment from the terminated connection could fall within delayed segment from the terminated connection could fall within
the current window for the new incarnation and be accepted as the current window for the new incarnation and be accepted as
valid. valid.
Duplicates from earlier incarnations, Case (2), are avoided by Duplicates from earlier incarnations, Case (2), are avoided by
enforcing the current fixed MSL of the TCP spec, as explained in enforcing the current fixed MSL of the TCP spec, as explained in
Section 4.3 and Appendix B. However, case (1), avoiding the reuse of Section 4.3 and Appendix B. However, case (1), avoiding the reuse of
sequence numbers within the same connection, requires an MSL bound sequence numbers within the same connection, requires an MSL bound
that depends upon the transfer rate, and at high enough rates, a new that depends upon the transfer rate, and at high enough rates, a new
mechanism is required. mechanism is required.
skipping to change at page 8, line 42 skipping to change at page 8, line 42
depend upon active enforcement of MSL for TCP connections, and it is depend upon active enforcement of MSL for TCP connections, and it is
unrealistic to imagine setting MSL's smaller than the current values unrealistic to imagine setting MSL's smaller than the current values
(e.g., 120 seconds specified for TCP). (e.g., 120 seconds specified for TCP).
A possible fix for the problem of cycling the sequence space would be A possible fix for the problem of cycling the sequence space would be
to increase the size of the TCP sequence number field. For example, to increase the size of the TCP sequence number field. For example,
the sequence number field (and also the acknowledgment field) could the sequence number field (and also the acknowledgment field) could
be expanded to 64 bits. This could be done either by changing the be expanded to 64 bits. This could be done either by changing the
TCP header or by means of an additional option. TCP header or by means of an additional option.
Section 4 presents a different mechanism, which we call PAWS (Protect Section 4 presents a different mechanism, which we call PAWS
Against Wrapped Sequence numbers), to extend TCP reliability to (Protection Against Wrapped Sequence numbers), to extend TCP
transfer rates well beyond the foreseeable upper limit of network reliability to transfer rates well beyond the foreseeable upper limit
bandwidths. PAWS uses the TCP Timestamps option defined in of network bandwidths. PAWS uses the TCP Timestamps option defined
Section 3.2 to protect against old duplicates from the same in Section 3.2 to protect against old duplicates from the same
connection. connection.
1.3. Using TCP options 1.3. Using TCP options
The extensions defined in this memo all use new TCP options. We must The extensions defined in this memo all use new TCP options. We must
address two possible issues concerning the use of TCP options: (1) address two possible issues concerning the use of TCP options: (1)
compatibility and (2) overhead. compatibility and (2) overhead.
We must pay careful attention to compatibility, i.e., to We must pay careful attention to compatibility, i.e., to
interoperation with existing implementations. The only TCP option interoperation with existing implementations. The only TCP option
skipping to change at page 9, line 25 skipping to change at page 9, line 25
options on SYN segments. When RFC 1323 was published, there was options on SYN segments. When RFC 1323 was published, there was
concern that some buggy TCP implementation might be crashed by the concern that some buggy TCP implementation might be crashed by the
first appearance of an option on a non-SYN segment. However, bugs first appearance of an option on a non-SYN segment. However, bugs
like that can lead to DOS attacks against a TCP, so it is now like that can lead to DOS attacks against a TCP, so it is now
expected that most TCP implementations will properly handle unknown expected that most TCP implementations will properly handle unknown
options on non-SYN segments. But it is still prudent to be options on non-SYN segments. But it is still prudent to be
conservative in what you send, and avoiding buggy TCP implementation conservative in what you send, and avoiding buggy TCP implementation
is not the only reason for negotiating TCP options on SYN segments. is not the only reason for negotiating TCP options on SYN segments.
Therefore, for each of the extensions defined below, TCP options will Therefore, for each of the extensions defined below, TCP options will
be sent on non-SYN segments only after an exchange of options on the be sent on non-SYN segments only after an exchange of options on the
the SYN segments has indicated that both sides understand the SYN segments has indicated that both sides understand the extension.
extension. Furthermore, an extension option will be sent in a Furthermore, an extension option will be sent in a <SYN,ACK> segment
<SYN,ACK> segment only if the corresponding option was received in only if the corresponding option was received in the initial <SYN>
the initial <SYN> segment. segment.
A question may be raised about the bandwidth and processing overhead A question may be raised about the bandwidth and processing overhead
for TCP options. Those options that occur on SYN segments are not for TCP options. Those options that occur on SYN segments are not
likely to cause a performance concern. Opening a TCP connection likely to cause a performance concern. Opening a TCP connection
requires execution of significant special-case code, and the requires execution of significant special-case code, and the
processing of options is unlikely to increase that cost processing of options is unlikely to increase that cost
significantly. significantly.
On the other hand, a Timestamps option may appear in any data or ACK On the other hand, a Timestamps option may appear in any data or ACK
segment, adding 12 bytes to the 20-byte TCP header. We believe that segment, adding 12 bytes to the 20-byte TCP header. We believe that
skipping to change at page 11, line 37 skipping to change at page 11, line 37
The Window field in a SYN (i.e., a <SYN> or <SYN,ACK>) segment itself The Window field in a SYN (i.e., a <SYN> or <SYN,ACK>) segment itself
is never scaled. is never scaled.
2.3. Using the Window Scale Option 2.3. Using the Window Scale Option
A model implementation of window scaling is as follows, using the A model implementation of window scaling is as follows, using the
notation of [RFC0793]: notation of [RFC0793]:
o All windows are treated as 32-bit quantities for storage in the o All windows are treated as 32-bit quantities for storage in the
connection control block and for local calculations. This connection control block and for local calculations. This
includes the send-window (SND.WND) and the receive- window includes the send-window (SND.WND) and the receive-window
(RCV.WND) values, as well as the congestion window. (RCV.WND) values, as well as the congestion window.
o The connection state is augmented by two window shift counts, o The connection state is augmented by two window shift counts,
Snd.Wind.Scale and Rcv.Wind.Scale, to be applied to the incoming Snd.Wind.Scale and Rcv.Wind.Scale, to be applied to the incoming
and outgoing window fields, respectively. and outgoing window fields, respectively.
o If a TCP receives a <SYN> segment containing a Window Scale o If a TCP receives a <SYN> segment containing a Window Scale
option, it sends its own Window Scale option in the <SYN,ACK> option, it sends its own Window Scale option in the <SYN,ACK>
segment. segment.
skipping to change at page 12, line 25 skipping to change at page 12, line 25
o The window field (SEG.WND) of every outgoing segment, with the o The window field (SEG.WND) of every outgoing segment, with the
exception of SYN segments, is right-shifted by Rcv.Wind.Scale exception of SYN segments, is right-shifted by Rcv.Wind.Scale
bits: bits:
SND.WND = RCV.WND >> Rcv.Wind.Scale SND.WND = RCV.WND >> Rcv.Wind.Scale
TCP determines if a data segment is "old" or "new" by testing whether TCP determines if a data segment is "old" or "new" by testing whether
its sequence number is within 2^31 bytes of the left edge of the its sequence number is within 2^31 bytes of the left edge of the
window, and if it is not, discarding the data as "old". To insure window, and if it is not, discarding the data as "old". To insure
that new data is never mistakenly considered old and vice- versa, the that new data is never mistakenly considered old and vice versa, the
left edge of the sender's window has to be at most 2^31 away from the left edge of the sender's window has to be at most 2^31 away from the
right edge of the receiver's window. Similarly with the sender's right edge of the receiver's window. Similarly with the sender's
right edge and receiver's left edge. Since the right and left edges right edge and receiver's left edge. Since the right and left edges
of either the sender's or receiver's window differ by the window of either the sender's or receiver's window differ by the window
size, and since the sender and receiver windows can be out of phase size, and since the sender and receiver windows can be out of phase
by at most the window size, the above constraints imply that 2 * the by at most the window size, the above constraints imply that 2 * the
max window size must be less than 2^31, or max window size must be less than 2^31, or
max window < 2^30 max window < 2^30
skipping to change at page 13, line 33 skipping to change at page 13, line 33
2) When window scaling is in effect, the receiver SHOULD track the 2) When window scaling is in effect, the receiver SHOULD track the
actual maximum window sequence number (which is likely to be actual maximum window sequence number (which is likely to be
greater than the window announced by the most recent ACK, if more greater than the window announced by the most recent ACK, if more
than one segment has arrived since the application consumed any than one segment has arrived since the application consumed any
data in the receive buffer). data in the receive buffer).
On the sender side: On the sender side:
3) The initial transmission MUST honor window on most recent ACK. 3) The initial transmission MUST honor window on most recent ACK.
4) On first retransmission, or if it is out-of-window by less than 4) On first retransmission, or if the sequence number is out-of-
(2^Rcv.Wind.Scale) then do normal retransmission(s) without window by less than (2^Rcv.Wind.Scale) then do normal
regard to receiver window as long as the original segment was in retransmission(s) without regard to receiver window as long as
window when it was sent. the original segment was in window when it was sent.
5) On subsequent retransmissions, treat it as zero window probes. 5) On subsequent retransmissions, treat such ACKs as zero window
probes.
3. RTTM -- Round-Trip Time Measurement 3. RTTM -- Round-Trip Time Measurement
3.1. Introduction 3.1. Introduction
Accurate and current RTT estimates are necessary to adapt to changing Accurate and current RTT estimates are necessary to adapt to changing
traffic conditions and to avoid an instability known as "congestion traffic conditions and to avoid an instability known as "congestion
collapse" [RFC0896] in a busy network. However, accurate measurement collapse" [RFC0896] in a busy network. However, accurate measurement
of RTT may be difficult both in theory and in implementation. of RTT may be difficult both in theory and in implementation.
skipping to change at page 17, line 9 skipping to change at page 17, line 11
Since TCP B is not sending data, the data segment C does not Since TCP B is not sending data, the data segment C does not
acknowledge any new data when it arrives at B. Thus, the inflated acknowledge any new data when it arrives at B. Thus, the inflated
RTTM measurement is not used to update B's RTTM measurement. RTTM measurement is not used to update B's RTTM measurement.
Implementors should note that with Timestamps multiple RTTMs can be Implementors should note that with Timestamps multiple RTTMs can be
taken per RTT. Many RTO estimators have a weighting factor based on taken per RTT. Many RTO estimators have a weighting factor based on
an implicit assumption that at most one RTTM will be gotten per RTT. an implicit assumption that at most one RTTM will be gotten per RTT.
When using multiple RTTMs per RTT to update the RTO estimator, the When using multiple RTTMs per RTT to update the RTO estimator, the
weighting factor needs to be decreased to take into account the more weighting factor needs to be decreased to take into account the more
frequent RTTMs. For example, an implementation could choose to just frequent RTTMs. For example, an implementation could choose to just
use one sample per RTT to update the RTO estimator, or or vary the use one sample per RTT to update the RTO estimator, or vary the gain
gain based on the congestion window, or take an average of all the based on the congestion window, or take an average of all the RTTM
RTTM measurements received over one RTT, and then use that value to measurements received over one RTT, and then use that value to update
update the RTO estimator. This document does not prescribe any the RTO estimator. This document does not prescribe any particular
particular method for modifying the RTO estimator, the important method for modifying the RTO estimator, the important point is that
point is that the implementation should do something more than just the implementation should do something more than just feeding
feeding additional RTTM samples from one RTT into the RTO estimator. additional RTTM samples from one RTT into the RTO estimator.
3.4. Which Timestamp to Echo 3.4. Which Timestamp to Echo
If more than one Timestamps option is received before a reply segment If more than one Timestamps option is received before a reply segment
is sent, the TCP must choose only one of the TSvals to echo, ignoring is sent, the TCP must choose only one of the TSvals to echo, ignoring
the others. To minimize the state kept in the receiver (i.e., the the others. To minimize the state kept in the receiver (i.e., the
number of unprocessed TSvals), the receiver should be required to number of unprocessed TSvals), the receiver should be required to
retain at most one timestamp in the connection control block. retain at most one timestamp in the connection control block.
There are three situations to consider: There are three situations to consider:
skipping to change at page 18, line 27 skipping to change at page 18, line 28
(1) The connection state is augmented with two 32-bit slots: (1) The connection state is augmented with two 32-bit slots:
TS.Recent holds a timestamp to be echoed in TSecr whenever a TS.Recent holds a timestamp to be echoed in TSecr whenever a
segment is sent, and Last.ACK.sent holds the ACK field from the segment is sent, and Last.ACK.sent holds the ACK field from the
last segment sent. Last.ACK.sent will equal RCV.NXT except when last segment sent. Last.ACK.sent will equal RCV.NXT except when
ACKs have been delayed. ACKs have been delayed.
(2) If: (2) If:
SEG.TSval >= TSrecent and SEG.SEQ <= Last.ACK.sent SEG.TSval >= TS.recent and SEG.SEQ <= Last.ACK.sent
then SEG.TSval is copied to TS.Recent; otherwise, it is ignored. then SEG.TSval is copied to TS.Recent; otherwise, it is ignored.
(3) When a TSopt is sent, its TSecr field is set to the current (3) When a TSopt is sent, its TSecr field is set to the current
TS.Recent value. TS.Recent value.
The following examples illustrate these rules. Here A, B, C... The following examples illustrate these rules. Here A, B, C...
represent data segments occupying successive blocks of sequence represent data segments occupying successive blocks of sequence
numbers, and ACK(A),... represent the corresponding acknowledgment numbers, and ACK(A),... represent the corresponding acknowledgment
segments. Note that ACK(A) has the same sequence number as B. We segments. Note that ACK(A) has the same sequence number as B. We
skipping to change at page 19, line 4 skipping to change at page 19, line 12
By Case (A), the timestamp from the oldest unacknowledged segment By Case (A), the timestamp from the oldest unacknowledged segment
is echoed. is echoed.
TS.Recent TS.Recent
<A, TSval=1> -------------------> <A, TSval=1> ------------------->
1 1
<B, TSval=2> -------------------> <B, TSval=2> ------------------->
1 1
<C, TSval=3> -------------------> <C, TSval=3> ------------------->
1 1
<---- <ACK(C), TSecr=1> <---- <ACK(C), TSecr=1>
(etc) (etc)
o Packets arrive out of order, and every packet is acknowledged. o Packets arrive out of order, and every packet is acknowledged.
By Case (B), the timestamp from the last segment that advanced the By Case (B), the timestamp from the last segment that advanced the
left window edge is echoed, until the missing segment arrives; it left window edge is echoed, until the missing segment arrives; it
is echoed according to Case (C). The same sequence would occur if is echoed according to Case (C). The same sequence would occur if
segments B and D were lost and retransmitted.. segments B and D were lost and retransmitted.
TS.Recent TS.Recent
<A, TSval=1> -------------------> <A, TSval=1> ------------------->
1 1
<---- <ACK(A), TSecr=1> <---- <ACK(A), TSecr=1>
1 1
<C, TSval=3> -------------------> <C, TSval=3> ------------------->
1 1
<---- <ACK(A), TSecr=1> <---- <ACK(A), TSecr=1>
1 1
skipping to change at page 19, line 41 skipping to change at page 19, line 48
2 2
<D, TSval=4> -------------------> <D, TSval=4> ------------------->
4 4
<---- <ACK(E), TSecr=4> <---- <ACK(E), TSecr=4>
(etc) (etc)
4. PAWS -- Protection Against Wrapped Sequence Numbers 4. PAWS -- Protection Against Wrapped Sequence Numbers
4.1. Introduction 4.1. Introduction
Section 4.2describes a simple mechanism to reject old duplicate Section 4.2 describes a simple mechanism to reject old duplicate
segments that might corrupt an open TCP connection; we call this segments that might corrupt an open TCP connection; we call this
mechanism PAWS (Protection Against Wrapped Sequence numbers). PAWS mechanism PAWS (Protection Against Wrapped Sequence numbers). PAWS
operates within a single TCP connection, using state that is saved in operates within a single TCP connection, using state that is saved in
the connection control block. Section 4.3 and Appendix C discuss the the connection control block. Section 4.3 and Appendix C discuss the
implications of the PAWS mechanism for avoiding old duplicates from implications of the PAWS mechanism for avoiding old duplicates from
previous incarnations of the same connection. previous incarnations of the same connection.
4.2. The PAWS Mechanism 4.2. The PAWS Mechanism
PAWS uses the same TCP Timestamps option as the RTTM mechanism PAWS uses the same TCP Timestamps option as the RTTM mechanism
skipping to change at page 20, line 47 skipping to change at page 20, line 50
are carried in both data and ACK segments and are echoed in TSecr are carried in both data and ACK segments and are echoed in TSecr
fields carried in returning ACK or data segments. PAWS submits all fields carried in returning ACK or data segments. PAWS submits all
incoming segments to the same test, and therefore protects against incoming segments to the same test, and therefore protects against
duplicate ACK segments as well as data segments. (An alternative duplicate ACK segments as well as data segments. (An alternative
non-symmetric algorithm would protect against old duplicate ACKs: the non-symmetric algorithm would protect against old duplicate ACKs: the
sender of data would reject incoming ACK segments whose TSecr values sender of data would reject incoming ACK segments whose TSecr values
were less than the TSecr saved from the last segment whose ACK field were less than the TSecr saved from the last segment whose ACK field
advanced the left edge of the send window. This algorithm was deemed advanced the left edge of the send window. This algorithm was deemed
to lack economy of mechanism and symmetry.) to lack economy of mechanism and symmetry.)
TSval timestamps sent on >SYN< and >SYN,ACK< segments are used to TSval timestamps sent on <SYN> and <SYN,ACK> segments are used to
initialize PAWS. PAWS protects against old duplicate non-SYN initialize PAWS. PAWS protects against old duplicate non-SYN
segments, and duplicate SYN segments received while there is a segments, and duplicate SYN segments received while there is a
synchronized connection. Duplicate >SYN< and >SYN,ACK< segments synchronized connection. Duplicate <SYN> and <SYN,ACK> segments
received when there is no connection will be discarded by the normal received when there is no connection will be discarded by the normal
3-way handshake and sequence number checks of TCP. 3-way handshake and sequence number checks of TCP.
RFC 1323 recommended that RST segments NOT carry timestamps, and that RFC 1323 recommended that RST segments NOT carry timestamps, and that
they be acceptable regardless of their timestamp. At that time, the they be acceptable regardless of their timestamp. At that time, the
thinking was that old duplicate RST segments should be exceedingly thinking was that old duplicate RST segments should be exceedingly
unlikely, and their cleanup function should take precedence over unlikely, and their cleanup function should take precedence over
timestamps. More recently, discussion about various blind attacks on timestamps. More recently, discussions about various blind attacks
TCP connections have raised the suggestion that if the Timestamps on TCP connections have raised the suggestion that if the Timestamps
option is present, SEG.TSecr could be used to provide stricter option is present, SEG.TSecr could be used to provide stricter
acceptance tests for RST packets. While still under discussion, to acceptance tests for RST packets. While still under discussion, to
enable research into this area it is now recommended that when enable research into this area it is now recommended that when
generating a RST, that if the packet causing the RST to be generated generating a RST, that if the packet causing the RST to be generated
contained a Timestamps option that the RST also contain a Timestamps contained a Timestamps option that the RST also contain a Timestamps
option. In the RST segment, SEG.TSecr should be set to SEG.TSval option. In the RST segment, SEG.TSecr should be set to SEG.TSval
from the incoming packet and SEG.TSval should be set to zero. If a from the incoming packet and SEG.TSval should be set to zero. If a
RST is being generated because of a user abort, and Snd.TS.OK is set, RST is being generated because of a user abort, and Snd.TS.OK is set,
then a Timestamps option should be included in the RST. When a RST then a Timestamps option should be included in the RST. When a RST
packet is received, it must not be subjected to PAWS checks, and packet is received, it must not be subjected to PAWS checks, and
skipping to change at page 22, line 5 skipping to change at page 22, line 11
R2) If the segment is outside the window, reject it (normal TCP R2) If the segment is outside the window, reject it (normal TCP
processing) processing)
R3) If an arriving segment satisfies: SEG.SEQ <= Last.ACK.sent (see R3) If an arriving segment satisfies: SEG.SEQ <= Last.ACK.sent (see
Section 3.4), then record its timestamp in TS.Recent. Section 3.4), then record its timestamp in TS.Recent.
R4) If an arriving segment is in-sequence (i.e., at the left window R4) If an arriving segment is in-sequence (i.e., at the left window
edge), then accept it normally. edge), then accept it normally.
R5) Otherwise, treat the segment as a normal in-window, out- of- R5) Otherwise, treat the segment as a normal in-window, out-of-
sequence TCP segment (e.g., queue it for later delivery to the sequence TCP segment (e.g., queue it for later delivery to the
user). user).
Steps R2, R4, and R5 are the normal TCP processing steps specified by Steps R2, R4, and R5 are the normal TCP processing steps specified by
RFC 793. RFC 793.
It is important to note that the timestamp is checked only when a It is important to note that the timestamp is checked only when a
segment first arrives at the receiver, regardless of whether it is segment first arrives at the receiver, regardless of whether it is
in-sequence or it must be queued for later delivery. in-sequence or it must be queued for later delivery.
skipping to change at page 23, line 20 skipping to change at page 23, line 27
We know of no case with a significant probability of occurrence in We know of no case with a significant probability of occurrence in
which timestamps will cause performance degradation by unnecessarily which timestamps will cause performance degradation by unnecessarily
discarding segments. discarding segments.
4.2.2. Timestamp Clock 4.2.2. Timestamp Clock
It is important to understand that the PAWS algorithm does not It is important to understand that the PAWS algorithm does not
require clock synchronization between sender and receiver. The require clock synchronization between sender and receiver. The
sender's timestamp clock is used to stamp the segments, and the sender's timestamp clock is used to stamp the segments, and the
sender uses the echoed timestamp to measure RTT's. However, the sender uses the echoed timestamp to measure RTTs. However, the
receiver treats the timestamp as simply a monotonically increasing receiver treats the timestamp as simply a monotonically increasing
serial number, without any necessary connection to its clock. From serial number, without any necessary connection to its clock. From
the receiver's viewpoint, the timestamp is acting as a logical the receiver's viewpoint, the timestamp is acting as a logical
extension of the high-order bits of the sequence number. extension of the high-order bits of the sequence number.
The receiver algorithm does place some requirements on the frequency The receiver algorithm does place some requirements on the frequency
of the timestamp clock. of the timestamp clock.
(a) The timestamp clock must not be "too slow". (a) The timestamp clock must not be "too slow".
skipping to change at page 24, line 9 skipping to change at page 24, line 15
every 59 ns. every 59 ns.
However, it is desirable to establish a much longer recycle However, it is desirable to establish a much longer recycle
period, in order to handle outdated timestamps on idle period, in order to handle outdated timestamps on idle
connections (see Section 4.2.3), and to relax the MSL connections (see Section 4.2.3), and to relax the MSL
requirement for preventing sequence number wrap-around. With a requirement for preventing sequence number wrap-around. With a
1 ms timestamp clock, the 32-bit timestamp will wrap its sign 1 ms timestamp clock, the 32-bit timestamp will wrap its sign
bit in 24.8 days. Thus, it will reject old duplicates on the bit in 24.8 days. Thus, it will reject old duplicates on the
same connection if MSL is 24.8 days or less. This appears to be same connection if MSL is 24.8 days or less. This appears to be
a very safe figure; an MSL of 24.8 days or longer can probably a very safe figure; an MSL of 24.8 days or longer can probably
be assumed by the gateway system without requiring precise MSL be assumed in the internet without requiring precise MSL
enforcement by the TTL value in the IP layer. enforcement.
Based upon these considerations, we choose a timestamp clock Based upon these considerations, we choose a timestamp clock
frequency in the range 1 ms to 1 sec per tick. This range also frequency in the range 1 ms to 1 sec per tick. This range also
matches the requirements of the RTTM mechanism, which does not need matches the requirements of the RTTM mechanism, which does not need
much more resolution than the granularity of the retransmit timer, much more resolution than the granularity of the retransmit timer,
e.g., tens or hundreds of milliseconds. e.g., tens or hundreds of milliseconds.
The PAWS mechanism also puts a strong monotonicity requirement on the The PAWS mechanism also puts a strong monotonicity requirement on the
sender's timestamp clock. The method of implementation of the sender's timestamp clock. The method of implementation of the
timestamp clock to meet this requirement depends upon the system timestamp clock to meet this requirement depends upon the system
skipping to change at page 26, line 5 skipping to change at page 26, line 14
H3) Process the segment normally, as specified in RFC 793. This H3) Process the segment normally, as specified in RFC 793. This
includes dropping segments that are outside the window and includes dropping segments that are outside the window and
possibly sending acknowledgments, and queueing in-window, out- possibly sending acknowledgments, and queueing in-window, out-
of-sequence segments. of-sequence segments.
Another possibility would be to interchange steps H1 and H2, i.e., to Another possibility would be to interchange steps H1 and H2, i.e., to
perform the header prediction step H2 FIRST, and perform H1 and H3 perform the header prediction step H2 FIRST, and perform H1 and H3
only when header prediction fails. This could be a performance only when header prediction fails. This could be a performance
improvement, since the timestamp check in step H1 is very unlikely to improvement, since the timestamp check in step H1 is very unlikely to
fail, and it requires unsigned modulo arithmetic, a relatively fail, and it requires unsigned modulo arithmetic. To perform this
expensive operation. To perform this check on every single segment check on every single segment is contrary to the philosophy of header
is contrary to the philosophy of header prediction. We believe that prediction. We believe that this change might produce a measurable
this change might produce a measurable reduction in CPU time for TCP reduction in CPU time for TCP protocol processing on high-speed
protocol processing on high-speed networks. networks.
However, putting H2 first would create a hazard: a segment from 2^32 However, putting H2 first would create a hazard: a segment from 2^32
bytes in the past might arrive at exactly the wrong time and be bytes in the past might arrive at exactly the wrong time and be
accepted mistakenly by the header-prediction step. The following accepted mistakenly by the header-prediction step. The following
reasoning has been introduced in [RFC1185] to show that the reasoning has been introduced in [RFC1185] to show that the
probability of this failure is negligible. probability of this failure is negligible.
If all segments are equally likely to show up as old duplicates, If all segments are equally likely to show up as old duplicates,
then the probability of an old duplicate exactly matching the left then the probability of an old duplicate exactly matching the left
window edge is the maximum segment size (MSS) divided by the size window edge is the maximum segment size (MSS) divided by the size
skipping to change at page 26, line 44 skipping to change at page 27, line 4
However, this probabilistic argument is not universally accepted, and However, this probabilistic argument is not universally accepted, and
the consensus at present is that the performance gain does not the consensus at present is that the performance gain does not
justify the hazard in the general case. It is therefore recommended justify the hazard in the general case. It is therefore recommended
that H2 follow H1. that H2 follow H1.
4.2.5. IP Fragmentation 4.2.5. IP Fragmentation
At high data rates, the protection against old packets provided by At high data rates, the protection against old packets provided by
PAWS can be circumvented by errors in IP fragment reassembly (see PAWS can be circumvented by errors in IP fragment reassembly (see
[RFC4963]). The only way to protect against incorrect IP fragment [RFC4963]). The only way to protect against incorrect IP fragment
reassembly is to not allow the packets to be fragmented. This is reassembly is to not allow the packets to be fragmented. This is
done by setting the Don't Fragment (DF) bit in the IP header. done by setting the Don't Fragment (DF) bit in the IP header.
Setting the DF bit implies the use of Path MTU Discovery as described Setting the DF bit implies the use of Path MTU Discovery as described
in [RFC1191], thus any TCP implementation that implements PAWS must in [RFC1191], [RFC1981], and [RFC4821], thus any TCP implementation
also implement Path MTU Discovery. that implements PAWS must also implement Path MTU Discovery.
4.3. Duplicates from Earlier Incarnations of Connection 4.3. Duplicates from Earlier Incarnations of Connection
The PAWS mechanism protects against errors due to sequence number The PAWS mechanism protects against errors due to sequence number
wrap-around on high-speed connection. Segments from an earlier wrap-around on high-speed connections. Segments from an earlier
incarnation of the same connection are also a potential cause of old incarnation of the same connection are also a potential cause of old
duplicate errors. In both cases, the TCP mechanisms to prevent such duplicate errors. In both cases, the TCP mechanisms to prevent such
errors depend upon the enforcement of a maximum segment lifetime errors depend upon the enforcement of a maximum segment lifetime
(MSL) by the Internet (IP) layer (see Appendix of RFC 1185 for a (MSL) by the Internet (IP) layer (see Appendix of RFC 1185 for a
detailed discussion). Unlike the case of sequence space wrap-around, detailed discussion). Unlike the case of sequence space wrap-around,
the MSL required to prevent old duplicate errors from earlier the MSL required to prevent old duplicate errors from earlier
incarnations does not depend upon the transfer rate. If the IP layer incarnations does not depend upon the transfer rate. If the IP layer
enforces the recommended 2 minute MSL of TCP, and if the TCP rules enforces the recommended 2 minute MSL of TCP, and if the TCP rules
are followed, TCP connections will be safe from earlier incarnations, are followed, TCP connections will be safe from earlier incarnations,
no matter how high the network speed. Thus, the PAWS mechanism is no matter how high the network speed. Thus, the PAWS mechanism is
skipping to change at page 27, line 38 skipping to change at page 27, line 45
5. Conclusions and Acknowledgements 5. Conclusions and Acknowledgements
This memo presented a set of extensions to TCP to provide efficient This memo presented a set of extensions to TCP to provide efficient
operation over large-bandwidth*delay-product paths and reliable operation over large-bandwidth*delay-product paths and reliable
operation over very high-speed paths. These extensions are designed operation over very high-speed paths. These extensions are designed
to provide compatible interworking with TCP's that do not implement to provide compatible interworking with TCP's that do not implement
the extensions. the extensions.
These mechanisms are implemented using new TCP options for scaled These mechanisms are implemented using new TCP options for scaled
windows and timestamps. The timestamps are used for two distinct windows and timestamps. The timestamps are used for two distinct
mechanisms: RTTM (Round Trip Time Measurement) and PAWS (Protect mechanisms: RTTM (Round Trip Time Measurement) and PAWS (Protection
Against Wrapped Sequences). Against Wrapped Sequences).
The Window Scale option was originally suggested by Mike St. Johns of The Window Scale option was originally suggested by Mike St. Johns of
USAF/DCA. The present form of the option was suggested by Mike USAF/DCA. The present form of the option was suggested by Mike
Karels of UC Berkeley in response to a more cumbersome scheme defined Karels of UC Berkeley in response to a more cumbersome scheme defined
by Van Jacobson. Lixia Zhang helped formulate the PAWS mechanism by Van Jacobson. Lixia Zhang helped formulate the PAWS mechanism
description in RFC 1185. description in RFC 1185.
Finally, much of this work originated as the result of discussions Finally, much of this work originated as the result of discussions
within the End-to-End Task Force on the theoretical limitations of within the End-to-End Task Force on the theoretical limitations of
skipping to change at page 28, line 22 skipping to change at page 28, line 29
The TCP sequence space is a fixed size, and as the window becomes The TCP sequence space is a fixed size, and as the window becomes
larger it becomes easier for an attacker to generate forged packets larger it becomes easier for an attacker to generate forged packets
that can fall within the TCP window, and be accepted as valid that can fall within the TCP window, and be accepted as valid
packets. While use of Timestamps and PAWS can help to mitigate this, packets. While use of Timestamps and PAWS can help to mitigate this,
when using PAWS, if an attacker is able to forge a packet that is when using PAWS, if an attacker is able to forge a packet that is
acceptable to the TCP connection, a timestamp that is in the future acceptable to the TCP connection, a timestamp that is in the future
would cause valid packets to be dropped due to PAWS checks. Hence, would cause valid packets to be dropped due to PAWS checks. Hence,
implementors should take care to not open the TCP window drastically implementors should take care to not open the TCP window drastically
beyond the requirements of the connection. beyond the requirements of the connection.
Middle boxes and options If a middle box removes TCP options from the Middle boxes and options: If a middle box removes TCP options from
SYN, such as TSopt, a high speed connection that needs PAWS would not the SYN, such as TSopt, a high speed connection that needs PAWS would
have that protection. In this situation, an implementor could not have that protection. In this situation, an implementor could
provide a mechanism for the application to determine whether or not provide a mechanism for the application to determine whether or not
PAWS is in use on the connection, and chose to terminate the PAWS is in use on the connection, and chose to terminate the
connection if that protection doesn't exist. connection if that protection doesn't exist.
Mechanisms to protect the TCP header from modification should also Mechanisms to protect the TCP header from modification should also
protect the TCP options. protect the TCP options.
Expanding the TCP window beyond 64K for IPv6 allows Jumbograms Expanding the TCP window beyond 64K for IPv6 allows Jumbograms
[RFC2675] to be used when the local network supports packets larger [RFC2675] to be used when the local network supports packets larger
than 64K. When larger TCP packets are used, the TCP checksum becomes than 64K. When larger TCP packets are used, the TCP checksum becomes
skipping to change at page 29, line 11 skipping to change at page 29, line 17
RFC 793, September 1981. RFC 793, September 1981.
[RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191, [RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC 1191,
November 1990. November 1990.
8.2. Informative References 8.2. Informative References
[Garlick77] [Garlick77]
Garlick, L., Rom, R., and J. Postel, "Issues in Reliable Garlick, L., Rom, R., and J. Postel, "Issues in Reliable
Host-to-Host Protocols", Proc. Second Berkeley Workshop on Host-to-Host Protocols", Proc. Second Berkeley Workshop on
Distributed Data Management and Computer Networks , Distributed Data Management and Computer Networks,
May 1977, <http://www.rfc-editor.org/ien/ien12.txt>. May 1977, <http://www.rfc-editor.org/ien/ien12.txt>.
[Hamming77] [Hamming77]
Hamming, R., "Digital Filters", Prentice Hall, Englewood Hamming, R., "Digital Filters", Prentice Hall, Englewood
Cliffs, N.J. ISBN 0-13-212571-4, 1977. Cliffs, N.J. ISBN 0-13-212571-4, 1977.
[Jacobson88a] [Jacobson88a]
Jacobson, V., "Congestion Avoidance and Control", SIGCOMM Jacobson, V., "Congestion Avoidance and Control", SIGCOMM
'88, Stanford, CA. , August 1988, '88, Stanford, CA., August 1988,
<http://ee.lbl.gov/papers/congavoid.pdf>. <http://ee.lbl.gov/papers/congavoid.pdf>.
[Jacobson90a] [Jacobson90a]
Jacobson, V., "4BSD Header Prediction", ACM Computer Jacobson, V., "4BSD Header Prediction", ACM Computer
Communication Review , April 1990. Communication Review, April 1990.
[Jacobson90c] [Jacobson90c]
Jacobson, V., "Modified TCP congestion avoidance Jacobson, V., "Modified TCP congestion avoidance
algorithm", Message to end2end-interest mailing list , algorithm", Message to the end2end-interest mailing list,
April 1990, April 1990,
<ftp://ftp.isi.edu/end2end/end2end-interest-1990.mail>. <ftp://ftp.isi.edu/end2end/end2end-interest-1990.mail>.
[Jain86] Jain, R., "Divergence of Timeout Algorithms for Packet [Jain86] Jain, R., "Divergence of Timeout Algorithms for Packet
Retransmissions", Proc. Fifth Phoenix Conf. on Comp. and Retransmissions", Proc. Fifth Phoenix Conf. on Comp. and
Comm., Scottsdale, Arizona , March 1986, Comm., Scottsdale, Arizona, March 1986,
<http://arxiv.org/ftp/cs/papers/9809/9809097.pdf>. <http://arxiv.org/ftp/cs/papers/9809/9809097.pdf>.
[Karn87] Karn, P. and C. Partridge, "Estimating Round-Trip Times in [Karn87] Karn, P. and C. Partridge, "Estimating Round-Trip Times in
Reliable Transport Protocols", Proc. SIGCOMM '87 , Reliable Transport Protocols", Proc. SIGCOMM '87,
August 1987. August 1987.
[Martin03] [Martin03]
Martin, D., "[Tsvwg] RFC 1323.bis", Message to the tsvwg Martin, D., "[Tsvwg] RFC 1323.bis", Message to the tsvwg
mailing list , September 2003, <http://www.ietf.org/ mailing list, September 2003, <http://www.ietf.org/
mail-archive/web/tsvwg/current/msg04435.html>. mail-archive/web/tsvwg/current/msg04435.html>.
[Mathis08] [Mathis08]
Mathis, M., "[tcpm] Example of 1323 window retraction Mathis, M., "[tcpm] Example of 1323 window retraction
problemPer my comments at the microphone at TCPM...", problem", Message to the tcpm mailing list, March 2008,
Message to the tcpm mailing list , March 2008, <http:// <http://www.ietf.org/mail-archive/web/tcpm/current/
www.ietf.org/mail-archive/web/tcpm/current/msg03564.html>. msg03564.html>.
[RFC0896] Nagle, J., "Congestion control in IP/TCP internetworks", [RFC0896] Nagle, J., "Congestion control in IP/TCP internetworks",
RFC 896, January 1984. RFC 896, January 1984.
[RFC1072] Jacobson, V. and R. Braden, "TCP extensions for long-delay [RFC1072] Jacobson, V. and R. Braden, "TCP extensions for long-delay
paths", RFC 1072, October 1988. paths", RFC 1072, October 1988.
[RFC1110] McKenzie, A., "Problem with the TCP big window option", [RFC1110] McKenzie, A., "Problem with the TCP big window option",
RFC 1110, August 1989. RFC 1110, August 1989.
[RFC1122] Braden, R., "Requirements for Internet Hosts - [RFC1122] Braden, R., "Requirements for Internet Hosts -
Communication Layers", STD 3, RFC 1122, October 1989. Communication Layers", STD 3, RFC 1122, October 1989.
[RFC1185] Jacobson, V., Braden, B., and L. Zhang, "TCP Extension for [RFC1185] Jacobson, V., Braden, B., and L. Zhang, "TCP Extension for
High-Speed Paths", RFC 1185, October 1990. High-Speed Paths", RFC 1185, October 1990.
[RFC1323] Jacobson, V., Braden, B., and D. Borman, "TCP Extensions [RFC1323] Jacobson, V., Braden, B., and D. Borman, "TCP Extensions
for High Performance", RFC 1323, May 1992. for High Performance", RFC 1323, May 1992.
[RFC1981] McCann, J., Deering, S., and J. Mogul, "Path MTU Discovery
for IP version 6", RFC 1981, August 1996.
[RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP
Selective Acknowledgment Options", RFC 2018, October 1996. Selective Acknowledgment Options", RFC 2018, October 1996.
[RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion
Control", RFC 2581, April 1999. Control", RFC 2581, April 1999.
[RFC2675] Borman, D., Deering, S., and R. Hinden, "IPv6 Jumbograms", [RFC2675] Borman, D., Deering, S., and R. Hinden, "IPv6 Jumbograms",
RFC 2675, August 1999. RFC 2675, August 1999.
[RFC2883] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An [RFC2883] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An
Extension to the Selective Acknowledgement (SACK) Option Extension to the Selective Acknowledgement (SACK) Option
for TCP", RFC 2883, July 2000. for TCP", RFC 2883, July 2000.
[RFC3517] Blanton, E., Allman, M., Fall, K., and L. Wang, "A [RFC3517] Blanton, E., Allman, M., Fall, K., and L. Wang, "A
Conservative Selective Acknowledgment (SACK)-based Loss Conservative Selective Acknowledgment (SACK)-based Loss
Recovery Algorithm for TCP", RFC 3517, April 2003. Recovery Algorithm for TCP", RFC 3517, April 2003.
[RFC4821] Mathis, M. and J. Heffner, "Packetization Layer Path MTU
Discovery", RFC 4821, March 2007.
[RFC4963] Heffner, J., Mathis, M., and B. Chandler, "IPv4 Reassembly [RFC4963] Heffner, J., Mathis, M., and B. Chandler, "IPv4 Reassembly
Errors at High Data Rates", RFC 4963, July 2007. Errors at High Data Rates", RFC 4963, July 2007.
[RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
Control", RFC 5681, September 2009. Control", RFC 5681, September 2009.
[Watson81] [Watson81]
Watson, R., "Timer-based Mechanisms in Reliable Transport Watson, R., "Timer-based Mechanisms in Reliable Transport
Protocol Connection Management", Computer Networks, Vol. Protocol Connection Management", Computer Networks, Vol.
5 , 1981. 5, 1981.
[Zhang86] Zhang, L., "Why TCP Timers Don't Work Well", Proc. SIGCOMM [Zhang86] Zhang, L., "Why TCP Timers Don't Work Well", Proc. SIGCOMM
'86, Stowe, VT , August 1986. '86, Stowe, VT, August 1986.
Appendix A. Implementation Suggestions Appendix A. Implementation Suggestions
TCP Option Layout TCP Option Layout
The following layouts are recommended for sending options on non- The following layouts are recommended for sending options on non-
SYN segments, to achieve maximum feasible alignment of 32-bit and SYN segments, to achieve maximum feasible alignment of 32-bit and
64-bit machines. 64-bit machines.
+--------+--------+--------+--------+ +--------+--------+--------+--------+
skipping to change at page 31, line 30 skipping to change at page 31, line 42
Interaction with the TCP Urgent Pointer Interaction with the TCP Urgent Pointer
The TCP Urgent pointer, like the TCP window, is a 16 bit value. The TCP Urgent pointer, like the TCP window, is a 16 bit value.
Some of the original discussion for the TCP Window Scale option Some of the original discussion for the TCP Window Scale option
included proposals to increase the Urgent pointer to 32 bits. As included proposals to increase the Urgent pointer to 32 bits. As
it turns out, this is unnecessary. There are two observations it turns out, this is unnecessary. There are two observations
that should be made: that should be made:
(1) With IP Version 4, the largest amount of TCP data that can be (1) With IP Version 4, the largest amount of TCP data that can be
sent in a single packet is 65495 bytes (64K - 1 - size of sent in a single packet is 65495 bytes (64K - 1 -- size of
fixed IP and TCP headers). fixed IP and TCP headers).
(2) Updates to the urgent pointer while the user is in "urgent (2) Updates to the urgent pointer while the user is in "urgent
mode" are invisible to the user. mode" are invisible to the user.
This means that if the Urgent Pointer points beyond the end of the This means that if the Urgent Pointer points beyond the end of the
TCP data in the current packet, then the user will remain in TCP data in the current packet, then the user will remain in
urgent mode until the next TCP packet arrives. That packet will urgent mode until the next TCP packet arrives. That packet will
update the urgent pointer to a new offset, and the user will never update the urgent pointer to a new offset, and the user will never
have left urgent mode. have left urgent mode.
skipping to change at page 32, line 47 skipping to change at page 33, line 12
than N ticks. This will guarantee monotonicity of the timestamps, than N ticks. This will guarantee monotonicity of the timestamps,
which can then be used to reject old duplicates even without an which can then be used to reject old duplicates even without an
enforced MSL. enforced MSL.
B.2. Closing and Reopening a Connection B.2. Closing and Reopening a Connection
When a TCP connection is closed, a delay of 2*MSL in TIME-WAIT state When a TCP connection is closed, a delay of 2*MSL in TIME-WAIT state
ties up the socket pair for 4 minutes (see Section 3.5 of [RFC0793]. ties up the socket pair for 4 minutes (see Section 3.5 of [RFC0793].
Applications built upon TCP that close one connection and open a new Applications built upon TCP that close one connection and open a new
one (e.g., an FTP data transfer connection using Stream mode) must one (e.g., an FTP data transfer connection using Stream mode) must
choose a new socket pair each time. The TIME- WAIT delay serves two choose a new socket pair each time. The TIME-WAIT delay serves two
different purposes: different purposes:
(a) Implement the full-duplex reliable close handshake of TCP. (a) Implement the full-duplex reliable close handshake of TCP.
The proper time to delay the final close step is not really The proper time to delay the final close step is not really
related to the MSL; it depends instead upon the RTO for the FIN related to the MSL; it depends instead upon the RTO for the FIN
segments and therefore upon the RTT of the path. (It could be segments and therefore upon the RTT of the path. (It could be
argued that the side that is sending a FIN knows what degree of argued that the side that is sending a FIN knows what degree of
reliability it needs, and therefore it should be able to reliability it needs, and therefore it should be able to
determine the length of the TIME-WAIT delay for the FIN's determine the length of the TIME-WAIT delay for the FIN's
skipping to change at page 35, line 18 skipping to change at page 35, line 30
the Karn algorithm [Karn87] is disabled. The Karn algorithm the Karn algorithm [Karn87] is disabled. The Karn algorithm
disables all RTT measurements during retransmission, since it is disables all RTT measurements during retransmission, since it is
ambiguous whether the ACK is is for the original packet, or the ambiguous whether the ACK is is for the original packet, or the
retransmitted packet. With Timestamps, that ambiguity is retransmitted packet. With Timestamps, that ambiguity is
removed since the TSecr in the ACK will contain the TSval from removed since the TSecr in the ACK will contain the TSval from
whichever data packet made it to the destination. whichever data packet made it to the destination.
(b) In RFC1323, section 3.4, step (2) of the algorithm to control (b) In RFC1323, section 3.4, step (2) of the algorithm to control
which timestamp is echoed was incorrect in two regards: which timestamp is echoed was incorrect in two regards:
(1) It failed to update TSrecent for a retransmitted segment (1) It failed to update TS.recent for a retransmitted segment
that resulted from a lost ACK. that resulted from a lost ACK.
(2) It failed if SEG.LEN = 0. (2) It failed if SEG.LEN = 0.
In the new algorithm, the case of SEG.TSval = TSrecent is In the new algorithm, the case of SEG.TSval >= TS.recent is
included for consistency with the PAWS test. included for consistency with the PAWS test.
(c) One correction was made to the Event Processing Summary in (c) One correction was made to the Event Processing Summary in
Appendix F. In SEND CALL/ESTABLISHED STATE, RCV.WND is used to Appendix F. In SEND CALL/ESTABLISHED STATE, RCV.WND is used to
fill in the SEG.WND value, not SND.WND. fill in the SEG.WND value, not SND.WND.
(d) New pseudo-code summary has been added in Appendix E. (d) New pseudo-code summary has been added in Appendix E.
(e) Appendix A has been expanded with information about the TCP MSS (e) Appendix A has been expanded with information about the TCP MSS
option and the TCP Urgent Pointer. option and the TCP Urgent Pointer.
skipping to change at page 38, line 38 skipping to change at page 38, line 50
if (SEG.TSval < TS.Recent && Idle less than 24 days) then { if (SEG.TSval < TS.Recent && Idle less than 24 days) then {
if (Send.TS.OK AND (NOT RST) ) then { if (Send.TS.OK AND (NOT RST) ) then {
/* Timestamp too old => /* Timestamp too old =>
* segment is unacceptable. * segment is unacceptable.
*/ */
Send ACK segment; Send ACK segment;
Discard segment and return; Discard segment and return;
} }
} }
else { else {
if (SEG.SEQ =< Last.ACK.sent) then if (SEG.SEQ <= Last.ACK.sent) then
TS.Recent = SEG.TSval; TS.Recent = SEG.TSval;
} }
} }
if (SEG.ACK > SND.UNA) then { if (SEG.ACK > SND.UNA) then {
/* (At least part of) first segment in /* (At least part of) first segment in
* retransmission queue has been ACKd * retransmission queue has been ACKd
*/ */
if (Segment contains TSopt) then if (Segment contains TSopt) then
Update_SRTT( Update_SRTT(
(Snd.TSclock - SEG.TSecr)/my.TSclock.rate); (Snd.TSclock - SEG.TSecr)/my.TSclock.rate);
else else
skipping to change at page 43, line 13 skipping to change at page 43, line 25
steps. steps.
Check whether the segment contains a Timestamps option and Check whether the segment contains a Timestamps option and
bit Snd.TS.OK is on. If so: bit Snd.TS.OK is on. If so:
If SEG.TSval < TS.Recent and the RST bit is off, then If SEG.TSval < TS.Recent and the RST bit is off, then
test whether connection has been idle less than 24 days; test whether connection has been idle less than 24 days;
if all are true, then the segment is not acceptable; if all are true, then the segment is not acceptable;
follow steps below for an unacceptable segment. follow steps below for an unacceptable segment.
If SEG.SEQ is equal to Last.ACK.sent, then save SEG.TSval If SEG.SEQ is less than or equal to Last.ACK.sent, then
in variable TS.Recent. save SEG.TSval in variable TS.Recent.
There are four cases for the acceptability test for an There are four cases for the acceptability test for an
incoming segment: incoming segment:
... ...
If an incoming segment is not acceptable, an acknowledgment If an incoming segment is not acceptable, an acknowledgment
should be sent in reply (unless the RST bit is set, if so should be sent in reply (unless the RST bit is set, if so
drop the segment and return): drop the segment and return):
 End of changes. 52 change blocks. 
84 lines changed or deleted 93 lines changed or added

This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/