draft-ietf-tcpm-1323bis-09.txt   draft-ietf-tcpm-1323bis-10.txt 
TCP Maintenance (TCPM) D. Borman TCP Maintenance (TCPM) D. Borman
Internet-Draft Quantum Corporation Internet-Draft Quantum Corporation
Intended status: Standards Track B. Braden Intended status: Standards Track B. Braden
Expires: October 14, 2013 University of Southern Expires: October 18, 2013 University of Southern
California California
V. Jacobson V. Jacobson
Packet Design Packet Design
R. Scheffenegger, Ed. R. Scheffenegger, Ed.
NetApp, Inc. NetApp, Inc.
April 12, 2013 April 16, 2013
TCP Extensions for High Performance TCP Extensions for High Performance
draft-ietf-tcpm-1323bis-09 draft-ietf-tcpm-1323bis-10
Abstract Abstract
This document specifies a set of TCP extensions to improve This document specifies a set of TCP extensions to improve
performance over paths with a large bandwidth * delay product and to performance over paths with a large bandwidth * delay product and to
provide reliable operation over very high-speed paths. It defines provide reliable operation over very high-speed paths. It defines
TCP options for scaled windows and timestamps. The timestamps are TCP options for scaled windows and timestamps. The timestamps are
used for two distinct mechanisms, RTTM (Round Trip Time Measurement) used for two distinct mechanisms, RTTM (Round Trip Time Measurement)
and PAWS (Protection Against Wrapped Sequences). and PAWS (Protection Against Wrapped Sequences).
skipping to change at page 1, line 43 skipping to change at page 1, line 43
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on October 14, 2013. This Internet-Draft will expire on October 18, 2013.
Copyright Notice Copyright Notice
Copyright (c) 2013 IETF Trust and the persons identified as the Copyright (c) 2013 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 4, line 41 skipping to change at page 4, line 41
large. A network having such paths is referred to as "long, fat large. A network having such paths is referred to as "long, fat
network" (LFN). network" (LFN).
There are three fundamental performance problems with basic TCP over There are three fundamental performance problems with basic TCP over
LFN paths: LFN paths:
(1) Window Size Limit (1) Window Size Limit
The TCP header uses a 16 bit field to report the receive window The TCP header uses a 16 bit field to report the receive window
size to the sender. Therefore, the largest window that can be size to the sender. Therefore, the largest window that can be
used is 2^16 = 65K bytes. used is 2^16 = 64 KiB.
To circumvent this problem, Section 2 of this memo defines a TCP To circumvent this problem, Section 2 of this memo defines a TCP
option, "Window Scale", to allow windows larger than 2^16. This option, "Window Scale", to allow windows larger than 2^16. This
option defines an implicit scale factor, which is used to option defines an implicit scale factor, which is used to
multiply the window size value found in a TCP header to obtain multiply the window size value found in a TCP header to obtain
the true window size. the true window size.
(2) Recovery from Losses (2) Recovery from Losses
Packet losses in an LFN can have a catastrophic effect on Packet losses in an LFN can have a catastrophic effect on
skipping to change at page 7, line 11 skipping to change at page 7, line 11
12 bytes to the 20-byte TCP header. We recognize there is a trade- 12 bytes to the 20-byte TCP header. We recognize there is a trade-
off between the bandwidth saved by reducing unnecessary off between the bandwidth saved by reducing unnecessary
retransmission timeouts, and the extra header bandwidth used by this retransmission timeouts, and the extra header bandwidth used by this
option. It is required that this TCP option will be sent on non- option. It is required that this TCP option will be sent on non-
<SYN> segments only after an exchange of options on the <SYN> <SYN> segments only after an exchange of options on the <SYN>
segments has indicated that both sides understand this extension. segments has indicated that both sides understand this extension.
Appendix A contains a recommended layout of the options in TCP Appendix A contains a recommended layout of the options in TCP
headers to achieve reasonable data field alignment. headers to achieve reasonable data field alignment.
Finally, we observe that most of the mechanisms defined in this memo Finally, we observe that most of the mechanisms defined in this
are important for LFN's and/or very high-speed networks. For low- document are important for LFN's and/or very high-speed networks.
speed networks, it might be a performance optimization to NOT use For low-speed networks, it might be a performance optimization to NOT
these mechanisms. A TCP vendor concerned about optimal performance use these mechanisms. A TCP vendor concerned about optimal
over low-speed paths might consider turning these extensions off for performance over low-speed paths might consider turning these
low-speed paths, or allow a user or installation manager to disable extensions off for low-speed paths, or allow a user or installation
them. manager to disable them.
1.4. Terminology 1.4. Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119]. document are to be interpreted as described in [RFC2119].
In this document, these words will appear with that interpretation In this document, these words will appear with that interpretation
only when in UPPER CASE. Lower case uses of these words are not to only when in UPPER CASE. Lower case uses of these words are not to
be interpreted as carrying [RFC2119] significance. be interpreted as carrying [RFC2119] significance.
skipping to change at page 10, line 21 skipping to change at page 10, line 17
TCP determines if a data segment is "old" or "new" by testing whether TCP determines if a data segment is "old" or "new" by testing whether
its sequence number is within 2^31 bytes of the left edge of the its sequence number is within 2^31 bytes of the left edge of the
window, and if it is not, discarding the data as "old". To insure window, and if it is not, discarding the data as "old". To insure
that new data is never mistakenly considered old and vice versa, the that new data is never mistakenly considered old and vice versa, the
left edge of the sender's window has to be at most 2^31 away from the left edge of the sender's window has to be at most 2^31 away from the
right edge of the receiver's window. Similarly with the sender's right edge of the receiver's window. Similarly with the sender's
right edge and receiver's left edge. Since the right and left edges right edge and receiver's left edge. Since the right and left edges
of either the sender's or receiver's window differ by the window of either the sender's or receiver's window differ by the window
size, and since the sender and receiver windows can be out of phase size, and since the sender and receiver windows can be out of phase
by at most the window size, the above constraints imply that two by at most the window size, the above constraints imply that two
times the max window size must be less than 2^31, or times the maximum window size must be less than 2^31, or
max window < 2^30 max window < 2^30
Since the max window is 2^S (where S is the scaling shift count) Since the max window is 2^S (where S is the scaling shift count)
times at most 2^16 - 1 (the maximum unscaled window), the maximum times at most 2^16 - 1 (the maximum unscaled window), the maximum
window is guaranteed to be < 2^30 if S <= 14. Thus, the shift count window is guaranteed to be < 2^30 if S <= 14. Thus, the shift count
MUST be limited to 14 (which allows windows of 2^30 = 1 Gbyte). If a MUST be limited to 14 (which allows windows of 2^30 = 1 GiB). If a
Window Scale option is received with a shift.cnt value exceeding 14, Window Scale option is received with a shift.cnt value exceeding 14,
the TCP SHOULD log the error but use 14 instead of the specified the TCP SHOULD log the error but use 14 instead of the specified
value. value.
The scale factor applies only to the Window field as transmitted in The scale factor applies only to the Window field as transmitted in
the TCP header; each TCP using extended windows will maintain the the TCP header; each TCP using extended windows will maintain the
window values locally as 32-bit numbers. For example, the window values locally as 32-bit numbers. For example, the
"congestion window" computed by Slow Start and Congestion Avoidance "congestion window" computed by Slow Start and Congestion Avoidance
is not affected by the scale factor, so window scaling will not is not affected by the scale factor, so window scaling will not
introduce quantization into the congestion window. introduce quantization into the congestion window.
skipping to change at page 15, line 30 skipping to change at page 15, line 30
(etc.) (etc.)
The dotted line marks a pause (60 time units long) in which A had The dotted line marks a pause (60 time units long) in which A had
nothing to send. Note that this pause inflates the RTT which B could nothing to send. Note that this pause inflates the RTT which B could
infer from receiving TSecr=131 in data segment C. Thus, in one-way infer from receiving TSecr=131 in data segment C. Thus, in one-way
data flows, RTTM in the reverse direction measures a value that is data flows, RTTM in the reverse direction measures a value that is
inflated by gaps in sending data. However, the following rule inflated by gaps in sending data. However, the following rule
prevents a resulting inflation of the measured RTT: prevents a resulting inflation of the measured RTT:
RTTM Rule: A TSecr value received in a segment MAY be used to RTTM Rule: A TSecr value received in a segment MAY be used to update
update the averaged RTT measurement only if the segment advances the averaged RTT measurement only if the segment advances
the left edge of the send window (e.g. SND.UNA is increased). the left edge of the send window, i.e. SND.UNA is
increased.
Since TCP B is not sending data, the data segment C does not Since TCP B is not sending data, the data segment C does not
acknowledge any new data when it arrives at B. Thus, the inflated acknowledge any new data when it arrives at B. Thus, the inflated
RTTM measurement is not used to update B's RTTM measurement. RTTM measurement is not used to update B's RTTM measurement.
Implementers should note that with timestamps multiple RTTMs can be Implementers should note that with timestamps multiple RTTMs can be
taken per RTT. Many RTO estimators have a weighting factor based on taken per RTT. Many RTO estimators have a weighting factor based on
an implicit assumption that at most one RTTM will be sampled per RTT. an implicit assumption that at most one RTTM will be sampled per RTT.
When using multiple RTTMs per RTT to update the RTO estimator, the When using multiple RTTMs per RTT to update the RTO estimator, the
weighting factor needs to be decreased to take into account the more weighting factor needs to be decreased to take into account the more
skipping to change at page 21, line 31 skipping to change at page 21, line 31
one segment will show up out-of-sequence to be queued at the receiver one segment will show up out-of-sequence to be queued at the receiver
(e.g., up to ~2^30 bytes of data); the timestamp option must not (e.g., up to ~2^30 bytes of data); the timestamp option must not
result in discarding this data. result in discarding this data.
In certain unlikely circumstances, the algorithm of rules R1-R5 could In certain unlikely circumstances, the algorithm of rules R1-R5 could
lead to discarding some segments unnecessarily, as shown in the lead to discarding some segments unnecessarily, as shown in the
following example: following example:
Suppose again that segments: A.1, B.1, C.1, ..., Z.1 have been Suppose again that segments: A.1, B.1, C.1, ..., Z.1 have been
sent in sequence and that segment B.1 has been lost. Furthermore, sent in sequence and that segment B.1 has been lost. Furthermore,
suppose delivery of some of C.1, ... Z.1 is delayed until AFTER suppose delivery of some of C.1, ... Z.1 is delayed until *after*
the retransmission B.2 arrives at the receiver. These delayed the retransmission B.2 arrives at the receiver. These delayed
segments will be discarded unnecessarily when they do arrive, segments will be discarded unnecessarily when they do arrive,
since their timestamps are now out of date. since their timestamps are now out of date.
This case is very unlikely to occur. If the retransmission was This case is very unlikely to occur. If the retransmission was
triggered by a timeout, some of the segments C.1, ... Z.1 must have triggered by a timeout, some of the segments C.1, ... Z.1 must have
been delayed longer than the RTO time. This is presumably an been delayed longer than the RTO time. This is presumably an
unlikely event, or there would be many spurious timeouts and unlikely event, or there would be many spurious timeouts and
retransmissions. If B's retransmission was triggered by the "fast retransmissions. If B's retransmission was triggered by the "fast
retransmit" algorithm, i.e., by duplicate <ACK>s, then the queued retransmit" algorithm, i.e., by duplicate <ACK>s, then the queued
skipping to change at page 24, line 39 skipping to change at page 24, line 39
H2) Do header prediction: if segment is next in sequence and if H2) Do header prediction: if segment is next in sequence and if
there are no special conditions requiring additional processing, there are no special conditions requiring additional processing,
accept the segment, record its timestamp, and skip H3. accept the segment, record its timestamp, and skip H3.
H3) Process the segment normally, as specified in RFC 793. This H3) Process the segment normally, as specified in RFC 793. This
includes dropping segments that are outside the window and includes dropping segments that are outside the window and
possibly sending acknowledgments, and queuing in-window, out-of- possibly sending acknowledgments, and queuing in-window, out-of-
sequence segments. sequence segments.
Another possibility would be to interchange steps H1 and H2, i.e., to Another possibility would be to interchange steps H1 and H2, i.e., to
perform the header prediction step H2 first, and perform H1 and H3 perform the header prediction step H2 *first*, and perform H1 and H3
only when header prediction fails. This could be a performance only when header prediction fails. This could be a performance
improvement, since the timestamp check in step H1 is very unlikely to improvement, since the timestamp check in step H1 is very unlikely to
fail, and it requires unsigned modulo arithmetic. To perform this fail, and it requires unsigned modulo arithmetic. To perform this
check on every single segment is contrary to the philosophy of header check on every single segment is contrary to the philosophy of header
prediction. We believe that this change might produce a measurable prediction. We believe that this change might produce a measurable
reduction in CPU time for TCP protocol processing on high-speed reduction in CPU time for TCP protocol processing on high-speed
networks. networks.
However, putting H2 first would create a hazard: a segment from 2^32 However, putting H2 first would create a hazard: a segment from 2^32
bytes in the past might arrive at exactly the wrong time and be bytes in the past might arrive at exactly the wrong time and be
skipping to change at page 27, line 31 skipping to change at page 27, line 31
A naive implementation that derives the timestamp clock value A naive implementation that derives the timestamp clock value
directly from a system uptime clock may unintentionally leak this directly from a system uptime clock may unintentionally leak this
information to an attacker. This does not directly compromise any of information to an attacker. This does not directly compromise any of
the mechanisms described in this document. However, this may be the mechanisms described in this document. However, this may be
valuable information to a potential attacker. An implementer should valuable information to a potential attacker. An implementer should
evaluate the potential impact and mitigate this accordingly (i.e. by evaluate the potential impact and mitigate this accordingly (i.e. by
using a random offset for the timestamp clock on each connection, or using a random offset for the timestamp clock on each connection, or
using an external, real-time derived timestamp clock source). using an external, real-time derived timestamp clock source).
Expanding the TCP window beyond 64K for IPv6 allows Jumbograms Expanding the TCP window beyond 64 KiB for IPv6 allows Jumbograms
[RFC2675] to be used when the local network supports packets larger [RFC2675] to be used when the local network supports packets larger
than 64K. When larger TCP segments are used, the TCP checksum becomes than 64 KiB. When larger TCP segments are used, the TCP checksum
weaker. becomes weaker.
7. IANA Considerations 7. IANA Considerations
This document has no actions for IANA. This document has no actions for IANA.
8. References 8. References
8.1. Normative References 8.1. Normative References
[RFC0793] Postel, J., "Transmission Control Protocol", STD 7, [RFC0793] Postel, J., "Transmission Control Protocol", STD 7,
skipping to change at page 30, line 46 skipping to change at page 30, line 46
Interaction with the TCP Urgent Pointer Interaction with the TCP Urgent Pointer
The TCP Urgent pointer, like the TCP window, is a 16 bit value. The TCP Urgent pointer, like the TCP window, is a 16 bit value.
Some of the original discussion for the TCP Window Scale option Some of the original discussion for the TCP Window Scale option
included proposals to increase the Urgent pointer to 32 bits. As included proposals to increase the Urgent pointer to 32 bits. As
it turns out, this is unnecessary. There are two observations it turns out, this is unnecessary. There are two observations
that should be made: that should be made:
(1) With IP Version 4, the largest amount of TCP data that can be (1) With IP Version 4, the largest amount of TCP data that can be
sent in a single packet is 65495 bytes (64K - 1 -- size of sent in a single packet is 65495 bytes (64 KiB - 1 -- size of
fixed IP and TCP headers). fixed IP and TCP headers).
(2) Updates to the urgent pointer while the user is in "urgent (2) Updates to the urgent pointer while the user is in "urgent
mode" are invisible to the user. mode" are invisible to the user.
This means that if the Urgent Pointer points beyond the end of the This means that if the Urgent Pointer points beyond the end of the
TCP data in the current segment, then the user will remain in TCP data in the current segment, then the user will remain in
urgent mode until the next TCP segment arrives. That segment will urgent mode until the next TCP segment arrives. That segment will
update the urgent pointer to a new offset, and the user will never update the urgent pointer to a new offset, and the user will never
have left urgent mode. have left urgent mode.
skipping to change at page 36, line 25 skipping to change at page 36, line 25
the variable TS.Recent and turn on the Snd.TS.OK bit. the variable TS.Recent and turn on the Snd.TS.OK bit.
Set RCV.NXT to SEG.SEQ+1, IRS is set to SEG.SEQ and any Set RCV.NXT to SEG.SEQ+1, IRS is set to SEG.SEQ and any
other control or text should be queued for processing later. other control or text should be queued for processing later.
ISS should be selected and a <SYN> segment sent of the form: ISS should be selected and a <SYN> segment sent of the form:
<SEQ=ISS><ACK=RCV.NXT><CTL=SYN,ACK> <SEQ=ISS><ACK=RCV.NXT><CTL=SYN,ACK>
If the Snd.WS.OK bit is on, include a WSopt option If the Snd.WS.OK bit is on, include a WSopt option
<WSopt=Rcv.Wind.Scale> in this segment. If the Snd.TS.OK <WSopt=Rcv.Wind.Scale> in this segment. If the Snd.TS.OK
bit is on, include a TSopt bit is on, include a TSopt <TSval=Snd.TSclock,
<TSval=Snd.TSclock,TSecr=TS.Recent> in this segment. TSecr=TS.Recent> in this segment. Last.ACK.sent is set to
Last.ACK.sent is set to RCV.NXT. RCV.NXT.
SND.NXT is set to ISS+1 and SND.UNA to ISS. The connection SND.NXT is set to ISS+1 and SND.UNA to ISS. The connection
state should be changed to SYN-RECEIVED. Note that any state should be changed to SYN-RECEIVED. Note that any
other incoming control or data (combined with SYN) will be other incoming control or data (combined with SYN) will be
processed in the SYN-RECEIVED state, but processing of SYN processed in the SYN-RECEIVED state, but processing of SYN
and ACK should not be repeated. If the listen was not fully and ACK should not be repeated. If the listen was not fully
specified (i.e., the foreign socket was not fully specified (i.e., the foreign socket was not fully
specified), then the unspecified fields should be filled in specified), then the unspecified fields should be filled in
now. now.
skipping to change at page 43, line 22 skipping to change at page 43, line 22
Section 1.3 to specifically address TS and WS options. Section 1.3 to specifically address TS and WS options.
(c) Section 1.4 was added for RFC2119 wording. Normative text was (c) Section 1.4 was added for RFC2119 wording. Normative text was
updated with the appropriate phrases. updated with the appropriate phrases.
(d) Added < > brackets to mark specific types of segments, and (d) Added < > brackets to mark specific types of segments, and
replaced most occurances of "packet" with "segment", where TCP replaced most occurances of "packet" with "segment", where TCP
segments are referred. segments are referred.
(e) Removed the list of changes between RFC 1323 and prior versions. (e) Removed the list of changes between RFC 1323 and prior versions.
These changes are mentioned in appendix C of RFC 1323. These changes are mentioned in Appendix C of RFC 1323.
(f) Moved Appendix "Changes" at the end of the appendices for easier (f) Moved Appendix "Changes" at the end of the appendices for easier
lookup. In addition, the entries were split into a technical lookup. In addition, the entries were split into a technical
and an editorial part, and sorted to roughly correspond with the and an editorial part, and sorted to roughly correspond with the
sections in the text where they apply. sections in the text where they apply.
Authors' Addresses Authors' Addresses
David Borman David Borman
Quantum Corporation Quantum Corporation
 End of changes. 16 change blocks. 
27 lines changed or deleted 28 lines changed or added

This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/