draft-ietf-tcpm-1323bis-05.txt   draft-ietf-tcpm-1323bis-06.txt 
TCP Maintenance (TCPM) D. Borman TCP Maintenance (TCPM) D. Borman
Internet-Draft Quantum Corporation Internet-Draft Quantum Corporation
Intended status: Standards Track B. Braden Intended status: Standards Track B. Braden
Expires: August 17, 2013 University of Southern Expires: August 29, 2013 University of Southern
California California
V. Jacobson V. Jacobson
Packet Design Packet Design
R. Scheffenegger, Ed. R. Scheffenegger, Ed.
NetApp, Inc. NetApp, Inc.
February 13, 2013 February 25, 2013
TCP Extensions for High Performance TCP Extensions for High Performance
draft-ietf-tcpm-1323bis-05 draft-ietf-tcpm-1323bis-06
Abstract Abstract
This document specifies a set of TCP extensions to improve This document specifies a set of TCP extensions to improve
performance over paths with a large bandwidth*delay product and to performance over paths with a large bandwidth*delay product and to
provide reliable operation over very high-speed paths. It defines provide reliable operation over very high-speed paths. It defines
TCP options for scaled windows and timestamps. The timestamps are TCP options for scaled windows and timestamps. The timestamps are
used for two distinct mechanisms, RTTM (Round Trip Time Measurement) used for two distinct mechanisms, RTTM (Round Trip Time Measurement)
and PAWS (Protection Against Wrapped Sequences). and PAWS (Protection Against Wrapped Sequences).
skipping to change at page 1, line 43 skipping to change at page 1, line 43
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on August 17, 2013. This Internet-Draft will expire on August 29, 2013.
Copyright Notice Copyright Notice
Copyright (c) 2013 IETF Trust and the persons identified as the Copyright (c) 2013 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 3, line 39 skipping to change at page 3, line 39
5.2.5. IP Fragmentation . . . . . . . . . . . . . . . . . . . 24 5.2.5. IP Fragmentation . . . . . . . . . . . . . . . . . . . 24
5.3. Duplicates from Earlier Incarnations of Connection . . . . 24 5.3. Duplicates from Earlier Incarnations of Connection . . . . 24
6. Conclusions and Acknowledgements . . . . . . . . . . . . . . . 24 6. Conclusions and Acknowledgements . . . . . . . . . . . . . . . 24
7. Security Considerations . . . . . . . . . . . . . . . . . . . 25 7. Security Considerations . . . . . . . . . . . . . . . . . . . 25
8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 26 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 26
9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 26 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 26
9.1. Normative References . . . . . . . . . . . . . . . . . . . 26 9.1. Normative References . . . . . . . . . . . . . . . . . . . 26
9.2. Informative References . . . . . . . . . . . . . . . . . . 26 9.2. Informative References . . . . . . . . . . . . . . . . . . 26
Appendix A. Implementation Suggestions . . . . . . . . . . . . . 28 Appendix A. Implementation Suggestions . . . . . . . . . . . . . 28
Appendix B. Duplicates from Earlier Connection Incarnations . . . 29 Appendix B. Duplicates from Earlier Connection Incarnations . . . 29
B.1. System Crash with Loss of State . . . . . . . . . . . . . 29 B.1. System Crash with Loss of State . . . . . . . . . . . . . 30
B.2. Closing and Reopening a Connection . . . . . . . . . . . . 30 B.2. Closing and Reopening a Connection . . . . . . . . . . . . 30
Appendix C. Summary of Notation . . . . . . . . . . . . . . . . . 31 Appendix C. Summary of Notation . . . . . . . . . . . . . . . . . 31
Appendix D. Pseudo-code Summary . . . . . . . . . . . . . . . . . 32 Appendix D. Pseudo-code Summary . . . . . . . . . . . . . . . . . 32
Appendix E. Event Processing Summary . . . . . . . . . . . . . . 34 Appendix E. Event Processing Summary . . . . . . . . . . . . . . 34
Appendix F. Timestamps Edge Cases . . . . . . . . . . . . . . . . 39 Appendix F. Timestamps Edge Cases . . . . . . . . . . . . . . . . 40
Appendix G. Changes from RFC 1072, RFC 1185, and RFC 1323 . . . . 40 Appendix G. Changes from RFC 1072, RFC 1185, and RFC 1323 . . . . 40
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 42 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 43
1. Introduction 1. Introduction
The TCP protocol [RFC0793] was designed to operate reliably over The TCP protocol [RFC0793] was designed to operate reliably over
almost any transmission medium regardless of transmission rate, almost any transmission medium regardless of transmission rate,
delay, corruption, duplication, or reordering of segments. Over the delay, corruption, duplication, or reordering of segments. Over the
years, advances in networking technology has resulted in ever-higher years, advances in networking technology has resulted in ever-higher
transmission speeds, and the fastest paths are well beyond the domain transmission speeds, and the fastest paths are well beyond the domain
for which TCP was originally engineered. for which TCP was originally engineered.
skipping to change at page 4, line 28 skipping to change at page 4, line 28
For brevity, the full discussions of the merits and history behind For brevity, the full discussions of the merits and history behind
the TCP options defined within this document have been omitted. the TCP options defined within this document have been omitted.
[RFC1323] should be consulted for reference. A modern TCP [RFC1323] should be consulted for reference. A modern TCP
implementation SHOULD implement and make use of the extensions implementation SHOULD implement and make use of the extensions
described in this document. described in this document.
1.1. TCP Performance 1.1. TCP Performance
TCP performance problems arise when the bandwidth*delay product is TCP performance problems arise when the bandwidth*delay product is
large. large. A network having such paths is referred to as "long, fat
network" (LFN).
There are three fundamental performance problems with the current TCP There are three fundamental performance problems with the current TCP
over LFN paths: over LFN paths:
(1) Window Size Limit (1) Window Size Limit
The TCP header uses a 16 bit field to report the receive window The TCP header uses a 16 bit field to report the receive window
size to the sender. Therefore, the largest window that can be size to the sender. Therefore, the largest window that can be
used is 2^16 = 65K bytes. used is 2^16 = 65K bytes.
skipping to change at page 13, line 9 skipping to change at page 13, line 9
and an ACK of the current SND.UNA generated. and an ACK of the current SND.UNA generated.
In the case of crossing SYN packets where one SYN contains a TSopt In the case of crossing SYN packets where one SYN contains a TSopt
and the other doesn't, both sides SHOULD put a TSopt in the <SYN,ACK> and the other doesn't, both sides SHOULD put a TSopt in the <SYN,ACK>
segment. segment.
4.3. The RTTM Mechanism 4.3. The RTTM Mechanism
RTTM places a Timestamps option in every segment, with a TSval that RTTM places a Timestamps option in every segment, with a TSval that
is obtained from a (virtual) "timestamp clock". Values of this clock is obtained from a (virtual) "timestamp clock". Values of this clock
values MUST be at least approximately proportional to real time, in MUST be at least approximately proportional to real time, in order to
order to measure actual RTT. measure actual RTT.
These TSval values are echoed in TSecr values in the reverse These TSval values are echoed in TSecr values in the reverse
direction. The difference between a received TSecr value and the direction. The difference between a received TSecr value and the
current timestamp clock value provides a RTT measurement. current timestamp clock value provides a RTT measurement.
When timestamps are used, every segment that is received will contain When timestamps are used, every segment that is received will contain
a TSecr value. However, these values cannot all be used to update a TSecr value. However, these values cannot all be used to update
the measured RTT. The following example illustrates why. It shows a the measured RTT. The following example illustrates why. It shows a
one-way data flow with segments arriving in sequence without loss. one-way data flow with segments arriving in sequence without loss.
Here A, B, C... represent data blocks occupying successive blocks of Here A, B, C... represent data blocks occupying successive blocks of
skipping to change at page 15, line 24 skipping to change at page 15, line 24
underestimate the RTT. An ACK for an out-of-order segment underestimate the RTT. An ACK for an out-of-order segment
SHOULD therefore contain the timestamp from the most recent SHOULD therefore contain the timestamp from the most recent
segment that advanced the window. segment that advanced the window.
The same situation occurs if segments are re-ordered by the The same situation occurs if segments are re-ordered by the
network. network.
(C) A filled hole in the sequence space. (C) A filled hole in the sequence space.
The segment that fills the hole represents the most recent The segment that fills the hole represents the most recent
measurement of the network characteristics. On the other hand, measurement of the network characteristics. A RTT computed from
an RTT computed from an earlier segment would probably include an earlier segment would probably include the sender's
the sender's retransmit time-out, badly biasing the sender's retransmit time-out, badly biasing the sender's average RTT
average RTT estimate. Thus, the timestamp from the latest estimate. Thus, the timestamp from the latest segment (which
segment (which filled the hole) MUST be echoed. filled the hole) MUST be echoed.
An algorithm that covers all three cases is described in the An algorithm that covers all three cases is described in the
following rules for Timestamps option processing on a synchronized following rules for Timestamps option processing on a synchronized
connection: connection:
(1) The connection state is augmented with two 32-bit slots: (1) The connection state is augmented with two 32-bit slots:
TS.Recent holds a timestamp to be echoed in TSecr whenever a TS.Recent holds a timestamp to be echoed in TSecr whenever a
segment is sent, and Last.ACK.sent holds the ACK field from the segment is sent, and Last.ACK.sent holds the ACK field from the
last segment sent. Last.ACK.sent will equal RCV.NXT except when last segment sent. Last.ACK.sent will equal RCV.NXT except when
skipping to change at page 25, line 47 skipping to change at page 25, line 47
Middle boxes and options: If a middle box removes TCP options from Middle boxes and options: If a middle box removes TCP options from
the SYN, such as TSopt, a high speed connection that needs PAWS would the SYN, such as TSopt, a high speed connection that needs PAWS would
not have that protection. In this situation, an implementer could not have that protection. In this situation, an implementer could
provide a mechanism for the application to determine whether or not provide a mechanism for the application to determine whether or not
PAWS is in use on the connection, and chose to terminate the PAWS is in use on the connection, and chose to terminate the
connection if that protection doesn't exist. connection if that protection doesn't exist.
Mechanisms to protect the TCP header from modification should also Mechanisms to protect the TCP header from modification should also
protect the TCP options. protect the TCP options.
A naive implementation that derives the timestamp clock value
directly from a system uptime clock may unintentionally leak this
information to an attacker. This does not directly compromise any of
the mechanisms described in this document. However, this may be
valuable information to a potential attacker. An implementer should
evaluate the potential impact and mitigate this accordingly (i.e. by
using a random offset for the timestamp clock on each connection, or
using an external, real-time derived timestamp clock source).
Expanding the TCP window beyond 64K for IPv6 allows Jumbograms Expanding the TCP window beyond 64K for IPv6 allows Jumbograms
[RFC2675] to be used when the local network supports packets larger [RFC2675] to be used when the local network supports packets larger
than 64K. When larger TCP packets are used, the TCP checksum becomes than 64K. When larger TCP packets are used, the TCP checksum becomes
weaker. weaker.
8. IANA Considerations 8. IANA Considerations
This document has no actions for IANA. This document has no actions for IANA.
9. References 9. References
skipping to change at page 28, line 21 skipping to change at page 28, line 32
Errors at High Data Rates", RFC 4963, July 2007. Errors at High Data Rates", RFC 4963, July 2007.
[RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
Control", RFC 5681, September 2009. Control", RFC 5681, September 2009.
[RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M., [RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M.,
and Y. Nishida, "A Conservative Loss Recovery Algorithm and Y. Nishida, "A Conservative Loss Recovery Algorithm
Based on Selective Acknowledgment (SACK) for TCP", Based on Selective Acknowledgment (SACK) for TCP",
RFC 6675, August 2012. RFC 6675, August 2012.
[RFC6691] Borman, D., "TCP Options and Maximum Segment Size (MSS)",
RFC 6691, July 2012.
[Watson81] [Watson81]
Watson, R., "Timer-based Mechanisms in Reliable Transport Watson, R., "Timer-based Mechanisms in Reliable Transport
Protocol Connection Management", Computer Networks, Vol. Protocol Connection Management", Computer Networks, Vol.
5, 1981. 5, 1981.
[Zhang86] Zhang, L., "Why TCP Timers Don't Work Well", Proc. SIGCOMM [Zhang86] Zhang, L., "Why TCP Timers Don't Work Well", Proc. SIGCOMM
'86, Stowe, VT, August 1986. '86, Stowe, VT, August 1986.
Appendix A. Implementation Suggestions Appendix A. Implementation Suggestions
skipping to change at page 29, line 42 skipping to change at page 30, line 11
losing connection state) and restarting, and (2) the same connection losing connection state) and restarting, and (2) the same connection
being closed and reopened without a loss of host state. These will being closed and reopened without a loss of host state. These will
be described in the following two sections. be described in the following two sections.
B.1. System Crash with Loss of State B.1. System Crash with Loss of State
TCP's quiet time of one MSL upon system startup handles the loss of TCP's quiet time of one MSL upon system startup handles the loss of
connection state in a system crash/restart. For an explanation, see connection state in a system crash/restart. For an explanation, see
for example "When to Keep Quiet" in the TCP protocol specification for example "When to Keep Quiet" in the TCP protocol specification
[RFC0793]. The MSL that is required here does not depend upon the [RFC0793]. The MSL that is required here does not depend upon the
transfer speed. The current TCP MSL of 2 minutes seems acceptable as transfer speed. The current TCP MSL of 2 minutes seemed acceptable
an operational compromise, as many host systems take this long to as an operational compromise, when many host systems used to take
boot after a crash. this long to boot after a crash. Current host systems can boot
considerably faster.
However, the timestamp option may be used to ease the MSL The timestamp option may be used to ease the MSL requirements (or to
requirements (or to provide additional security against data provide additional security against data corruption). If timestamps
corruption). If timestamps are being used and if the timestamp clock are being used and if the timestamp clock can be guaranteed to be
can be guaranteed to be monotonic over a system crash/restart, i.e., monotonic over a system crash/restart, i.e., if the first value of
if the first value of the sender's timestamp clock after a crash/ the sender's timestamp clock after a crash/restart can be guaranteed
restart can be guaranteed to be greater than the last value before to be greater than the last value before the restart, then a quiet
the restart, then a quiet time will be unnecessary. time is unnecessary.
To dispense totally with the quiet time would require that the host To dispense totally with the quiet time would require that the host
clock be synchronized to a time source that is stable over the crash/ clock be synchronized to a time source that is stable over the crash/
restart period, with an accuracy of one timestamp clock tick or restart period, with an accuracy of one timestamp clock tick or
better. We can back off from this strict requirement to take better. We can back off from this strict requirement to take
advantage of approximate clock synchronization. Suppose that the advantage of approximate clock synchronization. Suppose that the
clock is always re-synchronized to within N timestamp clock ticks and clock is always re-synchronized to within N timestamp clock ticks and
that booting (extended with a quiet time, if necessary) takes more that booting (extended with a quiet time, if necessary) takes more
than N ticks. This will guarantee monotonicity of the timestamps, than N ticks. This will guarantee monotonicity of the timestamps,
which can then be used to reject old duplicates even without an which can then be used to reject old duplicates even without an
skipping to change at page 42, line 11 skipping to change at page 42, line 19
In the new algorithm, the case of SEG.TSval >= TS.recent is In the new algorithm, the case of SEG.TSval >= TS.recent is
included for consistency with the PAWS test. included for consistency with the PAWS test.
(c) One correction was made to the Event Processing Summary in (c) One correction was made to the Event Processing Summary in
Appendix E. In SEND CALL/ESTABLISHED STATE, RCV.WND is used to Appendix E. In SEND CALL/ESTABLISHED STATE, RCV.WND is used to
fill in the SEG.WND value, not SND.WND. fill in the SEG.WND value, not SND.WND.
(d) New pseudo-code summary has been added in Appendix D. (d) New pseudo-code summary has been added in Appendix D.
(e) Appendix A has been expanded with information about the TCP MSS (e) Appendix A has been expanded with information about the TCP
option and the TCP Urgent Pointer. Urgent Pointer. An earlier revision contained text around the
TCP MSS option, which was split off into [RFC6691].
(f) It is now recommended that Timestamps options be included in RST (f) It is now recommended that Timestamps options be included in RST
packets if the incoming packet contained a Timestamps option. packets if the incoming packet contained a Timestamps option.
(g) RST packets are explicitly excluded from PAWS processing. (g) RST packets are explicitly excluded from PAWS processing.
(h) Snd.TSoffset and Snd.TSclock variables have been added. (h) Snd.TSoffset and Snd.TSclock variables have been added.
Snd.TSclock is the sum of my.TSclock and Snd.TSoffset. This Snd.TSclock is the sum of my.TSclock and Snd.TSoffset. This
allows the starting points for timestamps to be randomized on a allows the starting points for timestamps to be randomized on a
per-connection basis. Setting Snd.TSoffset to zero yields the per-connection basis. Setting Snd.TSoffset to zero yields the
 End of changes. 15 change blocks. 
27 lines changed or deleted 42 lines changed or added

This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/