draft-ietf-tcpm-1323bis-00.txt   draft-ietf-tcpm-1323bis-01.txt 
Network Working Group Network Working Group Network Working Group Network Working Group
Internet-Draft D. Borman Internet-Draft D. Borman
Obsoletes: 1323 Wind River Systems Obsoletes: 1323 Wind River Systems
File: draft-ietf-tcpm-1323bis-00.txt R. Braden Intended Status: Standards Track R. Braden
ISI File: draft-ietf-tcpm-1323bis-01.txt ISI
V. Jacobson V. Jacobson
Packet Design Packet Design
January 29, 2008 March 4, 2009
TCP Extensions for High Performance TCP Extensions for High Performance
Status of This Memo Status of This Memo
By submitting this Internet-Draft, each author represents that This Internet-Draft is submitted to IETF in full conformance with the
any applicable patent or other IPR claims of which he or she is provisions of BCP 78 and BCP 79.
aware have been or will be disclosed, and any of which he or she
becomes aware will be disclosed, in accordance with Section 6 of This document may contain material from IETF Documents or IETF
BCP 79. Contributions published or made publicly available before November
10, 2008. The person(s) controlling the copyright in some of this
material may not have granted the IETF Trust the right to allow
modifications of such material outside the IETF Standards Process.
Without obtaining an adequate license from the person(s) controlling
the copyright in such materials, this document may not be modified
outside the IETF Standards Process, and derivative works of it may
not be created outside the IETF Standards Process, except to format
it for publication as an RFC or to translate it into languages other
than English.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet- other groups may also distribute working documents as Internet-
Drafts. Drafts.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/1id-abstracts.html http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html http://www.ietf.org/shadow.html.
This Internet-Draftw will expire on July 29, 2008. This Internet-Draft will expire on September 4, 2009.
Copyright Copyright
Copyright (C) The IETF Trust (2008). Copyright (c) 2009 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents in effect on the date of
publication of this document (http://trustee.ietf.org/license-info).
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document.
Abstract Abstract
This memo presents a set of TCP extensions to improve performance This memo presents a set of TCP extensions to improve performance
over large bandwidth*delay product paths and to provide reliable over large bandwidth*delay product paths and to provide reliable
operation over very high-speed paths. It defines new TCP options for operation over very high-speed paths. It defines TCP options for
scaled windows and timestamps, which are designed to provide scaled windows and timestamps, which are designed to provide
compatible interworking with TCP's that do not implement the compatible interworking with TCP's that do not implement the
extensions. The timestamps are used for two distinct mechanisms: extensions. The timestamps are used for two distinct mechanisms:
RTTM (Round Trip Time Measurement) and PAWS (Protect Against Wrapped RTTM (Round Trip Time Measurement) and PAWS (Protection Against
Sequences). Selective acknowledgments are not included in this memo. Wrapped Sequences). Selective acknowledgments are not included in
this memo.
This memo updates and obsoletes RFC-1323 [Jacobson92d]. This memo updates and obsoletes RFC 1323.
TABLE OF CONTENTS TABLE OF CONTENTS
1. Introduction 2 1. Introduction 2
2. TCP Window Scale Option 8 2. TCP Window Scale Option 9
3. RTTM -- Round-Trip Time Measurement 11 3. RTTM -- Round-Trip Time Measurement 12
4. PAWS -- Protect Against Wrapped Sequence Numbers 17 4. PAWS -- Protection Against Wrapped Sequence Numbers 18
5. Conclusions and Acknowledgments 25 5. Conclusions and Acknowledgments 26
6. Security Considerations 26 6. Security Considerations 27
7. References 26 7. IANA Considerations 27
APPENDIX A: Implementation Suggestions 29 8. References 27
APPENDIX B: Duplicates from Earlier Connection Incarnations 30 APPENDIX A: Implementation Suggestions 30
APPENDIX C: Changes from RFC-1072, RFC-1185, RFC-1323 33 APPENDIX B: Duplicates from Earlier Connection Incarnations 31
APPENDIX D: Summary of Notation 35 APPENDIX C: Changes from RFC 1072, RFC 1185, RFC 1323 34
APPENDIX E: Pseudo-code Summary 36 APPENDIX D: Summary of Notation 36
APPENDIX F: Event Processing 38 APPENDIX E: Pseudo-code Summary 37
APPENDIX G: Timestamps Edge Cases 44 APPENDIX F: Event Processing 40
Authors' Addresses 44 APPENDIX G: Timestamps Edge Cases 46
Authors' Addresses 47
1. INTRODUCTION 1. INTRODUCTION
The TCP protocol [Postel81] was designed to operate reliably over The TCP protocol [Postel81] was designed to operate reliably over
almost any transmission medium regardless of transmission rate, almost any transmission medium regardless of transmission rate,
delay, corruption, duplication, or reordering of segments. delay, corruption, duplication, or reordering of segments.
Production TCP implementations currently adapt to transfer rates in Production TCP implementations currently adapt to transfer rates in
the range of 100 bps to 10**10 bps and round-trip delays in the range the range of 100 bps to 10**10 bps and round-trip delays in the range
1 ms to 100 seconds. Work on TCP performance has shown that TCP can 1 ms to 100 seconds. Work on TCP performance has shown that TCP
work well over a variety of Internet paths, ranging from 800 Mbit/sec without the extensions described in this memo can work well over a
I/O channels to 300 bit/sec dial-up modems [Jacobson88a]. variety of Internet paths, ranging from 800 Mbit/sec I/O channels to
300 bit/sec dial-up modems [Jacobson88a].
Over the years, advances in networking technology has resulted in Over the years, advances in networking technology has resulted in
ever-higher transmission speeds, and the fastest paths are well ever-higher transmission speeds, and the fastest paths are well
beyond the domain for which TCP was originally engineered. This memo beyond the domain for which TCP was originally engineered. This memo
defines a set of modest extensions to TCP to extend the domain of its defines a set of modest extensions to TCP to extend the domain of its
application to match this increasing network capability. It is an application to match this increasing network capability. It is an
update to and obsoletes RFC-1323 [Jacobson92d], which in turn is update to and obsoletes RFC 1323 [Jacobson92d], which in turn is
based upon and obsoletes RFC-1072 [Jacobson88b] and RFC-1185 based upon and obsoletes RFC 1072 [Jacobson88b] and RFC 1185
[Jacobson90b]. [Jacobson90b].
There is no one-line answer to the question: "How fast can TCP go?". There is no one-line answer to the question: "How fast can TCP go?".
There are two separate kinds of issues, performance and reliability, There are two separate kinds of issues, performance and reliability,
and each depends upon different parameters. We discuss each in turn. and each depends upon different parameters. We discuss each in turn.
1.1 TCP Performance 1.1 TCP Performance
TCP performance depends not upon the transfer rate itself, but TCP performance depends not upon the transfer rate itself, but
rather upon the product of the transfer rate and the round-trip rather upon the product of the transfer rate and the round-trip
delay. This "bandwidth*delay product" measures the amount of data delay. This "bandwidth*delay product" measures the amount of data
that would "fill the pipe"; it is the buffer space required at that would "fill the pipe"; it is the buffer space required at
sender and receiver to obtain maximum throughput on the TCP sender and receiver to obtain maximum throughput on the TCP
skipping to change at page 3, line 22 skipping to change at page 3, line 40
delay. This "bandwidth*delay product" measures the amount of data delay. This "bandwidth*delay product" measures the amount of data
that would "fill the pipe"; it is the buffer space required at that would "fill the pipe"; it is the buffer space required at
sender and receiver to obtain maximum throughput on the TCP sender and receiver to obtain maximum throughput on the TCP
connection over the path, i.e., the amount of unacknowledged data connection over the path, i.e., the amount of unacknowledged data
that TCP must handle in order to keep the pipeline full. TCP that TCP must handle in order to keep the pipeline full. TCP
performance problems arise when the bandwidth*delay product is performance problems arise when the bandwidth*delay product is
large. We refer to an Internet path operating in this region as a large. We refer to an Internet path operating in this region as a
"long, fat pipe", and a network containing this path as an "LFN" "long, fat pipe", and a network containing this path as an "LFN"
(pronounced "elephan(t)"). (pronounced "elephan(t)").
High-capacity packet satellite channels (e.g., DARPA's Wideband High-capacity packet satellite channels are LFN's. For example, a
Net) are LFN's. For example, a DS1-speed satellite channel has a DS1-speed satellite channel has a bandwidth*delay product of 10**6
bandwidth*delay product of 10**6 bits or more; this corresponds to bits or more; this corresponds to 100 outstanding TCP segments of
100 outstanding TCP segments of 1200 bytes each. Terrestrial 1200 bytes each. Terrestrial fiber-optical paths will also fall
fiber-optical paths will also fall into the LFN class; for into the LFN class; for example, a cross-country delay of 30 ms at
example, a cross-country delay of 30 ms at a DS3 bandwidth a DS3 bandwidth (45Mbps) also exceeds 10**6 bits.
(45Mbps) also exceeds 10**6 bits.
There are three fundamental performance problems with the current There are three fundamental performance problems with the current
TCP over LFN paths: TCP over LFN paths:
(1) Window Size Limit (1) Window Size Limit
The TCP header uses a 16 bit field to report the receive The TCP header uses a 16 bit field to report the receive
window size to the sender. Therefore, the largest window window size to the sender. Therefore, the largest window
that can be used is 2**16 = 65K bytes. that can be used is 2**16 = 65K bytes.
skipping to change at page 4, line 11 skipping to change at page 4, line 29
[Jacobson90c] [Allman99] were introduced, and their combined [Jacobson90c] [Allman99] were introduced, and their combined
effect was to recover from one packet loss per window, effect was to recover from one packet loss per window,
without draining the pipeline. However, more than one packet without draining the pipeline. However, more than one packet
loss per window typically resulted in a retransmission loss per window typically resulted in a retransmission
timeout and the resulting pipeline drain and slow start. timeout and the resulting pipeline drain and slow start.
Expanding the window size to match the capacity of an LFN Expanding the window size to match the capacity of an LFN
results in a corresponding increase of the probability of results in a corresponding increase of the probability of
more than one packet per window being dropped. This could more than one packet per window being dropped. This could
have a devastating effect upon the throughput of TCP over an have a devastating effect upon the throughput of TCP over an
LFN. In addition, if a congestion control mechanism based LFN. In addition, since the publication of RFC 1323,
upon some form of random dropping were introduced into congestion control mechanism based upon some form of random
gateways, randomly spaced packet drops would become common, dropping have been introduced into gateways, and randomly
possible increasing the probability of dropping more than one spaced packet drops have become common; this increases the
packet per window. probability of dropping more than one packet per window.
To generalize the Fast Retransmit/Fast Recovery mechanism to To generalize the Fast Retransmit/Fast Recovery mechanism to
handle multiple packets dropped per window, selective handle multiple packets dropped per window, selective
acknowledgments are required. Unlike the normal cumulative acknowledgments are required. Unlike the normal cumulative
acknowledgments of TCP, selective acknowledgments give the acknowledgments of TCP, selective acknowledgments give the
sender a complete picture of which segments are queued at the sender a complete picture of which segments are queued at the
receiver and which have not yet arrived. receiver and which have not yet arrived.
Since the publication of RFC-1323, selective acknowledgments Since the publication of RFC 1323, selective acknowledgments
have become important in the LFN regime. RFC-1072 defined a have become important in the LFN regime. RFC 1072 defined a
new TCP "SACK" option to send a selective acknowledgment, but new TCP "SACK" option to send a selective acknowledgment, but
at the time that RFC-1323 was published, important technical at the time that RFC 1323 was published, important technical
issues still had to be worked out concerning both the format issues still had to be worked out concerning both the format
and semantics of the SACK option, so it was split off from and semantics of the SACK option, so it was split off from
RFC-1323. SACK has now been published as a separate RFC 1323. SACK has now been published as a separate
document, RFC-2018 [Mathis96]. Additional information about document, RFC 2018 [Mathis96]. Additional information about
SACK can be found in RFC-2883, "An Extension to the Selective SACK can be found in RFC 2883, "An Extension to the Selective
Acknowledgement (SACK) option for TCP" [Floyd00] and Acknowledgement (SACK) option for TCP" [Floyd00] and RFC
RFC-3517, "A Conservative Selective Acknowledgment 3517, "A Conservative Selective Acknowledgment (SACK)-based
(SACK)-based Loss Recovery Algorithm for TCP" [Blanton03]. Loss Recovery Algorithm for TCP" [Blanton03].
(3) Round-Trip Measurement (3) Round-Trip Measurement
TCP implements reliable data delivery by retransmitting TCP implements reliable data delivery by retransmitting
segments that are not acknowledged within some retransmission segments that are not acknowledged within some retransmission
timeout (RTO) interval. Accurate dynamic determination of an timeout (RTO) interval. Accurate dynamic determination of an
appropriate RTO is essential to TCP performance. RTO is appropriate RTO is essential to TCP performance. RTO is
determined by estimating the mean and variance of the determined by estimating the mean and variance of the
measured round-trip time (RTT), i.e., the time interval measured round-trip time (RTT), i.e., the time interval
between sending a segment and receiving an acknowledgment for between sending a segment and receiving an acknowledgment for
skipping to change at page 6, line 22 skipping to change at page 7, line 9
2**31 / B > MSL (secs) [1] 2**31 / B > MSL (secs) [1]
The following table shows the value for Twrap = 2**31/B in The following table shows the value for Twrap = 2**31/B in
seconds, for some important values of the bandwidth B: seconds, for some important values of the bandwidth B:
Network B*8 B Twrap Network B*8 B Twrap
bits/sec bytes/sec secs bits/sec bytes/sec secs
_______ _______ ______ ______ _______ _______ ______ ______
ARPANET 56kbps 7KBps 3*10**5 (~3.6 days) Dialup 56kbps 7KBps 3*10**5 (~3.6 days)
DS1 1.5Mbps 190KBps 10**4 (~3 hours) DS1 1.5Mbps 190KBps 10**4 (~3 hours)
10mbit
Ethernet 10Mbps 1.25MBps 1700 (~30 mins) Ethernet 10Mbps 1.25MBps 1700 (~30 mins)
DS3 45Mbps 5.6MBps 380 DS3 45Mbps 5.6MBps 380
FDDI 100Mbps 12.5MBps 170 100mbit
Ethernet 100Mbps 12.5MBps 170
Gigabit 1Gbps 125MBps 17 Gigabit
Ethernet 1Gbps 125MBps 17
10GigE 10Gbps 1.25GBps 1.7 10GigE 10Gbps 1.25GBps 1.7
It is clear that wrap-around of the sequence space is not a It is clear that wrap-around of the sequence space is not a
problem for 56kbps packet switching or even 10Mbps Ethernets. On problem for 56kbps packet switching or even 10Mbps Ethernets. On
the other hand, at DS3 and FDDI speeds, Twrap is comparable to the the other hand, at DS3 and 100mbit speeds, Twrap is comparable to
2 minute MSL assumed by the TCP specification [Postel81]. Moving the 2 minute MSL assumed by the TCP specification [Postel81].
towards and beyond gigabit speeds, Twrap becomes too small for Moving towards and beyond gigabit speeds, Twrap becomes too small
reliable enforcement by the Internet TTL mechanism. for reliable enforcement by the Internet TTL mechanism.
The 16-bit window field of TCP limits the effective bandwidth B to The 16-bit window field of TCP limits the effective bandwidth B to
2**16/RTT, where RTT is the round-trip time in seconds 2**16/RTT, where RTT is the round-trip time in seconds
[McKenzie89]. If the RTT is large enough, this limits B to a [McKenzie89]. If the RTT is large enough, this limits B to a
value that meets the constraint [1] for a large MSL value. For value that meets the constraint [1] for a large MSL value. For
example, consider a transcontinental backbone with an RTT of 60ms example, consider a transcontinental backbone with an RTT of 60ms
(set by the laws of physics). With the bandwidth*delay product (set by the laws of physics). With the bandwidth*delay product
limited to 64KB by the TCP window size, B is then limited to limited to 64KB by the TCP window size, B is then limited to
1.1MBps, no matter how high the theoretical transfer rate of the 1.1MBps, no matter how high the theoretical transfer rate of the
path. This corresponds to cycling the sequence number space in path. This corresponds to cycling the sequence number space in
skipping to change at page 7, line 47 skipping to change at page 8, line 36
1.3 Using TCP options 1.3 Using TCP options
The extensions defined in this memo all use new TCP options. We The extensions defined in this memo all use new TCP options. We
must address two possible issues concerning the use of TCP must address two possible issues concerning the use of TCP
options: (1) compatibility and (2) overhead. options: (1) compatibility and (2) overhead.
We must pay careful attention to compatibility, i.e., to We must pay careful attention to compatibility, i.e., to
interoperation with existing implementations. The only TCP option interoperation with existing implementations. The only TCP option
defined previously, MSS, may appear only on a SYN segment. Every defined previously, MSS, may appear only on a SYN segment. Every
implementation should (and we expect that most will) ignore implementation should (and we expect that most will) ignore
unknown options on SYN segments. When RFC-1323 was published, unknown options on SYN segments. When RFC 1323 was published,
there was concern that some buggy TCP implementation might be there was concern that some buggy TCP implementation might be
crashed by the first appearance of an option on a non-SYN segment. crashed by the first appearance of an option on a non-SYN segment.
However, bugs like that can lead to DOS attacks against a TCP, so However, bugs like that can lead to DOS attacks against a TCP, so
it is now expected that most TCP implementations will properly it is now expected that most TCP implementations will properly
handle unknown options on non-SYN segments. But it is still handle unknown options on non-SYN segments. But it is still
prudent to be conservative in what you send, and avoiding buggy prudent to be conservative in what you send, and avoiding buggy
TCP implementation is not the only reason for negotiating TCP TCP implementation is not the only reason for negotiating TCP
options on SYN segments. Therefore, for each of the extensions options on SYN segments. Therefore, for each of the extensions
defined below, TCP options will be sent on non-SYN segments only defined below, TCP options will be sent on non-SYN segments only
after an exchange of options on the the SYN segments has indicated after an exchange of options on the the SYN segments has indicated
skipping to change at page 8, line 50 skipping to change at page 9, line 39
extensions off for low-speed paths, or allow a user or extensions off for low-speed paths, or allow a user or
installation manager to disable them. installation manager to disable them.
2. TCP WINDOW SCALE OPTION 2. TCP WINDOW SCALE OPTION
2.1 Introduction 2.1 Introduction
The window scale extension expands the definition of the TCP The window scale extension expands the definition of the TCP
window to 32 bits and then uses a scale factor to carry this window to 32 bits and then uses a scale factor to carry this
32-bit value in the 16-bit Window field of the TCP header (SEG.WND 32-bit value in the 16-bit Window field of the TCP header (SEG.WND
in RFC-793). The scale factor is carried in a new TCP option, in RFC 793). The scale factor is carried in a new TCP option,
Window Scale. This option is sent only in a SYN segment (a Window Scale. This option is sent only in a SYN segment (a
segment with the SYN bit on), hence the window scale is fixed in segment with the SYN bit on), hence the window scale is fixed in
each direction when a connection is opened. (Another design each direction when a connection is opened. (Another design
choice would be to specify the window scale in every TCP segment. choice would be to specify the window scale in every TCP segment.
It would be incorrect to send a window scale option only when the It would be incorrect to send a window scale option only when the
scale factor changed, since a TCP option in an acknowledgement scale factor changed, since a TCP option in an acknowledgement
segment will not be delivered reliably (unless the ACK happens to segment will not be delivered reliably (unless the ACK happens to
be piggy-backed on data in the other direction). Fixing the scale be piggy-backed on data in the other direction). Fixing the scale
when the connection is opened has the advantage of lower overhead when the connection is opened has the advantage of lower overhead
but the disadvantage that the scale factor cannot be changed but the disadvantage that the scale factor cannot be changed
skipping to change at page 10, line 14 skipping to change at page 10, line 51
be sent in a <SYN,ACK> segment, but only if a Window Scale be sent in a <SYN,ACK> segment, but only if a Window Scale
option was received in the initial <SYN> segment. A Window option was received in the initial <SYN> segment. A Window
Scale option in a segment without a SYN bit should be ignored. Scale option in a segment without a SYN bit should be ignored.
The Window field in a SYN (i.e., a <SYN> or <SYN,ACK>) segment The Window field in a SYN (i.e., a <SYN> or <SYN,ACK>) segment
itself is never scaled. itself is never scaled.
2.3 Using the Window Scale Option 2.3 Using the Window Scale Option
A model implementation of window scaling is as follows, using the A model implementation of window scaling is as follows, using the
notation of RFC-793 [Postel81]: notation of RFC 793 [Postel81]:
* All windows are treated as 32-bit quantities for storage in * All windows are treated as 32-bit quantities for storage in
the connection control block and for local calculations. the connection control block and for local calculations.
This includes the send-window (SND.WND) and the receive- This includes the send-window (SND.WND) and the receive-
window (RCV.WND) values, as well as the congestion window. window (RCV.WND) values, as well as the congestion window.
* The connection state is augmented by two window shift counts, * The connection state is augmented by two window shift counts,
Snd.Wind.Scale and Rcv.Wind.Scale, to be applied to the Snd.Wind.Scale and Rcv.Wind.Scale, to be applied to the
incoming and outgoing window fields, respectively. incoming and outgoing window fields, respectively.
skipping to change at page 10, line 44 skipping to change at page 11, line 33
containing shift.cnt = S, a TCP sets Snd.Wind.Scale to S and containing shift.cnt = S, a TCP sets Snd.Wind.Scale to S and
sets Rcv.Wind.Scale to R; otherwise, it sets both sets Rcv.Wind.Scale to R; otherwise, it sets both
Snd.Wind.Scale and Rcv.Wind.Scale to zero. Snd.Wind.Scale and Rcv.Wind.Scale to zero.
* The window field (SEG.WND) in the header of every incoming * The window field (SEG.WND) in the header of every incoming
segment, with the exception of SYN segments, is left-shifted segment, with the exception of SYN segments, is left-shifted
by Snd.Wind.Scale bits before updating SND.WND: by Snd.Wind.Scale bits before updating SND.WND:
SND.WND = SEG.WND << Snd.Wind.Scale SND.WND = SEG.WND << Snd.Wind.Scale
(assuming the other conditions of RFC-793 are met, and using (assuming the other conditions of RFC 793 are met, and using
the "C" notation "<<" for left-shift). the "C" notation "<<" for left-shift).
* The window field (SEG.WND) of every outgoing segment, with * The window field (SEG.WND) of every outgoing segment, with
the exception of SYN segments, is right-shifted by the exception of SYN segments, is right-shifted by
Rcv.Wind.Scale bits: Rcv.Wind.Scale bits:
SEG.WND = RCV.WND >> Rcv.Wind.Scale. SEG.WND = RCV.WND >> Rcv.Wind.Scale.
TCP determines if a data segment is "old" or "new" by testing TCP determines if a data segment is "old" or "new" by testing
whether its sequence number is within 2**31 bytes of the left edge whether its sequence number is within 2**31 bytes of the left edge
skipping to change at page 11, line 35 skipping to change at page 12, line 21
value exceeding 14, the TCP should log the error but use 14 value exceeding 14, the TCP should log the error but use 14
instead of the specified value. instead of the specified value.
The scale factor applies only to the Window field as transmitted The scale factor applies only to the Window field as transmitted
in the TCP header; each TCP using extended windows will maintain in the TCP header; each TCP using extended windows will maintain
the window values locally as 32-bit numbers. For example, the the window values locally as 32-bit numbers. For example, the
"congestion window" computed by Slow Start and Congestion "congestion window" computed by Slow Start and Congestion
Avoidance is not affected by the scale factor, so window scaling Avoidance is not affected by the scale factor, so window scaling
will not introduce quantization into the congestion window. will not introduce quantization into the congestion window.
When a non-zero scale factor is in use, there are instances when a
retracted window can be offered [Mathis08]. The end of the window
will be on a boundary based on the granularity of the scale factor
being used. If the sequence number is then updated by a number of
bytes smaller than that granularity, the TCP will have to either
advertise a new window that beyond what it previously advertised
(and perhaps beyond the buffer), or will have to advertise a
smaller window, which will cause the TCP window to shrink.
Implementations should ensure that they handle a shrinking window,
as specified in section 4.2.2.16 of RFC 1122 [Braden89].
3. RTTM: ROUND-TRIP TIME MEASUREMENT 3. RTTM: ROUND-TRIP TIME MEASUREMENT
3.1 Introduction 3.1 Introduction
Accurate and current RTT estimates are necessary to adapt to Accurate and current RTT estimates are necessary to adapt to
changing traffic conditions and to avoid an instability known as changing traffic conditions and to avoid an instability known as
"congestion collapse" [Nagle84] in a busy network. However, "congestion collapse" [Nagle84] in a busy network. However,
accurate measurement of RTT may be difficult both in theory and in accurate measurement of RTT may be difficult both in theory and in
implementation. implementation.
skipping to change at page 13, line 28 skipping to change at page 14, line 26
exceptions that are explained below. exceptions that are explained below.
A TCP may send the Timestamps option (TSopt) in an initial A TCP may send the Timestamps option (TSopt) in an initial
<SYN> segment (i.e., a segment containing a SYN bit and no ACK <SYN> segment (i.e., a segment containing a SYN bit and no ACK
bit), and may send a TSopt in other segments only if it bit), and may send a TSopt in other segments only if it
received a TSopt in the initial <SYN> or <SYN,ACK> segment for received a TSopt in the initial <SYN> or <SYN,ACK> segment for
the connection. Once a TSopt has been sent or received in a the connection. Once a TSopt has been sent or received in a
non <SYN> segment, it must be sent in all segments. Once a non <SYN> segment, it must be sent in all segments. Once a
TSopt has been received in a non <SYN> segment, then any TSopt has been received in a non <SYN> segment, then any
successive segment that is received without the RST bit and successive segment that is received without the RST bit and
without a TSopt may be ACKed and dropped without further without a TSopt may dropped without further processing, and an
processing. ACK of the current SND.UNA generated.
In the case of crossing SYN packets where one SYN contains a
TSopt and the other doesn't, both sides should put a TSopt in
the <SYN,ACK> segment.
3.3 The RTTM Mechanism 3.3 The RTTM Mechanism
RTTM places a Timestamps option in every segment, with a TSval RTTM places a Timestamps option in every segment, with a TSval
that is obtained from a (virtual) "timestamp clock". Values of that is obtained from a (virtual) "timestamp clock". Values of
this clock values must be at least approximately proportional to this clock values must be at least approximately proportional to
real time, in order to measure actual RTT. real time, in order to measure actual RTT.
These TSval values are echoed in TSecr values in the reverse These TSval values are echoed in TSecr values in the reverse
direction. The difference between a received TSecr value and the direction. The difference between a received TSecr value and the
skipping to change at page 13, line 51 skipping to change at page 14, line 53
When timestamps are used, every segment that is received will When timestamps are used, every segment that is received will
contain a TSecr value; however, these values cannot all be used to contain a TSecr value; however, these values cannot all be used to
update the measured RTT. The following example illustrates why. update the measured RTT. The following example illustrates why.
It shows a one-way data flow with segments arriving in sequence It shows a one-way data flow with segments arriving in sequence
without loss. Here A, B, C... represent data blocks occupying without loss. Here A, B, C... represent data blocks occupying
successive blocks of sequence numbers, and ACK(A),... represent successive blocks of sequence numbers, and ACK(A),... represent
the corresponding cumulative acknowledgments. The two timestamp the corresponding cumulative acknowledgments. The two timestamp
fields of the Timestamps option are shown symbolically as <TSval= fields of the Timestamps option are shown symbolically as <TSval=
x,TSecr=y>. Each TSecr field contains the value most recently x,TSecr=y>. Each TSecr field contains the value most recently
received in a TSval field; these echoed values. labelled received in a TSval field.
"TS.Recent", are shown in parentheses.
TCP A TCP B
(TS.Recent) (TS.Recent)
1. (120) <A,TSval=1,TSecr=120> ---> (1)
2. (125) <--- <ACK(A),TSval=125,TSecr=1> (1)
3. (125) <B,TSval=6,TSecr=125> ---> (6)
4. (130) <--- <ACK(B),TSval=130,TSecr=6> (6)
. . . ( Pause for 60 timestamp clock ticks ) . . . .
5. (130) <C,TSval=1,TSecr=120> ---> (1)
6. (125) <--- <ACK(A),TSval=125,TSecr=1> (1)
4. (127) <b,ACK(x),TSval=65,TSecr=127> ---> ...
5. ... <--- <y,ACK(A),TSval=191,TSecr=5> (5)
TCP A TCP B TCP A TCP B
<A,TSval=1,TSecr=120> ------> <A,TSval=1,TSecr=120> ------>
<---- <ACK(A),TSval=127,TSecr=1> <---- <ACK(A),TSval=127,TSecr=1>
<B,TSval=5,TSecr=127> ------> <B,TSval=5,TSecr=127> ------>
<---- <ACK(B),TSval=131,TSecr=5> <---- <ACK(B),TSval=131,TSecr=5>
skipping to change at page 15, line 19 skipping to change at page 15, line 44
Since TCP B is not sending data, the data segment C does not Since TCP B is not sending data, the data segment C does not
acknowledge any new data when it arrives at B. Thus, the inflated acknowledge any new data when it arrives at B. Thus, the inflated
RTTM measurement is not used to update B's RTTM measurement. RTTM measurement is not used to update B's RTTM measurement.
Implementors should note that with Timestamps multiple RTTMs can Implementors should note that with Timestamps multiple RTTMs can
be taken per RTT. Many RTO estimators have a weighting factor be taken per RTT. Many RTO estimators have a weighting factor
based on an implicit assumption that at most one RTTM will be based on an implicit assumption that at most one RTTM will be
gotten per RTT. When using multiple RTTMs per RTT to update the gotten per RTT. When using multiple RTTMs per RTT to update the
RTO estimator, the weighting factor needs to be decreased to take RTO estimator, the weighting factor needs to be decreased to take
into account the more frequent RTTMs. For example, into account the more frequent RTTMs. For example, an
implementation could choose to just use one sample per RTT to
update the RTO estimator, or or vary the gain based on the
congestion window, or take an average of all the RTTM measurements
received over one RTT, and then use that value to update the RTO
estimator. This document does not prescribe any particular method
for modifying the RTO estimator, the important point is that the
implementation should do something more than just feeding
additional RTTM samples from one RTT into the RTO estimator.
3.4 Which Timestamp to Echo 3.4 Which Timestamp to Echo
If more than one Timestamps option is received before a reply If more than one Timestamps option is received before a reply
segment is sent, the TCP must choose only one of the TSvals to segment is sent, the TCP must choose only one of the TSvals to
echo, ignoring the others. To minimize the state kept in the echo, ignoring the others. To minimize the state kept in the
receiver (i.e., the number of unprocessed TSvals), the receiver receiver (i.e., the number of unprocessed TSvals), the receiver
should be required to retain at most one timestamp in the should be required to retain at most one timestamp in the
connection control block. connection control block.
There are three situations to consider: There are three situations to consider:
(A) Delayed ACKs. (A) Delayed ACKs.
skipping to change at page 17, line 46 skipping to change at page 18, line 27
2 2
<E, TSval=5> -------------------> <E, TSval=5> ------------------->
2 2
<---- <ACK(C), TSecr=2> <---- <ACK(C), TSecr=2>
2 2
<D, TSval=4> -------------------> <D, TSval=4> ------------------->
4 4
<---- <ACK(E), TSecr=4> <---- <ACK(E), TSecr=4>
(etc) (etc)
4. PAWS: PROTECT AGAINST WRAPPED SEQUENCE NUMBERS 4. PAWS: PROTECTION AGAINST WRAPPED SEQUENCE NUMBERS
4.1 Introduction 4.1 Introduction
Section 4.2 describes a simple mechanism to reject old duplicate Section 4.2 describes a simple mechanism to reject old duplicate
segments that might corrupt an open TCP connection; we call this segments that might corrupt an open TCP connection; we call this
mechanism PAWS (Protect Against Wrapped Sequence numbers). PAWS mechanism PAWS (Protection Against Wrapped Sequence numbers).
operates within a single TCP connection, using state that is saved PAWS operates within a single TCP connection, using state that is
in the connection control block. Section 4.3 and Appendix C saved in the connection control block. Section 4.3 and Appendix C
discuss the implications of the PAWS mechanism for avoiding old discuss the implications of the PAWS mechanism for avoiding old
duplicates from previous incarnations of the same connection. duplicates from previous incarnations of the same connection.
4.2 The PAWS Mechanism 4.2 The PAWS Mechanism
PAWS uses the same TCP Timestamps option as the RTTM mechanism PAWS uses the same TCP Timestamps option as the RTTM mechanism
described earlier, and assumes that every received TCP segment described earlier, and assumes that every received TCP segment
(including data and ACK segments) contains a timestamp SEG.TSval (including data and ACK segments) contains a timestamp SEG.TSval
whose values are monotone non-decreasing in time. The basic idea whose values are monotonically non-decreasing in time. The basic
is that a segment can be discarded as an old duplicate if it is idea is that a segment can be discarded as an old duplicate if it
received with a timestamp SEG.TSval less than some timestamp is received with a timestamp SEG.TSval less than some timestamp
recently received on this connection. recently received on this connection.
In both the PAWS and the RTTM mechanism, the "timestamps" are In both the PAWS and the RTTM mechanism, the "timestamps" are
32-bit unsigned integers in a modular 32-bit space. Thus, "less 32-bit unsigned integers in a modular 32-bit space. Thus, "less
than" is defined the same way it is for TCP sequence numbers, and than" is defined the same way it is for TCP sequence numbers, and
the same implementation techniques apply. If s and t are the same implementation techniques apply. If s and t are
timestamp values, s < t if 0 < (t - s) < 2**31, computed in timestamp values, s < t if 0 < (t - s) < 2**31, computed in
unsigned 32-bit arithmetic. unsigned 32-bit arithmetic.
The choice of incoming timestamps to be saved for this comparison The choice of incoming timestamps to be saved for this comparison
must guarantee a value that is monotone increasing. For example, must guarantee a value that is monotonically increasing. For
we might save the timestamp from the segment that last advanced example, we might save the timestamp from the segment that last
the left edge of the receive window, i.e., the most recent in- advanced the left edge of the receive window, i.e., the most
sequence segment. Instead, we choose the value TS.Recent recent in-sequence segment. Instead, we choose the value
introduced in Section 3.4 for the RTTM mechanism, since using a TS.Recent introduced in Section 3.4 for the RTTM mechanism, since
common value for both PAWS and RTTM simplifies the implementation using a common value for both PAWS and RTTM simplifies the
of both. As Section 3.4 explained, TS.Recent differs from the implementation of both. As Section 3.4 explained, TS.Recent
timestamp from the last in-sequence segment only in the case of differs from the timestamp from the last in-sequence segment only
delayed ACKs, and therefore by less than one window. Either in the case of delayed ACKs, and therefore by less than one
choice will therefore protect against sequence number wrap-around. window. Either choice will therefore protect against sequence
number wrap-around.
RTTM was specified in a symmetrical manner, so that TSval RTTM was specified in a symmetrical manner, so that TSval
timestamps are carried in both data and ACK segments and are timestamps are carried in both data and ACK segments and are
echoed in TSecr fields carried in returning ACK or data segments. echoed in TSecr fields carried in returning ACK or data segments.
PAWS submits all incoming segments to the same test, and therefore PAWS submits all incoming segments to the same test, and therefore
protects against duplicate ACK segments as well as data segments. protects against duplicate ACK segments as well as data segments.
(An alternative un-symmetric algorithm would protect against old (An alternative non-symmetric algorithm would protect against old
duplicate ACKs: the sender of data would reject incoming ACK duplicate ACKs: the sender of data would reject incoming ACK
segments whose TSecr values were less than the TSecr saved from segments whose TSecr values were less than the TSecr saved from
the last segment whose ACK field advanced the left edge of the the last segment whose ACK field advanced the left edge of the
send window. This algorithm was deemed to lack economy of send window. This algorithm was deemed to lack economy of
mechanism and symmetry.) mechanism and symmetry.)
TSval timestamps sent on {SYN} and {SYN,ACK} segments are used to TSval timestamps sent on {SYN} and {SYN,ACK} segments are used to
initialize PAWS. PAWS protects against old duplicate non-SYN initialize PAWS. PAWS protects against old duplicate non-SYN
segments, and duplicate SYN segments received while there is a segments, and duplicate SYN segments received while there is a
synchronized connection. Duplicate {SYN} and {SYN,ACK} segments synchronized connection. Duplicate {SYN} and {SYN,ACK} segments
received when there is no connection will be discarded by the received when there is no connection will be discarded by the
normal 3-way handshake and sequence number checks of TCP. normal 3-way handshake and sequence number checks of TCP.
RFC-1323 recommended that RST segments NOT carry timestamps, and RFC 1323 recommended that RST segments NOT carry timestamps, and
that they be accetable regardless of their timestamp. At that that they be acceptable regardless of their timestamp. At that
time, the thinking was that old duplicate RST segments should be time, the thinking was that old duplicate RST segments should be
exceedingly unlikely, and their cleanup function should take exceedingly unlikely, and their cleanup function should take
precedence over timestamps. More recently, discussion about precedence over timestamps. More recently, discussion about
various blind attacks on TCP connections have raised the various blind attacks on TCP connections have raised the
suggestion that if the Timestamps option is present, SEG.TSecr suggestion that if the Timestamps option is present, SEG.TSecr
could be used to provide stricter acceptance tests for RST could be used to provide stricter acceptance tests for RST
packets. While still under discussion, to enable research into packets. While still under discussion, to enable research into
this area it is now recommended that when generating a RST, that this area it is now recommended that when generating a RST, that
if the packet causing the RST to be generated contained a if the packet causing the RST to be generated contained a
Timestamps option that the RST also contain a Timestamps option. Timestamps option that the RST also contain a Timestamps option.
In the RST segment, SEG.TSecr should be set to SEG.TSval from the In the RST segment, SEG.TSecr should be set to SEG.TSval from the
incoming packet and SEG.TSval should be set to zero. If a RST is incoming packet and SEG.TSval should be set to zero. If a RST is
being generated because of a user abort, and Snd.TS.OK is set, being generated because of a user abort, and Snd.TS.OK is set,
then a Timestamps option should be included in the RST. When a then a Timestamps option should be included in the RST. When a
RST packet is received, it must not be subjected to PAWS checks, RST packet is received, it must not be subjected to PAWS checks,
and information from Timestamps option must not be use to update and information from the Timestamps option must not be use to
connection state information. SEG.TSecr may be used to provide update connection state information. SEG.TSecr may be used to
stricter RST acceptance checks. provide stricter RST acceptance checks.
4.2.1 Basic PAWS Algorithm 4.2.1 Basic PAWS Algorithm
The PAWS algorithm requires the following processing to be The PAWS algorithm requires the following processing to be
performed on all incoming segments for a synchronized performed on all incoming segments for a synchronized
connection: connection:
R1) If there is a Timestamps option in the arriving segment, R1) If there is a Timestamps option in the arriving segment,
SEG.TSval < TS.Recent, TS.Recent is valid (see later SEG.TSval < TS.Recent, TS.Recent is valid (see later
discussion) and the RST bit is not set, then treat the discussion) and the RST bit is not set, then treat the
arriving segment as not acceptable: arriving segment as not acceptable:
Send an acknowledgement in reply as specified in Send an acknowledgement in reply as specified in RFC
RFC-793 page 69 and drop the segment. 793 page 69 and drop the segment.
Note: it is necessary to send an ACK segment in order Note: it is necessary to send an ACK segment in order
to retain TCP's mechanisms for detecting and to retain TCP's mechanisms for detecting and
recovering from half-open connections. For example, recovering from half-open connections. For example,
see Figure 10 of RFC-793. see Figure 10 of RFC 793.
R2) If the segment is outside the window, reject it (normal R2) If the segment is outside the window, reject it (normal
TCP processing) TCP processing)
R3) If an arriving segment satisfies: SEG.SEQ <= Last.ACK.sent R3) If an arriving segment satisfies: SEG.SEQ <= Last.ACK.sent
(see Section 3.4), then record its timestamp in TS.Recent. (see Section 3.4), then record its timestamp in TS.Recent.
R4) If an arriving segment is in-sequence (i.e., at the left R4) If an arriving segment is in-sequence (i.e., at the left
window edge), then accept it normally. window edge), then accept it normally.
R5) Otherwise, treat the segment as a normal in-window, out- R5) Otherwise, treat the segment as a normal in-window, out-
of-sequence TCP segment (e.g., queue it for later delivery of-sequence TCP segment (e.g., queue it for later delivery
to the user). to the user).
Steps R2, R4, and R5 are the normal TCP processing steps Steps R2, R4, and R5 are the normal TCP processing steps
specified by RFC-793. specified by RFC 793.
It is important to note that the timestamp is checked only when It is important to note that the timestamp is checked only when
a segment first arrives at the receiver, regardless of whether a segment first arrives at the receiver, regardless of whether
it is in-sequence or it must be queued for later delivery. it is in-sequence or it must be queued for later delivery.
Consider the following example. Consider the following example.
Suppose the segment sequence: A.1, B.1, C.1, ..., Z.1 has Suppose the segment sequence: A.1, B.1, C.1, ..., Z.1 has
been sent, where the letter indicates the sequence number been sent, where the letter indicates the sequence number
and the digit represents the timestamp. Suppose also that and the digit represents the timestamp. Suppose also that
segment B.1 has been lost. The timestamp in TS.TStamp is segment B.1 has been lost. The timestamp in TS.TStamp is
1 (from A.1), so C.1, ..., Z.1 are considered acceptable 1 (from A.1), so C.1, ..., Z.1 are considered acceptable
and are queued. When B is retransmitted as segment B.2 and are queued. When B is retransmitted as segment B.2
(using the latest timestamp), it fills the hole and causes (using the latest timestamp), it fills the hole and causes
all the segments through Z to be acknowledged and passed all the segments through Z to be acknowledged and passed
skipping to change at page 20, line 40 skipping to change at page 21, line 26
*not* inspected again at this time, since they have *not* inspected again at this time, since they have
already been accepted. When B.2 is accepted, TS.Stamp is already been accepted. When B.2 is accepted, TS.Stamp is
set to 2. set to 2.
This rule allows reasonable performance under loss. A full This rule allows reasonable performance under loss. A full
window of data is in transit at all times, and after a loss a window of data is in transit at all times, and after a loss a
full window less one packet will show up out-of-sequence to be full window less one packet will show up out-of-sequence to be
queued at the receiver (e.g., up to ~2**30 bytes of data); the queued at the receiver (e.g., up to ~2**30 bytes of data); the
timestamp option must not result in discarding this data. timestamp option must not result in discarding this data.
In certain unlikely circumstances, the algorithm of rules R1-R4 In certain unlikely circumstances, the algorithm of rules R1-R5
could lead to discarding some segments unnecessarily, as shown could lead to discarding some segments unnecessarily, as shown
in the following example: in the following example:
Suppose again that segments: A.1, B.1, C.1, ..., Z.1 have Suppose again that segments: A.1, B.1, C.1, ..., Z.1 have
been sent in sequence and that segment B.1 has been lost. been sent in sequence and that segment B.1 has been lost.
Furthermore, suppose delivery of some of C.1, ... Z.1 is Furthermore, suppose delivery of some of C.1, ... Z.1 is
delayed until AFTER the retransmission B.2 arrives at the delayed until AFTER the retransmission B.2 arrives at the
receiver. These delayed segments will be discarded receiver. These delayed segments will be discarded
unnecessarily when they do arrive, since their timestamps unnecessarily when they do arrive, since their timestamps
are now out of date. are now out of date.
skipping to change at page 21, line 27 skipping to change at page 22, line 12
We know of no case with a significant probability of occurrence We know of no case with a significant probability of occurrence
in which timestamps will cause performance degradation by in which timestamps will cause performance degradation by
unnecessarily discarding segments. unnecessarily discarding segments.
4.2.2 Timestamp Clock 4.2.2 Timestamp Clock
It is important to understand that the PAWS algorithm does not It is important to understand that the PAWS algorithm does not
require clock synchronization between sender and receiver. The require clock synchronization between sender and receiver. The
sender's timestamp clock is used to stamp the segments, and the sender's timestamp clock is used to stamp the segments, and the
sender uses the echoed timestamp to measure RTT's. However, sender uses the echoed timestamp to measure RTT's. However,
the receiver treats the timestamp as simply a monotone- the receiver treats the timestamp as simply a monotonically
increasing serial number, without any necessary connection to increasing serial number, without any necessary connection to
its clock. From the receiver's viewpoint, the timestamp is its clock. From the receiver's viewpoint, the timestamp is
acting as a logical extension of the high-order bits of the acting as a logical extension of the high-order bits of the
sequence number. sequence number.
The receiver algorithm does place some requirements on the The receiver algorithm does place some requirements on the
frequency of the timestamp clock. frequency of the timestamp clock.
(a) The timestamp clock must not be "too slow". (a) The timestamp clock must not be "too slow".
It must tick at least once for each 2**31 bytes sent. In It must tick at least once for each 2**31 bytes sent. In
fact, in order to be useful to the sender for round trip fact, in order to be useful to the sender for round trip
timing, the clock should tick at least once per window's timing, the clock should tick at least once per window's
worth of data, and even with the RFC-1072 window worth of data, and even with the window extension defined
extension, 2**31 bytes must be at least two windows. in Section 2.2, 2**31 bytes must be at least two windows.
To make this more quantitative, any clock faster than 1 To make this more quantitative, any clock faster than 1
tick/sec will reject old duplicate segments for link tick/sec will reject old duplicate segments for link
speeds of ~8 Gbps. A 1ms timestamp clock will work at speeds of ~8 Gbps. A 1ms timestamp clock will work at
link speeds up to 8 Tbps (8*10**12) bps! link speeds up to 8 Tbps (8*10**12) bps!
(b) The timestamp clock must not be "too fast". (b) The timestamp clock must not be "too fast".
Its recycling time must be greater than MSL seconds. Its recycling time must be greater than MSL seconds.
Since the clock (timestamp) is 32 bits and the worst-case Since the clock (timestamp) is 32 bits and the worst-case
skipping to change at page 23, line 13 skipping to change at page 23, line 47
With the chosen range of timestamp clock frequencies (1 sec to With the chosen range of timestamp clock frequencies (1 sec to
1 ms), the time to wrap the sign bit will be between 24.8 days 1 ms), the time to wrap the sign bit will be between 24.8 days
and 24800 days. A TCP connection that is idle for more than 24 and 24800 days. A TCP connection that is idle for more than 24
days and then comes to life is exceedingly unusual. However, days and then comes to life is exceedingly unusual. However,
it is undesirable in principle to place any limitation on TCP it is undesirable in principle to place any limitation on TCP
connection lifetimes. connection lifetimes.
We therefore require that an implementation of PAWS include a We therefore require that an implementation of PAWS include a
mechanism to "invalidate" the TS.Recent value when a connection mechanism to "invalidate" the TS.Recent value when a connection
is idle for more than 24 days. (An alternative solution to the is idle for more than 24 days. (An alternative solution to the
problem of outdated timestamps would be to send keepalive problem of outdated timestamps would be to send keep-alive
segments at a very low rate, but still more often than the segments at a very low rate, but still more often than the
wrap-around time for timestamps, e.g., once a day. This would wrap-around time for timestamps, e.g., once a day. This would
impose negligible overhead. However, the TCP specification has impose negligible overhead. However, the TCP specification has
never included keepalives, so the solution based upon never included keep-alives, so the solution based upon
invalidation was chosen.) invalidation was chosen.)
Note that a TCP does not know the frequency, and therefore, the Note that a TCP does not know the frequency, and therefore, the
wraparound time, of the other TCP, so it must assume the worst. wraparound time, of the other TCP, so it must assume the worst.
The validity of TS.Recent needs to be checked only if the basic The validity of TS.Recent needs to be checked only if the basic
PAWS timestamp check fails, i.e., only if SEG.TSval < PAWS timestamp check fails, i.e., only if SEG.TSval <
TS.Recent. If TS.Recent is found to be invalid, then the TS.Recent. If TS.Recent is found to be invalid, then the
segment is accepted, regardless of the failure of the timestamp segment is accepted, regardless of the failure of the timestamp
check, and rule R3 updates TS.Recent with the TSval from the check, and rule R3 updates TS.Recent with the TSval from the
new segment. new segment.
To detect how long the connection has been idle, the TCP may To detect how long the connection has been idle, the TCP may
skipping to change at page 24, line 6 skipping to change at page 24, line 40
the following recommended sequence for processing an arriving the following recommended sequence for processing an arriving
TCP segment: TCP segment:
H1) Check timestamp (same as step R1 above) H1) Check timestamp (same as step R1 above)
H2) Do header prediction: if segment is next in sequence and H2) Do header prediction: if segment is next in sequence and
if there are no special conditions requiring additional if there are no special conditions requiring additional
processing, accept the segment, record its timestamp, and processing, accept the segment, record its timestamp, and
skip H3. skip H3.
H3) Process the segment normally, as specified in RFC-793. H3) Process the segment normally, as specified in RFC 793.
This includes dropping segments that are outside the This includes dropping segments that are outside the
window and possibly sending acknowledgments, and queueing window and possibly sending acknowledgments, and queueing
in-window, out-of-sequence segments. in-window, out-of-sequence segments.
Another possibility would be to interchange steps H1 and H2, Another possibility would be to interchange steps H1 and H2,
i.e., to perform the header prediction step H2 FIRST, and i.e., to perform the header prediction step H2 FIRST, and
perform H1 and H3 only when header prediction fails. This perform H1 and H3 only when header prediction fails. This
could be a performance improvement, since the timestamp check could be a performance improvement, since the timestamp check
in step H1 is very unlikely to fail, and it requires interval in step H1 is very unlikely to fail, and it requires unsigned
arithmetic on a finite field, a relatively expensive operation. modulo arithmetic, a relatively expensive operation. To
To perform this check on every single segment is contrary to perform this check on every single segment is contrary to the
the philosophy of header prediction. We believe that this philosophy of header prediction. We believe that this change
change might reduce CPU time for TCP protocol processing by up might produce a measurable reduction in CPU time for TCP
to 5-10% on high-speed networks. protocol processing on high-speed networks.
However, putting H2 first would create a hazard: a segment from However, putting H2 first would create a hazard: a segment from
2**32 bytes in the past might arrive at exactly the wrong time 2**32 bytes in the past might arrive at exactly the wrong time
and be accepted mistakenly by the header-prediction step. The and be accepted mistakenly by the header-prediction step. The
following reasoning has been introduced [Jacobson90b] to show following reasoning has been introduced [Jacobson90b] to show
that the probability of this failure is negligible. that the probability of this failure is negligible.
If all segments are equally likely to show up as old If all segments are equally likely to show up as old
duplicates, then the probability of an old duplicate duplicates, then the probability of an old duplicate
exactly matching the left window edge is the maximum exactly matching the left window edge is the maximum
skipping to change at page 25, line 12 skipping to change at page 25, line 46
gain does not justify the hazard in the general case. It is gain does not justify the hazard in the general case. It is
therefore recommended that H2 follow H1. therefore recommended that H2 follow H1.
4.2.5 IP Fragmentation 4.2.5 IP Fragmentation
At high data rates, the protection against old packets provided At high data rates, the protection against old packets provided
by PAWS can be circumvented by errors in IP fragment reassembly by PAWS can be circumvented by errors in IP fragment reassembly
[Heffner07]. The only way to protect against incorrect IP [Heffner07]. The only way to protect against incorrect IP
fragment reassembly is to not allow the packets to be fragment reassembly is to not allow the packets to be
fragmented. This is done by setting the Don't Fragment (DF) fragmented. This is done by setting the Don't Fragment (DF)
bit in the IP header. Setting the DF bit implies that Path MTU bit in the IP header. Setting the DF bit implies the use of
Discovery as described in RFC-1191 [Mogul90]. Thus any TCP Path MTU Discovery as described in RFC 1191 [Mogul90], thus any
implementation that implements PAWS must also implement Path TCP implementation that implements PAWS must also implement
MTU Discovery. Path MTU Discovery.
4.3. Duplicates from Earlier Incarnations of Connection 4.3. Duplicates from Earlier Incarnations of Connection
The PAWS mechanism protects against errors due to sequence number The PAWS mechanism protects against errors due to sequence number
wrap-around on high-speed connection. Segments from an earlier wrap-around on high-speed connection. Segments from an earlier
incarnation of the same connection are also a potential cause of incarnation of the same connection are also a potential cause of
old duplicate errors. In both cases, the TCP mechanisms to old duplicate errors. In both cases, the TCP mechanisms to
prevent such errors depend upon the enforcement of a maximum prevent such errors depend upon the enforcement of a maximum
segment lifetime (MSL) by the Internet (IP) layer (see Appendix of segment lifetime (MSL) by the Internet (IP) layer (see Appendix of
RFC-1185 for a detailed discussion). Unlike the case of sequence RFC 1185 for a detailed discussion). Unlike the case of sequence
space wrap-around, the MSL required to prevent old duplicate space wrap-around, the MSL required to prevent old duplicate
errors from earlier incarnations does not depend upon the transfer errors from earlier incarnations does not depend upon the transfer
rate. If the IP layer enforces the recommended 2 minute MSL of rate. If the IP layer enforces the recommended 2 minute MSL of
TCP, and if the TCP rules are followed, TCP connections will be TCP, and if the TCP rules are followed, TCP connections will be
safe from earlier incarnations, no matter how high the network safe from earlier incarnations, no matter how high the network
speed. Thus, the PAWS mechanism is not required for this case. speed. Thus, the PAWS mechanism is not required for this case.
We may still ask whether the PAWS mechanism can provide additional We may still ask whether the PAWS mechanism can provide additional
security against old duplicates from earlier connections, allowing security against old duplicates from earlier connections, allowing
us to relax the enforcement of MSL by the IP layer. Appendix B us to relax the enforcement of MSL by the IP layer. Appendix B
skipping to change at page 26, line 9 skipping to change at page 26, line 42
These mechanisms are implemented using new TCP options for scaled These mechanisms are implemented using new TCP options for scaled
windows and timestamps. The timestamps are used for two distinct windows and timestamps. The timestamps are used for two distinct
mechanisms: RTTM (Round Trip Time Measurement) and PAWS (Protect mechanisms: RTTM (Round Trip Time Measurement) and PAWS (Protect
Against Wrapped Sequences). Against Wrapped Sequences).
The Window Scale option was originally suggested by Mike St. Johns of The Window Scale option was originally suggested by Mike St. Johns of
USAF/DCA. The present form of the option was suggested by Mike USAF/DCA. The present form of the option was suggested by Mike
Karels of UC Berkeley in response to a more cumbersome scheme defined Karels of UC Berkeley in response to a more cumbersome scheme defined
by Van Jacobson. Lixia Zhang helped formulate the PAWS mechanism by Van Jacobson. Lixia Zhang helped formulate the PAWS mechanism
description in RFC-1185. description in RFC 1185.
Finally, much of this work originated as the result of discussions Finally, much of this work originated as the result of discussions
within the End-to-End Task Force on the theoretical limitations of within the End-to-End Task Force on the theoretical limitations of
transport protocols in general and TCP in particular. Task force transport protocols in general and TCP in particular. Task force
members and other on the end2end-interest list have made valuable members and other on the end2end-interest list have made valuable
contributions by pointing out flaws in the algorithms and the contributions by pointing out flaws in the algorithms and the
documentation. Continued discussion and development since the documentation. Continued discussion and development since the
publication of RFC-1323 originally occurred in the IETF TCP Large publication of RFC 1323 originally occurred in the IETF TCP Large
Windows Working Group, later on in the End-to-End Taks Force, and Windows Working Group, later on in the End-to-End Task Force, and
most recently in the IETF TCP Maintance Working Group. The authors most recently in the IETF TCP Maintenance Working Group. The authors
are grateful for all these contributions. are grateful for all these contributions.
6. SECURITY CONSIDERATIONS 6. SECURITY CONSIDERATIONS
Security issues are not discussed in this memo. The TCP sequence space is a fixed size, and as the window becomes
larger it becomes easier for an attacker to generate forged packets
that can fall within the TCP window, and be accepted as valid
packets. While use of Timestamps and PAWS can help to mitigate this,
when using PAWS, if an attacker is able to forge a packet that is
acceptable to the TCP connection, a timestamp that is in the future
would cause valid packets to be dropped due to PAWS checks. Hence,
implementors should take care to not open the TCP window drastically
beyond the requirements of the connection.
7. REFERENCES Middle boxes and options If a middle box removes TCP options from the
SYN, such as TSopt, a high speed connection that needs PAWS would not
have that protection. In this situation, an implementor could
provide a mechanism for the application to determine whether or not
PAWS is in use on the connection, and chose to terminate the
connection if that protection doesn't exist.
Mechanisms to protect the TCP header from modification should also
protect the TCP options.
Expanding the TCP window beyond 64K for IPv6 allows Jumbograms
[Borman99] to be used when the local network supports packets larger
than 64K. When larger TCP packets are used, the TCP checksum becomes
weaker.
7. IANA CONSIDERATIONS
This document has no actions for IANA.
8. REFERENCES
Normative References Normative References
[Mogul90] Mojul, J. and Deering, S., "Path MTU Discovery", [Mogul90] Mojul, J. and Deering, S., "Path MTU Discovery", RFC
RFC-1191, November 1990. 1191, November 1990.
[Postel81] Postel, J., "Transmission Control Protocol - DARPA [Postel81] Postel, J., "Transmission Control Protocol - DARPA
Internet Program Protocol Specification", RFC-793, DARPA, Internet Program Protocol Specification", RFC 793, DARPA,
September 1981. September 1981.
Informative References Informative References
[Allman99] Allman, M., Paxson, V., Stevens, W., "TCP Congestion [Allman99] Allman, M., Paxson, V., Stevens, W., "TCP Congestion
Control", RFC-2581, NASA Glenn/Sterling Software, ACIRI / ICSI, Control", RFC 2581, NASA Glenn/Sterling Software, ACIRI / ICSI,
April 1999. April 1999.
[Borman99] Borman, D., Deering, S., and Hinden, R, "IPv6 [Borman99] Borman, D., Deering, S., and Hinden, R, "IPv6
Jumbograms" RFC-2675, August 1999. Jumbograms" RFC 2675, August 1999.
[Braden89] Braden, R., editor, "Requirements for Internet Hosts -- [Braden89] Braden, R., editor, "Requirements for Internet Hosts --
Communication Layers", RFC-1122, October, 1989 Communication Layers", RFC 1122, October, 1989
[Floyd00] Floyd, S., Mahdavi, J., Mathis, M., Podolsky, M., "An [Floyd00] Floyd, S., Mahdavi, J., Mathis, M., Podolsky, M., "An
Extension to the Selective Acknowledgement (SACK) Option for TCP", Extension to the Selective Acknowledgement (SACK) Option for TCP",
RFC-2883, July 2000. RFC 2883, July 2000.
[Blanton03] Blanton, E., Allman, M., Fall, K., Wang, L., "A [Blanton03] Blanton, E., Allman, M., Fall, K., Wang, L., "A
Conservative Selective Acknowledgment (SACK)-based Loss Recovery Conservative Selective Acknowledgment (SACK)-based Loss Recovery
Algorithm for TCP", RFC-3517, April 2003. Algorithm for TCP", RFC 3517, April 2003.
[Garlick77] Garlick, L., R. Rom, and J. Postel, "Issues in [Garlick77] Garlick, L., R. Rom, and J. Postel, "Issues in
Reliable Host-to-Host Protocols", Proc. Second Berkeley Workshop Reliable Host-to-Host Protocols", Proc. Second Berkeley Workshop
on Distributed Data Management and Computer Networks, May 1977. on Distributed Data Management and Computer Networks, May 1977.
[Hamming77] Hamming, R., "Digital Filters", ISBN 0-13-212571-4, [Hamming77] Hamming, R., "Digital Filters", ISBN 0-13-212571-4,
Prentice Hall, Englewood Cliffs, N.J., 1977. Prentice Hall, Englewood Cliffs, N.J., 1977.
[Heffner07] Heffner, J., Mathis, M., and Chandler, B., "IPv4 [Heffner07] Heffner, J., Mathis, M., and Chandler, B., "IPv4
Reassembly Errors at High Data Rates" RFC-4963, PSC, July 2007. Reassembly Errors at High Data Rates" RFC 4963, PSC, July 2007.
[Jacobson88a] Jacobson, V., "Congestion Avoidance and Control", [Jacobson88a] Jacobson, V., "Congestion Avoidance and Control",
SIGCOMM '88, Stanford, CA., August 1988. SIGCOMM '88, Stanford, CA., August 1988.
[Jacobson88b] Jacobson, V., and R. Braden, "TCP Extensions for [Jacobson88b] Jacobson, V., and R. Braden, "TCP Extensions for
Long-Delay Paths", RFC-1072, LBL and USC/Information Sciences Long-Delay Paths", RFC 1072, LBL and USC/Information Sciences
Institute, October 1988. Institute, October 1988.
[Jacobson90a] Jacobson, V., "4BSD Header Prediction", ACM [Jacobson90a] Jacobson, V., "4BSD Header Prediction", ACM
Computer Communication Review, April 1990. Computer Communication Review, April 1990.
[Jacobson90b] Jacobson, V., Braden, R., and Zhang, L., "TCP [Jacobson90b] Jacobson, V., Braden, R., and Zhang, L., "TCP
Extension for High-Speed Paths", RFC-1185, LBL and USC/Information Extension for High-Speed Paths", RFC 1185, LBL and USC/Information
Sciences Institute, October 1990. Sciences Institute, October 1990.
[Jacobson90c] Jacobson, V., "Modified TCP congestion avoidance [Jacobson90c] Jacobson, V., "Modified TCP congestion avoidance
algorithm", Message to end2end-interest mailing list, April 1990. algorithm", Message to end2end-interest mailing list, April 1990.
[Jacobson92d] Jacobson, V., Braden, R., and Borman, D., "TCP [Jacobson92d] Jacobson, V., Braden, R., and Borman, D., "TCP
Extension for High Performance", RFC-1323, LBL, USC/Information Extension for High Performance", RFC 1323, LBL, USC/Information
Sciences Institute and Cray Research, May 1992. Sciences Institute and Cray Research, May 1992.
[Jain86] Jain, R., "Divergence of Timeout Algorithms for Packet [Jain86] Jain, R., "Divergence of Timeout Algorithms for Packet
Retransmissions", Proc. Fifth Phoenix Conf. on Comp. and Comm., Retransmissions", Proc. Fifth Phoenix Conf. on Comp. and Comm.,
Scottsdale, Arizona, March 1986. Scottsdale, Arizona, March 1986.
[Karn87] Karn, P. and C. Partridge, "Estimating Round-Trip Times [Karn87] Karn, P. and C. Partridge, "Estimating Round-Trip Times
in Reliable Transport Protocols", Proc. SIGCOMM '87, Stowe, VT, in Reliable Transport Protocols", Proc. SIGCOMM '87, Stowe, VT,
August 1987. August 1987.
[Martin03] Martin, D., "[Tsvwg] RFC 1323.bis" Message to tsvwg [Martin03] Martin, D., "[Tsvwg] RFC 1323.bis" Message to tsvwg
mailing list, September 30, 2003. mailing list, September 30, 2003.
[Mathis96] Mathis, M., Mahdavi, J., Floyd, S., and Romanow, A., [Mathis96] Mathis, M., Mahdavi, J., Floyd, S., and Romanow, A.,
"TCP Selective Acknowledgment Options", RFC-2018, October, 1996. "TCP Selective Acknowledgment Options", RFC 2018, October, 1996.
[Mathis08] Mathis, M., "[tcpm] Example of 1323 window retraction
problemPer my comments at the microphone at TCPM...", Message to
the tcpm mailing list, March 2008.
[McKenzie89] McKenzie, A., "A Problem with the TCP Big Window [McKenzie89] McKenzie, A., "A Problem with the TCP Big Window
Option", RFC-1110, BBN STC, August 1989. Option", RFC 1110, BBN STC, August 1989.
[Nagle84] Nagle, J., "Congestion Control in IP/TCP [Nagle84] Nagle, J., "Congestion Control in IP/TCP
Internetworks", RFC-896, FACC, January 1984. Internetworks", RFC 896, FACC, January 1984.
[Postel83] Postel, J., "The TCP Maximum Segment Size and Related
Topics", RFC-879, ISI, November 1983.
[Watson81] Watson, R., "Timer-based Mechanisms in Reliable [Watson81] Watson, R., "Timer-based Mechanisms in Reliable
Transport Protocol Connection Management", Computer Networks, Vol. Transport Protocol Connection Management", Computer Networks, Vol.
5, 1981. 5, 1981.
[Zhang86] Zhang, L., "Why TCP Timers Don't Work Well", Proc. [Zhang86] Zhang, L., "Why TCP Timers Don't Work Well", Proc.
SIGCOMM '86, Stowe, Vt., August 1986. SIGCOMM '86, Stowe, Vt., August 1986.
APPENDIX A: IMPLEMENTATION SUGGESTIONS APPENDIX A: IMPLEMENTATION SUGGESTIONS
skipping to change at page 29, line 48 skipping to change at page 30, line 48
urgent mode until the next TCP packet arrives. That packet will urgent mode until the next TCP packet arrives. That packet will
update the urgent pointer to a new offset, and the user will update the urgent pointer to a new offset, and the user will
never have left urgent mode. never have left urgent mode.
Thus, to properly implement the Urgent Pointer, the sending TCP Thus, to properly implement the Urgent Pointer, the sending TCP
only has to check for overflow of the 16 bit Urgent Pointer only has to check for overflow of the 16 bit Urgent Pointer
field before filling it in. If it does overflow, than a value field before filling it in. If it does overflow, than a value
of 65535 should be inserted into the Urgent Pointer. of 65535 should be inserted into the Urgent Pointer.
The same technique applies to IP Version 6, except in the case The same technique applies to IP Version 6, except in the case
of IPv6 Jumbograms. When IPv6 Jumbograms are supported, of IPv6 Jumbograms. When IPv6 Jumbograms are supported, RFC
RFC-2675 [Borman99] requires additional steps for dealing with 2675 [Borman99] requires additional steps for dealing with the
the Urgent Pointer, these are described in section 5.2 of Urgent Pointer, these are described in section 5.2 of RFC 2675.
RFC-2675.
TCP Options and MSS
There has been some confusion as to what value should be filled
in the TCP MSS option when using TCP options. RFC-879
[Postel83] stated:
The MSS counts only data octets in the segment, it does not
count the TCP header or the IP header.
which is unclear about what to do about TCP options. RFC-1122
[Braden89] attempted to clarify this in section 4.2.2.6, but
there still seems to be confusion.
So, the MSS value to be sent in an MSS option should be equal to
the effective MTU minus the fixed IP and TCP headers. Since
both IP and TCP options are ignored when calculating the value
for the MSS option, if there are any IP or TCP options to be
sent in a packet, then the sender must decrease the size of the
TCP data accordingly. The reason for this can be seen in the
following table:
+--------------------+--------------------+
| MSS is adjusted | MSS isn't adjusted |
| to include options | to include options |
+----------------+--------------------+--------------------+
| Sender adjusts | Packets are too | Packets are the |
| length for | short | correct length |
| options | | |
+----------------+--------------------+--------------------+
| Sender doesn't | Packets are the | Packets are too |
| adjust length | correct length | long. |
| for options | | |
+----------------+--------------------+--------------------+
Since the goal is to not send IP datagrams that have to be
fragmented, and packets sent with the constraints in the lower
right of this grid will cause IP fragmentation, the only way to
guarantee that this doesn't happen is for the data sender to
decrease the TCP data length by the size of the IP and TCP
options. And since the sender will be adjusting the TCP data
length when sending IP and TCP options, there is no need to
include the IP and TCP option lengths in the MSS value.
APPENDIX B: DUPLICATES FROM EARLIER CONNECTION INCARNATIONS APPENDIX B: DUPLICATES FROM EARLIER CONNECTION INCARNATIONS
There are two cases to be considered: (1) a system crashing (and There are two cases to be considered: (1) a system crashing (and
losing connection state) and restarting, and (2) the same connection losing connection state) and restarting, and (2) the same connection
being closed and reopened without a loss of host state. These will being closed and reopened without a loss of host state. These will
be described in the following two sections. be described in the following two sections.
B.1 System Crash with Loss of State B.1 System Crash with Loss of State
skipping to change at page 32, line 29 skipping to change at page 32, line 35
2**16. However, high network speeds are not the major 2**16. However, high network speeds are not the major
contributor to this problem; the RTT is the limiting factor contributor to this problem; the RTT is the limiting factor
in how quickly connections can be opened and closed. in how quickly connections can be opened and closed.
Therefore, this problem will be no worse at high transfer Therefore, this problem will be no worse at high transfer
speeds. speeds.
(b) Allow old duplicate segments to expire. (b) Allow old duplicate segments to expire.
To replace this function of TIME-WAIT state, a mechanism To replace this function of TIME-WAIT state, a mechanism
would have to operate across connections. PAWS is defined would have to operate across connections. PAWS is defined
strictly within a single connection; the last timestamp is strictly within a single connection; the last timestamp
TS.Recent is kept in the connection control block, and (TS.Recent) is kept in the connection control block, and
discarded when a connection is closed. discarded when a connection is closed.
An additional mechanism could be added to the TCP, a per-host An additional mechanism could be added to the TCP, a per-host
cache of the last timestamp received from any connection. cache of the last timestamp received from any connection.
This value could then be used in the PAWS mechanism to reject This value could then be used in the PAWS mechanism to reject
old duplicate segments from earlier incarnations of the old duplicate segments from earlier incarnations of the
connection, if the timestamp clock can be guaranteed to have connection, if the timestamp clock can be guaranteed to have
ticked at least once since the old connection was open. This ticked at least once since the old connection was open. This
would require that the TIME-WAIT delay plus the RTT together would require that the TIME-WAIT delay plus the RTT together
must be at least one tick of the sender's timestamp clock. must be at least one tick of the sender's timestamp clock.
Such an extension is not part of the proposal of this RFC. Such an extension is not part of the proposal of this RFC.
Note that this is a variant on the mechanism proposed by Note that this is a variant on the mechanism proposed by
Garlick, Rom, and Postel [Garlick77], which required each Garlick, Rom, and Postel [Garlick77], which required each
host to maintain connection records containing the highest host to maintain connection records containing the highest
sequence numbers on every connection. Using timestamps sequence numbers on every connection. Using timestamps
instead, it is only necessary to keep one quantity per remote instead, it is only necessary to keep one quantity per remote
host, regardless of the number of simultaneous connections to host, regardless of the number of simultaneous connections to
that host. that host.
APPENDIX C: CHANGES FROM RFC-1072, RFC-1185, RFC-1323 APPENDIX C: CHANGES FROM RFC 1072, RFC 1185, RFC 1323
The protocol extensions defined in RFC-1323 document differ in The protocol extensions defined in RFC 1323 document differ in
several important ways from those defined in RFC-1072 and RFC-1185. several important ways from those defined in RFC 1072 and RFC 1185.
(a) SACK has been split off into a separate document, RFC-2018 (a) SACK has been split off into a separate document, RFC 2018
[Mathis96]. [Mathis96].
(b) The detailed rules for sending timestamp replies (see Section (b) The detailed rules for sending timestamp replies (see Section
3.4) differ in important ways. The earlier rules could result 3.4) differ in important ways. The earlier rules could result
in an under-estimate of the RTT in certain cases (packets in an under-estimate of the RTT in certain cases (packets
dropped or out of order). dropped or out of order).
(c) The same value TS.Recent is now shared by the two distinct (c) The same value TS.Recent is now shared by the two distinct
mechanisms RTTM and PAWS. This simplification became possible mechanisms RTTM and PAWS. This simplification became possible
because of change (b). because of change (b).
(d) An ambiguity in RFC-1185 was resolved in favor of putting (d) An ambiguity in RFC 1185 was resolved in favor of putting
timestamps on ACK as well as data segments. This supports the timestamps on ACK as well as data segments. This supports the
symmetry of the underlying TCP protocol. symmetry of the underlying TCP protocol.
(e) The echo and echo reply options of RFC-1072 were combined into a (e) The echo and echo reply options of RFC 1072 were combined into a
single Timestamps option, to reflect the symmetry and to single Timestamps option, to reflect the symmetry and to
simplify processing. simplify processing.
(f) The problem of outdated timestamps on long-idle connections, (f) The problem of outdated timestamps on long-idle connections,
discussed in Section 4.2.2, was realized and resolved. discussed in Section 4.2.2, was realized and resolved.
(g) RFC-1185 recommended that header prediction take precedence over (g) RFC 1185 recommended that header prediction take precedence over
the timestamp check. Based upon some scepticism about the the timestamp check. Based upon some skepticism about the
probabilistic arguments given in Section 4.2.4, it was decided probabilistic arguments given in Section 4.2.4, it was decided
to recommend that the timestamp check be performed first. to recommend that the timestamp check be performed first.
(h) The spec was modified so that the extended options will be sent (h) The spec was modified so that the extended options will be sent
on <SYN,ACK> segments only when they are received in the on <SYN,ACK> segments only when they are received in the
corresponding <SYN> segments. This provides the most corresponding <SYN> segments. This provides the most
conservative possible conditions for interoperation with conservative possible conditions for interoperation with
implementations without the extensions. implementations without the extensions.
In addition to these substantive changes, the present RFC attempts to In addition to these substantive changes, the present RFC attempts to
specify the algorithms unambiguously by presenting modifications to specify the algorithms unambiguously by presenting modifications to
the Event Processing rules of RFC-793; see Appendix F. the Event Processing rules of RFC 793; see Appendix F.
There are additional changes in this document from RFC-1323. These There are additional changes in this document from RFC 1323. These
changes are: changes are:
(a) The description of which TSecr values can be used to update the (a) The description of which TSecr values can be used to update the
measured RTT has been clarified. Specifically, with Timestamps, measured RTT has been clarified. Specifically, with Timestamps,
the Karn algorithm [Karn87] is disabled. The Karn algorithm the Karn algorithm [Karn87] is disabled. The Karn algorithm
disables all RTT measurements during retransmission, since it is disables all RTT measurements during retransmission, since it is
ambiguous whether the ACK is is for the original packet, or the ambiguous whether the ACK is is for the original packet, or the
retransmitted packet. With Timestamps, that ambiguity is retransmitted packet. With Timestamps, that ambiguity is
removed since the TSecr in the ACK will contain the TSval from removed since the TSecr in the ACK will contain the TSval from
which ever data packet made it to the destination. which ever data packet made it to the destination.
(b) In RFC-1323, section 3.4, step (2) of the algorithm to control (b) In RFC 1323, section 3.4, step (2) of the algorithm to control
which timestamp is echoed was incorrect in two regards: which timestamp is echoed was incorrect in two regards:
(1) It failed to update TSrecent for a retransmitted segment (1) It failed to update TSrecent for a retransmitted segment
that resulted from a lost ACK. that resulted from a lost ACK.
(2) It failed if SEG.LEN = 0. (2) It failed if SEG.LEN = 0.
In the new algorithm, the case of SEG.TSval = TSrecent is In the new algorithm, the case of SEG.TSval = TSrecent is
included for consistency with the PAWS test. included for consistency with the PAWS test.
(c) One correction was made to the Event Processing Summary in (c) One correction was made to the Event Processing Summary in
Appendix F. In SEND CALL/ESTABLISHED STATE, RCV.WND is used to Appendix F. In SEND CALL/ESTABLISHED STATE, RCV.WND is used to
fill in the SEG.WND value, not SND.WND. fill in the SEG.WND value, not SND.WND.
(d) New pseudo-code summary has been added in Appendix E. (d) New pseudo-code summary has been added in Appendix E.
(e) Appendix A has been expanded with information about the TCP MSS (e) Appendix A has been expanded with information about the TCP MSS
option and the TCP Urgent Pointer. option and the TCP Urgent Pointer.
(f) It is now recommended that Timestamps options be included RST (f) It is now recommended that Timestamps options be included in RST
packets if the incoming packet contained a Timestamps option. packets if the incoming packet contained a Timestamps option.
(g) RST packets are explicitly excluded from PAWS processing. (g) RST packets are explicitly excluded from PAWS processing.
(h) Snd.TSoffset and Snd.TSclock variables have been added. (h) Snd.TSoffset and Snd.TSclock variables have been added.
Snd.TSoffset is the sum of my.TSclock and Snd.TSoffset. This Snd.TSoffset is the sum of my.TSclock and Snd.TSoffset. This
allows the starting points for timestamps to be randomized on a allows the starting points for timestamps to be randomized on a
per-connection basis. Setting Snd.TSoffset to zero yields the per-connection basis. Setting Snd.TSoffset to zero yields the
same results as RFC-1323. same results as RFC 1323.
APPENDIX D: SUMMARY OF NOTATION APPENDIX D: SUMMARY OF NOTATION
The following notation has been used in this document. The following notation has been used in this document.
Options Options
WSopt: TCP Window Scale Option WSopt: TCP Window Scale Option
TSopt: TCP Timestamps Option TSopt: TCP Timestamps Option
skipping to change at page 35, line 28 skipping to change at page 36, line 28
TSecr: 32-bit Timestamp Reply field in TSopt. TSecr: 32-bit Timestamp Reply field in TSopt.
Option Fields in Current Segment Option Fields in Current Segment
SEG.TSval: TSval field from TSopt in current segment. SEG.TSval: TSval field from TSopt in current segment.
SEG.TSecr: TSecr field from TSopt in current segment. SEG.TSecr: TSecr field from TSopt in current segment.
SEG.WSopt: 8-bit value in WSopt SEG.WSopt: 8-bit value in WSopt
Clock Values Clock Values
my.TSclock: System Wide Local source of 32-bit timestamp values my.TSclock: System wide source of 32-bit timestamp values
my.TSclock.rate: Period of my.TSclock (1 ms to 1 sec). my.TSclock.rate: Period of my.TSclock (1 ms to 1 sec).
Snd.TSoffset: A offset for randomizing Snd.TSclock Snd.TSoffset: A offset for randomizing Snd.TSclock
Snd.TSclock: my.TSclock + Snd.TSoffset Snd.TSclock: my.TSclock + Snd.TSoffset
Per-Connection State Variables Per-Connection State Variables
TS.Recent: Latest received Timestamp TS.Recent: Latest received Timestamp
Last.ACK.sent: Last ACK field sent Last.ACK.sent: Last ACK field sent
Snd.TS.OK: 1-bit flag Snd.TS.OK: 1-bit flag
skipping to change at page 36, line 28 skipping to change at page 37, line 28
SEG.WND = MIN( RCV.WND, 65535 ); SEG.WND = MIN( RCV.WND, 65535 );
Include in segment: TSopt(TSval=Snd.TSclock, TCecr=0); Include in segment: TSopt(TSval=Snd.TSclock, TCecr=0);
Include in segment: WSopt = Rcv.wind.scale; Include in segment: WSopt = Rcv.wind.scale;
} }
Send {SYN, ACK} segment => { Send {SYN, ACK} segment => {
SEG.ACK = Last.ACK.sent = RCV.NXT; SEG.ACK = Last.ACK.sent = RCV.NXT;
SEG.WND = MIN( RCV.WND, 65535 ); SEG.WND = MIN( RCV.WND, 65535 );
if (Snd.TS.OK) then if (Snd.TS.OK) then
Include in segment: TSopt(TSval=Snd.TSclock, TSecr=TS.Recent); Include in segment:
TSopt(TSval=Snd.TSclock, TSecr=TS.Recent);
if (Snd.WS.OK) then if (Snd.WS.OK) then
Include in segment: WSopt = Rcv.wind.scale; Include in segment: WSopt = Rcv.wind.scale;
} }
Receive {SYN} or {SYN,ACK} segment => { Receive {SYN} or {SYN,ACK} segment => {
if (Segment contains TSopt) then { if (Segment contains TSopt) then {
TS.Recent = SEG.TSval; TS.Recent = SEG.TSval;
Snd.TS.OK = TRUE; Snd.TS.OK = TRUE;
if (is {SYN,ACK} segment) then if (is {SYN,ACK} segment) then
skipping to change at page 37, line 11 skipping to change at page 38, line 12
} }
else else
Rcv.wind.scale = Snd.wind.scale = 0; Rcv.wind.scale = Snd.wind.scale = 0;
} }
Send non-SYN segment => { Send non-SYN segment => {
SEG.ACK = Last.ACK.sent = RCV.NXT; SEG.ACK = Last.ACK.sent = RCV.NXT;
SEG.WND = MIN( RCV.WND >> Rcv.wind.scale, 65535 ); SEG.WND = MIN( RCV.WND >> Rcv.wind.scale, 65535 );
if (Snd.TS.OK) then if (Snd.TS.OK) then
Include in segment: TSopt(TSval=Snd.TSclock, TSecr=TS.Recent); Include in segment:
TSopt(TSval=Snd.TSclock, TSecr=TS.Recent);
} }
Receive non-SYN segment in (state >= ESTABLISHED) => { Receive non-SYN segment in (state >= ESTABLISHED) => {
Window = (SEG.WND << Snd.wind.scale); Window = (SEG.WND << Snd.wind.scale);
/* Use 32-bit 'Window' instead of 16-bit 'SEG.WND' /* Use 32-bit 'Window' instead of 16-bit 'SEG.WND'
* in rest of processing. * in rest of processing.
*/ */
if (Segment contains TSopt) then { if (Segment contains TSopt) then {
skipping to change at page 44, line 7 skipping to change at page 46, line 7
If the Snd.TS.OK bit is on, include Timestamps option If the Snd.TS.OK bit is on, include Timestamps option
<TSval=Snd.TSclock,TSecr=TS.Recent> in this ACK segment. Set <TSval=Snd.TSclock,TSecr=TS.Recent> in this ACK segment. Set
Last.ACK.sent to SEG.ACK of the acknowledgment, and send it. Last.ACK.sent to SEG.ACK of the acknowledgment, and send it.
This acknowledgment should be piggy-backed on a segment being This acknowledgment should be piggy-backed on a segment being
transmitted if possible without incurring undue delay. transmitted if possible without incurring undue delay.
... ...
APPENDIX G: Timestamps Edge Cases APPENDIX G: Timestamps Edge Cases
While the rules layed out for when to calculate RTTM produce the While the rules laid out for when to calculate RTTM produce the
correct results most of the time, there are some edge cases where an correct results most of the time, there are some edge cases where an
incorrect RTTM can be calculated. All of these situations involve incorrect RTTM can be calculated. All of these situations involve
the loss of packets. It is felt that these scenarios are rare, and the loss of packets. It is felt that these scenarios are rare, and
that if they should happen, they will cause a single RTTM measurement that if they should happen, they will cause a single RTTM measurement
to be inflated, which mitigates its effects on RTO calculations. to be inflated, which mitigates its effects on RTO calculations.
[Martin03] cites two similar cases when the returning ACK is lost, [Martin03] cites two similar cases when the returning ACK is lost,
and before the retransmission timer fires, another returning packet and before the retransmission timer fires, another returning packet
arrives, which ACKs the data. In this case, the RTTM calculated will arrives, which ACKs the data. In this case, the RTTM calculated will
be inflated: be inflated:
skipping to change at page 45, line 19 skipping to change at line 2067
Phone: (310) 448-9173 Phone: (310) 448-9173
EMail: Braden@ISI.EDU EMail: Braden@ISI.EDU
Van Jacobson Van Jacobson
Packet Design Packet Design
2465 Latham Street 2465 Latham Street
Mountain View, CA 94040 Mountain View, CA 94040
EMail: van@packetdesign.com EMail: van@packetdesign.com
Full Copyright Statement
Copyright (C) The IETF Trust (2008).
This document is subject to the rights, licenses and restrictions
contained in BCP 78, and except as set forth therein, the authors
retain all their rights.
This document and the information contained herein are provided on an
"AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Intellectual Property
The IETF takes no position regarding the validity or scope of any
Intellectual Property Rights or other rights that might be claimed to
pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights
might or might not be available; nor does it represent that it has
made any independent effort to identify any such rights. Information
on the procedures with respect to rights in RFC documents can be
found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any
assurances of licenses to be made available, or the result of an
attempt made to obtain a general license or permission for the use of
such proprietary rights by implementers or users of this
specification can be obtained from the IETF on-line IPR repository at
http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at ietf-
ipr@ietf.org.
 End of changes. 92 change blocks. 
234 lines changed or deleted 238 lines changed or added

This html diff was produced by rfcdiff 1.35. The latest version is available from http://tools.ietf.org/tools/rfcdiff/