draft-ietf-tsvwg-byte-pkt-congest-02.txt | draft-ietf-tsvwg-byte-pkt-congest-03.txt | |||
---|---|---|---|---|
Transport Area Working Group B. Briscoe | Transport Area Working Group B. Briscoe | |||
Internet-Draft BT | Internet-Draft BT | |||
Updates: 2309 (if approved) J. Manner | Updates: 2309 (if approved) J. Manner | |||
Intended status: Informational Aalto University | Intended status: Informational Aalto University | |||
Expires: January 13, 2011 July 12, 2010 | Expires: April 27, 2011 October 24, 2010 | |||
Byte and Packet Congestion Notification | Byte and Packet Congestion Notification | |||
draft-ietf-tsvwg-byte-pkt-congest-02 | draft-ietf-tsvwg-byte-pkt-congest-03 | |||
Abstract | Abstract | |||
This memo concerns dropping or marking packets using active queue | This memo concerns dropping or marking packets using active queue | |||
management (AQM) such as random early detection (RED) or pre- | management (AQM) such as random early detection (RED) or pre- | |||
congestion notification (PCN). We give two strong recommendations: | congestion notification (PCN). We give three strong recommendations: | |||
(1) packet size should not be taken into account when transports read | (1) packet size should be taken into account when transports read | |||
congestion indications, not when network equipment writes them, and | congestion indications, (2) packet size should not be taken into | |||
(2) byte-mode packet drop variant of AQM algorithms, such as RED, | account when network equipment creates congestion signals (marking, | |||
should not be used to drop fewer small packets. | dropping), and therefore (3) the byte-mode packet drop variant of the | |||
RED AQM algorithm that drops fewer small packets should not be used. | ||||
Status of This Memo | Status of This Memo | |||
This Internet-Draft is submitted in full conformance with the | This Internet-Draft is submitted in full conformance with the | |||
provisions of BCP 78 and BCP 79. | provisions of BCP 78 and BCP 79. | |||
Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
Drafts is at http://datatracker.ietf.org/drafts/current/. | Drafts is at http://datatracker.ietf.org/drafts/current/. | |||
Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
This Internet-Draft will expire on January 13, 2011. | This Internet-Draft will expire on April 27, 2011. | |||
Copyright Notice | Copyright Notice | |||
Copyright (c) 2010 IETF Trust and the persons identified as the | Copyright (c) 2010 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | document authors. All rights reserved. | |||
This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
(http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
publication of this document. Please review these documents | publication of this document. Please review these documents | |||
carefully, as they describe your rights and restrictions with respect | carefully, as they describe your rights and restrictions with respect | |||
to this document. Code Components extracted from this document must | to this document. Code Components extracted from this document must | |||
include Simplified BSD License text as described in Section 4.e of | include Simplified BSD License text as described in Section 4.e of | |||
the Trust Legal Provisions and are provided without warranty as | the Trust Legal Provisions and are provided without warranty as | |||
described in the Simplified BSD License. | described in the Simplified BSD License. | |||
Table of Contents | Table of Contents | |||
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 | |||
1.1. Terminology and Scoping . . . . . . . . . . . . . . . . . 6 | 1.1. Terminology and Scoping . . . . . . . . . . . . . . . . . 7 | |||
1.2. Why now? . . . . . . . . . . . . . . . . . . . . . . . . . 7 | 1.2. Why now? . . . . . . . . . . . . . . . . . . . . . . . . . 8 | |||
2. Motivating Arguments . . . . . . . . . . . . . . . . . . . . . 8 | 2. Motivating Arguments . . . . . . . . . . . . . . . . . . . . . 10 | |||
2.1. Scaling Congestion Control with Packet Size . . . . . . . 8 | 2.1. Scaling Congestion Control with Packet Size . . . . . . . 10 | |||
2.2. Avoiding Perverse Incentives to (ab)use Smaller Packets . 10 | 2.2. Transport-Independent Network . . . . . . . . . . . . . . 10 | |||
2.3. Small != Control . . . . . . . . . . . . . . . . . . . . . 11 | 2.3. Avoiding Perverse Incentives to (Ab)use Smaller Packets . 11 | |||
2.4. Implementation Efficiency . . . . . . . . . . . . . . . . 11 | 2.4. Small != Control . . . . . . . . . . . . . . . . . . . . . 12 | |||
3. The State of the Art . . . . . . . . . . . . . . . . . . . . . 11 | 2.5. Implementation Efficiency . . . . . . . . . . . . . . . . 13 | |||
3.1. Congestion Measurement: Status . . . . . . . . . . . . . . 12 | 3. Recommendations . . . . . . . . . . . . . . . . . . . . . . . 13 | |||
3.1.1. Fixed Size Packet Buffers . . . . . . . . . . . . . . 13 | 3.1. Recommendation on Queue Measurement . . . . . . . . . . . 13 | |||
3.1.2. Congestion Measurement without a Queue . . . . . . . . 14 | 3.2. Recommendation on Notifying Congestion . . . . . . . . . . 13 | |||
3.2. Congestion Coding: Status . . . . . . . . . . . . . . . . 14 | 3.3. Recommendation on Responding to Congestion . . . . . . . . 14 | |||
3.2.1. Network Bias when Encoding . . . . . . . . . . . . . . 14 | 3.4. Recommended Future Research . . . . . . . . . . . . . . . 15 | |||
3.2.2. Transport Bias when Decoding . . . . . . . . . . . . . 16 | 4. A Survey and Critique of Past Advice . . . . . . . . . . . . . 15 | |||
3.2.3. Making Transports Robust against Control Packet | 4.1. Congestion Measurement Advice . . . . . . . . . . . . . . 16 | |||
Losses . . . . . . . . . . . . . . . . . . . . . . . . 17 | 4.1.1. Fixed Size Packet Buffers . . . . . . . . . . . . . . 16 | |||
3.2.4. Congestion Coding: Summary of Status . . . . . . . . . 18 | 4.1.2. Congestion Measurement without a Queue . . . . . . . . 17 | |||
4. Outstanding Issues and Next Steps . . . . . . . . . . . . . . 20 | 4.2. Congestion Notification Advice . . . . . . . . . . . . . . 18 | |||
4.1. Bit-congestible World . . . . . . . . . . . . . . . . . . 20 | 4.2.1. Network Bias when Encoding . . . . . . . . . . . . . . 18 | |||
4.2. Bit- & Packet-congestible World . . . . . . . . . . . . . 21 | 4.2.2. Transport Bias when Decoding . . . . . . . . . . . . . 20 | |||
5. Recommendation and Conclusions . . . . . . . . . . . . . . . . 22 | 4.2.3. Making Transports Robust against Control Packet | |||
5.1. Recommendation on Queue Measurement . . . . . . . . . . . 22 | Losses . . . . . . . . . . . . . . . . . . . . . . . . 21 | |||
5.2. Recommendation on Notifying Congestion . . . . . . . . . . 23 | 4.2.4. Congestion Notification: Summary of Conflicting | |||
5.3. Recommendation on Responding to Congestion . . . . . . . . 24 | Advice . . . . . . . . . . . . . . . . . . . . . . . . 22 | |||
5.4. Recommended Future Research . . . . . . . . . . . . . . . 24 | 4.2.5. RED Implementation Status . . . . . . . . . . . . . . 23 | |||
6. Security Considerations . . . . . . . . . . . . . . . . . . . 24 | 5. Outstanding Issues and Next Steps . . . . . . . . . . . . . . 24 | |||
7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 25 | 5.1. Bit-congestible World . . . . . . . . . . . . . . . . . . 24 | |||
8. Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 25 | 5.2. Bit- & Packet-congestible World . . . . . . . . . . . . . 25 | |||
9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 25 | 6. Security Considerations . . . . . . . . . . . . . . . . . . . 26 | |||
9.1. Normative References . . . . . . . . . . . . . . . . . . . 25 | 7. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 27 | |||
9.2. Informative References . . . . . . . . . . . . . . . . . . 26 | 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 27 | |||
Appendix A. Congestion Notification Definition: Further | 9. Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 28 | |||
Justification . . . . . . . . . . . . . . . . . . . . 30 | 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 28 | |||
Appendix B. Idealised Wire Protocol . . . . . . . . . . . . . . . 30 | 10.1. Normative References . . . . . . . . . . . . . . . . . . . 28 | |||
B.1. Protocol Coding . . . . . . . . . . . . . . . . . . . . . 30 | 10.2. Informative References . . . . . . . . . . . . . . . . . . 29 | |||
B.2. Example Scenarios . . . . . . . . . . . . . . . . . . . . 32 | Appendix A. Idealised Wire Protocol . . . . . . . . . . . . . . . 32 | |||
B.2.1. Notation . . . . . . . . . . . . . . . . . . . . . . . 32 | A.1. Protocol Coding . . . . . . . . . . . . . . . . . . . . . 32 | |||
B.2.2. Bit-congestible resource, equal bit rates (Ai) . . . . 32 | A.2. Example Scenarios . . . . . . . . . . . . . . . . . . . . 34 | |||
B.2.3. Bit-congestible resource, equal packet rates (Bi) . . 33 | A.2.1. Notation . . . . . . . . . . . . . . . . . . . . . . . 34 | |||
B.2.4. Pkt-congestible resource, equal bit rates (Aii) . . . 34 | A.2.2. Bit-congestible resource, equal bit rates (Ai) . . . . 34 | |||
B.2.5. Pkt-congestible resource, equal packet rates (Bii) . . 35 | A.2.3. Bit-congestible resource, equal packet rates (Bi) . . 35 | |||
A.2.4. Pkt-congestible resource, equal bit rates (Aii) . . . 36 | ||||
A.2.5. Pkt-congestible resource, equal packet rates (Bii) . . 37 | ||||
Appendix B. Byte-mode Drop Complicates Policing Congestion | ||||
Response . . . . . . . . . . . . . . . . . . . . . . 37 | ||||
Appendix C. Byte-mode Drop Complicates Policing Congestion | Appendix C. Changes from Previous Versions . . . . . . . . . . . 38 | |||
Response . . . . . . . . . . . . . . . . . . . . . . 35 | ||||
Appendix D. Changes from Previous Versions . . . . . . . . . . . 36 | ||||
1. Introduction | 1. Introduction | |||
This memo is initially concerned with how we should correctly scale | ||||
congestion control functions with packet size for the long term. But | ||||
it also recognises that expediency may be necessary to deal with | ||||
existing widely deployed protocols that don't live up to the long | ||||
term goal. | ||||
When notifying congestion, the problem of how (and whether) to take | When notifying congestion, the problem of how (and whether) to take | |||
packet sizes into account has exercised the minds of researchers and | packet sizes into account has exercised the minds of researchers and | |||
practitioners for as long as active queue management (AQM) has been | practitioners for as long as active queue management (AQM) has been | |||
discussed. Indeed, one reason AQM was originally introduced was to | discussed. Indeed, one reason AQM was originally introduced was to | |||
reduce the lock-out effects that small packets can have on large | reduce the lock-out effects that small packets can have on large | |||
packets in drop-tail queues. This memo aims to state the principles | packets in drop-tail queues. This memo aims to state the principles | |||
we should be using and to come to conclusions on what these | we should be using and to come to conclusions on what these | |||
principles will mean for future protocol design, taking into account | principles will mean for future protocol design, taking into account | |||
the deployments we have already. | the deployments we have already. | |||
The byte vs. packet dilemma arises at three stages in the congestion | The byte vs. packet dilemma arises at three stages in the congestion | |||
notification process: | notification process: | |||
Measuring congestion: When the congested resource decides locally to | Measuring congestion: When the congested resource decides locally to | |||
measure how congested it is. (Should the queue measure its length | measure how congested it is, should the queue measure its length | |||
in bytes or packets?); | in bytes or packets? | |||
Coding congestion notification into the wire protocol: When the | Encoding congestion notification into the wire protocol: When the | |||
congested resource decides whether to notify the level of | congested network resource decides whether to notify the level of | |||
congestion on each particular packet. (When a queue considers | congestion by dropping or marking a particular packet, should its | |||
whether to notify congestion by dropping or marking a particular | decision depend on the byte-size of the particular packet being | |||
packet, should its decision depend on the byte-size of the | dropped or marked? | |||
particular packet being dropped or marked?); | ||||
Decoding congestion notification from the wire protocol: When the | Decoding congestion notification from the wire protocol: When the | |||
transport interprets the notification in order to decide how much | transport interprets the notification in order to decide how much | |||
to respond to congestion. (Should the transport take into account | to respond to congestion, should it take into account the byte- | |||
the byte-size of each missing or marked packet?). | size of each missing or marked packet? | |||
Consensus has emerged over the years concerning the first stage: | Consensus has emerged over the years concerning the first stage: | |||
whether queues are measured in bytes or packets, termed byte-mode | whether queues are measured in bytes or packets, termed byte-mode | |||
queue measurement or packet-mode queue measurement. This memo | queue measurement or packet-mode queue measurement. This memo | |||
records this consensus in the RFC Series. In summary the choice | records this consensus in the RFC Series. In summary the choice | |||
solely depends on whether the resource is congested by bytes or | solely depends on whether the resource is congested by bytes or | |||
packets. | packets. | |||
The controversy is mainly around the last two stages to do with | The controversy is mainly around the last two stages: whether to | |||
encoding congestion notification into packets: whether to allow for | allow for the size of the specific packet notifying congestion i) | |||
the size of the specific packet notifying congestion i) when the | when the network encodes or ii) when the transport decodes the | |||
network encodes or ii) when the transport decodes the congestion | congestion notification. | |||
notification. | ||||
Currently, the RFC series is silent on this matter other than a paper | Currently, the RFC series is silent on this matter other than a paper | |||
trail of advice referenced from [RFC2309], which conditionally | trail of advice referenced from [RFC2309], which conditionally | |||
recommends byte-mode (packet-size dependent) drop [pktByteEmail]. | recommends byte-mode (packet-size dependent) drop [pktByteEmail]. | |||
Reducing drop of small packets certainly has some tempting | ||||
advantages: i) it drops less control packets, which tend to be small | ||||
and ii) it makes TCP's bit-rate less dependent on packet size. | ||||
However, there are ways of addressing these issues at the transport | ||||
layer, rather than reverse engineering network forwarding to fix the | ||||
problems of one specific transport. | ||||
The primary purpose of this memo is to build a definitive consensus | The primary purpose of this memo is to build a definitive consensus | |||
against such deliberate preferential treatment for small packets in | against deliberate preferential treatment for small packets in AQM | |||
AQM algorithms and to record this advice within the RFC series. | algorithms and to record this advice within the RFC series. It | |||
Fortunately all the implementers who responded to our survey | recommends that (1) packet size should be taken into account when | |||
(Section 3.2.4) have not followed the earlier advice, so the | transports read congestion indications, (2) not when network | |||
equipment writes them. | ||||
In particular this means that the byte-mode packet drop variant of | ||||
RED should not be used to drop fewer small packets, because that | ||||
creates a perverse incentive for transports to use tiny segments, | ||||
consequently also opening up a DoS vulnerability. Fortunately all | ||||
the RED implementers who responded to our survey (Section 4.2.4) have | ||||
not followed the earlier advice to use byte-mode drop, so the | ||||
consensus this memo argues for seems to already exist in | consensus this memo argues for seems to already exist in | |||
implementations. | implementations. | |||
The primary conclusion of this memo is that packet size should be | However, at the transport layer, TCP congestion control is a widely | |||
taken into account when transports read congestion indications, not | deployed protocol that we argue doesn't scale correctly with packet | |||
when network equipment writes them. Reducing drop of small packets | size. To date this hasn't been a significant problem because most | |||
has some tempting advantages: i) it drops less control packets, which | TCPs have been used with similar packet sizes. But, as we design new | |||
tend to be small and ii) it makes TCP's bit-rate less dependent on | congestion controls, we should build in scaling with packet size | |||
packet size. However, there are ways of addressing these issues at | rather than assuming we should follow TCP's example. | |||
the transport layer, rather than reverse engineering network | ||||
forwarding to fix specific transport problems. | ||||
The second conclusion is that network layer algorithms like the byte- | ||||
mode packet drop variant of RED should not be used to drop fewer | ||||
small packets, because that creates a perverse incentive for | ||||
transports to use tiny segments, consequently also opening up a DoS | ||||
vulnerability. | ||||
This memo is initially concerned with how we should correctly scale | ||||
congestion control functions with packet size for the long term. But | ||||
it also recognises that expediency may be necessary to deal with | ||||
existing widely deployed protocols that don't live up to the long | ||||
term goal. It turns out that the 'correct' variant of RED to deploy | ||||
seems to be the one everyone has deployed, and no-one who responded | ||||
to our survey has implemented the other variant. However, at the | ||||
transport layer, TCP congestion control is a widely deployed protocol | ||||
that we argue doesn't scale correctly with packet size. To date this | ||||
hasn't been a significant problem because most TCPs have been used | ||||
with similar packet sizes. But, as we design new congestion | ||||
controls, we should build in scaling with packet size rather than | ||||
assuming we should follow TCP's example. | ||||
This memo continues as follows. Terminology and scoping are | This memo continues as follows. First it discusses terminology and | |||
discussed next, and the reasons to make the recommendations presented | scoping and why it is relevant to publish this memo now. Section 2 | |||
in this memo now are given in Section 1.2. Motivating arguments for | gives motivating arguments for the recommendations that are formally | |||
our advice are given in Section 2. We then survey the advice given | stated in Section 3, which follows. We then critically survey the | |||
previously in the RFC series, the research literature and the | advice given previously in the RFC series and the research literature | |||
deployed legacy (Section 3) before listing outstanding issues | (Section 4), followed by an assessment of whether or not this advice | |||
(Section 4) that will need resolution both to inform future protocols | has been followed in production networks (Section 4.2.5). To wrap | |||
designs and to handle legacy. We then give concrete recommendations | up, outstanding issues are discussed that will need resolution both | |||
for the way forward in (Section 5). We finally give security | to inform future protocols designs and to handle legacy (Section 5). | |||
considerations in Section 6. The interested reader can also find | Then security issues are collected together in Section 6 before | |||
further discussions about the theme of byte vs. packet in the | conclusions are drawn in Section 7. The interested reader can find | |||
appendices. | discussion of more detailed issues on the theme of byte vs. packet in | |||
the appendices. | ||||
This memo intentionally includes a non-negligible amount of material | This memo intentionally includes a non-negligible amount of material | |||
on the subject. A busy reader can jump right into Section 5 to read | on the subject. A busy reader can jump right into Section 3 to read | |||
a summary of the recommendations for the Internet community. | a summary of the recommendations for the Internet community. | |||
1.1. Terminology and Scoping | 1.1. Terminology and Scoping | |||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | |||
document are to be interpreted as described in [RFC2119]. | document are to be interpreted as described in [RFC2119]. | |||
Congestion Notification: Rather than aim to achieve what many have | Congestion Notification: Rather than aim to achieve what many have | |||
tried and failed, this memo will not try to define congestion. It | tried and failed, this memo will not try to define congestion. It | |||
skipping to change at page 6, line 32 | skipping to change at page 7, line 30 | |||
serve. L is the instantaneous offered load. | serve. L is the instantaneous offered load. | |||
The phrase `unwilling to serve' is added, because AQM systems | The phrase `unwilling to serve' is added, because AQM systems | |||
(e.g. RED, PCN [RFC5670]) set a virtual limit smaller than the | (e.g. RED, PCN [RFC5670]) set a virtual limit smaller than the | |||
actual limit to the resource, then notify when this virtual limit | actual limit to the resource, then notify when this virtual limit | |||
is exceeded in order to avoid congestion of the actual capacity. | is exceeded in order to avoid congestion of the actual capacity. | |||
Note that the denominator is offered load, not capacity. | Note that the denominator is offered load, not capacity. | |||
Therefore congestion notification is a real number bounded by the | Therefore congestion notification is a real number bounded by the | |||
range [0,1]. This ties in with the most well-understood measure | range [0,1]. This ties in with the most well-understood measure | |||
of congestion notification: drop fraction (often loosely called | of congestion notification: drop probability (often loosely called | |||
loss rate). It also means that congestion has a natural | loss rate). It also means that congestion has a natural | |||
interpretation as a probability; the probability of offered | interpretation as a probability; the probability of offered | |||
traffic not being served (or being marked as at risk of not being | traffic not being served (or being marked as at risk of not being | |||
served). Appendix A describes a further incidental benefit that | served). | |||
arises from using load as the denominator of congestion | ||||
notification. | ||||
Explicit and Implicit Notification: The byte vs. packet dilemma | Explicit and Implicit Notification: The byte vs. packet dilemma | |||
concerns congestion notification irrespective of whether it is | concerns congestion notification irrespective of whether it is | |||
signalled implicitly by drop or using explicit congestion | signalled implicitly by drop or using explicit congestion | |||
notification (ECN [RFC3168] or PCN [RFC5670]). Throughout this | notification (ECN [RFC3168] or PCN [RFC5670]). Throughout this | |||
document, unless clear from the context, the term marking will be | document, unless clear from the context, the term marking will be | |||
used to mean notifying congestion explicitly, while congestion | used to mean notifying congestion explicitly, while congestion | |||
notification will be used to mean notifying congestion either | notification will be used to mean notifying congestion either | |||
implicitly by drop or explicitly by marking. | implicitly by drop or explicitly by marking. | |||
skipping to change at page 7, line 12 | skipping to change at page 8, line 9 | |||
congestible. If the load depends on the rate at which bits arrive | congestible. If the load depends on the rate at which bits arrive | |||
it is called bit-congestible. | it is called bit-congestible. | |||
Examples of packet-congestible resources are route look-up engines | Examples of packet-congestible resources are route look-up engines | |||
and firewalls, because load depends on how many packet headers | and firewalls, because load depends on how many packet headers | |||
they have to process. Examples of bit-congestible resources are | they have to process. Examples of bit-congestible resources are | |||
transmission links, radio power and most buffer memory, because | transmission links, radio power and most buffer memory, because | |||
the load depends on how many bits they have to transmit or store. | the load depends on how many bits they have to transmit or store. | |||
Some machine architectures use fixed size packet buffers, so | Some machine architectures use fixed size packet buffers, so | |||
buffer memory in these cases is packet-congestible (see | buffer memory in these cases is packet-congestible (see | |||
Section 3.1.1). | Section 4.1.1). | |||
Currently a design goal of network processing equipment such as | Currently a design goal of network processing equipment such as | |||
routers and firewalls is to keep packet processing uncongested | routers and firewalls is to keep packet processing uncongested | |||
even under worst case bit rates with minimum packet sizes. | even under worst case bit rates with minimum packet sizes. | |||
Therefore, packet-congestion is currently rare, but there is no | Therefore, packet-congestion is currently rare | |||
guarantee that it will not become common with future technology | [I-D.irtf-iccrg-welzl; S.3.3], but there is no guarantee that it | |||
trends. | will not become common with future technology trends. | |||
Note that information is generally processed or transmitted with a | Note that information is generally processed or transmitted with a | |||
minimum granularity greater than a bit (e.g. octets). The | minimum granularity greater than a bit (e.g. octets). The | |||
appropriate granularity for the resource in question should be | appropriate granularity for the resource in question should be | |||
used, but for the sake of brevity we will talk in terms of bytes | used, but for the sake of brevity we will talk in terms of bytes | |||
in this memo. | in this memo. | |||
Coarser granularity: Resources may be congestible at higher levels | Coarser Granularity: Resources may be congestible at higher levels | |||
of granularity than packets, for instance stateful firewalls are | of granularity than bits or packets, for instance stateful | |||
flow-congestible and call-servers are session-congestible. This | firewalls are flow-congestible and call-servers are session- | |||
memo focuses on congestion of connectionless resources, but the | congestible. This memo focuses on congestion of connectionless | |||
same principles may be applicable for congestion notification | resources, but the same principles may be applicable for | |||
protocols controlling per-flow and per-session processing or | congestion notification protocols controlling per-flow and per- | |||
state. | session processing or state. | |||
RED Terminology: In RED, whether to use packets or bytes when | RED Terminology: In RED, whether to use packets or bytes when | |||
measuring queues is respectively called packet-mode or byte-mode | measuring queues is called respectively packet-mode queue | |||
queue measurement. And if the probability of dropping a packet | measurement or byte-mode queue measurement. And whether the | |||
depends on its byte-size it is called byte-mode drop, whereas if | probability of dropping a packet is independent or dependent on | |||
the drop probability is independent of a packet's byte-size it is | its byte-size is called respectively packet-mode drop or byte-mode | |||
called packet-mode drop. | drop. The terms byte-mode and packet-mode should not be used | |||
without specifying whether they apply to queue measurement or to | ||||
drop. | ||||
1.2. Why now? | 1.2. Why now? | |||
Now is a good time to discuss whether fairness between different | Now is a good time to discuss whether fairness between different | |||
sized packets would best be implemented in the network layer, or at | sized packets would best be implemented in network equipment, or at | |||
the transport, for a number of reasons: | the transport, for a number of reasons: | |||
1. The packet vs. byte issue requires speedy resolution because the | 1. The IETF pre-congestion notification (PCN) working group is | |||
IETF pre-congestion notification (PCN) working group is | ||||
standardising the external behaviour of a PCN congestion | standardising the external behaviour of a PCN congestion | |||
notification (AQM) algorithm [RFC5670]; | notification (AQM) algorithm [RFC5670]; | |||
2. [RFC2309] says RED may either take account of packet size or not | 2. [RFC2309] says RED may either take account of packet size or not | |||
when dropping, but gives no recommendation between the two, | when dropping, but gives no recommendation between the two, | |||
referring instead to advice on the performance implications in an | referring instead to advice on the performance implications in an | |||
email [pktByteEmail], which recommends byte-mode drop. Further, | email [pktByteEmail], which recommends byte-mode drop. Further, | |||
just before RFC2309 was issued, an addendum was added to the | just before RFC2309 was issued, an addendum was added to the | |||
archived email that revisited the issue of packet vs. byte-mode | archived email that revisited the issue of packet vs. byte-mode | |||
drop in its last paragraph, making the recommendation less clear- | drop in its last paragraph, making the recommendation less clear- | |||
skipping to change at page 8, line 32 | skipping to change at page 9, line 32 | |||
forwarding functions in future [I-D.irtf-iccrg-welzl]. The wider | forwarding functions in future [I-D.irtf-iccrg-welzl]. The wider | |||
Internet community needs to discuss whether the complexity of | Internet community needs to discuss whether the complexity of | |||
adjusting for packet size should be in the network or in | adjusting for packet size should be in the network or in | |||
transports; | transports; | |||
5. Given there are many good reasons why larger path max | 5. Given there are many good reasons why larger path max | |||
transmission units (PMTUs) would help solve a number of scaling | transmission units (PMTUs) would help solve a number of scaling | |||
issues, we don't want to create any bias against large packets | issues, we don't want to create any bias against large packets | |||
that is greater than their true cost; | that is greater than their true cost; | |||
6. The IETF has started to consider the question of fairness between | 6. The IETF audio/video transport (AVT) working group is | |||
standardising how the real-time protocol (RTP) should feedback | ||||
and respond to explicit congestion notification (ECN) | ||||
[I-D.ietf-avt-ecn-for-rtp]. | ||||
7. The IETF has started to consider the question of fairness between | ||||
flows that use different packet sizes (e.g. in the small-packet | flows that use different packet sizes (e.g. in the small-packet | |||
variant of TCP-friendly rate control, TFRC-SP [RFC4828]). Given | variant of TCP-friendly rate control, TFRC-SP [RFC4828]). Given | |||
transports with different packet sizes, if we don't decide | transports with different packet sizes, if we don't decide | |||
whether the network or the transport should allow for packet | whether the network or the transport should allow for packet | |||
size, it will be hard if not impossible to design any transport | size, it will be hard if not impossible to design any transport | |||
protocol so that its bit-rate relative to other transports meets | protocol so that its bit-rate relative to other transports meets | |||
design guidelines [RFC5033] (Note however that, if the concern | design guidelines [RFC5033] (Note however that, if the concern | |||
were fairness between users, rather than between flows | were fairness between users, rather than between flows | |||
[Rate_fair_Dis], relative rates between flows would have to come | [Rate_fair_Dis], relative rates between flows would have to come | |||
under run-time control rather than being embedded in protocol | under run-time control rather than being embedded in protocol | |||
designs). | designs). | |||
2. Motivating Arguments | 2. Motivating Arguments | |||
In this section, we evaluate the topic of packet vs. byte based | ||||
congestion notifications and motivate the recommendations given in | ||||
this document. | ||||
2.1. Scaling Congestion Control with Packet Size | 2.1. Scaling Congestion Control with Packet Size | |||
There are two ways of interpreting a dropped or marked packet. It | There are two ways of interpreting a dropped or marked packet. It | |||
can either be considered as a single loss event or as loss/marking of | can either be considered as a single loss event or as loss/marking of | |||
the bytes in the packet. Here we try to design a test to see which | the bytes in the packet. | |||
approach scales with packet size. | ||||
Given bit-congestible is the more common case (see Section 1.1), | Consider a bit-congestible link shared by many flows (bit-congestible | |||
consider a bit-congestible link shared by many flows, so that each | is the more common case, see Section 1.1), so that each busy period | |||
busy period tends to cause packets to be lost from different flows. | tends to cause packets to be lost from different flows. Consider | |||
The test compares two identical scenarios with the same applications, | further two sources that have the same data rate but break the load | |||
the same numbers of sources and the same load. But the sources break | into large packets in one application (A) and small packets in the | |||
the load into large packets in one scenario and small packets in the | other (B). Of course, because the load is the same, there will be | |||
other. Of course, because the load is the same, there will be | proportionately more packets in the small packet flow (B). | |||
proportionately more packets in the small packet case. | ||||
The test of whether a congestion control scales with packet size is | If a congestion control scales with packet size it should respond in | |||
that it should respond in the same way to the same congestion | the same way to the same congestion excursion, irrespective of the | |||
excursion, irrespective of the size of the packets that the bytes | size of the packets that the bytes causing congestion happen to be | |||
causing congestion happen to be broken down into. | broken down into. | |||
A bit-congestible queue suffering a congestion excursion has to drop | A bit-congestible queue suffering a congestion excursion has to drop | |||
or mark the same excess bytes whether they are in a few large packets | or mark the same excess bytes whether they are in a few large packets | |||
or many small packets. So for the same congestion excursion, the | (A) or many small packets (B). So for the same congestion excursion, | |||
same amount of bytes have to be shed to get the load back to its | the same amount of bytes have to be shed to get the load back to its | |||
operating point. But, of course, for smaller packets more packets | operating point. But, of course, for smaller packets (B) more | |||
will have to be discarded to shed the same bytes. | packets will have to be discarded to shed the same bytes. | |||
If all the transports interpret each drop/mark as a single loss event | If all the transports interpret each drop/mark as a single loss event | |||
irrespective of the size of the packet dropped, those with smaller | irrespective of the size of the packet dropped, those with smaller | |||
packets will respond more to the same congestion excursion, failing | packets (B) will respond more to the same congestion excursion. On | |||
our test. On the other hand, if they respond proportionately less | the other hand, if they respond proportionately less when smaller | |||
when smaller packets are dropped/marked, overall they will be able to | packets are dropped/marked, overall they will be able to respond the | |||
respond the same to the same congestion excursion. | same to the same congestion excursion. | |||
Therefore, for a congestion control to scale with packet size it | Therefore, for a congestion control to scale with packet size it | |||
should respond to dropped or marked bytes (as TFRC-SP [RFC4828] | should respond to dropped or marked bytes (as TFRC-SP [RFC4828] | |||
effectively does), not just to dropped or marked packets irrespective | effectively does), instead of dropped or marked packets (as TCP | |||
of packet size (as TCP does). | does). | |||
The email [pktByteEmail] referred to by RFC2309 says the question of | 2.2. Transport-Independent Network | |||
whether a packet's own size should affect its drop probability | ||||
"depends on the dominant end-to-end congestion control mechanisms". | ||||
But we argue the network layer should not be optimised for whatever | ||||
transport is predominant. | ||||
TCP congestion control ensures that flows competing for the same | TCP congestion control ensures that flows competing for the same | |||
resource each maintain the same number of segments in flight, | resource each maintain the same number of segments in flight, | |||
irrespective of segment size. So under similar conditions, flows | irrespective of segment size. So under similar conditions, flows | |||
with different segment sizes will get different bit rates. But even | with different segment sizes will get different bit rates. | |||
though reducing the drop probability of small packets helps ensure | ||||
TCPs with different packet sizes will achieve similar bit rates, we | Even though reducing the drop probability of small packets (e.g. | |||
argue this correction should be made to TCP itself, not to the | RED's byte-mode drop) helps ensure TCPs with different packet sizes | |||
will achieve similar bit rates, we argue this correction should be | ||||
made to any future transport protocols based on TCP, not to the | ||||
network in order to fix one transport, no matter how prominent it is. | network in order to fix one transport, no matter how prominent it is. | |||
Effectively, favouring small packets is reverse engineering of | ||||
network equipment around one particular transport protocol (TCP), | ||||
contrary to the excellent advice in [RFC3426], which asks designers | ||||
to question "Why are you proposing a solution at this layer of the | ||||
protocol stack, rather than at another layer?" | ||||
Effectively, favouring small packets is reverse engineering of the | RFC2309 refers to an email [pktByteEmail] for advice on how RED | |||
network layer around TCP, contrary to the excellent advice in | should allow for different packet sizes. The email says the question | |||
[RFC3426], which asks designers to question "Why are you proposing a | of whether a packet's own size should affect its drop probability | |||
solution at this layer of the protocol stack, rather than at another | "depends on the dominant end-to-end congestion control mechanisms". | |||
layer?" | But we argue network equipment should not be specialised for whatever | |||
transport is predominant. No matter how convenient it is, we SHOULD | ||||
NOT hack the network solely to allow for omissions from the design of | ||||
one transport protocol, even if it is as predominant as TCP. | ||||
2.2. Avoiding Perverse Incentives to (ab)use Smaller Packets | 2.3. Avoiding Perverse Incentives to (Ab)use Smaller Packets | |||
Increasingly, it is being recognised that a protocol design must take | Increasingly, it is being recognised that a protocol design must take | |||
care not to cause unintended consequences by giving the parties in | care not to cause unintended consequences by giving the parties in | |||
the protocol exchange perverse incentives [Evol_cc][RFC3426]. Again, | the protocol exchange perverse incentives [Evol_cc][RFC3426]. Again, | |||
imagine a scenario where the same bit rate of packets will contribute | imagine a scenario where the same bit rate of packets will contribute | |||
the same to bit-congestion of a link irrespective of whether it is | the same to bit-congestion of a link irrespective of whether it is | |||
sent as fewer larger packets or more smaller packets. A protocol | sent as fewer larger packets or more smaller packets. A protocol | |||
design that caused larger packets to be more likely to be dropped | design that caused larger packets to be more likely to be dropped | |||
than smaller ones would be dangerous in this case: | than smaller ones would be dangerous in this case: | |||
Normal transports: Even if a transport is not actually malicious, if | ||||
it finds small packets go faster, over time it will tend to act in | ||||
its own interest and use them. Queues that give advantage to | ||||
small packets create an evolutionary pressure for transports to | ||||
send at the same bit-rate but break their data stream down into | ||||
tiny segments to reduce their drop rate. Encouraging a high | ||||
volume of tiny packets might in turn unnecessarily overload a | ||||
completely unrelated part of the system, perhaps more limited by | ||||
header-processing than bandwidth. | ||||
Malicious transports: A queue that gives an advantage to small | Malicious transports: A queue that gives an advantage to small | |||
packets can be used to amplify the force of a flooding attack. By | packets can be used to amplify the force of a flooding attack. By | |||
sending a flood of small packets, the attacker can get the queue | sending a flood of small packets, the attacker can get the queue | |||
to discard more traffic in large packets, allowing more attack | to discard more traffic in large packets, allowing more attack | |||
traffic to get through to cause further damage. Such a queue | traffic to get through to cause further damage. Such a queue | |||
allows attack traffic to have a disproportionately large effect on | allows attack traffic to have a disproportionately large effect on | |||
regular traffic without the attacker having to do much work. | regular traffic without the attacker having to do much work. | |||
Note that, although the byte-mode drop variant of RED amplifies | Non-malicious transports: Even if a transport is not actually | |||
small packet attacks, drop-tail queues amplify small packet | malicious, if it finds small packets go faster, over time it will | |||
attacks even more (see Security Considerations in Section 6). | tend to act in its own interest and use them. Queues that give | |||
Wherever possible neither should be used. | advantage to small packets create an evolutionary pressure for | |||
transports to send at the same bit-rate but break their data | ||||
stream down into tiny segments to reduce their drop rate. | ||||
Encouraging a high volume of tiny packets might in turn | ||||
unnecessarily overload a completely unrelated part of the system, | ||||
perhaps more limited by header-processing than bandwidth. | ||||
Imagine two unresponsive flows arrive at a bit-congestible | Imagine two unresponsive flows arrive at a bit-congestible | |||
transmission link each with the same bit rate, say 1Mbps, but one | transmission link each with the same bit rate, say 1Mbps, but one | |||
consists of 1500B and the other 60B packets, which are 25x smaller. | consists of 1500B and the other 60B packets, which are 25x smaller. | |||
Consider a scenario where gentle RED [gentle_RED] is used, along with | Consider a scenario where gentle RED [gentle_RED] is used, along with | |||
the variant of RED we advise against, i.e. where the RED algorithm is | the variant of RED we advise against, i.e. where the RED algorithm is | |||
configured to adjust the drop probability of packets in proportion to | configured to adjust the drop probability of packets in proportion to | |||
each packet's size (byte mode packet drop). In this case, if RED | each packet's size (byte mode packet drop). In this case, if RED | |||
drops 25% of the larger packets, it will aim to drop 1% of the | drops 25% of the larger packets, it will aim to drop 1% of the | |||
smaller packets (but in practice it may drop more as congestion | smaller packets (but in practice it may drop more as congestion | |||
increases [RFC4828](S.B.4)). Even though both flows arrive with the | increases [RFC4828; S.B.4]). Even though both flows arrive with the | |||
same bit rate, the bit rate the RED queue aims to pass to the line | same bit rate, the bit rate the RED queue aims to pass to the line | |||
will be 750k for the flow of larger packet but 990k for the smaller | will be 750k for the flow of larger packet but 990k for the smaller | |||
packets (but because of rate variation it will be less than this | packets (but because of rate variation it will be less than this | |||
target). | target). | |||
It can be seen that this behaviour reopens the same denial of service | Note that, although the byte-mode drop variant of RED amplifies small | |||
vulnerability that drop tail queues offer to floods of small packet, | packet attacks, drop-tail queues amplify small packet attacks even | |||
though not necessarily as strongly (see Section 6). | more (see Security Considerations in Section 6). Wherever possible | |||
neither should be used. | ||||
2.3. Small != Control | 2.4. Small != Control | |||
It is tempting to drop small packets with lower probability to | It is tempting to drop small packets with lower probability to | |||
improve performance, because many control packets are small (TCP SYNs | improve performance, because many control packets are small (TCP SYNs | |||
& ACKs, DNS queries & responses, SIP messages, HTTP GETs, etc) and | & ACKs, DNS queries & responses, SIP messages, HTTP GETs, etc) and | |||
dropping fewer control packets considerably improves performance. | dropping fewer control packets considerably improves performance. | |||
However, we must not give control packets preference purely by virtue | However, we must not give control packets preference purely by virtue | |||
of their smallness, otherwise it is too easy for any data source to | of their smallness, otherwise it is too easy for any data source to | |||
get the same preferential treatment simply by sending data in smaller | get the same preferential treatment simply by sending data in smaller | |||
packets. Again we should not create perverse incentives to favour | packets. Again we should not create perverse incentives to favour | |||
small packets rather than to favour control packets, which is what we | small packets rather than to favour control packets, which is what we | |||
intend. | intend. | |||
Just because many control packets are small does not mean all small | Just because many control packets are small does not mean all small | |||
packets are control packets. | packets are control packets. | |||
So again, rather than fix these problems in the network layer, we | So again, rather than fix these problems in the network, we argue | |||
argue that the transport should be made more robust against losses of | that the transport should be made more robust against losses of | |||
control packets (see 'Making Transports Robust against Control Packet | control packets (see 'Making Transports Robust against Control Packet | |||
Losses' in Section 3.2.3). | Losses' in Section 4.2.3). | |||
2.4. Implementation Efficiency | 2.5. Implementation Efficiency | |||
Allowing for packet size at the transport rather than in the network | Allowing for packet size at the transport rather than in the network | |||
ensures that neither the network nor the transport needs to do a | ensures that neither the network nor the transport needs to do a | |||
multiply operation--multiplication by packet size is effectively | multiply operation--multiplication by packet size is effectively | |||
achieved as a repeated add when the transport adds to its count of | achieved as a repeated add when the transport adds to its count of | |||
marked bytes as each congestion event is fed to it. This isn't a | marked bytes as each congestion event is fed to it. This isn't a | |||
principled reason in itself, but it is a happy consequence of the | principled reason in itself, but it is a happy consequence of the | |||
other principled reasons. | other principled reasons. | |||
3. The State of the Art | 3. Recommendations | |||
3.1. Recommendation on Queue Measurement | ||||
Queue length is usually the most correct and simplest way to measure | ||||
congestion of a resource. To avoid the pathological effects of drop | ||||
tail, an AQM function can then be used to transform queue length into | ||||
the probability of dropping or marking a packet (e.g. RED's | ||||
piecewise linear function between thresholds). | ||||
If the resource is bit-congestible, the implementation SHOULD measure | ||||
the length of the queue in bytes. If the resource is packet- | ||||
congestible, the implementation SHOULD measure the length of the | ||||
queue in packets. No other choice makes sense, because the number of | ||||
packets waiting in the queue isn't relevant if the resource gets | ||||
congested by bytes and vice versa. | ||||
Corollaries: | ||||
1. Whether a resource is bit-congestible or packet-congestible is a | ||||
property of the resource, so an admin should not ever need to, or | ||||
be able to, configure the way a queue measures itself. | ||||
2. If RED is used, the implementation SHOULD use byte mode queue | ||||
measurement for measuring the congestion of bit-congestible | ||||
resources and packet mode queue measurement for packet- | ||||
congestible resources. | ||||
The recommended approach in less straightforward scenarios, such as | ||||
fixed size buffers, and resources without a queue, is discussed in | ||||
Section 4.1. | ||||
3.2. Recommendation on Notifying Congestion | ||||
The Internet's congestion notification protocols (drop, ECN & PCN) | ||||
SHOULD NOT take account of packet size when congestion is notified by | ||||
network equipment. Allowance for packet size is only appropriate | ||||
when the transport responds to congestion (See Recommendation 3.3). | ||||
This approach offers sufficient and correct congestion information | ||||
for all known and future transport protocols and also ensures no | ||||
perverse incentives are created that would encourage transports to | ||||
use inappropriately small packet sizes. | ||||
Corollaries: | ||||
1. AQM algorithms such as RED SHOULD NOT use byte-mode drop, which | ||||
deflates RED's drop probability for smaller packet sizes. RED's | ||||
byte-mode drop has no enduring advantages. It is more complex, | ||||
it creates the perverse incentive to fragment segments into tiny | ||||
pieces and it reopens the vulnerability to floods of small- | ||||
packets that drop-tail queues suffered from and AQM was designed | ||||
to remove. | ||||
2. If a vendor has implemented byte-mode drop, and an operator has | ||||
turned it on, it is strongly RECOMMENDED that it SHOULD be turned | ||||
off. Note that RED as a whole SHOULD NOT be turned off, as | ||||
without it, a drop tail queue also biases against large packets. | ||||
But note also that turning off byte-mode drop may alter the | ||||
relative performance of applications using different packet | ||||
sizes, so it would be advisable to establish the implications | ||||
before turning it off. | ||||
NOTE WELL that RED's byte-mode queue drop is completely | ||||
orthogonal to byte-mode queue measurement and should not be | ||||
confused with it. If a RED implementation has a byte-mode but | ||||
does not specify what sort of byte-mode, it is most probably | ||||
byte-mode queue measurement, which is fine. However, if in | ||||
doubt, the vendor should be consulted. | ||||
The byte mode packet drop variant of RED was recommended in the past | ||||
(see Section 4.2.1 for how thinking evolved). However, our survey of | ||||
84 vendors across the industry (Section 4.2.5) has found that none of | ||||
the 19% who responded have implemented byte mode drop in RED. Given | ||||
there appears to be little, if any, installed base it seems we can | ||||
deprecate byte-mode drop in RED with little, if any, incremental | ||||
deployment impact. | ||||
3.3. Recommendation on Responding to Congestion | ||||
Instead of network equipment biasing its congestion notification in | ||||
favour of small packets, the IETF transport area should continue its | ||||
programme of; | ||||
o updating host-based congestion control protocols to take account | ||||
of packet size | ||||
o making transports less sensitive to losing control packets like | ||||
SYNs and pure ACKs. | ||||
Corollaries: | ||||
1. If two TCPs with different packet sizes are required to run at | ||||
equal bit rates under the same path conditions, this SHOULD be | ||||
done by altering TCP (Section 4.2.2), not network equipment, | ||||
which would otherwise affect other transports besides TCP. | ||||
2. If it is desired to improve TCP performance by reducing the | ||||
chance that a SYN or a pure ACK will be dropped, this should be | ||||
done by modifying TCP (Section 4.2.3), not network equipment. | ||||
3.4. Recommended Future Research | ||||
The above conclusions cater for the Internet as it is today with most | ||||
resources being primarily bit-congestible. A secondary conclusion of | ||||
this memo is that research is needed to determine whether there might | ||||
be more packet-congestible resources in the future. Then further | ||||
research would be needed to extend the Internet's congestion | ||||
notification (drop or ECN) so that it would be able to handle a more | ||||
even mix of bit-congestible and packet-congestible resources. | ||||
4. A Survey and Critique of Past Advice | ||||
The original 1993 paper on RED [RED93] proposed two options for the | The original 1993 paper on RED [RED93] proposed two options for the | |||
RED active queue management algorithm: packet mode and byte mode. | RED active queue management algorithm: packet mode and byte mode. | |||
Packet mode measured the queue length in packets and dropped (or | Packet mode measured the queue length in packets and dropped (or | |||
marked) individual packets with a probability independent of their | marked) individual packets with a probability independent of their | |||
size. Byte mode measured the queue length in bytes and marked an | size. Byte mode measured the queue length in bytes and marked an | |||
individual packet with probability in proportion to its size | individual packet with probability in proportion to its size | |||
(relative to the maximum packet size). In the paper's outline of | (relative to the maximum packet size). In the paper's outline of | |||
further work, it was stated that no recommendation had been made on | further work, it was stated that no recommendation had been made on | |||
whether the queue size should be measured in bytes or packets, but | whether the queue size should be measured in bytes or packets, but | |||
noted that the difference could be significant. | noted that the difference could be significant. | |||
When RED was recommended for general deployment in 1998 [RFC2309], | When RED was recommended for general deployment in 1998 [RFC2309], | |||
the two modes were mentioned implying the choice between them was a | the two modes were mentioned implying the choice between them was a | |||
question of performance, referring to a 1997 email [pktByteEmail] for | question of performance, referring to a 1997 email [pktByteEmail] for | |||
advice on tuning. This email clarified that there were in fact two | advice on tuning. A later addendum to this email introduced the | |||
orthogonal choices: whether to measure queue length in bytes or | insight that there are in fact two orthogonal choices: | |||
packets (Section 3.1 below) and whether the drop probability of an | ||||
individual packet should depend on its own size (Section 3.2 below). | ||||
3.1. Congestion Measurement: Status | o whether to measure queue length in bytes or packets (Section 4.1) | |||
o whether the drop probability of an individual packet should depend | ||||
on its own size (Section 4.2). | ||||
The rest of this section is structured accordingly. | ||||
4.1. Congestion Measurement Advice | ||||
The choice of which metric to use to measure queue length was left | The choice of which metric to use to measure queue length was left | |||
open in RFC2309. It is now well understood that queues for bit- | open in RFC2309. It is now well understood that queues for bit- | |||
congestible resources should be measured in bytes, and queues for | congestible resources should be measured in bytes, and queues for | |||
packet-congestible resources should be measured in packets. | packet-congestible resources should be measured in packets. | |||
Where buffers are not configured or legacy buffers cannot be | ||||
configured to the above guideline, we do not have to make allowances | ||||
for such legacy in future protocol design. If a bit-congestible | ||||
buffer is measured in packets, the operator will have set the | ||||
thresholds mindful of a typical mix of packets sizes. Any AQM | ||||
algorithm on such a buffer will be oversensitive to high proportions | ||||
of small packets, e.g. a DoS attack, and undersensitive to high | ||||
proportions of large packets. But an operator can safely keep such a | ||||
legacy buffer because any undersensitivity during unusual traffic | ||||
mixes cannot lead to congestion collapse given the buffer will | ||||
eventually revert to tail drop, discarding proportionately more large | ||||
packets. | ||||
Some modern queue implementations give a choice for setting RED's | Some modern queue implementations give a choice for setting RED's | |||
thresholds in byte-mode or packet-mode. This may merely be an | thresholds in byte-mode or packet-mode. This may merely be an | |||
administrator-interface preference, not altering how the queue itself | administrator-interface preference, not altering how the queue itself | |||
is measured but on some hardware it does actually change the way it | is measured but on some hardware it does actually change the way it | |||
measures its queue. Whether a resource is bit-congestible or packet- | measures its queue. Whether a resource is bit-congestible or packet- | |||
congestible is a property of the resource, so an admin should not | congestible is a property of the resource, so an admin should not | |||
ever need to, or be able to, configure the way a queue measures | ever need to, or be able to, configure the way a queue measures | |||
itself. | itself. | |||
We believe the question of whether to measure queues in bytes or | NOTE: Congestion in some legacy bit-congestible buffers is only | |||
packets is fairly well understood these days. The only outstanding | measured in packets not bytes. In such cases, the operator has to | |||
issues concern how to measure congestion when the queue is bit | set the thresholds mindful of a typical mix of packets sizes. Any | |||
congestible but the resource is packet congestible or vice versa. | AQM algorithm on such a buffer will be oversensitive to high | |||
proportions of small packets, e.g. a DoS attack, and undersensitive | ||||
to high proportions of large packets. However, there is no need to | ||||
make allowances for the possibility of such legacy in future protocol | ||||
design. This is safe because any undersensitivity during unusual | ||||
traffic mixes cannot lead to congestion collapse given the buffer | ||||
will eventually revert to tail drop, discarding proportionately more | ||||
large packets. | ||||
There is no controversy over what should be done. It's just you have | 4.1.1. Fixed Size Packet Buffers | |||
to be an expert in probability to work out what should be done | ||||
(summarised in the following section) and, even if you have, it's not | ||||
always easy to find a practical algorithm to implement it. | ||||
3.1.1. Fixed Size Packet Buffers | Although the question of whether to measure queues in bytes or | |||
packets is fairly well understood these days, measuring congestion is | ||||
not straightforward when the resource is bit congestible but the | ||||
queue is packet congestible or vice versa. This section outlines the | ||||
approach to take. There is no controversy over what should be done, | ||||
you just need to be expert in probability to work it out. And, even | ||||
if you know what should be done, it's not always easy to find a | ||||
practical algorithm to implement it. | ||||
Some, mostly older, queuing hardware sets aside fixed sized buffers | Some, mostly older, queuing hardware sets aside fixed sized buffers | |||
in which to store each packet in the queue. Also, with some | in which to store each packet in the queue. Also, with some | |||
hardware, any fixed sized buffers not completely filled by a packet | hardware, any fixed sized buffers not completely filled by a packet | |||
are padded when transmitted to the wire. If we imagine a theoretical | are padded when transmitted to the wire. If we imagine a theoretical | |||
forwarding system with both queuing and transmission in fixed, MTU- | forwarding system with both queuing and transmission in fixed, MTU- | |||
sized units, it should clearly be treated as packet-congestible, | sized units, it should clearly be treated as packet-congestible, | |||
because the queue length in packets would be a good model of | because the queue length in packets would be a good model of | |||
congestion of the lower layer link. | congestion of the lower layer link. | |||
skipping to change at page 14, line 8 | skipping to change at page 17, line 42 | |||
We now return to the issue we temporarily set aside: small packets | We now return to the issue we temporarily set aside: small packets | |||
borrowing space in larger buffers. In this case, the only difference | borrowing space in larger buffers. In this case, the only difference | |||
is that the pools for smaller packets have a maximum queue size that | is that the pools for smaller packets have a maximum queue size that | |||
includes all the pools for larger packets. And every time a packet | includes all the pools for larger packets. And every time a packet | |||
takes a larger buffer, the current queue size has to be incremented | takes a larger buffer, the current queue size has to be incremented | |||
for all queues in the pools of buffers less than or equal to the | for all queues in the pools of buffers less than or equal to the | |||
buffer size used. | buffer size used. | |||
We will return to borrowing of fixed sized buffers when we discuss | We will return to borrowing of fixed sized buffers when we discuss | |||
biasing the drop/marking probability of a specific packet because of | biasing the drop/marking probability of a specific packet because of | |||
its size in Section 3.2.1. But here we can give a simple summary of | its size in Section 4.2.1. But here we can give a at least one | |||
the present discussion on how to measure the length of queues of | simple rule for how to measure the length of queues of fixed buffers: | |||
fixed buffers: no matter how complicated the scheme is, ultimately | no matter how complicated the scheme is, ultimately any fixed buffer | |||
any fixed buffer system will need to measure its queue length in | system will need to measure its queue length in packets not bytes. | |||
packets not bytes. | ||||
3.1.2. Congestion Measurement without a Queue | 4.1.2. Congestion Measurement without a Queue | |||
AQM algorithms are nearly always described assuming there is a queue | AQM algorithms are nearly always described assuming there is a queue | |||
for a congested resource and the algorithm can use the queue length | for a congested resource and the algorithm can use the queue length | |||
to determine the probability that it will drop or mark each packet. | to determine the probability that it will drop or mark each packet. | |||
But not all congested resources lead to queues. For instance, | But not all congested resources lead to queues. For instance, | |||
wireless spectrum is bit-congestible (for a given coding scheme), | wireless spectrum is bit-congestible (for a given coding scheme), | |||
because interference increases with the rate at which bits are | because interference increases with the rate at which bits are | |||
transmitted. But wireless link protocols do not always maintain a | transmitted. But wireless link protocols do not always maintain a | |||
queue that depends on spectrum interference. Similarly, power | queue that depends on spectrum interference. Similarly, power | |||
limited resources are also usually bit-congestible if energy is | limited resources are also usually bit-congestible if energy is | |||
primarily required for transmission rather than header processing, | primarily required for transmission rather than header processing, | |||
but it is rare for a link protocol to build a queue as it approaches | but it is rare for a link protocol to build a queue as it approaches | |||
maximum power. | maximum power. | |||
skipping to change at page 14, line 39 | skipping to change at page 18, line 25 | |||
Nonetheless, AQM algorithms do not require a queue in order to work. | Nonetheless, AQM algorithms do not require a queue in order to work. | |||
For instance spectrum congestion can be modelled by signal quality | For instance spectrum congestion can be modelled by signal quality | |||
using target bit-energy-to-noise-density ratio. And, to model radio | using target bit-energy-to-noise-density ratio. And, to model radio | |||
power exhaustion, transmission power levels can be measured and | power exhaustion, transmission power levels can be measured and | |||
compared to the maximum power available. [ECNFixedWireless] proposes | compared to the maximum power available. [ECNFixedWireless] proposes | |||
a practical and theoretically sound way to combine congestion | a practical and theoretically sound way to combine congestion | |||
notification for different bit-congestible resources at different | notification for different bit-congestible resources at different | |||
layers along an end to end path, whether wireless or wired, and | layers along an end to end path, whether wireless or wired, and | |||
whether with or without queues. | whether with or without queues. | |||
3.2. Congestion Coding: Status | 4.2. Congestion Notification Advice | |||
3.2.1. Network Bias when Encoding | 4.2.1. Network Bias when Encoding | |||
The previously mentioned email [pktByteEmail] referred to by | The previously mentioned email [pktByteEmail] referred to by | |||
[RFC2309] gave advice we now disagree with. It said that drop | [RFC2309] advised that most scarce resources in the Internet were | |||
probability should depend on the size of the packet being considered | bit-congestible, which is still believed to be true (Section 1.1). | |||
for drop if the resource is bit-congestible, but not if it is packet- | But it went on to give advice we now disagree with. It said that | |||
congestible, but advised that most scarce resources in the Internet | drop probability should depend on the size of the packet being | |||
were currently bit-congestible. The argument continued that if | considered for drop if the resource is bit-congestible, but not if it | |||
packet drops were inflated by packet size (byte-mode dropping), "a | is packet-congestible. The argument continued that if packet drops | |||
flow's fraction of the packet drops is then a good indication of that | were inflated by packet size (byte-mode dropping), "a flow's fraction | |||
flow's fraction of the link bandwidth in bits per second". This was | of the packet drops is then a good indication of that flow's fraction | |||
consistent with a referenced policing mechanism being worked on at | of the link bandwidth in bits per second". This was consistent with | |||
the time for detecting unusually high bandwidth flows, eventually | a referenced policing mechanism being worked on at the time for | |||
published in 1999 [pBox]. [The problem could and should have been | detecting unusually high bandwidth flows, eventually published in | |||
solved by making the policing mechanism count the volume of bytes | 1999 [pBox]. However, the problem could and should have been solved | |||
randomly dropped, not the number of packets.] | by making the policing mechanism count the volume of bytes randomly | |||
dropped, not the number of packets. | ||||
A few months before RFC2309 was published, an addendum was added to | A few months before RFC2309 was published, an addendum was added to | |||
the above archived email referenced from the RFC, in which the final | the above archived email referenced from the RFC, in which the final | |||
paragraph seemed to partially retract what had previously been said. | paragraph seemed to partially retract what had previously been said. | |||
It clarified that the question of whether the probability of | It clarified that the question of whether the probability of | |||
dropping/marking a packet should depend on its size was not related | dropping/marking a packet should depend on its size was not related | |||
to whether the resource itself was bit congestible, but a completely | to whether the resource itself was bit congestible, but a completely | |||
orthogonal question. However the only example given had the queue | orthogonal question. However the only example given had the queue | |||
measured in packets but packet drop depended on the byte-size of the | measured in packets but packet drop depended on the byte-size of the | |||
packet in question. No example was given the other way round. | packet in question. No example was given the other way round. | |||
In 2000, Cnodder et al [REDbyte] pointed out that there was an error | In 2000, Cnodder et al [REDbyte] pointed out that there was an error | |||
in the part of the original 1993 RED algorithm that aimed to | in the part of the original 1993 RED algorithm that aimed to | |||
distribute drops uniformly, because it didn't correctly take into | distribute drops uniformly, because it didn't correctly take into | |||
account the adjustment for packet size. They recommended an | account the adjustment for packet size. They recommended an | |||
algorithm called RED_4 to fix this. But they also recommended a | algorithm called RED_4 to fix this. But they also recommended a | |||
further change, RED_5, to adjust drop rate dependent on the square of | further change, RED_5, to adjust drop rate dependent on the square of | |||
relative packet size. This was indeed consistent with one implied | relative packet size. This was indeed consistent with one implied | |||
motivation behind RED's byte mode drop--that we should reverse | motivation behind RED's byte mode drop--that we should reverse | |||
engineer the network to improve the performance of dominant end-to- | engineer the network to improve the performance of dominant end-to- | |||
end congestion control mechanisms. | end congestion control mechanisms. But it is not consistent with the | |||
present recommendations of Section 3. | ||||
By 2003, a further change had been made to the adjustment for packet | By 2003, a further change had been made to the adjustment for packet | |||
size, this time in the RED algorithm of the ns2 simulator. Instead | size, this time in the RED algorithm of the ns2 simulator. Instead | |||
of taking each packet's size relative to a `maximum packet size' it | of taking each packet's size relative to a `maximum packet size' it | |||
was taken relative to a `mean packet size', intended to be a static | was taken relative to a `mean packet size', intended to be a static | |||
value representative of the `typical' packet size on the link. We | value representative of the `typical' packet size on the link. We | |||
have not been able to find a justification for this change in the | have not been able to find a justification in the literature for this | |||
literature, however Eddy and Allman conducted experiments [REDbias] | change, however Eddy and Allman conducted experiments [REDbias] that | |||
that assessed how sensitive RED was to this parameter, amongst other | assessed how sensitive RED was to this parameter, amongst other | |||
things. No-one seems to have pointed out that this changed algorithm | things. No-one seems to have pointed out that this changed algorithm | |||
can often lead to drop probabilities of greater than 1 [which should | can often lead to drop probabilities of greater than 1 (which should | |||
ring alarm bells hinting that there's a mistake in the theory | ring alarm bells hinting that there's a mistake in the theory | |||
somewhere]. On 10-Nov-2004, this variant of byte-mode packet drop | somewhere). | |||
was made the default in the ns2 simulator. | ||||
On 10-Nov-2004, this variant of byte-mode packet drop was made the | ||||
default in the ns2 simulator. None of the responses to our | ||||
admittedly limited survey of implementers (Section 4.2.5) found any | ||||
variant of byte-mode drop had been implemented. Therefore any | ||||
conclusions based on ns2 simulations that use RED without disabling | ||||
byte-mode drop are likely to be highly questionable. | ||||
The byte-mode drop variant of RED is, of course, not the only | The byte-mode drop variant of RED is, of course, not the only | |||
possible bias towards small packets in queueing algorithms. We have | possible bias towards small packets in queueing systems. We have | |||
already mentioned that tail-drop queues naturally tend to lock-out | already mentioned that tail-drop queues naturally tend to lock-out | |||
large packets once they are full. But also queues with fixed sized | large packets once they are full. But also queues with fixed sized | |||
buffers reduce the probability that small packets will be dropped if | buffers reduce the probability that small packets will be dropped if | |||
(and only if) they allow small packets to borrow buffers from the | (and only if) they allow small packets to borrow buffers from the | |||
pools for larger packets. As was explained in Section 3.1.1 on fixed | pools for larger packets. As was explained in Section 4.1.1 on fixed | |||
size buffer carving, borrowing effectively makes the maximum queue | size buffer carving, borrowing effectively makes the maximum queue | |||
size for small packets greater than that for large packets, because | size for small packets greater than that for large packets, because | |||
more buffers can be used by small packets while less will fit large | more buffers can be used by small packets while less will fit large | |||
packets. | packets. | |||
In itself, the bias towards small packets caused by buffer borrowing | In itself, the bias towards small packets caused by buffer borrowing | |||
is perfectly correct. Lower drop probability for small packets is | is perfectly correct. Lower drop probability for small packets is | |||
legitimate in buffer borrowing schemes, because small packets | legitimate in buffer borrowing schemes, because small packets | |||
genuinely congest the machine's buffer memory less than large | genuinely congest the machine's buffer memory less than large | |||
packets, given they can fit in more spaces. The bias towards small | packets, given they can fit in more spaces. The bias towards small | |||
skipping to change at page 16, line 28 | skipping to change at page 20, line 21 | |||
mode drop. | mode drop. | |||
Nonetheless, fixed-buffer memory with tail drop is still prone to | Nonetheless, fixed-buffer memory with tail drop is still prone to | |||
lock-out large packets, purely because of the tail-drop aspect. So a | lock-out large packets, purely because of the tail-drop aspect. So a | |||
good AQM algorithm like RED with packet-mode drop should be used with | good AQM algorithm like RED with packet-mode drop should be used with | |||
fixed buffer memories where possible. If RED is too complicated to | fixed buffer memories where possible. If RED is too complicated to | |||
implement with multiple fixed buffer pools, the minimum necessary to | implement with multiple fixed buffer pools, the minimum necessary to | |||
prevent large packet lock-out is to ensure smaller packets never use | prevent large packet lock-out is to ensure smaller packets never use | |||
the last available buffer in any of the pools for larger packets. | the last available buffer in any of the pools for larger packets. | |||
3.2.2. Transport Bias when Decoding | 4.2.2. Transport Bias when Decoding | |||
The above proposals to alter the network equipment to bias towards | The above proposals to alter the network equipment to bias towards | |||
smaller packets have largely carried on outside the IETF process | smaller packets have largely carried on outside the IETF process | |||
(unless one counts a reference in an informational RFC to an archived | (unless one counts a reference in an informational RFC to an archived | |||
email!). Whereas, within the IETF, there are many different | email!). Whereas, within the IETF, there are many different | |||
proposals to alter transport protocols to achieve the same goals, | proposals to alter transport protocols to achieve the same goals, | |||
i.e. either to make the flow bit-rate take account of packet size, or | i.e. either to make the flow bit-rate take account of packet size, or | |||
to protect control packets from loss. This memo argues that altering | to protect control packets from loss. This memo argues that altering | |||
transport protocols is the more principled approach. | transport protocols is the more principled approach. | |||
skipping to change at page 17, line 27 | skipping to change at page 21, line 20 | |||
conclusive, instead reporting simulations of many of the | conclusive, instead reporting simulations of many of the | |||
possibilities in order to assess performance but not recommending any | possibilities in order to assess performance but not recommending any | |||
particular course of action. | particular course of action. | |||
The paper originally proposing TFRC with virtual packets (VP-TFRC) | The paper originally proposing TFRC with virtual packets (VP-TFRC) | |||
[CCvarPktSize] proposed that there should perhaps be two variants to | [CCvarPktSize] proposed that there should perhaps be two variants to | |||
cater for the different variants of RED. However, as the TFRC-SP | cater for the different variants of RED. However, as the TFRC-SP | |||
authors point out, there is no way for a transport to know whether | authors point out, there is no way for a transport to know whether | |||
some queues on its path have deployed RED with byte-mode packet drop | some queues on its path have deployed RED with byte-mode packet drop | |||
(except if an exhaustive survey found that no-one has deployed it!-- | (except if an exhaustive survey found that no-one has deployed it!-- | |||
see Section 3.2.4). Incidentally, VP-TFRC also proposed that byte- | see Section 4.2.4). Incidentally, VP-TFRC also proposed that byte- | |||
mode RED dropping should really square the packet size compensation | mode RED dropping should really square the packet size compensation | |||
factor (like that of RED_5, but apparently unaware of it). | factor (like that of Cnodder's RED_5, but apparently unaware of it). | |||
Pre-congestion notification [I-D.ietf-pcn] is a proposal to use a | Pre-congestion notification [RFC5670] is a proposal to use a virtual | |||
virtual queue for AQM marking for packets within one Diffserv class | queue for AQM marking for packets within one Diffserv class in order | |||
in order to give early warning prior to any real queuing. The | to give early warning prior to any real queuing. The proposed PCN | |||
proposed PCN marking algorithms have been designed not to take | marking algorithms have been designed not to take account of packet | |||
account of packet size when forwarding through queues. Instead the | size when forwarding through queues. Instead the general principle | |||
general principle has been to take account of the sizes of marked | has been to take account of the sizes of marked packets when | |||
packets when monitoring the fraction of marking at the edge of the | monitoring the fraction of marking at the edge of the network, as | |||
network. | recommended here. | |||
3.2.3. Making Transports Robust against Control Packet Losses | 4.2.3. Making Transports Robust against Control Packet Losses | |||
Recently, two RFCs have defined changes to TCP that make it more | Recently, two RFCs have defined changes to TCP that make it more | |||
robust against losing small control packets [RFC5562] [RFC5690]. In | robust against losing small control packets [RFC5562] [RFC5690]. In | |||
both cases they note that the case for these TCP changes would be | both cases they note that the case for these two TCP changes would be | |||
weaker if RED were biased against dropping small packets. We argue | weaker if RED were biased against dropping small packets. We argue | |||
here that these two proposals are a safer and more principled way to | here that these two proposals are a safer and more principled way to | |||
achieve TCP performance improvements than reverse engineering RED to | achieve TCP performance improvements than reverse engineering RED to | |||
benefit TCP. | benefit TCP. | |||
Although no proposals exist as far as we know, it would also be | Although no proposals exist as far as we know, it would also be | |||
possible and perfectly valid to make control packets robust against | possible and perfectly valid to make control packets robust against | |||
drop by explicitly requesting a lower drop probability using their | drop by explicitly requesting a lower drop probability using their | |||
Diffserv code point [RFC2474] to request a scheduling class with | Diffserv code point [RFC2474] to request a scheduling class with | |||
lower drop. | lower drop. | |||
The re-ECN protocol proposal [I-D.briscoe-tsvwg-re-ecn-tcp] is | ||||
designed so that transports can be made more robust against losing | ||||
control packets. It gives queues an incentive to optionally give | ||||
preference against drop to packets with the 'feedback not | ||||
established' codepoint in the proposed 'extended ECN' field. Senders | ||||
have incentives to use this codepoint sparingly, but they can use it | ||||
on control packets to reduce their chance of being dropped. For | ||||
instance, the proposed modification to TCP for re-ECN uses this | ||||
codepoint on the SYN and SYN-ACK. | ||||
Although not brought to the IETF, a simple proposal from Wischik | Although not brought to the IETF, a simple proposal from Wischik | |||
[DupTCP] suggests that the first three packets of every TCP flow | [DupTCP] suggests that the first three packets of every TCP flow | |||
should be routinely duplicated after a short delay. It shows that | should be routinely duplicated after a short delay. It shows that | |||
this would greatly improve the chances of short flows completing | this would greatly improve the chances of short flows completing | |||
quickly, but it would hardly increase traffic levels on the Internet, | quickly, but it would hardly increase traffic levels on the Internet, | |||
because Internet bytes have always been concentrated in the large | because Internet bytes have always been concentrated in the large | |||
flows. It further shows that the performance of many typical | flows. It further shows that the performance of many typical | |||
applications depends on completion of long serial chains of short | applications depends on completion of long serial chains of short | |||
messages. It argues that, given most of the value people get from | messages. It argues that, given most of the value people get from | |||
the Internet is concentrated within short flows, this simple | the Internet is concentrated within short flows, this simple | |||
expedient would greatly increase the value of the best efforts | expedient would greatly increase the value of the best efforts | |||
Internet at minimal cost. | Internet at minimal cost. | |||
3.2.4. Congestion Coding: Summary of Status | 4.2.4. Congestion Notification: Summary of Conflicting Advice | |||
+-----------+----------------+-----------------+--------------------+ | +-----------+----------------+-----------------+--------------------+ | |||
| transport | RED_1 (packet | RED_4 (linear | RED_5 (square byte | | | transport | RED_1 (packet | RED_4 (linear | RED_5 (square byte | | |||
| cc | mode drop) | byte mode drop) | mode drop) | | | cc | mode drop) | byte mode drop) | mode drop) | | |||
+-----------+----------------+-----------------+--------------------+ | +-----------+----------------+-----------------+--------------------+ | |||
| TCP or | s/sqrt(p) | sqrt(s/p) | 1/sqrt(p) | | | TCP or | s/sqrt(p) | sqrt(s/p) | 1/sqrt(p) | | |||
| TFRC | | | | | | TFRC | | | | | |||
| TFRC-SP | 1/sqrt(p) | 1/sqrt(sp) | 1/(s.sqrt(p)) | | | TFRC-SP | 1/sqrt(p) | 1/sqrt(sp) | 1/(s.sqrt(p)) | | |||
+-----------+----------------+-----------------+--------------------+ | +-----------+----------------+-----------------+--------------------+ | |||
Table 1: Dependence of flow bit-rate per RTT on packet size s and | Table 1: Dependence of flow bit-rate per RTT on packet size s and | |||
drop rate p when network and/or transport bias towards small packets | drop rate p when network and/or transport bias towards small packets | |||
to varying degrees | to varying degrees | |||
Table 1 aims to summarise the positions we may now be in. Each | Table 1 aims to summarise the potential effects of all the advice | |||
column shows a different possible AQM behaviour in different queues | from different sources. Each column shows a different possible AQM | |||
in the network, using the terminology of Cnodder et al outlined | behaviour in different queues in the network, using the terminology | |||
earlier (RED_1 is basic RED with packet-mode drop). Each row shows a | of Cnodder et al outlined earlier (RED_1 is basic RED with packet- | |||
different transport behaviour: TCP [RFC5681] and TFRC [RFC3448] on | mode drop). Each row shows a different transport behaviour: TCP | |||
the top row with TFRC-SP [RFC4828] below. Suppressing all | [RFC5681] and TFRC [RFC3448] on the top row with TFRC-SP [RFC4828] | |||
inessential details the table shows that independence from packet | below. | |||
size should either be achievable by not altering the TCP transport in | ||||
a RED_5 network, or using the small packet TFRC-SP transport in a | Let us assume that the goal is for the bit-rate of a flow to be | |||
network without any byte-mode dropping RED (top right and bottom | independent of packet size. Suppressing all inessential details, the | |||
left). Top left is the `do nothing' scenario, while bottom right is | table shows that this should either be achievable by not altering the | |||
the `do-both' scenario in which bit-rate would become far too biased | TCP transport in a RED_5 network, or using the small packet TFRC-SP | |||
towards small packets. Of course, if any form of byte-mode dropping | transport (or similar) in a network without any byte-mode dropping | |||
RED has been deployed on a selection of congested queues, each path | RED (top right and bottom left). Top left is the `do nothing' | |||
will present a different hybrid scenario to its transport. | scenario, while bottom right is the `do-both' scenario in which bit- | |||
rate would become far too biased towards small packets. Of course, | ||||
if any form of byte-mode dropping RED has been deployed on a subset | ||||
of queues that congest, each path through the network will present a | ||||
different hybrid scenario to its transport. | ||||
Whatever, we can see that the linear byte-mode drop column in the | Whatever, we can see that the linear byte-mode drop column in the | |||
middle considerably complicates the Internet. It's a half-way house | middle considerably complicates the Internet. It's a half-way house | |||
that doesn't bias enough towards small packets even if one believes | that doesn't bias enough towards small packets even if one believes | |||
the network should be doing the biasing. We argue below that _all_ | the network should be doing the biasing. Section 3 recommends that | |||
network layer bias towards small packets should be turned off--if | _all_ bias in network equipment towards small packets should be | |||
indeed any equipment vendors have implemented it--leaving packet size | turned off--if indeed any equipment vendors have implemented it-- | |||
bias solely as the preserve of the transport layer (solely the | leaving packet size bias solely as the preserve of the transport | |||
leftmost, packet-mode drop column). | layer (solely the leftmost, packet-mode drop column). | |||
4.2.5. RED Implementation Status | ||||
A survey has been conducted of 84 vendors to assess how widely drop | A survey has been conducted of 84 vendors to assess how widely drop | |||
probability based on packet size has been implemented in RED. Prior | probability based on packet size has been implemented in RED. Prior | |||
to the survey, an individual approach to Cisco received confirmation | to the survey, an individual approach to Cisco received confirmation | |||
that, having checked the code-base for each of the product ranges, | that, having checked the code-base for each of the product ranges, | |||
Cisco has not implemented any discrimination based on packet size in | Cisco has not implemented any discrimination based on packet size in | |||
any AQM algorithm in any of its products. Also an individual | any AQM algorithm in any of its products. Also an individual | |||
approach to Alcatel-Lucent drew a confirmation that it was very | approach to Alcatel-Lucent drew a confirmation that it was very | |||
likely that none of their products contained RED code that | likely that none of their products contained RED code that | |||
implemented any packet-size bias. | implemented any packet-size bias. | |||
skipping to change at page 19, line 43 | skipping to change at page 23, line 31 | |||
Turning to our more formal survey (Table 2), about 19% of those | Turning to our more formal survey (Table 2), about 19% of those | |||
surveyed have replied so far, giving a sample size of 16. Although | surveyed have replied so far, giving a sample size of 16. Although | |||
we do not have permission to identify the respondents, we can say | we do not have permission to identify the respondents, we can say | |||
that those that have responded include most of the larger vendors, | that those that have responded include most of the larger vendors, | |||
covering a large fraction of the market. They range across the large | covering a large fraction of the market. They range across the large | |||
network equipment vendors at L3 & L2, firewall vendors, wireless | network equipment vendors at L3 & L2, firewall vendors, wireless | |||
equipment vendors, as well as large software businesses with a small | equipment vendors, as well as large software businesses with a small | |||
selection of networking products. So far, all those who have | selection of networking products. So far, all those who have | |||
responded have confirmed that they have not implemented the variant | responded have confirmed that they have not implemented the variant | |||
of RED with drop dependent on packet size (2 were fairly sure they | of RED with drop dependent on packet size (2 were fairly sure they | |||
had not but needed to check more thoroughly). | had not but needed to check more thoroughly). We have established | |||
that Linux does not implement RED with packet size drop bias, | ||||
although we have not investigated a wider range of open source code. | ||||
+-------------------------------+----------------+-----------------+ | +-------------------------------+----------------+-----------------+ | |||
| Response | No. of vendors | %age of vendors | | | Response | No. of vendors | %age of vendors | | |||
+-------------------------------+----------------+-----------------+ | +-------------------------------+----------------+-----------------+ | |||
| Not implemented | 14 | 17% | | | Not implemented | 14 | 17% | | |||
| Not implemented (probably) | 2 | 2% | | | Not implemented (probably) | 2 | 2% | | |||
| Implemented | 0 | 0% | | | Implemented | 0 | 0% | | |||
| No response | 68 | 81% | | | No response | 68 | 81% | | |||
| Total companies/orgs surveyed | 84 | 100% | | | Total companies/orgs surveyed | 84 | 100% | | |||
+-------------------------------+----------------+-----------------+ | +-------------------------------+----------------+-----------------+ | |||
Table 2: Vendor Survey on byte-mode drop variant of RED (lower drop | Table 2: Vendor Survey on byte-mode drop variant of RED (lower drop | |||
probability for small packets) | probability for small packets) | |||
Where reasons have been given, the extra complexity of packet bias | Where reasons have been given, the extra complexity of packet bias | |||
code has been most prevalent, though one vendor had a more principled | code has been most prevalent, though one vendor had a more principled | |||
reason for avoiding it--similar to the argument of this document. We | reason for avoiding it--similar to the argument of this document. | |||
have established that Linux does not implement RED with packet size | ||||
drop bias, although we have not investigated a wider range of open | ||||
source code. | ||||
Finally, we repeat that RED's byte mode drop is not the only way to | Finally, we repeat that RED's byte mode drop SHOULD be disabled, but | |||
bias towards small packets--tail-drop tends to lock-out large packets | active queue management such as RED SHOULD be enabled wherever | |||
very effectively. Our survey was of vendor implementations, so we | possible if we are to eradicate bias towards small packets--without | |||
cannot be certain about operator deployment. But we believe many | any AQM at all, tail-drop tends to lock-out large packets very | |||
queues in the Internet are still tail-drop. The company of one of | effectively. | |||
the co-authors (BT) has widely deployed RED, but there are bound to | ||||
be many tail-drop queues, particularly in access network equipment | Our survey was of vendor implementations, so we cannot be certain | |||
and on middleboxes like firewalls, where RED is not always available. | about operator deployment. But we believe many queues in the | |||
Internet are still tail-drop. The company of one of the co-authors | ||||
(BT) has widely deployed RED, but many tail-drop queues are there are | ||||
bound to still exist, particularly in access network equipment and on | ||||
middleboxes like firewalls, where RED is not always available. | ||||
Routers using a memory architecture based on fixed size buffers with | Routers using a memory architecture based on fixed size buffers with | |||
borrowing may also still be prevalent in the Internet. As explained | borrowing may also still be prevalent in the Internet. As explained | |||
in Section 3.2.1, these also provide a marginal (but legitimate) bias | in Section 4.2.1, these also provide a marginal (but legitimate) bias | |||
towards small packets. So even though RED byte-mode drop is not | towards small packets. So even though RED byte-mode drop is not | |||
prevalent, it is likely there is still some bias towards small | prevalent, it is likely there is still some bias towards small | |||
packets in the Internet due to tail drop and fixed buffer borrowing. | packets in the Internet due to tail drop and fixed buffer borrowing. | |||
4. Outstanding Issues and Next Steps | 5. Outstanding Issues and Next Steps | |||
4.1. Bit-congestible World | 5.1. Bit-congestible World | |||
For a connectionless network with nearly all resources being bit- | For a connectionless network with nearly all resources being bit- | |||
congestible we believe the recommended position is now unarguably | congestible we believe the recommended position is now unarguably | |||
clear--that the network should not make allowance for packet sizes | clear--that the network should not make allowance for packet sizes | |||
and the transport should. This leaves two outstanding issues: | and the transport should. This leaves two outstanding issues: | |||
o How to handle any legacy of AQM with byte-mode drop already | o How to handle any legacy of AQM with byte-mode drop already | |||
deployed; | deployed; | |||
o The need to start a programme to update transport congestion | o The need to start a programme to update transport congestion | |||
control protocol standards to take account of packet size. | control protocol standards to take account of packet size. | |||
The sample of returns from our vendor survey Section 3.2.4 suggest | The sample of returns from our vendor survey Section 4.2.4 suggest | |||
that byte-mode packet drop seems not to be implemented at all let | that byte-mode packet drop seems not to be implemented at all let | |||
alone deployed, or if it is, it is likely to be very sparse. | alone deployed, or if it is, it is likely to be very sparse. | |||
Therefore, we do not really need a migration strategy from all but | Therefore, we do not really need a migration strategy from all but | |||
nothing to nothing. | nothing to nothing. | |||
A programme of standards updates to take account of packet size in | A programme of standards updates to take account of packet size in | |||
transport congestion control protocols has started with TFRC-SP | transport congestion control protocols has started with TFRC-SP | |||
[RFC4828], while weighted TCPs implemented in the research community | [RFC4828], while weighted TCPs implemented in the research community | |||
[WindowPropFair] could form the basis of a future change to TCP | [WindowPropFair] could form the basis of a future change to TCP | |||
congestion control [RFC5681] itself. | congestion control [RFC5681] itself. | |||
4.2. Bit- & Packet-congestible World | 5.2. Bit- & Packet-congestible World | |||
Nonetheless, a connectionless network with both bit-congestible and | Nonetheless, the position is much less clear-cut if the Internet | |||
packet-congestible resources is a different matter. If we believe we | becomes populated by a more even mix of both packet-congestible and | |||
should allow for this possibility in the future, this space contains | bit-congestible resources. If we believe we should allow for this | |||
a truly open research issue. | possibility in the future, this space contains a truly open research | |||
issue. | ||||
We develop the concept of an idealised congestion notification | We develop the concept of an idealised congestion notification | |||
protocol that supports both bit-congestible and packet-congestible | protocol that supports both bit-congestible and packet-congestible | |||
resources in Appendix B. The congestion notification requires at | resources in Appendix A. This congestion notification requires at | |||
least two flags for congestion of bit-congestible and packet- | least two flags for congestion of bit-congestible and packet- | |||
congestible resources. This hides a fundamental problem--much more | congestible resources. This hides a fundamental problem--much more | |||
fundamental than whether we can magically create header space for yet | fundamental than whether we can magically create header space for yet | |||
another ECN flag in IPv4, or whether it would work while being | another ECN flag in IPv4, or whether it would work while being | |||
deployed incrementally. A congestion notification protocol must | deployed incrementally. Distinguishing drop from delivery naturally | |||
survive a transition from low levels of congestion to high. Marking | provides just one congestion flag--it is hard to drop a packet in two | |||
two states is feasible with explicit marking, but much harder if | ways that are distinguishable remotely. This is a similar problem to | |||
packets are dropped. Also, it will not always be cost-effective to | that of distinguishing wireless transmission losses from congestive | |||
implement AQM at every low level resource, so drop will often have to | losses. | |||
suffice. Distinguishing drop from delivery naturally provides just | ||||
one congestion flag--it is hard to drop a packet in two ways that are | This problem would not be solved even if ECN were universally | |||
distinguishable remotely. This is a similar problem to that of | deployed. A congestion notification protocol must survive a | |||
distinguishing wireless transmission losses from congestive losses. | transition from low levels of congestion to high. Marking two states | |||
is feasible with explicit marking, but much harder if packets are | ||||
dropped. Also, it will not always be cost-effective to implement AQM | ||||
at every low level resource, so drop will often have to suffice. | ||||
We should also note that, strictly, packet-congestible resources are | We should also note that, strictly, packet-congestible resources are | |||
actually cycle-congestible because load also depends on the | actually cycle-congestible because load also depends on the | |||
complexity of each look-up and whether the pattern of arrivals is | complexity of each look-up and whether the pattern of arrivals is | |||
amenable to caching or not. Further, this reminds us that any | amenable to caching or not. Further, this reminds us that any | |||
solution must not require a forwarding engine to use excessive | solution must not require a forwarding engine to use excessive | |||
processor cycles in order to decide how to say it has no spare | processor cycles in order to decide how to say it has no spare | |||
processor cycles. | processor cycles. | |||
Recently, the dual resource queue (DRQ) proposal [DRQ] has been made | Recently, the dual resource queue (DRQ) proposal [DRQ] has been made | |||
on the premise that, as network processors become more cost | on the premise that, as network processors become more cost | |||
effective, per packet operations will become more complex | effective, per packet operations will become more complex | |||
(irrespective of whether more function in the network layer is | (irrespective of whether more function in the network is desirable). | |||
desirable). Consequently the premise is that CPU congestion will | Consequently the premise is that CPU congestion will become more | |||
become more common. DRQ is a proposed modification to the RED | common. DRQ is a proposed modification to the RED algorithm that | |||
algorithm that folds both bit congestion and packet congestion into | folds both bit congestion and packet congestion into one signal | |||
one signal (either loss or ECN). | (either loss or ECN). | |||
The problem of signalling packet processing congestion is not | The problem of signalling packet processing congestion is not | |||
pressing, as most Internet resources are designed to be bit- | pressing, as most Internet resources are designed to be bit- | |||
congestible before packet processing starts to congest (see | congestible before packet processing starts to congest (see | |||
Section 1.1). However, the IRTF Internet congestion control research | Section 1.1). However, the IRTF Internet congestion control research | |||
group (ICCRG) has set itself the task of reaching consensus on | group (ICCRG) has set itself the task of reaching consensus on | |||
generic forwarding mechanisms that are necessary and sufficient to | generic forwarding mechanisms that are necessary and sufficient to | |||
support the Internet's future congestion control requirements (the | support the Internet's future congestion control requirements (the | |||
first challenge in [I-D.irtf-iccrg-welzl]). Therefore, rather than | first challenge in [I-D.irtf-iccrg-welzl]). Therefore, rather than | |||
not giving this problem any thought at all, just because it is hard | not giving this problem any thought at all, just because it is hard | |||
and currently hypothetical, we defer the question of whether packet | and currently hypothetical, we defer the question of whether packet | |||
congestion might become common and what to do if it does to the IRTF | congestion might become common and what to do if it does to the IRTF | |||
(the 'Small Packets' challenge in [I-D.irtf-iccrg-welzl]). | (the 'Small Packets' challenge in [I-D.irtf-iccrg-welzl]). | |||
5. Recommendation and Conclusions | ||||
5.1. Recommendation on Queue Measurement | ||||
Queue length is usually the most correct and simplest way to measure | ||||
congestion of a resource. To avoid the pathological effects of drop | ||||
tail, an AQM function can then be used to transform queue length into | ||||
the probability of dropping or marking a packet (e.g. RED's | ||||
piecewise linear function between thresholds). | ||||
If the resource is bit-congestible, the length of the queue SHOULD be | ||||
measured in bytes. If the resource is packet-congestible, the length | ||||
of the queue SHOULD be measured in packets. No other choice makes | ||||
sense, because the number of packets waiting in the queue isn't | ||||
relevant if the resource gets congested by bytes and vice versa. We | ||||
discuss the implications on RED's byte mode and packet mode for | ||||
measuring queue length in Section 3. | ||||
NOTE WELL that RED's byte-mode queue measurement is fine, being | ||||
completely orthogonal to byte-mode drop. If a RED implementation has | ||||
a byte-mode but does not specify what sort of byte-mode, it is most | ||||
probably byte-mode queue measurement, which is fine. However, if in | ||||
doubt, the vendor should be consulted. | ||||
5.2. Recommendation on Notifying Congestion | ||||
The strong recommendation is that AQM algorithms such as RED SHOULD | ||||
NOT use byte-mode drop. More generally, the Internet's congestion | ||||
notification protocols (drop, ECN & PCN) SHOULD take account of | ||||
packet size when the notification is read by the transport layer, NOT | ||||
when it is written by the network layer. This approach offers | ||||
sufficient and correct congestion information for all known and | ||||
future transport protocols and also ensures no perverse incentives | ||||
are created that would encourage transports to use inappropriately | ||||
small packet sizes. | ||||
The alternative of deflating RED's drop probability for smaller | ||||
packet sizes (byte-mode drop) has no enduring advantages. It is more | ||||
complex, it creates the perverse incentive to fragment segments into | ||||
tiny pieces and it reopens the vulnerability to floods of small- | ||||
packets that drop-tail queues suffered from and AQM was designed to | ||||
remove. | ||||
Byte-mode drop is a change to the network layer that makes allowance | ||||
for an omission from the design of TCP, effectively reverse | ||||
engineering the network layer to contrive to make two TCPs with | ||||
different packet sizes run at equal bit rates (rather than packet | ||||
rates) under the same path conditions. | ||||
It also improves TCP performance by reducing the chance that a SYN or | ||||
a pure ACK will be dropped, because they are small. But we SHOULD | ||||
NOT hack the network layer to improve or fix certain transport | ||||
protocols. No matter how predominant a transport protocol is (even | ||||
if it's TCP), trying to correct for its failings by biasing towards | ||||
small packets in the network layer creates a perverse incentive to | ||||
break down all flows from all transports into tiny segments. | ||||
So far, our survey of 84 vendors across the industry has drawn | ||||
responses from about 19%, none of whom have implemented the byte mode | ||||
packet drop variant of RED. Given there appears to be little, if | ||||
any, installed base it seems we can recommend removal of byte-mode | ||||
drop from RED with little, if any, incremental deployment impact. | ||||
If a vendor has implemented byte-mode drop, and an operator has | ||||
turned it on, it is strongly RECOMMENDED that it SHOULD be turned | ||||
off. Note that RED as a whole SHOULD NOT be turned off, as without | ||||
it, a drop tail queue also biases against large packets. But note | ||||
also that turning off byte-mode may alter the relative performance of | ||||
applications using different packet sizes, so it would be advisable | ||||
to establish the implications before turning it off. | ||||
5.3. Recommendation on Responding to Congestion | ||||
Instead of network equipment biasing its congestion notification for | ||||
small packets, the IETF transport area should continue its programme | ||||
of updating congestion control protocols to take account of packet | ||||
size and to make transports less sensitive to losing control packets | ||||
like SYNs and pure ACKS. | ||||
5.4. Recommended Future Research | ||||
The above conclusions cater for the Internet as it is today with | ||||
most, if not all, resources being primarily bit-congestible. A | ||||
secondary conclusion of this memo is that we may see more packet- | ||||
congestible resources in the future, so research may be needed to | ||||
extend the Internet's congestion notification (drop or ECN) so that | ||||
it can handle a mix of bit-congestible and packet-congestible | ||||
resources. | ||||
6. Security Considerations | 6. Security Considerations | |||
This draft recommends that queues do not bias drop probability | This draft recommends that queues do not bias drop probability | |||
towards small packets as this creates a perverse incentive for | towards small packets as this creates a perverse incentive for | |||
transports to break down their flows into tiny segments. One of the | transports to break down their flows into tiny segments. One of the | |||
benefits of implementing AQM was meant to be to remove this perverse | benefits of implementing AQM was meant to be to remove this perverse | |||
incentive that drop-tail queues gave to small packets. Of course, if | incentive that drop-tail queues gave to small packets. Of course, if | |||
transports really want to make the greatest gains, they don't have to | transports really want to make the greatest gains, they don't have to | |||
respond to congestion anyway. But we don't want applications that | respond to congestion anyway. But we don't want applications that | |||
are trying to behave to discover that they can go faster by using | are trying to behave to discover that they can go faster by using | |||
skipping to change at page 24, line 52 | skipping to change at page 26, line 43 | |||
If most queues implemented AQM with byte-mode drop, the resulting | If most queues implemented AQM with byte-mode drop, the resulting | |||
network would amplify the potency of a small packet DDoS attack. At | network would amplify the potency of a small packet DDoS attack. At | |||
the first queue the stream of packets would push aside a greater | the first queue the stream of packets would push aside a greater | |||
proportion of large packets, so more of the small packets would | proportion of large packets, so more of the small packets would | |||
survive to attack the next queue. Thus a flood of small packets | survive to attack the next queue. Thus a flood of small packets | |||
would continue on towards the destination, pushing regular traffic | would continue on towards the destination, pushing regular traffic | |||
with large packets out of the way in one queue after the next, but | with large packets out of the way in one queue after the next, but | |||
suffering much less drop itself. | suffering much less drop itself. | |||
Appendix C explains why the ability of networks to police the | Appendix B explains why the ability of networks to police the | |||
response of _any_ transport to congestion depends on bit-congestible | response of _any_ transport to congestion depends on bit-congestible | |||
network resources only doing packet-mode not byte-mode drop. In | network resources only doing packet-mode not byte-mode drop. In | |||
summary, it says that making drop probability depend on the size of | summary, it says that making drop probability depend on the size of | |||
the packets that bits happen to be divided into simply encourages the | the packets that bits happen to be divided into simply encourages the | |||
bits to be divided into smaller packets. Byte-mode drop would | bits to be divided into smaller packets. Byte-mode drop would | |||
therefore irreversibly complicate any attempt to fix the Internet's | therefore irreversibly complicate any attempt to fix the Internet's | |||
incentive structures. | incentive structures. | |||
7. Acknowledgements | 7. Conclusions | |||
This memo strongly recommends that the size of an individual packet | ||||
that is dropped or marked should only be taken into account when a | ||||
transport reads this as a congestion indication, not when network | ||||
equipment writes it. The memo therefore strongly deprecates using | ||||
RED's byte-mode of packet drop in network equipment. | ||||
Whether network equipment should measure the length of a queue by | ||||
counting bytes or counting packets is a different question to whether | ||||
it should take into account the size of each packet being dropped or | ||||
marked. The answer depends on whether the network resource is | ||||
congested respectively by bytes or by packets. This means that RED's | ||||
byte-mode queue measurement will often be appropriate even though | ||||
byte-mode drop is strongly deprecated. | ||||
At the transport layer the IETF should continue updating congestion | ||||
control protocols to take account of the size of each packet that | ||||
indicates congestion. Also the IETF should continue to make | ||||
transports less sensitive to losing control packets like SYNs, pure | ||||
ACKs and DNS exchanges. Although many control packets happen to be | ||||
small, the alternative of network equipment favouring all small | ||||
packets would be dangerous. That would create perverse incentives to | ||||
split data transfers into smaller packets. | ||||
The memo develops these recommendations from principled arguments | ||||
concerning scaling, layering, incentives, inherent efficiency, | ||||
security and policability. But it also addresses practical issues | ||||
such as specific buffer architectures and incremental deployment. | ||||
Indeed a limited survey of RED implementations is included, which | ||||
shows there appears to be little, if any, installed base of RED's | ||||
byte-mode drop. Therefore it can be deprecated with little, if any, | ||||
incremental deployment complications. | ||||
The recommendations have been developed on the well-founded basis | ||||
that most Internet resources are bit-congestible not packet- | ||||
congestible. We need to know the likelihood that this assumption | ||||
will prevail longer term and, if it might not, what protocol changes | ||||
will be needed to cater for a mix of the two. These questions have | ||||
been delegated to the IRTF. | ||||
8. Acknowledgements | ||||
Thank you to Sally Floyd, who gave extensive and useful review | Thank you to Sally Floyd, who gave extensive and useful review | |||
comments. Also thanks for the reviews from Philip Eardley, Toby | comments. Also thanks for the reviews from Philip Eardley, Toby | |||
Moncaster and Arnaud Jacquet as well as helpful explanations of | Moncaster and Arnaud Jacquet as well as helpful explanations of | |||
different hardware approaches from Larry Dunn and Fred Baker. I am | different hardware approaches from Larry Dunn and Fred Baker. I am | |||
grateful to Bruce Davie and his colleagues for providing a timely and | grateful to Bruce Davie and his colleagues for providing a timely and | |||
efficient survey of RED implementation in Cisco's product range. | efficient survey of RED implementation in Cisco's product range. | |||
Also grateful thanks to Toby Moncaster, Will Dormann, John Regnault, | Also grateful thanks to Toby Moncaster, Will Dormann, John Regnault, | |||
Simon Carter and Stefaan De Cnodder who further helped survey the | Simon Carter and Stefaan De Cnodder who further helped survey the | |||
current status of RED implementation and deployment and, finally, | current status of RED implementation and deployment and, finally, | |||
thanks to the anonymous individuals who responded. | thanks to the anonymous individuals who responded. | |||
Bob Briscoe and Jukka Manner are partly funded by Trilogy, a research | Bob Briscoe and Jukka Manner are partly funded by Trilogy, a research | |||
project (ICT- 216372) supported by the European Community under its | project (ICT- 216372) supported by the European Community under its | |||
Seventh Framework Programme. The views expressed here are those of | Seventh Framework Programme. The views expressed here are those of | |||
the authors only. | the authors only. | |||
8. Comments Solicited | 9. Comments Solicited | |||
Comments and questions are encouraged and very welcome. They can be | Comments and questions are encouraged and very welcome. They can be | |||
addressed to the IETF Transport Area working group mailing list | addressed to the IETF Transport Area working group mailing list | |||
<tsvwg@ietf.org>, and/or to the authors. | <tsvwg@ietf.org>, and/or to the authors. | |||
9. References | 10. References | |||
9.1. Normative References | ||||
[RFC2119] Bradner, S., "Key words for use in | ||||
RFCs to Indicate Requirement Levels", | ||||
BCP 14, RFC 2119, March 1997. | ||||
[RFC2309] Braden, B., Clark, D., Crowcroft, J., | ||||
Davie, B., Deering, S., Estrin, D., | ||||
Floyd, S., Jacobson, V., Minshall, | ||||
G., Partridge, C., Peterson, L., | ||||
Ramakrishnan, K., Shenker, S., | ||||
Wroclawski, J., and L. Zhang, | ||||
"Recommendations on Queue Management | ||||
and Congestion Avoidance in the | ||||
Internet", RFC 2309, April 1998. | ||||
[RFC3168] Ramakrishnan, K., Floyd, S., and D. | ||||
Black, "The Addition of Explicit | ||||
Congestion Notification (ECN) to IP", | ||||
RFC 3168, September 2001. | ||||
[RFC3426] Floyd, S., "General Architectural and | ||||
Policy Considerations", RFC 3426, | ||||
November 2002. | ||||
[RFC5033] Floyd, S. and M. Allman, "Specifying | 10.1. Normative References | |||
New Congestion Control Algorithms", | ||||
BCP 133, RFC 5033, August 2007. | ||||
9.2. Informative References | [RFC2119] Bradner, S., "Key words for use in RFCs | |||
to Indicate Requirement Levels", BCP 14, | ||||
RFC 2119, March 1997. | ||||
[CCvarPktSize] Widmer, J., Boutremans, C., and J-Y. | [RFC2309] Braden, B., Clark, D., Crowcroft, J., | |||
Le Boudec, "Congestion Control for | Davie, B., Deering, S., Estrin, D., | |||
Flows with Variable Packet Size", ACM | Floyd, S., Jacobson, V., Minshall, G., | |||
CCR 34(2) 137--151, 2004, <http:// | Partridge, C., Peterson, L., | |||
doi.acm.org/10.1145/997150.997162>. | Ramakrishnan, K., Shenker, S., | |||
Wroclawski, J., and L. Zhang, | ||||
"Recommendations on Queue Management and | ||||
Congestion Avoidance in the Internet", | ||||
RFC 2309, April 1998. | ||||
[DRQ] Shin, M., Chong, S., and I. Rhee, | [RFC3168] Ramakrishnan, K., Floyd, S., and D. | |||
"Dual-Resource TCP/AQM for | Black, "The Addition of Explicit | |||
Processing-Constrained Networks", | Congestion Notification (ECN) to IP", | |||
IEEE/ACM Transactions on | RFC 3168, September 2001. | |||
Networking Vol 16, issue 2, | ||||
April 2008, <http://dx.doi.org/ | ||||
10.1109/TNET.2007.900415>. | ||||
[DupTCP] Wischik, D., "Short messages", Royal | [RFC3426] Floyd, S., "General Architectural and | |||
Society workshop on networks: | Policy Considerations", RFC 3426, | |||
modelling and control , | November 2002. | |||
September 2007, <http:// | ||||
www.cs.ucl.ac.uk/staff/ucacdjw/ | ||||
Research/shortmsg.html>. | ||||
[ECNFixedWireless] Siris, V., "Resource Control for | [RFC5033] Floyd, S. and M. Allman, "Specifying New | |||
Elastic Traffic in CDMA Networks", | Congestion Control Algorithms", BCP 133, | |||
Proc. ACM MOBICOM'02 , | RFC 5033, August 2007. | |||
September 2002, <http:// | ||||
www.ics.forth.gr/netlab/publications/ | ||||
resource_control_elastic_cdma.html>. | ||||
[Evol_cc] Gibbens, R. and F. Kelly, "Resource | 10.2. Informative References | |||
pricing and the evolution of | ||||
congestion control", | ||||
Automatica 35(12)1969--1985, | ||||
December 1999, <http:// | ||||
www.statslab.cam.ac.uk/~frank/ | ||||
evol.html>. | ||||
[I-D.briscoe-tsvwg-re-ecn-tcp] Briscoe, B., Jacquet, A., Moncaster, | [CCvarPktSize] Widmer, J., Boutremans, C., and J-Y. Le | |||
T., and A. Smith, "Re-ECN: Adding | Boudec, "Congestion Control for Flows | |||
Accountability for Causing Congestion | with Variable Packet Size", ACM CCR 34(2) | |||
to TCP/IP", | 137--151, 2004, <http://doi.acm.org/ | |||
draft-briscoe-tsvwg-re-ecn-tcp-08 | 10.1145/997150.997162>. | |||
(work in progress), September 2009. | ||||
[I-D.ietf-pcn] Eardley, P., "Metering and marking | [DRQ] Shin, M., Chong, S., and I. Rhee, "Dual- | |||
behaviour of PCN-nodes", | Resource TCP/AQM for Processing- | |||
draft-ietf-pcn-marking-behaviour-05 | Constrained Networks", IEEE/ACM | |||
(work in progress), August 2009. | Transactions on Networking Vol 16, issue | |||
2, April 2008, <http://dx.doi.org/ | ||||
10.1109/TNET.2007.900415>. | ||||
[I-D.irtf-iccrg-welzl] Welzl, M., Scharf, M., Briscoe, B., | [DupTCP] Wischik, D., "Short messages", Royal | |||
and D. Papadimitriou, "Open Research | Society workshop on networks: modelling | |||
Issues in Internet Congestion | and control , September 2007, <http:// | |||
Control", draft-irtf-iccrg-welzl- | www.cs.ucl.ac.uk/staff/ucacdjw/Research/ | |||
congestion-control-open-research-07 | shortmsg.html>. | |||
(work in progress), June 2010. | ||||
[IOSArch] Bollapragada, V., White, R., and C. | [ECNFixedWireless] Siris, V., "Resource Control for Elastic | |||
Murphy, "Inside Cisco IOS Software | Traffic in CDMA Networks", Proc. ACM | |||
Architecture", Cisco Press: CCIE | MOBICOM'02 , September 2002, <http:// | |||
Professional Development ISBN13: 978- | www.ics.forth.gr/netlab/publications/ | |||
1-57870-181-0, July 2000. | resource_control_elastic_cdma.html>. | |||
[MulTCP] Crowcroft, J. and Ph. Oechslin, | [Evol_cc] Gibbens, R. and F. Kelly, "Resource | |||
"Differentiated End to End Internet | pricing and the evolution of congestion | |||
Services using a Weighted | control", Automatica 35(12)1969--1985, | |||
Proportional Fair Sharing TCP", | December 1999, <http:// | |||
CCR 28(3) 53--69, July 1998, <http:// | www.statslab.cam.ac.uk/~frank/evol.html>. | |||
www.cs.ucl.ac.uk/staff/J.Crowcroft/ | ||||
hipparch/pricing.html>. | ||||
[PktSizeEquCC] Vasallo, P., "Variable Packet Size | [I-D.conex-concepts-uses] Briscoe, B., Woundy, R., Moncaster, T., | |||
Equation-Based Congestion Control", | and J. Leslie, "ConEx Concepts and Use | |||
ICSI Technical Report tr-00-008, | Cases", | |||
2000, <http://http.icsi.berkeley.edu/ | draft-moncaster-conex-concepts-uses-01 | |||
ftp/global/pub/techreports/2000/ | (work in progress), July 2010. | |||
tr-00-008.pdf>. | ||||
[RED93] Floyd, S. and V. Jacobson, "Random | [I-D.ietf-avt-ecn-for-rtp] Westerlund, M., Johansson, I., Perkins, | |||
Early Detection (RED) gateways for | C., and K. Carlberg, "Explicit Congestion | |||
Congestion Avoidance", IEEE/ACM | Notification (ECN) for RTP over UDP", | |||
Transactions on Networking 1(4) 397-- | draft-ietf-avt-ecn-for-rtp-02 (work in | |||
413, August 1993, <http:// | progress), July 2010. | |||
www.icir.org/floyd/papers/red/ | ||||
red.html>. | ||||
[REDbias] Eddy, W. and M. Allman, "A Comparison | [I-D.irtf-iccrg-welzl] Welzl, M., Scharf, M., Briscoe, B., and | |||
of RED's Byte and Packet Modes", | D. Papadimitriou, "Open Research Issues | |||
Computer Networks 42(3) 261--280, | in Internet Congestion Control", draft- | |||
June 2003, <http://www.ir.bbn.com/ | irtf-iccrg-welzl-congestion-control-open- | |||
documents/articles/redbias.ps>. | research-08 (work in progress), | |||
September 2010. | ||||
[REDbyte] De Cnodder, S., Elloumi, O., and K. | [IOSArch] Bollapragada, V., White, R., and C. | |||
Pauwels, "RED behavior with different | Murphy, "Inside Cisco IOS Software | |||
packet sizes", Proc. 5th IEEE | Architecture", Cisco Press: CCIE | |||
Symposium on Computers and | Professional Development ISBN13: 978-1- | |||
Communications (ISCC) 793--799, | 57870-181-0, July 2000. | |||
July 2000, <http://www.icir.org/ | ||||
floyd/red/Elloumi99.pdf>. | ||||
[RFC2474] Nichols, K., Blake, S., Baker, F., | [MulTCP] Crowcroft, J. and Ph. Oechslin, | |||
and D. Black, "Definition of the | "Differentiated End to End Internet | |||
Differentiated Services Field (DS | Services using a Weighted Proportional | |||
Field) in the IPv4 and IPv6 Headers", | Fair Sharing TCP", CCR 28(3) 53--69, | |||
RFC 2474, December 1998. | July 1998, <http://www.cs.ucl.ac.uk/ | |||
staff/J.Crowcroft/hipparch/pricing.html>. | ||||
[RFC3448] Handley, M., Floyd, S., Padhye, J., | [PktSizeEquCC] Vasallo, P., "Variable Packet Size | |||
and J. Widmer, "TCP Friendly Rate | Equation-Based Congestion Control", ICSI | |||
Control (TFRC): Protocol | Technical Report tr-00-008, 2000, <http:/ | |||
Specification", RFC 3448, | /http.icsi.berkeley.edu/ftp/global/pub/ | |||
January 2003. | techreports/2000/tr-00-008.pdf>. | |||
[RFC3714] Floyd, S. and J. Kempf, "IAB Concerns | [RED93] Floyd, S. and V. Jacobson, "Random Early | |||
Regarding Congestion Control for | Detection (RED) gateways for Congestion | |||
Voice Traffic in the Internet", | Avoidance", IEEE/ACM Transactions on | |||
RFC 3714, March 2004. | Networking 1(4) 397--413, August 1993, <h | |||
ttp://www.icir.org/floyd/papers/red/ | ||||
red.html>. | ||||
[RFC4782] Floyd, S., Allman, M., Jain, A., and | [REDbias] Eddy, W. and M. Allman, "A Comparison of | |||
P. Sarolahti, "Quick-Start for TCP | RED's Byte and Packet Modes", Computer | |||
and IP", RFC 4782, January 2007. | Networks 42(3) 261--280, June 2003, <http | |||
://www.ir.bbn.com/documents/articles/ | ||||
redbias.ps>. | ||||
[RFC4828] Floyd, S. and E. Kohler, "TCP | [REDbyte] De Cnodder, S., Elloumi, O., and K. | |||
Friendly Rate Control (TFRC): The | Pauwels, "RED behavior with different | |||
Small-Packet (SP) Variant", RFC 4828, | packet sizes", Proc. 5th IEEE Symposium | |||
April 2007. | on Computers and Communications | |||
(ISCC) 793--799, July 2000, <http:// | ||||
www.icir.org/floyd/red/Elloumi99.pdf>. | ||||
[RFC5562] Kuzmanovic, A., Mondal, A., Floyd, | [RFC2474] Nichols, K., Blake, S., Baker, F., and D. | |||
S., and K. Ramakrishnan, "Adding | Black, "Definition of the Differentiated | |||
Explicit Congestion Notification | Services Field (DS Field) in the IPv4 and | |||
(ECN) Capability to TCP's SYN/ACK | IPv6 Headers", RFC 2474, December 1998. | |||
Packets", RFC 5562, June 2009. | ||||
[RFC5670] Eardley, P., "Metering and Marking | [RFC3448] Handley, M., Floyd, S., Padhye, J., and | |||
Behaviour of PCN-Nodes", RFC 5670, | J. Widmer, "TCP Friendly Rate Control | |||
November 2009. | (TFRC): Protocol Specification", | |||
RFC 3448, January 2003. | ||||
[RFC5681] Allman, M., Paxson, V., and E. | [RFC3714] Floyd, S. and J. Kempf, "IAB Concerns | |||
Blanton, "TCP Congestion Control", | Regarding Congestion Control for Voice | |||
RFC 5681, September 2009. | Traffic in the Internet", RFC 3714, | |||
March 2004. | ||||
[RFC5690] Floyd, S., Arcia, A., Ros, D., and J. | [RFC4828] Floyd, S. and E. Kohler, "TCP Friendly | |||
Iyengar, "Adding Acknowledgement | Rate Control (TFRC): The Small-Packet | |||
Congestion Control to TCP", RFC 5690, | (SP) Variant", RFC 4828, April 2007. | |||
February 2010. | ||||
[Rate_fair_Dis] Briscoe, B., "Flow Rate Fairness: | [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., | |||
Dismantling a Religion", ACM | and K. Ramakrishnan, "Adding Explicit | |||
CCR 37(2)63--74, April 2007, <http:// | Congestion Notification (ECN) Capability | |||
portal.acm.org/ | to TCP's SYN/ACK Packets", RFC 5562, | |||
citation.cfm?id=1232926>. | June 2009. | |||
[WindowPropFair] Siris, V., "Service Differentiation | [RFC5670] Eardley, P., "Metering and Marking | |||
and Performance of Weighted Window- | Behaviour of PCN-Nodes", RFC 5670, | |||
Based Congestion Control and Packet | November 2009. | |||
Marking Algorithms in ECN Networks", | ||||
Computer Communications 26(4) 314-- | ||||
326, 2002, <http://www.ics.forth.gr/ | ||||
netgroup/publications/ | ||||
weighted_window_control.html>. | ||||
[gentle_RED] Floyd, S., "Recommendation on using | [RFC5681] Allman, M., Paxson, V., and E. Blanton, | |||
the "gentle_" variant of RED", Web | "TCP Congestion Control", RFC 5681, | |||
page , March 2000, <http:// | September 2009. | |||
www.icir.org/floyd/red/gentle.html>. | ||||
[pBox] Floyd, S. and K. Fall, "Promoting the | [RFC5690] Floyd, S., Arcia, A., Ros, D., and J. | |||
Use of End-to-End Congestion Control | Iyengar, "Adding Acknowledgement | |||
in the Internet", IEEE/ACM | Congestion Control to TCP", RFC 5690, | |||
Transactions on Networking 7(4) 458-- | February 2010. | |||
472, August 1999, <http:// | ||||
www.aciri.org/floyd/ | ||||
end2end-paper.html>. | ||||
[pktByteEmail] Yes and J. Doe, "Missing for now", | [Rate_fair_Dis] Briscoe, B., "Flow Rate Fairness: | |||
RFC 0000, May 2006. | Dismantling a Religion", ACM | |||
CCR 37(2)63--74, April 2007, <http:// | ||||
portal.acm.org/citation.cfm?id=1232926>. | ||||
[xcp-spec] Falk, A., "Specification for the | [WindowPropFair] Siris, V., "Service Differentiation and | |||
Explicit Control Protocol (XCP)", | Performance of Weighted Window-Based | |||
draft-falk-xcp-spec-03 (work in | Congestion Control and Packet Marking | |||
progress), July 2007. | Algorithms in ECN Networks", Computer | |||
Communications 26(4) 314--326, 2002, <htt | ||||
p://www.ics.forth.gr/netgroup/ | ||||
publications/ | ||||
weighted_window_control.html>. | ||||
Appendix A. Congestion Notification Definition: Further Justification | [gentle_RED] Floyd, S., "Recommendation on using the | |||
"gentle_" variant of RED", Web page , | ||||
March 2000, <http://www.icir.org/floyd/ | ||||
red/gentle.html>. | ||||
In Section 1.1 on the definition of congestion notification, load not | [pBox] Floyd, S. and K. Fall, "Promoting the Use | |||
capacity was used as the denominator. This also has a subtle | of End-to-End Congestion Control in the | |||
significance in the related debate over the design of new transport | Internet", IEEE/ACM Transactions on | |||
protocols--typical new protocol designs (e.g. in XCP [xcp-spec] & | Networking 7(4) 458--472, August 1999, <h | |||
Quickstart [RFC4782]) expect the sending transport to communicate its | ttp://www.aciri.org/floyd/ | |||
desired flow rate to the network and network elements to | end2end-paper.html>. | |||
progressively subtract from this so that the achievable flow rate | ||||
emerges at the receiving transport. | ||||
Congestion notification with total load in the denominator can serve | [pktByteEmail] Floyd, S., "RED: Discussions of Byte and | |||
a similar purpose (though in retrospect not in advance like XCP & | Packet Modes", email , March 1997, <http: | |||
QuickStart). Congestion notification is a dimensionless fraction but | //www-nrg.ee.lbl.gov/floyd/ | |||
each source can extract necessary rate information from it because it | REDaveraging.txt>. | |||
already knows what its own rate is. Even though congestion | ||||
notification doesn't communicate a rate explicitly, from each | ||||
source's point of view congestion notification represents the | ||||
fraction of the rate it was sending a round trip ago that couldn't | ||||
(or wouldn't) be served by available resources. | ||||
Appendix B. Idealised Wire Protocol | Appendix A. Idealised Wire Protocol | |||
We will start by inventing an idealised congestion notification | We will start by inventing an idealised congestion notification | |||
protocol before discussing how to make it practical. The idealised | protocol before discussing how to make it practical. The idealised | |||
protocol is shown to be correct using examples later in this | protocol is shown to be correct using examples later in this | |||
appendix. | appendix. | |||
B.1. Protocol Coding | A.1. Protocol Coding | |||
Congestion notification involves the congested resource coding a | Congestion notification involves the congested resource coding a | |||
congestion notification signal into the packet stream and the | congestion notification signal into the packet stream and the | |||
transports decoding it. The idealised protocol uses two different | transports decoding it. The idealised protocol uses two different | |||
(imaginary) fields in each datagram to signal congestion: one for | (imaginary) fields in each datagram to signal congestion: one for | |||
byte congestion and one for packet congestion. | byte congestion and one for packet congestion. | |||
We are not saying two ECN fields will be needed (and we are not | We are not saying two ECN fields will be needed (and we are not | |||
saying that somehow a resource should be able to drop a packet in one | saying that somehow a resource should be able to drop a packet in one | |||
of two different ways so that the transport can distinguish which | of two different ways so that the transport can distinguish which | |||
skipping to change at page 31, line 18 | skipping to change at page 33, line 9 | |||
distinguish between bit and packet congestion [RFC3714]. Currently, | distinguish between bit and packet congestion [RFC3714]. Currently, | |||
packet-congestion is not the common case, but there is no guarantee | packet-congestion is not the common case, but there is no guarantee | |||
that it will not become common with future technology trends. | that it will not become common with future technology trends. | |||
The idealised wire protocol is given below. It accounts for packet | The idealised wire protocol is given below. It accounts for packet | |||
sizes at the transport layer, not in the network, and then only in | sizes at the transport layer, not in the network, and then only in | |||
the case of bit-congestible resources. This avoids the perverse | the case of bit-congestible resources. This avoids the perverse | |||
incentive to send smaller packets and the DoS vulnerability that | incentive to send smaller packets and the DoS vulnerability that | |||
would otherwise result if the network were to bias towards them (see | would otherwise result if the network were to bias towards them (see | |||
the motivating argument about avoiding perverse incentives in | the motivating argument about avoiding perverse incentives in | |||
Section 2.2): | Section 2.3): | |||
1. A packet-congestible resource trying to code congestion level p_p | 1. A packet-congestible resource trying to code congestion level p_p | |||
into a packet stream should mark the idealised `packet | into a packet stream should mark the idealised `packet | |||
congestion' field in each packet with probability p_p | congestion' field in each packet with probability p_p | |||
irrespective of the packet's size. The transport should then | irrespective of the packet's size. The transport should then | |||
take a packet with the packet congestion field marked to mean | take a packet with the packet congestion field marked to mean | |||
just one mark, irrespective of the packet size. | just one mark, irrespective of the packet size. | |||
2. A bit-congestible resource trying to code time-varying byte- | 2. A bit-congestible resource trying to code time-varying byte- | |||
congestion level p_b into a packet stream should mark the `byte | congestion level p_b into a packet stream should mark the `byte | |||
congestion' field in each packet with probability p_b, again | congestion' field in each packet with probability p_b, again | |||
irrespective of the packet's size. Unlike before, the transport | irrespective of the packet's size. Unlike before, the transport | |||
should take a packet with the byte congestion field marked to | should take a packet with the byte congestion field marked to | |||
count as a mark on each byte in the packet. | count as a mark on each byte in the packet. | |||
The worked examples in Appendix B.2 show that transports can extract | The worked examples in Appendix A.2 show that transports can extract | |||
sufficient and correct congestion notification from these protocols | sufficient and correct congestion notification from these protocols | |||
for cases when two flows with different packet sizes have matching | for cases when two flows with different packet sizes have matching | |||
bit rates or matching packet rates. Examples are also given that mix | bit rates or matching packet rates. Examples are also given that mix | |||
these two flows into one to show that a flow with mixed packet sizes | these two flows into one to show that a flow with mixed packet sizes | |||
would still be able to extract sufficient and correct information. | would still be able to extract sufficient and correct information. | |||
Sufficient and correct congestion information means that there is | Sufficient and correct congestion information means that there is | |||
sufficient information for the two different types of transport | sufficient information for the two different types of transport | |||
requirements: | requirements: | |||
skipping to change at page 32, line 16 | skipping to change at page 34, line 8 | |||
Absolute-target-based: Other congestion controls proposed in the | Absolute-target-based: Other congestion controls proposed in the | |||
research community aim to limit the volume of congestion caused to | research community aim to limit the volume of congestion caused to | |||
a constant weight parameter. [MulTCP][WindowPropFair] are | a constant weight parameter. [MulTCP][WindowPropFair] are | |||
examples of weighted proportionally fair transports designed for | examples of weighted proportionally fair transports designed for | |||
cost-fair environments [Rate_fair_Dis]. In this case, the | cost-fair environments [Rate_fair_Dis]. In this case, the | |||
transport requires a count (not a ratio) of dropped/marked bytes | transport requires a count (not a ratio) of dropped/marked bytes | |||
in the bit-congestible case and of dropped/marked packets in the | in the bit-congestible case and of dropped/marked packets in the | |||
packet congestible case. | packet congestible case. | |||
B.2. Example Scenarios | A.2. Example Scenarios | |||
B.2.1. Notation | A.2.1. Notation | |||
To prove our idealised wire protocol (Appendix B.1) is correct, we | To prove our idealised wire protocol (Appendix A.1) is correct, we | |||
will compare two flows with different packet sizes, s_1 and s_2 [bit/ | will compare two flows with different packet sizes, s_1 and s_2 [bit/ | |||
pkt], to make sure their transports each see the correct congestion | pkt], to make sure their transports each see the correct congestion | |||
notification. Initially, within each flow we will take all packets | notification. Initially, within each flow we will take all packets | |||
as having equal sizes, but later we will generalise to flows within | as having equal sizes, but later we will generalise to flows within | |||
which packet sizes vary. A flow's bit rate, x [bit/s], is related to | which packet sizes vary. A flow's bit rate, x [bit/s], is related to | |||
its packet rate, u [pkt/s], by | its packet rate, u [pkt/s], by | |||
x(t) = s.u(t). | x(t) = s.u(t). | |||
We will consider a 2x2 matrix of four scenarios: | We will consider a 2x2 matrix of four scenarios: | |||
skipping to change at page 32, line 42 | skipping to change at page 34, line 34 | |||
+-----------------------------+------------------+------------------+ | +-----------------------------+------------------+------------------+ | |||
| resource type and | A) Equal bit | B) Equal pkt | | | resource type and | A) Equal bit | B) Equal pkt | | |||
| congestion level | rates | rates | | | congestion level | rates | rates | | |||
+-----------------------------+------------------+------------------+ | +-----------------------------+------------------+------------------+ | |||
| i) bit-congestible, p_b | (Ai) | (Bi) | | | i) bit-congestible, p_b | (Ai) | (Bi) | | |||
| ii) pkt-congestible, p_p | (Aii) | (Bii) | | | ii) pkt-congestible, p_p | (Aii) | (Bii) | | |||
+-----------------------------+------------------+------------------+ | +-----------------------------+------------------+------------------+ | |||
Table 3 | Table 3 | |||
B.2.2. Bit-congestible resource, equal bit rates (Ai) | A.2.2. Bit-congestible resource, equal bit rates (Ai) | |||
Starting with the bit-congestible scenario, for two flows to maintain | Starting with the bit-congestible scenario, for two flows to maintain | |||
equal bit rates (Ai) the ratio of the packet rates must be the | equal bit rates (Ai) the ratio of the packet rates must be the | |||
inverse of the ratio of packet sizes: u_2/u_1 = s_1/s_2. So, for | inverse of the ratio of packet sizes: u_2/u_1 = s_1/s_2. So, for | |||
instance, a flow of 60B packets would have to send 25x more packets | instance, a flow of 60B packets would have to send 25x more packets | |||
to achieve the same bit rate as a flow of 1500B packets. If a | to achieve the same bit rate as a flow of 1500B packets. If a | |||
congested resource marks proportion p_b of packets irrespective of | congested resource marks proportion p_b of packets irrespective of | |||
size, the ratio of marked packets received by each transport will | size, the ratio of marked packets received by each transport will | |||
still be the same as the ratio of their packet rates, p_b.u_2/p_b.u_1 | still be the same as the ratio of their packet rates, p_b.u_2/p_b.u_1 | |||
= s_1/s_2. So of the 25x more 60B packets sent, 25x more will be | = s_1/s_2. So of the 25x more 60B packets sent, 25x more will be | |||
marked than in the 1500B packet flow, but 25x more won't be marked | marked than in the 1500B packet flow, but 25x more won't be marked | |||
too. | too. | |||
In this scenario, the resource is bit-congestible, so it always uses | In this scenario, the resource is bit-congestible, so it always uses | |||
our idealised bit-congestion field when it marks packets. Therefore | our idealised bit-congestion field when it marks packets. Therefore | |||
the transport should count marked bytes not packets. But it doesn't | the transport should count marked bytes not packets. But it doesn't | |||
actually matter for ratio-based transports like TCP (Appendix B.1). | actually matter for ratio-based transports like TCP (Appendix A.1). | |||
The ratio of marked to unmarked bytes seen by each flow will be p_b, | The ratio of marked to unmarked bytes seen by each flow will be p_b, | |||
as will the ratio of marked to unmarked packets. Because they are | as will the ratio of marked to unmarked packets. Because they are | |||
ratios, the units cancel out. | ratios, the units cancel out. | |||
If a flow sent an inconsistent mixture of packet sizes, we have said | If a flow sent an inconsistent mixture of packet sizes, we have said | |||
it should count the ratio of marked and unmarked bytes not packets in | it should count the ratio of marked and unmarked bytes not packets in | |||
order to correctly decode the level of congestion. But actually, if | order to correctly decode the level of congestion. But actually, if | |||
all it is trying to do is decode p_b, it still doesn't matter. For | all it is trying to do is decode p_b, it still doesn't matter. For | |||
instance, imagine the two equal bit rate flows were actually one flow | instance, imagine the two equal bit rate flows were actually one flow | |||
at twice the bit rate sending a mixture of one 1500B packet for every | at twice the bit rate sending a mixture of one 1500B packet for every | |||
thirty 60B packets. 25x more small packets will be marked and 25x | thirty 60B packets. 25x more small packets will be marked and 25x | |||
more will be unmarked. The transport can still calculate p_b whether | more will be unmarked. The transport can still calculate p_b whether | |||
it uses bytes or packets for the ratio. In general, for any | it uses bytes or packets for the ratio. In general, for any | |||
algorithm which works on a ratio of marks to non-marks, either bytes | algorithm which works on a ratio of marks to non-marks, either bytes | |||
or packets can be counted interchangeably, because the choice cancels | or packets can be counted interchangeably, because the choice cancels | |||
out in the ratio calculation. | out in the ratio calculation. | |||
However, where an absolute target rather than relative volume of | However, where an absolute target rather than relative volume of | |||
congestion caused is important (Appendix B.1), as it is for | congestion caused is important (Appendix A.1), as it is for | |||
congestion accountability [Rate_fair_Dis], the transport must count | congestion accountability [Rate_fair_Dis], the transport must count | |||
marked bytes not packets, in this bit-congestible case. Aside from | marked bytes not packets, in this bit-congestible case. Aside from | |||
the goal of congestion accountability, this is how the bit rate of a | the goal of congestion accountability, this is how the bit rate of a | |||
transport can be made independent of packet size; by ensuring the | transport can be made independent of packet size; by ensuring the | |||
rate of congestion caused is kept to a constant weight | rate of congestion caused is kept to a constant weight | |||
[WindowPropFair], rather than merely responding to the ratio of | [WindowPropFair], rather than merely responding to the ratio of | |||
marked and unmarked bytes. | marked and unmarked bytes. | |||
Note the unit of byte-congestion-volume is the byte. | Note the unit of byte-congestion-volume is the byte. | |||
B.2.3. Bit-congestible resource, equal packet rates (Bi) | A.2.3. Bit-congestible resource, equal packet rates (Bi) | |||
If two flows send different packet sizes but at the same packet rate, | If two flows send different packet sizes but at the same packet rate, | |||
their bit rates will be in the same ratio as their packet sizes, x_2/ | their bit rates will be in the same ratio as their packet sizes, x_2/ | |||
x_1 = s_2/s_1. For instance, a flow sending 1500B packets at the | x_1 = s_2/s_1. For instance, a flow sending 1500B packets at the | |||
same packet rate as another sending 60B packets will be sending at | same packet rate as another sending 60B packets will be sending at | |||
25x greater bit rate. In this case, if a congested resource marks | 25x greater bit rate. In this case, if a congested resource marks | |||
proportion p_b of packets irrespective of size, the ratio of packets | proportion p_b of packets irrespective of size, the ratio of packets | |||
received with the byte-congestion field marked by each transport will | received with the byte-congestion field marked by each transport will | |||
be the same, p_b.u_2/p_b.u_1 = 1. | be the same, p_b.u_2/p_b.u_1 = 1. | |||
skipping to change at page 34, line 29 | skipping to change at page 36, line 20 | |||
If the two flows are mixed into one, of bit rate x1+x2, with equal | If the two flows are mixed into one, of bit rate x1+x2, with equal | |||
packet rates of each size packet, the ratio p_b will still be | packet rates of each size packet, the ratio p_b will still be | |||
measurable by counting the ratio of marked to unmarked bytes (or | measurable by counting the ratio of marked to unmarked bytes (or | |||
packets because the ratio cancels out the units). However, if the | packets because the ratio cancels out the units). However, if the | |||
absolute volume of congestion is required, the transport must count | absolute volume of congestion is required, the transport must count | |||
the sum of congestion marked bytes, which indeed gives a correct | the sum of congestion marked bytes, which indeed gives a correct | |||
measure of the rate of byte-congestion p_b(x_1 + x_2) caused by the | measure of the rate of byte-congestion p_b(x_1 + x_2) caused by the | |||
combined bit rate. | combined bit rate. | |||
B.2.4. Pkt-congestible resource, equal bit rates (Aii) | A.2.4. Pkt-congestible resource, equal bit rates (Aii) | |||
Moving to the case of packet-congestible resources, we now take two | Moving to the case of packet-congestible resources, we now take two | |||
flows that send different packet sizes at the same bit rate, but this | flows that send different packet sizes at the same bit rate, but this | |||
time the pkt-congestion field is marked by the resource with | time the pkt-congestion field is marked by the resource with | |||
probability p_p. As in scenario Ai with the same bit rates but a | probability p_p. As in scenario Ai with the same bit rates but a | |||
bit-congestible resource, the flow with smaller packets will have a | bit-congestible resource, the flow with smaller packets will have a | |||
higher packet rate, so more packets will be both marked and unmarked, | higher packet rate, so more packets will be both marked and unmarked, | |||
but in the same proportion. | but in the same proportion. | |||
This time, the transport should only count marks without taking into | This time, the transport should only count marks without taking into | |||
skipping to change at page 35, line 10 | skipping to change at page 37, line 5 | |||
flow of our example, as required. | flow of our example, as required. | |||
But if the transport is interested in the absolute number of packet | But if the transport is interested in the absolute number of packet | |||
congestion, it should just count how many marked packets arrive. For | congestion, it should just count how many marked packets arrive. For | |||
instance, a flow sending 60B packets will see 25x more marked packets | instance, a flow sending 60B packets will see 25x more marked packets | |||
than one sending 1500B packets at the same bit rate, because it is | than one sending 1500B packets at the same bit rate, because it is | |||
sending more packets through a packet-congestible resource. | sending more packets through a packet-congestible resource. | |||
Note the unit of packet congestion is a packet. | Note the unit of packet congestion is a packet. | |||
B.2.5. Pkt-congestible resource, equal packet rates (Bii) | A.2.5. Pkt-congestible resource, equal packet rates (Bii) | |||
Finally, if two flows with the same packet rate, pass through a | Finally, if two flows with the same packet rate, pass through a | |||
packet-congestible resource, they will both suffer the same | packet-congestible resource, they will both suffer the same | |||
proportion of marking, p_p, irrespective of their packet sizes. On | proportion of marking, p_p, irrespective of their packet sizes. On | |||
detecting that the pkt-congestion field is marked, the transport | detecting that the pkt-congestion field is marked, the transport | |||
should count packets, and it will be able to extract the ratio p_p of | should count packets, and it will be able to extract the ratio p_p of | |||
marked to unmarked packets from both flows, irrespective of packet | marked to unmarked packets from both flows, irrespective of packet | |||
sizes. | sizes. | |||
Even if the transport is monitoring the absolute amount of packets | Even if the transport is monitoring the absolute amount of packets | |||
congestion over a period, still it will see the same amount of packet | congestion over a period, still it will see the same amount of packet | |||
congestion from either flow. | congestion from either flow. | |||
And if the two equal packet rates of different size packets are mixed | And if the two equal packet rates of different size packets are mixed | |||
together in one flow, the packet rate will double, so the absolute | together in one flow, the packet rate will double, so the absolute | |||
volume of packet-congestion will accumulate at twice the rate of | volume of packet-congestion will accumulate at twice the rate of | |||
either flow, 2p_p.u_1 = p_p(u_1+u_2). | either flow, 2p_p.u_1 = p_p(u_1+u_2). | |||
Appendix C. Byte-mode Drop Complicates Policing Congestion Response | Appendix B. Byte-mode Drop Complicates Policing Congestion Response | |||
This appendix explains why the ability of networks to police the | This appendix explains why the ability of networks to police the | |||
response of _any_ transport to congestion depends on bit-congestible | response of _any_ transport to congestion depends on bit-congestible | |||
network resources only doing packet-mode not byte-mode drop. | network resources only doing packet-mode not byte-mode drop. | |||
To be able to police a transport's response to congestion when | To be able to police a transport's response to congestion when | |||
fairness can only be judged over time and over all an individual's | fairness can only be judged over time and over all an individual's | |||
flows, the policer has to have an integrated view of all the | flows, the policer has to have an integrated view of all the | |||
congestion an individual (not just one flow) has caused due to all | congestion an individual (not just one flow) has caused due to all | |||
traffic entering the Internet from that individual. This is termed | traffic entering the Internet from that individual. This is termed | |||
congestion accountability. | congestion accountability. | |||
But a byte-mode drop algorithm has to depend on the local MTU of the | But a byte-mode drop algorithm has to depend on the local MTU of the | |||
line - an algorithm needs to use some concept of a 'normal' packet | line - an algorithm needs to use some concept of a 'normal' packet | |||
size. Therefore, one dropped or marked packet is not necessarily | size. Therefore, one dropped or marked packet is not necessarily | |||
equivalent to another unless you know the MTU at the queue where it | equivalent to another unless you know the MTU at the queue where it | |||
was dropped/marked. To have an integrated view of a user, we believe | was dropped/marked. To have an integrated view of a user, we believe | |||
congestion policing has to be located at an individual's attachment | congestion policing has to be located at an individual's attachment | |||
point to the Internet [I-D.briscoe-tsvwg-re-ecn-tcp]. But from there | point to the Internet [I-D.conex-concepts-uses]. But from there it | |||
it cannot know the MTU of each remote queue that caused each drop/ | cannot know the MTU of each remote queue that caused each drop/mark. | |||
mark. Therefore it cannot take an integrated approach to policing | Therefore it cannot take an integrated approach to policing all the | |||
all the responses to congestion of all the transports of one | responses to congestion of all the transports of one individual. | |||
individual. Therefore it cannot police anything. | Therefore it cannot police anything. | |||
The security/incentive argument _for_ packet-mode drop is similar. | The security/incentive argument _for_ packet-mode drop is similar. | |||
Firstly, confining RED to packet-mode drop would not preclude | Firstly, confining RED to packet-mode drop would not preclude | |||
bottleneck policing approaches such as [pBox] as it seems likely they | bottleneck policing approaches such as [pBox] as it seems likely they | |||
could work just as well by monitoring the volume of dropped bytes | could work just as well by monitoring the volume of dropped bytes | |||
rather than packets. Secondly packet-mode dropping/marking naturally | rather than packets. Secondly packet-mode dropping/marking naturally | |||
allows the congestion notification of packets to be globally | allows the congestion notification of packets to be globally | |||
meaningful without relying on MTU information held elsewhere. | meaningful without relying on MTU information held elsewhere. | |||
Because we recommend that a dropped/marked packet should be taken to | Because we recommend that a dropped/marked packet should be taken to | |||
skipping to change at page 36, line 27 | skipping to change at page 38, line 21 | |||
packets or across different size flows [Rate_fair_Dis]. Therefore | packets or across different size flows [Rate_fair_Dis]. Therefore | |||
policing would work naturally with just simple packet-mode drop in | policing would work naturally with just simple packet-mode drop in | |||
RED. | RED. | |||
In summary, making drop probability depend on the size of the packets | In summary, making drop probability depend on the size of the packets | |||
that bits happen to be divided into simply encourages the bits to be | that bits happen to be divided into simply encourages the bits to be | |||
divided into smaller packets. Byte-mode drop would therefore | divided into smaller packets. Byte-mode drop would therefore | |||
irreversibly complicate any attempt to fix the Internet's incentive | irreversibly complicate any attempt to fix the Internet's incentive | |||
structures. | structures. | |||
Appendix D. Changes from Previous Versions | Appendix C. Changes from Previous Versions | |||
To be removed by the RFC Editor on publication. | To be removed by the RFC Editor on publication. | |||
Full incremental diffs between each version are available at | Full incremental diffs between each version are available at | |||
<http://www.cs.ucl.ac.uk/staff/B.Briscoe/pubs.html#byte-pkt-congest> | <http://www.cs.ucl.ac.uk/staff/B.Briscoe/pubs.html#byte-pkt-congest> | |||
or | or | |||
<http://tools.ietf.org/wg/tsvwg/draft-ietf-tsvwg-byte-pkt-congest/> | <http://tools.ietf.org/wg/tsvwg/draft-ietf-tsvwg-byte-pkt-congest/> | |||
(courtesy of the rfcdiff tool): | (courtesy of the rfcdiff tool): | |||
From -01 to -02 (this version): | From -02 to -03 (this version) | |||
* Structural changes: | ||||
+ Split off text at end of "Scaling Congestion Control with | ||||
Packet Size" into new section "Transport-Independent | ||||
Network" | ||||
+ Shifted "Recommendations" straight after "Motivating | ||||
Arguments" and added "Conclusions" at end to reinforce | ||||
Recommendations | ||||
+ Added more internal structure to Recommendations, so that | ||||
recommendations specific to RED or to TCP are just | ||||
corollaries of a more general recommendation, rather than | ||||
being listed as a separate recommendation. | ||||
+ Renamed "State of the Art" as "Critical Survey of Existing | ||||
Advice" and retitled a number of subsections with more | ||||
descriptive titles. | ||||
+ Split end of "Congestion Coding: Summary of Status" into a | ||||
new subsection called "RED Implementation Status". | ||||
+ Removed text that had been in the Appendix "Congestion | ||||
Notification Definition: Further Justification". | ||||
* Reordered the intro text a little. | ||||
* Made it clearer when advice being reported is deprecated and | ||||
when it is not. | ||||
* Described AQM as in network equipment, rather than saying "at | ||||
the network layer" (to side-step controversy over whether | ||||
functions like AQM are in the transport layer but in network | ||||
equipment). | ||||
* Minor improvements to clarity throughout | ||||
From -01 to -02: | ||||
* Restructured the whole document for (hopefully) easier reading | * Restructured the whole document for (hopefully) easier reading | |||
and clarity. The concrete recommendation, in RFC2119 language, | and clarity. The concrete recommendation, in RFC2119 language, | |||
is now in Section 5. | is now in Section 7. | |||
From -00 to -01: | From -00 to -01: | |||
* Minor clarifications throughout and updated references | * Minor clarifications throughout and updated references | |||
From briscoe-byte-pkt-mark-02 to ietf-byte-pkt-congest-00: | From briscoe-byte-pkt-mark-02 to ietf-byte-pkt-congest-00: | |||
* Added note on relationship to existing RFCs | * Added note on relationship to existing RFCs | |||
* Posed the question of whether packet-congestion could become | * Posed the question of whether packet-congestion could become | |||
common and deferred it to the IRTF ICCRG. Added ref to the | common and deferred it to the IRTF ICCRG. Added ref to the | |||
dual-resource queue (DRQ) proposal. | dual-resource queue (DRQ) proposal. | |||
* Changed PCN references from the PCN charter & architecture to | * Changed PCN references from the PCN charter & architecture to | |||
the PCN marking behaviour draft most likely to imminently | the PCN marking behaviour draft most likely to imminently | |||
become the standards track WG item. | become the standards track WG item. | |||
From -01 to -02: | From -01 to -02: | |||
skipping to change at page 37, line 41 | skipping to change at page 40, line 26 | |||
new transports unless we decide whether the network or the | new transports unless we decide whether the network or the | |||
transport is allowing for packet size. | transport is allowing for packet size. | |||
* Added statement explaining the horizon of the memo is long | * Added statement explaining the horizon of the memo is long | |||
term, but with short term expediency in mind. | term, but with short term expediency in mind. | |||
* Added material on scaling congestion control with packet size | * Added material on scaling congestion control with packet size | |||
(Section 2.1). | (Section 2.1). | |||
* Separated out issue of normalising TCP's bit rate from issue of | * Separated out issue of normalising TCP's bit rate from issue of | |||
preference to control packets (Section 2.3). | preference to control packets (Section 2.4). | |||
* Divided up Congestion Measurement section for clarity, | * Divided up Congestion Measurement section for clarity, | |||
including new material on fixed size packet buffers and buffer | including new material on fixed size packet buffers and buffer | |||
carving (Section 3.1.1 & Section 3.2.1) and on congestion | carving (Section 4.1.1 & Section 4.2.1) and on congestion | |||
measurement in wireless link technologies without queues | measurement in wireless link technologies without queues | |||
(Section 3.1.2). | (Section 4.1.2). | |||
* Added section on 'Making Transports Robust against Control | * Added section on 'Making Transports Robust against Control | |||
Packet Losses' (Section 3.2.3) with existing & new material | Packet Losses' (Section 4.2.3) with existing & new material | |||
included. | included. | |||
* Added tabulated results of vendor survey on byte-mode drop | * Added tabulated results of vendor survey on byte-mode drop | |||
variant of RED (Table 2). | variant of RED (Table 2). | |||
From -00 to -01: | From -00 to -01: | |||
* Clarified applicability to drop as well as ECN. | * Clarified applicability to drop as well as ECN. | |||
* Highlighted DoS vulnerability. | * Highlighted DoS vulnerability. | |||
End of changes. 148 change blocks. | ||||
654 lines changed or deleted | 740 lines changed or added | |||
This html diff was produced by rfcdiff 1.40. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |