draft-ietf-tsvwg-aqm-dualq-coupled-08.txt | draft-ietf-tsvwg-aqm-dualq-coupled-09.txt | |||
---|---|---|---|---|
Transport Area working group (tsvwg) K. De Schepper | Transport Area working group (tsvwg) K. De Schepper | |||
Internet-Draft Nokia Bell Labs | Internet-Draft Nokia Bell Labs | |||
Intended status: Experimental B. Briscoe, Ed. | Intended status: Experimental B. Briscoe, Ed. | |||
Expires: May 8, 2019 CableLabs | Expires: January 6, 2020 G. White | |||
O. Bondarenko | CableLabs | |||
Simula Research Lab | July 05, 2019 | |||
I. Tsang | ||||
Nokia | ||||
November 04, 2018 | ||||
DualQ Coupled AQMs for Low Latency, Low Loss and Scalable Throughput | DualQ Coupled AQMs for Low Latency, Low Loss and Scalable Throughput | |||
(L4S) | (L4S) | |||
draft-ietf-tsvwg-aqm-dualq-coupled-08 | draft-ietf-tsvwg-aqm-dualq-coupled-09 | |||
Abstract | Abstract | |||
The Low Latency Low Loss Scalable Throughput (L4S) architecture | The Low Latency Low Loss Scalable Throughput (L4S) architecture | |||
allows data flows over the public Internet to predictably achieve | allows data flows over the public Internet to predictably achieve | |||
ultra-low queuing latency, generally zero congestion loss and scaling | ultra-low queuing latency, generally zero congestion loss and scaling | |||
of per-flow throughput without the problems of traditional TCP. To | of per-flow throughput without the problems of traditional TCP. To | |||
achieve this, L4S data flows use a 'scalable' congestion control | achieve this, L4S data flows have to use one of the family of | |||
similar to Data Centre TCP (DCTCP) and a form of Explicit Congestion | 'Scalable' congestion controls (Data Centre TCP and TCP Prague are | |||
Notification (ECN) with modified behaviour. However, until now, | examples) and a form of Explicit Congestion Notification (ECN) with | |||
scalable congestion controls did not co-exist with existing TCP Reno/ | modified behaviour. However, until now, Scalable congestion controls | |||
Cubic traffic---scalable controls are so aggressive that 'Classic' | did not co-exist with existing TCP Reno/Cubic traffic --- Scalable | |||
TCP algorithms drive themselves to starvation. Therefore, until now, | controls are so aggressive that 'Classic' TCP algorithms drive | |||
L4S controls could only be deployed where a clean-slate environment | themselves to a small capacity share. Therefore, until now, L4S | |||
could be arranged, such as in private data centres (hence the name | controls could only be deployed where a clean-slate environment could | |||
DCTCP). This specification defines `DualQ Coupled Active Queue | be arranged, such as in private data centres (hence the name DCTCP). | |||
Management (AQM)', which enables these scalable congestion controls | This specification defines `DualQ Coupled Active Queue Management | |||
to safely co-exist with Classic Internet traffic. | (AQM)', which enables these Scalable congestion controls to safely | |||
co-exist with Classic Internet traffic. | ||||
The Coupled AQM ensures that a flow runs at about the same rate | The Coupled AQM ensures that competing Scalable and Classic flows run | |||
whether it uses DCTCP or TCP Reno/Cubic. It achieves this | at about the same rate. It achieves this indirectly, without having | |||
indirectly, without having to inspect transport layer flow | to inspect transport layer flow identifiers, When tested in a | |||
identifiers, When tested in a residential broadband setting, DCTCP | residential broadband setting, DCTCP also achieves sub-millisecond | |||
also achieves sub-millisecond average queuing delay and zero | average queuing delay and zero congestion loss under a wide range of | |||
congestion loss under a wide range of mixes of DCTCP and `Classic' | mixes of DCTCP and `Classic' broadband Internet traffic, without | |||
broadband Internet traffic, without compromising the performance of | compromising the performance of the Classic traffic. The solution | |||
the Classic traffic. The solution also reduces network complexity | also reduces network complexity and requires no configuration for the | |||
and eliminates network configuration. | public Internet. | |||
Status of This Memo | Status of This Memo | |||
This Internet-Draft is submitted in full conformance with the | This Internet-Draft is submitted in full conformance with the | |||
provisions of BCP 78 and BCP 79. | provisions of BCP 78 and BCP 79. | |||
Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
Drafts is at https://datatracker.ietf.org/drafts/current/. | Drafts is at https://datatracker.ietf.org/drafts/current/. | |||
Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
This Internet-Draft will expire on May 8, 2019. | This Internet-Draft will expire on January 6, 2020. | |||
Copyright Notice | Copyright Notice | |||
Copyright (c) 2018 IETF Trust and the persons identified as the | Copyright (c) 2019 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | document authors. All rights reserved. | |||
This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
(https://trustee.ietf.org/license-info) in effect on the date of | (https://trustee.ietf.org/license-info) in effect on the date of | |||
publication of this document. Please review these documents | publication of this document. Please review these documents | |||
carefully, as they describe your rights and restrictions with respect | carefully, as they describe your rights and restrictions with respect | |||
to this document. Code Components extracted from this document must | to this document. Code Components extracted from this document must | |||
include Simplified BSD License text as described in Section 4.e of | include Simplified BSD License text as described in Section 4.e of | |||
the Trust Legal Provisions and are provided without warranty as | the Trust Legal Provisions and are provided without warranty as | |||
described in the Simplified BSD License. | described in the Simplified BSD License. | |||
Table of Contents | Table of Contents | |||
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 | |||
1.1. Problem and Scope . . . . . . . . . . . . . . . . . . . . 3 | 1.1. Outline of the Problem . . . . . . . . . . . . . . . . . 3 | |||
1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 5 | 1.2. Scope . . . . . . . . . . . . . . . . . . . . . . . . . . 5 | |||
1.3. Features . . . . . . . . . . . . . . . . . . . . . . . . 6 | 1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . 7 | |||
2. DualQ Coupled AQM . . . . . . . . . . . . . . . . . . . . . . 7 | 1.4. Features . . . . . . . . . . . . . . . . . . . . . . . . 8 | |||
2.1. Coupled AQM . . . . . . . . . . . . . . . . . . . . . . . 8 | 2. DualQ Coupled AQM . . . . . . . . . . . . . . . . . . . . . . 9 | |||
2.2. Dual Queue . . . . . . . . . . . . . . . . . . . . . . . 9 | 2.1. Coupled AQM . . . . . . . . . . . . . . . . . . . . . . . 9 | |||
2.3. Traffic Classification . . . . . . . . . . . . . . . . . 9 | 2.2. Dual Queue . . . . . . . . . . . . . . . . . . . . . . . 10 | |||
2.4. Overall DualQ Coupled AQM Structure . . . . . . . . . . . 10 | 2.3. Traffic Classification . . . . . . . . . . . . . . . . . 11 | |||
2.5. Normative Requirements for a DualQ Coupled AQM . . . . . 12 | 2.4. Overall DualQ Coupled AQM Structure . . . . . . . . . . . 11 | |||
2.5.1. Functional Requirements . . . . . . . . . . . . . . . 12 | 2.5. Normative Requirements for a DualQ Coupled AQM . . . . . 14 | |||
2.5.1.1. Requirements in Unexpected Cases . . . . . . . . 13 | 2.5.1. Functional Requirements . . . . . . . . . . . . . . . 14 | |||
2.5.2. Management Requirements . . . . . . . . . . . . . . . 15 | 2.5.1.1. Requirements in Unexpected Cases . . . . . . . . 15 | |||
3. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 16 | 2.5.2. Management Requirements . . . . . . . . . . . . . . . 16 | |||
4. Security Considerations . . . . . . . . . . . . . . . . . . . 16 | 2.5.2.1. Configuration . . . . . . . . . . . . . . . . . . 16 | |||
4.1. Overload Handling . . . . . . . . . . . . . . . . . . . . 16 | 2.5.2.2. Monitoring . . . . . . . . . . . . . . . . . . . 18 | |||
2.5.2.3. Anomaly Detection . . . . . . . . . . . . . . . . 18 | ||||
2.5.2.4. Deployment, Coexistence and Scaling . . . . . . . 19 | ||||
3. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 | ||||
4. Security Considerations . . . . . . . . . . . . . . . . . . . 19 | ||||
4.1. Overload Handling . . . . . . . . . . . . . . . . . . . . 19 | ||||
4.1.1. Avoiding Classic Starvation: Sacrifice L4S Throughput | 4.1.1. Avoiding Classic Starvation: Sacrifice L4S Throughput | |||
or Delay? . . . . . . . . . . . . . . . . . . . . . . 17 | or Delay? . . . . . . . . . . . . . . . . . . . . . . 20 | |||
4.1.2. Congestion Signal Saturation: Introduce L4S Drop or | 4.1.2. Congestion Signal Saturation: Introduce L4S Drop or | |||
Delay? . . . . . . . . . . . . . . . . . . . . . . . 18 | Delay? . . . . . . . . . . . . . . . . . . . . . . . 21 | |||
4.1.3. Protecting against Unresponsive ECN-Capable Traffic . 19 | 4.1.3. Protecting against Unresponsive ECN-Capable Traffic . 22 | |||
5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 19 | 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 22 | |||
6. References . . . . . . . . . . . . . . . . . . . . . . . . . 20 | 6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 23 | |||
6.1. Normative References . . . . . . . . . . . . . . . . . . 20 | 7. References . . . . . . . . . . . . . . . . . . . . . . . . . 23 | |||
6.2. Informative References . . . . . . . . . . . . . . . . . 20 | 7.1. Normative References . . . . . . . . . . . . . . . . . . 23 | |||
Appendix A. Example DualQ Coupled PI2 Algorithm . . . . . . . . 23 | 7.2. Informative References . . . . . . . . . . . . . . . . . 24 | |||
A.1. Pass #1: Core Concepts . . . . . . . . . . . . . . . . . 23 | Appendix A. Example DualQ Coupled PI2 Algorithm . . . . . . . . 27 | |||
A.2. Pass #2: Overload Details . . . . . . . . . . . . . . . . 30 | A.1. Pass #1: Core Concepts . . . . . . . . . . . . . . . . . 28 | |||
Appendix B. Example DualQ Coupled Curvy RED Algorithm . . . . . 33 | A.2. Pass #2: Overload Details . . . . . . . . . . . . . . . . 36 | |||
Appendix C. Guidance on Controlling Throughput Equivalence . . . 39 | Appendix B. Example DualQ Coupled Curvy RED Algorithm . . . . . 40 | |||
Appendix D. Open Issues . . . . . . . . . . . . . . . . . . . . 40 | B.1. Curvy RED in Pseudocode . . . . . . . . . . . . . . . . . 40 | |||
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 41 | B.2. Efficient Implementation of Curvy RED . . . . . . . . . . 46 | |||
Appendix C. Guidance on Controlling Throughput Equivalence . . . 48 | ||||
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 49 | ||||
1. Introduction | 1. Introduction | |||
1.1. Problem and Scope | This document specifies a framework for DualQ Coupled AQMs, which is | |||
the network part of the L4S architecture [I-D.ietf-tsvwg-l4s-arch]. | ||||
L4S enables both ultra-low queuing latency and high throughput at the | ||||
same time, for ad hoc numbers of capacity-seeking applications all | ||||
sharing the same capacity. | ||||
1.1. Outline of the Problem | ||||
Latency is becoming the critical performance factor for many (most?) | Latency is becoming the critical performance factor for many (most?) | |||
applications on the public Internet, e.g. interactive Web, Web | applications on the public Internet, e.g. interactive Web, Web | |||
services, voice, conversational video, interactive video, interactive | services, voice, conversational video, interactive video, interactive | |||
remote presence, instant messaging, online gaming, remote desktop, | remote presence, instant messaging, online gaming, remote desktop, | |||
cloud-based applications, and video-assisted remote control of | cloud-based applications, and video-assisted remote control of | |||
machinery and industrial processes. In the developed world, further | machinery and industrial processes. In the developed world, further | |||
increases in access network bit-rate offer diminishing returns, | increases in access network bit-rate offer diminishing returns, | |||
whereas latency is still a multi-faceted problem. In the last decade | whereas latency is still a multi-faceted problem. In the last decade | |||
or so, much has been done to reduce propagation time by placing | or so, much has been done to reduce propagation time by placing | |||
caches or servers closer to users. However, queuing remains a major | caches or servers closer to users. However, queuing remains a major | |||
intermittent component of latency. | intermittent component of latency. | |||
The Diffserv architecture provides Expedited Forwarding [RFC3246], so | Traditionally ultra-low latency has only been available for a few | |||
that low latency traffic can jump the queue of other traffic. | selected low rate applications, that confine their sending rate | |||
However, on access links dedicated to individual sites (homes, small | within a specially carved-off portion of capacity, which is | |||
enterprises or mobile devices), often all traffic at any one time | prioritized over other traffic, e.g. Diffserv EF [RFC3246]. Up to | |||
will be latency-sensitive and, if all the traffic on a link is marked | now it has not been possible to allow any number of low latency, high | |||
as EF, Diffserv cannot reduce the delay of any of it. In contrast, | throughput applications to seek to fully utilize available capacity, | |||
the Low Latency Low Loss Scalable throughput (L4S) approach removes | because the capacity-seeking process itself causes too much queuing | |||
the causes of any unnecessary queuing delay. | delay. | |||
The bufferbloat project has shown that excessively-large buffering | To reduce this queuing delay caused by the capacity seeking process, | |||
(`bufferbloat') has been introducing significantly more delay than | changes either to the network alone or to end-systems alone are in | |||
the underlying propagation time. These delays appear only | progress. L4S involves a recognition that both approaches are | |||
intermittently--only when a capacity-seeking (e.g. TCP) flow is long | yielding diminishing returns: | |||
enough for the queue to fill the buffer, making every packet in other | ||||
flows sharing the buffer sit through the queue. | ||||
Active queue management (AQM) was originally developed to solve this | o Recent state-of-the-art active queue management (AQM) in the | |||
problem (and others). Unlike Diffserv, which gives low latency to | network, e.g. fq_CoDel [RFC8290], PIE [RFC8033], Adaptive | |||
some traffic at the expense of others, AQM controls latency for _all_ | RED [ARED01] ) has reduced queuing delay for all traffic, not just | |||
traffic in a class. In general, AQMs introduce an increasing level | a select few applications. However, no matter how good the AQM, | |||
of discard from the buffer the longer the queue persists above a | the capacity-seeking (sawtoothing) rate of TCP-like congestion | |||
shallow threshold. This gives sufficient signals to capacity-seeking | controls represents a lower limit that will either cause queuing | |||
(aka. greedy) flows to keep the buffer empty for its intended | delay to vary or cause the link to be under-utilized. These AQMs | |||
purpose: absorbing bursts. However, RED [RFC2309] and other | are tuned to allow a typical capacity-seeking TCP-Friendly flow to | |||
algorithms from the 1990s were sensitive to their configuration and | induce an average queue that roughly doubles the base RTT, adding | |||
hard to set correctly. So, AQM was not widely deployed in the 1990s. | 5-15 ms of queuing on average (cf. 500 microseconds with L4S for | |||
the same mix of long-running and web traffic). However, for many | ||||
applications low delay is not useful unless it is consistently | ||||
low. With these AQMs, 99th percentile queuing delay is 20-30 ms | ||||
(cf. 2 ms with the same traffic over L4S). | ||||
More recent state-of-the-art AQMs, e.g. fq_CoDel [RFC8290], | o Similarly, recent research into using e2e congestion control | |||
PIE [RFC8033], Adaptive RED [ARED01], are easier to configure, | without needing an AQM in the network (e.g.BBRv1 [BBRv1]) seems to | |||
because they define the queuing threshold in time not bytes, so it is | have hit a similar lower limit to queuing delay of about 20ms on | |||
invariant for different link rates. However, no matter how good the | average (and any additional BBRv1 flow adds another 20ms of | |||
AQM, the sawtoothing rate of TCP will either cause queuing delay to | queuing) but there are also regular 25ms delay spikes due to | |||
vary or cause the link to be under-utilized. Even with a perfectly | bandwidth probes and 60ms spikes due to flow-starts. | |||
tuned AQM, the additional queuing delay will be of the same order as | ||||
the underlying speed-of-light delay across the network. Flow-queuing | ||||
can isolate one flow from another, but it cannot isolate a TCP flow | ||||
from the delay variations it inflicts on itself, and it has other | ||||
problems - it overrides the flow rate decisions of variable rate | ||||
video applications, it does not recognise the flows within IPSec VPN | ||||
tunnels and it is relatively expensive to implement. | ||||
It seems that further changes to the network alone will now yield | L4S learns from the experience of Data Center TCP [RFC8257], which | |||
diminishing returns. Data Centre TCP (DCTCP [RFC8257]) teaches us | shows the power of complementary changes both in the network and on | |||
that a small but radical change to TCP is needed to cut two major | end-systems. DCTCP teaches us that two small but radical changes to | |||
outstanding causes of queuing delay variability: | congestion control are needed to cut the two major outstanding causes | |||
of queuing delay variability: | ||||
1. the `sawtooth' varying rate of TCP itself; | 1. Far smaller rate variations (sawteeth) than TCP-Friendly | |||
congestion controls; | ||||
2. the smoothing delay deliberately introduced into AQMs to permit | 2. A shift of smoothing and hence smoothing delay from network to | |||
bursts without triggering losses. | sender. | |||
The former causes a flow's round trip time (RTT) to vary from about 1 | Without the former, a 'Classic' flow's round trip time (RTT) varies | |||
to 2 times the base RTT between the machines in question. The latter | between roughly 1 and 2 times the base RTT between the machines in | |||
delays the system's response to change by a worst-case | question. Without the latter a 'Classic' flow's response to changing | |||
(transcontinental) RTT, which could be hundreds of times the actual | events is delayed by a worst-case (transcontinental) RTT, which could | |||
RTT of typical traffic from localized CDNs. | be hundreds of times the actual smoothing delay needed for the RTT of | |||
typical traffic from localized CDNs. | ||||
Latency is not our only concern: | These changes are the two main features of the family of so-called | |||
'Scalable' congestion controls (which includes DCTCP). Both these | ||||
changes only reduce delay in combination with a complementary change | ||||
in the network and they are both only feasible with ECN, not drop, | ||||
for the signalling: | ||||
3. It was known when TCP was first developed that it would not scale | 1. The smaller sawteeth need an extremely shallow ECN packet-marking | |||
to high bandwidth-delay products [TCP-CA]. | threshold in the queue. | |||
Given regular broadband bit-rates over WAN distances are | 2. And no smoothing in the network means that every fluctuation of | |||
already [RFC3649] beyond the scaling range of `classic' TCP Reno, | the queue is signalled immediately. | |||
`less unscalable' Cubic [RFC8312] and | ||||
Compound [I-D.sridharan-tcpm-ctcp] variants of TCP have been | Without ECN, either of these would lead to very high loss levels. | |||
successfully deployed. However, these are now approaching their | But, with ECN, the resulting high marking levels are fine. | |||
scaling limits. Unfortunately, fully scalable TCPs such as DCTCP | ||||
cause `classic' TCP to starve itself, which is why they have been | However, until now, Scalable congestion controls (like DCTCP) did not | |||
confined to private data centres or research testbeds (until now). | co-exist with existing ECN-capable TCP Reno [RFC5681] or Cubic | |||
[RFC8312] traffic --- Scalable controls are so aggressive that these | ||||
'Classic' TCP algorithms drive themselves to a small capacity share. | ||||
Therefore, until now, L4S controls could only be deployed where a | ||||
clean-slate environment could be arranged, such as in private data | ||||
centres (hence the name DCTCP). | ||||
This document specifies a `DualQ Coupled AQM' extension that solves | This document specifies a `DualQ Coupled AQM' extension that solves | |||
the problem of coexistence between scalable and classic flows, | the problem of coexistence between Scalable and Classic flows, | |||
without having to inspect flow identifiers. The AQM is not like | without having to inspect flow identifiers. It is not like flow- | |||
flow-queuing approaches [RFC8290] that classify packets by flow | queuing approaches [RFC8290] that classify packets by flow identifier | |||
identifier into numerous separate queues in order to isolate sparse | into separate queues in order to isolate sparse flows from the higher | |||
flows from the higher latency in the queues assigned to heavier | latency in the queues assigned to heavier flows. If a flow needs | |||
flows. In contrast, the AQM exploits the behaviour of scalable | both low delay and high throughput, having a queue to itself does not | |||
congestion controls like DCTCP so that every packet in every flow | isolate it from the harm it causes to itself. In contrast, L4S | |||
sharing the queue for DCTCP-like traffic can be served with very low | addresses the root cause of the latency problem --- it is an enabler | |||
latency. | for the smooth low latency scalable behaviour of Scalable congestion | |||
controls, so that every packet in every flow can enjoy very low | ||||
latency, then there is no need to isolate each flow into a separate | ||||
queue. | ||||
This AQM extension can be combined with any AQM designed for a single | 1.2. Scope | |||
queue that generates a statistical or deterministic mark/drop | ||||
probability driven by the queue dynamics. In many cases it | L4S involves complementary changes in the network and on end-systems: | |||
simplifies the basic control algorithm, and requires little extra | ||||
processing. Therefore it is believed the Coupled AQM would be | Network: A DualQ Coupled AQM (defined in the present document); | |||
applicable and easy to deploy in all types of buffers; buffers in | ||||
cost-reduced mass-market residential equipment; buffers in end-system | End-system: A Scalable congestion control (defined in Section 2.1. | |||
stacks; buffers in carrier-scale equipment including remote access | ||||
servers, routers, firewalls and Ethernet switches; buffers in network | Packet identifier: The network and end-system parts of L4S can be | |||
interface cards, buffers in virtualized network appliances, | deployed incrementally, because they both identify L4S packets | |||
hypervisors, and so on. | using the experimentally assigned explicit congestion notification | |||
(ECN) codepoints in the IP header: ECT(1) and CE [RFC8311] | ||||
[I-D.ietf-tsvwg-ecn-l4s-id]. | ||||
Data Center TCP (DCTCP [RFC8257]) is an example of a Scalable | ||||
congestion control that has been deployed for some time in Linux, | ||||
Windows and FreeBSD operating systems and Relentless TCP [Mathis09] | ||||
is another example. During the progress of this document through the | ||||
IETF a number of other Scalable congestion controls were implemented, | ||||
e.g. TCP Prague [PragueLinux], QUIC Prague and the L4S variant of | ||||
SCREAM for real-time media [RFC8298]. (Note: after the v3.19 Linux | ||||
kernel, bugs were introduced into DCTCP's scalable behaviour and not | ||||
all the patches applied for L4S evaluation had been applied to the | ||||
mainline Linux kernel, which was at v5.2 at the time of writing). | ||||
The focus of this specification is to get the network part of the L4S | ||||
service in place. Then, without any management intervention, | ||||
applications can exploit this new network capability as their | ||||
operating systems migrate to Scalable congestion controls, which can | ||||
then evolve _while_ their benefits are being enjoyed by everyone on | ||||
the Internet. | ||||
The DualQ Coupled AQM framework can incorporate any AQM designed for | ||||
a single queue that generates a statistical or deterministic mark/ | ||||
drop probability driven by the queue dynamics. Pseudocode examples | ||||
of two different DualQ Coupled AQMs are given the appendices. In | ||||
many cases the framework simplifies the basic control algorithm, and | ||||
requires little extra processing. Therefore it is believed the | ||||
Coupled AQM would be applicable and easy to deploy in all types of | ||||
buffers; buffers in cost-reduced mass-market residential equipment; | ||||
buffers in end-system stacks; buffers in carrier-scale equipment | ||||
including remote access servers, routers, firewalls and Ethernet | ||||
switches; buffers in network interface cards, buffers in virtualized | ||||
network appliances, hypervisors, and so on. | ||||
For the public Internet, nearly all the benefit will typically be | For the public Internet, nearly all the benefit will typically be | |||
achieved by deploying the Coupled AQM into either end of the access | achieved by deploying the Coupled AQM into either end of the access | |||
link between a 'site' and the Internet, which is invariably the | link between a 'site' and the Internet, which is invariably the | |||
bottleneck. Here, the term 'site' is used loosely to mean a home, an | bottleneck. Here, the term 'site' is used loosely to mean a home, an | |||
office, a campus or mobile user equipment. | office, a campus or mobile user equipment. | |||
Latency is not the only concern of L4S: | ||||
o The 'Low Loss" part of the name denotes that L4S generally | ||||
achieves zero congestion loss (which would otherwise cause | ||||
retransmission delays), due to its use of ECN. | ||||
o The "Scalable throughput" part of the name denotes that the per- | ||||
flow throughput of Scalable congestion controls should scale | ||||
indefinitely, avoiding the imminent scaling problems with TCP- | ||||
Friendly congestion control algorithms [RFC3649]. | ||||
The former is clearly in scope of this AQM document. However, the | ||||
latter is an outcome of the end-system behaviour, and therefore | ||||
outside the scope of this AQM document, even though the AQM is an | ||||
enabler. | ||||
The overall L4S architecture [I-D.ietf-tsvwg-l4s-arch] gives more | The overall L4S architecture [I-D.ietf-tsvwg-l4s-arch] gives more | |||
detail, including on wider deployment aspects such as coexistence in | detail, including on wider deployment aspects such as backwards | |||
bottlenecks where a DualQ Coupled AQM has not been deployed. The | compatibility of Scalable congestion controls in bottlenecks where a | |||
supporting papers [PI2] and [DCttH15] give the full rationale for the | DualQ Coupled AQM has not been deployed. The supporting papers [PI2] | |||
AQM's design, both discursively and in more precise mathematical | and [DCttH15] give the full rationale for the AQM's design, both | |||
form. | discursively and in more precise mathematical form. | |||
1.2. Terminology | 1.3. Terminology | |||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | |||
document are to be interpreted as described in [RFC2119] when, and | document are to be interpreted as described in [RFC2119] when, and | |||
only when, they appear in all capitals, as shown here. | only when, they appear in all capitals, as shown here. | |||
The DualQ Coupled AQM uses two queues for two services. Each of the | The DualQ Coupled AQM uses two queues for two services. Each of the | |||
following terms identifies both the service and the queue that | following terms identifies both the service and the queue that | |||
provides the service: | provides the service: | |||
Classic (denoted by subscript C): The `Classic' service is intended | Classic (denoted by subscript C): The `Classic' service is intended | |||
for all the behaviours that currently co-exist with TCP Reno (TCP | for all the behaviours that currently co-exist with TCP Reno (TCP | |||
Cubic, Compound, SCTP, etc). | Cubic, Compound, SCTP, etc). | |||
Low-Latency, Low-Loss and Scalable (L4S, denoted by subscript L): | Low-Latency, Low-Loss and Scalable (L4S, denoted by subscript L): | |||
The `L4S' service is intended for a set of congestion controls | The `L4S' service is intended for a set of congestion controls | |||
with scalable properties (e.g. DCTCP [RFC8257], Relentless | with scalable properties, such as TCP Prague and DCTCP. For the | |||
TCP [Mathis09], the L4S variant of SCREAM for real-time | public Internet an L4S transport has to comply with the | |||
media {ToDo: ref}). For the public Internet a scalable control | requirements in Section 4 of [I-D.ietf-tsvwg-ecn-l4s-id] (aka. | |||
has to comply with the requirements in [I-D.ietf-tsvwg-ecn-l4s-id] | the 'Prague L4S requirements'). | |||
(aka. the 'TCP Prague requirements'). | ||||
Either service can cope with a proportion of unresponsive or less- | Either service can cope with a proportion of unresponsive or less- | |||
responsive traffic as well, as long (e.g. DNS, VoIP, game sync | responsive traffic as well, as long (e.g. DNS, VoIP, game sync | |||
datagrams, etc), just as a single queue AQM can if this traffic makes | datagrams, etc), just as a single queue AQM can if this traffic makes | |||
minimal contribution to queuing. The DualQ Coupled AQM behaviour | minimal contribution to queuing. The DualQ Coupled AQM behaviour | |||
below is defined to be similar to a single FIFO queue with respect to | below is defined to be similar to a single FIFO queue with respect to | |||
unresponsive and overload traffic. | unresponsive and overload traffic. | |||
1.3. Features | 1.4. Features | |||
The AQM couples marking and/or dropping across the two queues such | The AQM couples marking and/or dropping from the Classic queue to the | |||
that a flow will get roughly the same throughput whichever it uses. | L4S queue in such a way that a flow will get roughly the same | |||
Therefore both queues can feed into the full capacity of a link and | throughput whichever it uses. Therefore both queues can feed into | |||
no rates need to be configured for the queues. The L4S queue enables | the full capacity of a link and no rates need to be configured for | |||
scalable congestion controls like DCTCP to give stunningly low and | the queues. The L4S queue enables Scalable congestion controls like | |||
predictably low latency, without compromising the performance of | DCTCP or TCP Prague to give stunningly low and predictably low | |||
competing 'Classic' Internet traffic. Thousands of tests have been | latency, without compromising the performance of competing 'Classic' | |||
conducted in a typical fixed residential broadband setting. Typical | Internet traffic. | |||
experiments used base round trip delays up to 100ms between the data | ||||
centre and home network, and large amounts of background traffic in | Thousands of tests have been conducted in a typical fixed residential | |||
both queues. For every L4S packet, the AQM kept the average queuing | broadband setting. Experiments used a range of base round trip | |||
delay below 1ms (or 2 packets if serialization delay is bigger for | delays up to 100ms and link rates up to 200 Mb/s between the data | |||
slow links), and no losses at all were introduced by the AQM. | centre and home network, with varying amounts of background traffic | |||
Details of the extensive experiments are available [PI2] [DCttH15]. | in both queues. For every L4S packet, the AQM kept the average | |||
queuing delay below 1ms (or 2 packets where serialization delay | ||||
exceeded 1ms on slower links), with 99th percentile no worse than | ||||
2ms. No losses at all were introduced by the L4S AQM. Details of | ||||
the extensive experiments are available [PI2] [DCttH15]. | ||||
Subjective testing was also conducted by multiple people all | Subjective testing was also conducted by multiple people all | |||
simultaneously using very demanding high bandwidth low latency | simultaneously using very demanding high bandwidth low latency | |||
applications over a single shared access link [L4Sdemo16]. In one | applications over a single shared access link [L4Sdemo16]. In one | |||
application, each user could use finger gestures to pan or zoom their | application, each user could use finger gestures to pan or zoom their | |||
own high definition (HD) sub-window of a larger video scene generated | own high definition (HD) sub-window of a larger video scene generated | |||
on the fly in 'the cloud' from a football match. Another user | on the fly in 'the cloud' from a football match. Another user | |||
wearing VR goggles was remotely receiving a feed from a 360-degree | wearing VR goggles was remotely receiving a feed from a 360-degree | |||
camera in a racing car, again with the sub-window in their field of | camera in a racing car, again with the sub-window in their field of | |||
vision generated on the fly in 'the cloud' dependent on their head | vision generated on the fly in 'the cloud' dependent on their head | |||
skipping to change at page 7, line 18 ¶ | skipping to change at page 8, line 48 ¶ | |||
latency was so low that the football picture appeared to stick to the | latency was so low that the football picture appeared to stick to the | |||
user's finger on the touchpad and the experience fed from the remote | user's finger on the touchpad and the experience fed from the remote | |||
camera did not noticeably lag head movements. All the L4S data (even | camera did not noticeably lag head movements. All the L4S data (even | |||
including the downloads) achieved the same ultra-low latency. With | including the downloads) achieved the same ultra-low latency. With | |||
an alternative AQM, the video noticeably lagged behind the finger | an alternative AQM, the video noticeably lagged behind the finger | |||
gestures and head movements. | gestures and head movements. | |||
Unlike Diffserv Expedited Forwarding, the L4S queue does not have to | Unlike Diffserv Expedited Forwarding, the L4S queue does not have to | |||
be limited to a small proportion of the link capacity in order to | be limited to a small proportion of the link capacity in order to | |||
achieve low delay. The L4S queue can be filled with a heavy load of | achieve low delay. The L4S queue can be filled with a heavy load of | |||
capacity-seeking flows like DCTCP and still achieve low delay. The | capacity-seeking flows (TCP Prague etc.) and still achieve low delay. | |||
L4S queue does not rely on the presence of other traffic in the | The L4S queue does not rely on the presence of other traffic in the | |||
Classic queue that can be 'overtaken'. It gives low latency to L4S | Classic queue that can be 'overtaken'. It gives low latency to L4S | |||
traffic whether or not there is Classic traffic, and the latency of | traffic whether or not there is Classic traffic, and the latency of | |||
Classic traffic does not suffer when a proportion of the traffic is | Classic traffic does not suffer when a proportion of the traffic is | |||
L4S. The two queues are only necessary because DCTCP-like flows | L4S. | |||
cannot keep latency predictably low and keep utilization high if they | ||||
are mixed with legacy TCP flows, | ||||
The experiments used the Linux implementation of DCTCP that is | The two queues are only necessary because: | |||
deployed in private data centres, without any modification despite | ||||
its known deficiencies. Nonetheless, certain modifications will be | o the large variations (sawteeth) of Classic flows need roughly a | |||
necessary before DCTCP is safe to use on the Internet, which are | base RTT of queuing delay to ensure full utilization | |||
recorded in Appendix A of [I-D.ietf-tsvwg-ecn-l4s-id]. However, the | ||||
focus of this specification is to get the network service in place. | o while Scalable flows do not need a queue to keep utilization high, | |||
Then, without any management intervention, applications can exploit | but they cannot keep latency predictably low if they are mixed | |||
it by migrating to scalable controls like DCTCP, which can then | with legacy TCP flows, | |||
evolve _while_ their benefits are being enjoyed by everyone on the | ||||
Internet. | The L4S queue has latency priority, but the coupling from the Classic | |||
to the L4S AQM (explained below) ensures that it does not have | ||||
bandwidth priority over the Classic queue. | ||||
2. DualQ Coupled AQM | 2. DualQ Coupled AQM | |||
There are two main aspects to the approach: | There are two main aspects to the approach: | |||
o the Coupled AQM that addresses throughput equivalence between | o the Coupled AQM that addresses throughput equivalence between | |||
Classic (e.g. Reno, Cubic) flows and L4S flows (that satisfy the | Classic (e.g. Reno, Cubic) flows and L4S flows (that satisfy the | |||
TCP Prague requirements). | Prague L4S requirements). | |||
o the Dual Queue structure that provides latency separation for L4S | o the Dual Queue structure that provides latency separation for L4S | |||
flows to isolate them from the typically large Classic queue. | flows to isolate them from the typically large Classic queue. | |||
2.1. Coupled AQM | 2.1. Coupled AQM | |||
In the 1990s, the `TCP formula' was derived for the relationship | In the 1990s, the `TCP formula' was derived for the relationship | |||
between TCP's congestion window, cwnd, and its drop probability, p. | between TCP's congestion window, cwnd, and its drop probability, p. | |||
To a first order approximation, cwnd of TCP Reno is inversely | To a first order approximation, cwnd of TCP Reno is inversely | |||
proportional to the square root of p. | proportional to the square root of p. | |||
We focus on Reno as the worst case, because if we do not harm Reno, | The design focuses on Reno as the worst case, because if it does no | |||
we will not harm Cubic. Nonetheless, TCP Cubic implements a Reno- | harm to Reno, it will not harm Cubic or any traffic designed to be | |||
compatibility mode, which is the only relevant mode for typical RTTs | friendly to Reno. TCP Cubic implements a Reno-compatibility mode, | |||
under 20ms as long as the throughput of a single flow is less than | which is relevant for typical RTTs under 20ms as long as the | |||
about 500Mb/s. Therefore it can be assumed that Cubic traffic | throughput of a single flow is less than about 700Mb/s. In such | |||
behaves similarly to Reno (but with a slightly different constant of | cases it can be assumed that Cubic traffic behaves similarly to Reno | |||
proportionality). The term 'Classic' will be used for the collection | (but with a slightly different constant of proportionality). The | |||
of Reno-friendly traffic including Cubic in Reno mode. | term 'Classic' will be used for the collection of Reno-friendly | |||
traffic including Cubic in Reno mode. | ||||
The supporting paper [PI2] includes the derivation of the equivalent | The supporting paper [PI2] includes the derivation of the equivalent | |||
rate equation for DCTCP, for which cwnd is inversely proportional to | rate equation for DCTCP, for which cwnd is inversely proportional to | |||
p (not the square root), where in this case p is the ECN marking | p (not the square root), where in this case p is the ECN marking | |||
probability. DCTCP is not the only congestion control that behaves | probability. DCTCP is not the only congestion control that behaves | |||
like this, so the term 'L4S' traffic will be used for all similar | like this, so the term 'Scalable' will be used for all similar | |||
behaviour. | congestion control behaviours (see examples in Section 1.2). The | |||
term 'L4S' is also used for traffic driven by a Scalable congestion | ||||
control that also complies with the additional 'Prague L4S' | ||||
requirements [I-D.ietf-tsvwg-ecn-l4s-id]. | ||||
For safe co-existence, under stationary conditions, a DCTCP flow has | For safe co-existence, under stationary conditions, a Scalable flow | |||
to run at roughly the same rate as a Reno TCP flow (all other factors | has to run at roughly the same rate as a Reno TCP flow (all other | |||
being equal). So the drop or marking probability for Classic | factors being equal). So the drop or marking probability for Classic | |||
traffic, p_C has to be distinct from the marking probability for L4S | traffic, p_C has to be distinct from the marking probability for L4S | |||
traffic, p_L. [RFC8311] updates the original ECN specification | traffic, p_L. [RFC8311] updates the original ECN specification | |||
[RFC3168] to allow these probabilities to be distinct, because RFC | [RFC3168] to allow these probabilities to be distinct, because RFC | |||
3168 required them to be the same. | 3168 required them to be the same. | |||
Also, to remain stable, Classic sources need the network to smooth | Also, to remain stable, Classic sources need the network to smooth | |||
p_C so it changes relatively slowly. In contrast, L4S avoids | p_C so it changes relatively slowly. It is hard for a network node | |||
smoothing in the network, because it delays all signals for a worst- | to know the RTTs of all the flows, so a Classic AQM adds a _worst- | |||
case RTT. So instead, L4S sources smooth the ECN marking probability | case_ RTT of smoothing delay (about 100-200 ms). In contrast, L4S | |||
themselves, so they expect the network to generate ECN marks with a | shifts responsibility for smoothing ECN feedback to the sender, which | |||
probability p_L that tracks the instantaneous unsmoothed queue. | only delays its response by its _own_ RTT, and allows a more | |||
immediate response if necessary. | ||||
The Coupled AQM achieves safe coexistence by making the Classic drop | The Coupled AQM achieves safe coexistence by making the Classic drop | |||
probability p_C proportional to the square of the coupled L4S | probability p_C proportional to the square of the coupled L4S | |||
probability p_CL. p_CL is an input to the instantaneous L4S marking | probability p_CL. p_CL is an input to the instantaneous L4S marking | |||
probability p_L but it changes as slowly as p_C. This makes the Reno | probability p_L but it changes as slowly as p_C. This makes the Reno | |||
flow rate roughly equal the DCTCP flow rate, because the squaring of | flow rate roughly equal the DCTCP flow rate, because the squaring of | |||
p_CL counterbalances the square root of p_C in the Classic 'TCP | p_CL counterbalances the square root of p_C in the Classic 'TCP | |||
formula'. | formula'. | |||
Stating this as a formula, the relation between Classic drop | Stating this as a formula, the relation between Classic drop | |||
probability, p_C, and the coupled L4S probability p_CL needs to take | probability, p_C, and the coupled L4S probability p_CL needs to take | |||
the form: | the form: | |||
p_C = ( p_CL / k )^2 (1) | p_C = ( p_CL / k )^2 (1) | |||
where k is the constant of proportionality, which we shall call the | where k is the constant of proportionality, which is termed the | |||
coupling factor. | coupling factor. | |||
2.2. Dual Queue | 2.2. Dual Queue | |||
Classic traffic typically builds a large queue to prevent under- | Classic traffic needs to build a large queue to prevent under- | |||
utilization. Therefore a separate queue is provided for L4S traffic, | utilization. Therefore a separate queue is provided for L4S traffic, | |||
and it is scheduled with priority over Classic. Priority is | and it is scheduled with priority over the Classic queue. Priority | |||
conditional to prevent starvation of Classic traffic. | is conditional to prevent starvation of Classic traffic. | |||
Nonetheless, coupled marking ensures that giving priority to L4S | Nonetheless, coupled marking ensures that giving priority to L4S | |||
traffic still leaves the right amount of spare scheduling time for | traffic still leaves the right amount of spare scheduling time for | |||
Classic flows to each get equivalent throughput to DCTCP flows (all | Classic flows to each get equivalent throughput to DCTCP flows (all | |||
other factors such as RTT being equal). | other factors such as RTT being equal). | |||
2.3. Traffic Classification | 2.3. Traffic Classification | |||
Both the Coupled AQM and DualQ mechanisms need an identifier to | Both the Coupled AQM and DualQ mechanisms need an identifier to | |||
distinguish L and C packets. Then the coupling algorithm can achieve | distinguish L and C packets. Then the coupling algorithm can achieve | |||
coexistence without having to inspect flow identifiers, because it | coexistence without having to inspect flow identifiers, because it | |||
can apply the appropriate marking or dropping probability to all | can apply the appropriate marking or dropping probability to all | |||
flows of each type. A separate | flows of each type. A separate | |||
specification [I-D.ietf-tsvwg-ecn-l4s-id] requires the sender to use | specification [I-D.ietf-tsvwg-ecn-l4s-id] requires the sender to use | |||
the ECT(1) codepoint of the ECN field as this identifier, having | the ECT(1) and CE codepoints of the ECN field as this identifier, | |||
assessed various alternatives. An additional process document has | having assessed various alternatives. An additional process document | |||
proved necessary to make the ECT(1) codepoint available for | has proved necessary to make the ECT(1) codepoint available for | |||
experimentation [RFC8311]. | experimentation [RFC8311]. | |||
For policy reasons, an operator might choose to steer certain packets | For policy reasons, an operator might choose to steer certain packets | |||
(e.g. from certain flows or with certain addresses) out of the L | (e.g. from certain flows or with certain addresses) out of the L | |||
queue, even though they identify themselves as L4S by their ECN | queue, even though they identify themselves as L4S by their ECN | |||
codepoints. In such cases, the device MUST NOT alter the ECN field, | codepoints. In such cases, [I-D.ietf-tsvwg-ecn-l4s-id] says that the | |||
so that it is preserved end-to-end. The aim is that each operator | device "MUST NOT alter the end-to-end L4S ECN identifier", so that it | |||
can choose how it treats L4S traffic locally, but an individual | is preserved end-to-end. The aim is that each operator can choose | |||
operator does not alter the identification of L4S packets, which | how it treats L4S traffic locally, but an individual operator does | |||
would prevent other operators downstream from making their own | not alter the identification of L4S packets, which would prevent | |||
choices on how to treat L4S traffic. | other operators downstream from making their own choices on how to | |||
treat L4S traffic. | ||||
In addition, other identifiers could be used to classify certain | In addition, an operator could use other identifiers to classify | |||
additional packet types into the L queue, that are deemed not to risk | certain additional packet types into the L queue that it deems will | |||
harming the L4S service. For instance addresses of specific | not risk harm to the L4S service. For instance addresses of specific | |||
applications or hosts (see [I-D.ietf-tsvwg-ecn-l4s-id]), specific | applications or hosts (see [I-D.ietf-tsvwg-ecn-l4s-id]), specific | |||
Diffserv codepoints such as EF (Expedited Forwarding) and Voice-Admit | Diffserv codepoints such as EF (Expedited Forwarding) and Voice-Admit | |||
service classes (see [I-D.briscoe-tsvwg-l4s-diffserv]) or certain | service classes (see [I-D.briscoe-tsvwg-l4s-diffserv]) or certain | |||
protocols (e.g. ARP, DNS). | protocols (e.g. ARP, DNS). Note that the mechanism only reads these | |||
identifiers. [I-D.ietf-tsvwg-ecn-l4s-id] says it "MUST NOT alter | ||||
Note that the mechanism only reads these classifiers, it MUST NOT re- | these non-ECN identifiers". | |||
mark or alter these identifiers (except for marking the ECN field | ||||
with the CE codepoint - with increasing frequency to indicate | ||||
increasing congestion). | ||||
2.4. Overall DualQ Coupled AQM Structure | 2.4. Overall DualQ Coupled AQM Structure | |||
Figure 1 shows the overall structure that any DualQ Coupled AQM is | Figure 1 shows the overall structure that any DualQ Coupled AQM is | |||
likely to have. This schematic is intended to aid understanding of | likely to have. This schematic is intended to aid understanding of | |||
the current designs of DualQ Coupled AQMs. However, it is not | the current designs of DualQ Coupled AQMs. However, it is not | |||
intended to preclude other innovative ways of satisfying the | intended to preclude other innovative ways of satisfying the | |||
normative requirements in Section 2.5 that minimally define a DualQ | normative requirements in Section 2.5 that minimally define a DualQ | |||
Coupled AQM. | Coupled AQM. | |||
skipping to change at page 10, line 49 ¶ | skipping to change at page 12, line 29 ¶ | |||
So the slow-moving input to ECN marking in the L queue (the coupled | So the slow-moving input to ECN marking in the L queue (the coupled | |||
L4S probability) is: | L4S probability) is: | |||
p_CL = k*p', (3) | p_CL = k*p', (3) | |||
where k is the constant coupling factor (see Appendix C). | where k is the constant coupling factor (see Appendix C). | |||
It can be seen that these two transformations of p' implement the | It can be seen that these two transformations of p' implement the | |||
required coupling given in equation (1) earlier. | required coupling given in equation (1) earlier. | |||
The actual probability p_L that we apply to the L queue needs to | The actual ECN marking probability p_L that is applied to the L queue | |||
track the immediate L queue delay, as well as track p_CL under | needs to track the immediate L queue delay under L-only congestion | |||
stationary conditions. So we use a native AQM in the L queue that | conditions, as well as track p_CL under coupled congestion | |||
calculates a probability p'_L as a function of the instantaneous L | conditions. So the L queue uses a native AQM that calculates a | |||
queue. And, given the L queue has conditional strict priority over | probability p'_L as a function of the instantaneous L queue delay. | |||
the C queue, whenever the L queue grows, we should apply marking | And, given the L queue has conditional strict priority over the C | |||
queue, whenever the L queue grows, the AQM should apply marking | ||||
probability p'_L, but p_L should not fall below p_CL. This suggests: | probability p'_L, but p_L should not fall below p_CL. This suggests: | |||
p_L = max(p'_L, p_CL), (4) | p_L = max(p'_L, p_CL), (4) | |||
which has also been found to work very well in practice. | which has also been found to work very well in practice. | |||
_________ | _________ | |||
| | ,------. | | | ,------. | |||
L4S queue | |===>| ECN | | L4S queue | |===>| ECN | | |||
,'| _______|_| |marker|\ | ,'| _______|_| |marker|\ | |||
skipping to change at page 12, line 8 ¶ | skipping to change at page 13, line 47 ¶ | |||
where a continually busy L4S queue blocks a DNS request in the | where a continually busy L4S queue blocks a DNS request in the | |||
Classic queue, arbitrarily delaying the start of a new Classic flow. | Classic queue, arbitrarily delaying the start of a new Classic flow. | |||
Example DualQ Coupled AQM algorithms called DualPI2 and Curvy RED are | Example DualQ Coupled AQM algorithms called DualPI2 and Curvy RED are | |||
given in Appendix A and Appendix B. Either example AQM can be used | given in Appendix A and Appendix B. Either example AQM can be used | |||
to couple packet marking and dropping across a dual Q. | to couple packet marking and dropping across a dual Q. | |||
DualPI2 uses a Proportional-Integral (PI) controller as the Base AQM. | DualPI2 uses a Proportional-Integral (PI) controller as the Base AQM. | |||
Indeed, this Base AQM with just the squared output and no L4S queue | Indeed, this Base AQM with just the squared output and no L4S queue | |||
can be used as a drop-in replacement for PIE [RFC8033], in which case | can be used as a drop-in replacement for PIE [RFC8033], in which case | |||
we call it just PI2 [PI2]. PI2 is a principled simplification of PIE | it is just called PI2 [PI2]. PI2 is a principled simplification of | |||
that is both more responsive and more stable in the face of | PIE that is both more responsive and more stable in the face of | |||
dynamically varying load. | dynamically varying load. | |||
Curvy RED is derived from RED [RFC2309], but its configuration | Curvy RED is derived from RED [RFC2309], but its configuration | |||
parameters are insensitive to link rate and it requires less | parameters are insensitive to link rate and it requires less | |||
operations per packet. However, DualPI2 is more responsive and | operations per packet. However, DualPI2 is more responsive and | |||
stable over a wider range of RTTs than Curvy RED. As a consequence, | stable over a wider range of RTTs than Curvy RED. As a consequence, | |||
DualPI2 has attracted more development attention than Curvy RED, | DualPI2 has attracted more development and evaluation attention than | |||
leaving the Curvy RED design incomplete and not so fully evaluated. | Curvy RED, leaving the Curvy RED design incomplete and not so fully | |||
evaluated. | ||||
Both AQMs regulate their queue in units of time not bytes. As | Both AQMs regulate their queue in units of time rather than bytes. | |||
already explained, this ensures configuration can be invariant for | As already explained, this ensures configuration can be invariant for | |||
different drain rates. With AQMs in a dualQ structure this is | different drain rates. With AQMs in a dualQ structure this is | |||
particularly important because the drain rate of each queue can vary | particularly important because the drain rate of each queue can vary | |||
rapidly as flows for the two queues arrive and depart, even if the | rapidly as flows for the two queues arrive and depart, even if the | |||
combined link rate is constant. | combined link rate is constant. | |||
It would be possible to control the queues with other alternative | It would be possible to control the queues with other alternative | |||
AQMs, as long as the normative requirements (those expressed in | AQMs, as long as the normative requirements (those expressed in | |||
capitals) in Section 2.5 are observed. | capitals) in Section 2.5 are observed. | |||
2.5. Normative Requirements for a DualQ Coupled AQM | 2.5. Normative Requirements for a DualQ Coupled AQM | |||
skipping to change at page 12, line 42 ¶ | skipping to change at page 14, line 36 ¶ | |||
The following requirements are intended to capture only the essential | The following requirements are intended to capture only the essential | |||
aspects of a DualQ Coupled AQM. They are intended to be independent | aspects of a DualQ Coupled AQM. They are intended to be independent | |||
of the particular AQMs used for each queue. | of the particular AQMs used for each queue. | |||
2.5.1. Functional Requirements | 2.5.1. Functional Requirements | |||
A Dual Queue Coupled AQM implementation MUST utilize two queues, each | A Dual Queue Coupled AQM implementation MUST utilize two queues, each | |||
with an AQM algorithm. The two queues can be part of a larger | with an AQM algorithm. The two queues can be part of a larger | |||
queuing hierarchy [I-D.briscoe-tsvwg-l4s-diffserv]. | queuing hierarchy [I-D.briscoe-tsvwg-l4s-diffserv]. | |||
The AQM algorithm for the low latency (L) queue MUST apply ECN | The AQM algorithm for the low latency (L) queue MUST be able to apply | |||
marking. | ECN marking to ECN-capable packets. | |||
The scheduler draining the two queues MUST give L4S packets priority | The scheduler draining the two queues MUST give L4S packets priority | |||
over Classic, although priority MUST be bounded in order not to | over Classic, although priority MUST be bounded in order not to | |||
starve Classic traffic. | starve Classic traffic. | |||
[I-D.ietf-tsvwg-ecn-l4s-id] defines the meaning of an ECN marking on | [I-D.ietf-tsvwg-ecn-l4s-id] defines the meaning of an ECN marking on | |||
L4S traffic, relative to drop of Classic traffic. In order to | L4S traffic, relative to drop of Classic traffic. In order to ensure | |||
prevent starvation of Classic traffic by scalable L4S traffic, it | coexistence of Classic and Scalable L4S traffic, it says, "The | |||
says, "The likelihood that an AQM drops a Not-ECT Classic packet | likelihood that an AQM drops a Not-ECT Classic packet (p_C) MUST be | |||
(p_C) MUST be roughly proportional to the square of the likelihood | roughly proportional to the square of the likelihood that it would | |||
that it would have marked it if it had been an L4S packet (p_L)." | have marked it if it had been an L4S packet (p_L)." The term | |||
The term 'likelihood' is used to allow for marking and dropping to be | 'likelihood' is used to allow for marking and dropping to be either | |||
either probabilistic or deterministic. | probabilistic or deterministic. | |||
For the current specification, this translates into the following | For the current specification, this translates into the following | |||
requirement. A DualQ Coupled AQM MUST apply ECN marking to traffic | requirement. A DualQ Coupled AQM MUST apply ECN marking to traffic | |||
in the L queue that is no lower than that derived from the likelihood | in the L queue that is no lower than that derived from the likelihood | |||
of drop (or ECN marking) in the Classic queue using Eqn. (1). | of drop (or ECN marking) in the Classic queue using Eqn. (1). | |||
The constant of proportionality, k, in Eqn (1) determines the | The constant of proportionality, k, in Eqn (1) determines the | |||
relative flow rates of Classic and L4S flows when the AQM concerned | relative flow rates of Classic and L4S flows when the AQM concerned | |||
is the bottleneck (all other factors being equal). | is the bottleneck (all other factors being equal). | |||
[I-D.ietf-tsvwg-ecn-l4s-id] says, "The constant of proportionality | [I-D.ietf-tsvwg-ecn-l4s-id] says, "The constant of proportionality | |||
(k) does not have to be standardised for interoperability, but a | (k) does not have to be standardised for interoperability, but a | |||
value of 2 is RECOMMENDED." | value of 2 is RECOMMENDED." | |||
Assuming scalable congestion controls for the Internet will be as | Assuming Scalable congestion controls for the Internet will be as | |||
aggressive as DCTCP, this will ensure their congestion window will be | aggressive as DCTCP, this will ensure their congestion window will be | |||
roughly the same as that of a standards track TCP congestion control | roughly the same as that of a standards track TCP congestion control | |||
(Reno) [RFC5681] and other so-called TCP-friendly controls, such as | (Reno) [RFC5681] and other so-called TCP-friendly controls, such as | |||
TCP Cubic in its TCP-friendly mode. | TCP Cubic in its TCP-friendly mode. | |||
The choice of k is a matter of operator policy, and operators MAY | The choice of k is a matter of operator policy, and operators MAY | |||
choose a different value using Table 1 and the guidelines in | choose a different value using Table 1 and the guidelines in | |||
Appendix C. | Appendix C. | |||
If multiple users share capacity at a bottleneck (e.g. in the | If multiple customers or users share capacity at a bottleneck (e.g. | |||
Internet access link of a campus network), the operator's choice of k | in the Internet access link of a campus network), the operator's | |||
will determine capacity sharing between the flows of different users. | choice of k will determine capacity sharing between the flows of | |||
However, on the public Internet, access network operators typically | different customers. However, on the public Internet, access network | |||
isolate customers from each other with some form of layer-2 | operators typically isolate customers from each other with some form | |||
multiplexing (OFDM(A) in DOCSIS3.1, CDMA in 3G, SC-FDMA in LTE) or L3 | of layer-2 multiplexing (OFDM(A) in DOCSIS3.1, CDMA in 3G, SC-FDMA in | |||
scheduling (WRR in DSL), rather than relying on TCP to share capacity | LTE) or L3 scheduling (WRR in DSL), rather than relying on TCP to | |||
between customers [RFC0970]. In such cases, the choice of k will | share capacity between customers [RFC0970]. In such cases, the | |||
solely affect relative flow rates within each customer's access | choice of k will solely affect relative flow rates within each | |||
capacity, not between customers. Also, k will not affect relative | customer's access capacity, not between customers. Also, k will not | |||
flow rates at any times when all flows are Classic or all L4S, and it | affect relative flow rates at any times when all flows are Classic or | |||
will not affect the relative throughput of small flows. | all flows are L4S, and it will not affect the relative throughput of | |||
small flows. | ||||
2.5.1.1. Requirements in Unexpected Cases | 2.5.1.1. Requirements in Unexpected Cases | |||
The flexibility to allow operator-specific classifiers (Section 2.3) | The flexibility to allow operator-specific classifiers (Section 2.3) | |||
leads to the need to specify what the AQM in each queue ought to do | leads to the need to specify what the AQM in each queue ought to do | |||
with packets that do not carry the ECN field expected for that queue. | with packets that do not carry the ECN field expected for that queue. | |||
It is recommended that the AQM in each queue inspects the ECN field | It is recommended that the AQM in each queue inspects the ECN field | |||
to determine what sort of congestion notification to signal, then | to determine what sort of congestion notification to signal, then | |||
decides whether to apply congestion notification to this particular | decides whether to apply congestion notification to this particular | |||
packet, as follows: | packet, as follows: | |||
skipping to change at page 14, line 38 ¶ | skipping to change at page 16, line 33 ¶ | |||
SHOULD apply drop using a drop probability appropriate to | SHOULD apply drop using a drop probability appropriate to | |||
Classic congestion control and appropriate to the target | Classic congestion control and appropriate to the target | |||
delay in the L queue | delay in the L queue | |||
o If a packet that carries an ECT(1) codepoint is classified into | o If a packet that carries an ECT(1) codepoint is classified into | |||
the C queue: | the C queue: | |||
* the C AQM SHOULD apply CE-marking using the coupled AQM | * the C AQM SHOULD apply CE-marking using the coupled AQM | |||
probability p_CL (= k*p'). | probability p_CL (= k*p'). | |||
If the DualQ Coupled AQM has detected overload, it will signal | ||||
congestion solely using drop, irrespective of the ECN field. | ||||
The above requirements are worded as "SHOULDs", because operator- | The above requirements are worded as "SHOULDs", because operator- | |||
specific classifiers are for flexibility, by definition. Therefore, | specific classifiers are for flexibility, by definition. Therefore, | |||
alternative actions might be appropriate in the operator's specific | alternative actions might be appropriate in the operator's specific | |||
circumstances. An example would be where the operator knows that | circumstances. An example would be where the operator knows that | |||
certain legacy traffic marked with one codepoint actually has a | certain legacy traffic marked with one codepoint actually has a | |||
congestion response associated with another codepoint. | congestion response associated with another codepoint. | |||
If the DualQ Coupled AQM has detected overload, it MUST signal | ||||
congestion solely using drop, irrespective of the ECN field. | ||||
Switching to drop if ECN marking is persistently high is required by | ||||
Section 7 of [RFC3168] and Section 4.2.1 of [RFC7567]. | ||||
2.5.2. Management Requirements | 2.5.2. Management Requirements | |||
2.5.2.1. Configuration | ||||
By default, a DualQ Coupled AQM SHOULD NOT need any configuration for | By default, a DualQ Coupled AQM SHOULD NOT need any configuration for | |||
use at a bottleneck on the public Internet [RFC7567]. The following | use at a bottleneck on the public Internet [RFC7567]. The following | |||
parameters MAY be operator-configurable, e.g. to tune for non- | parameters MAY be operator-configurable, e.g. to tune for non- | |||
Internet settings: | Internet settings: | |||
o Optional packet classifier(s) to use in addition to the ECN field | o Optional packet classifier(s) to use in addition to the ECN field | |||
(see Section 2.3); | (see Section 2.3); | |||
o Expected typical RTT (a parameter for typical or target queuing | o Expected typical RTT, which can be used to determine the queuing | |||
delay in each queue might be configurable instead; if so it MUST | delay of the Classic AQM at its operating point, in order to | |||
be expressed in units of time); | prevent typical lone TCP flows from under-utilizing capacity. For | |||
example: | ||||
o Expected maximum RTT (a stability parameter that depends on | * for the PI2 algorithm (Appendix A) the queuing delay target is | |||
maximum RTT might be configurable instead); | set to the typical RTT; | |||
* for the Curvy RED algorithm (Appendix B) the queuing delay at | ||||
the desired operating point of the curvy ramp is configured to | ||||
encompass a typical RTT; | ||||
* if another Classic AQM was used, it would be likely to need an | ||||
operating point for the queue based on the typical RTT, and if | ||||
so it SHOULD be expressed in units of time. | ||||
An operating point that is manually calculated might be directly | ||||
configurable instead, e.g. for links with large numbers of flows | ||||
where under-utilization by a single TCP flow would be unlikely. | ||||
o Expected maximum RTT, which can be used to set the stability | ||||
parameter(s) of the Classic AQM. For example: | ||||
* for the PI2 algorithm (Appendix A), the gain parameters of the | ||||
PI algorithm depend on the maximum RTT. | ||||
* for the Curvy RED algorithm (Appendix B) the smoothing | ||||
parameter is chosen to filter out transients in the queue | ||||
within a maximum RTT. | ||||
Stability parameter(s) that are manually calculated assuming a | ||||
maximum RTT might be directly configurable instead. | ||||
o Coupling factor, k; | o Coupling factor, k; | |||
o The limit to the conditional priority of L4S (scheduler-dependent, | o A limit to the conditional priority of L4S. This is scheduler- | |||
e.g. the scheduler weight for WRR, or the time-shift for time- | dependent, but it SHOULD be expressed as a relation between the | |||
shifted FIFO); | max delay of a C packet and an L packet. For example: | |||
* for a WRR scheduler a weight ratio between L and C of w:1 means | ||||
that the maximum delay to a C packet is w times that of an L | ||||
packet. | ||||
* for a time-shifted FIFO (TS-FIFO) scheduler (see Section 4.1.1) | ||||
a time-shift of tshift means that the maximum delay to a C | ||||
packet is tshift greater than that of an L packet. tshift could | ||||
be expressed as a multiple of the typical RTT rather than as an | ||||
absolute delay. | ||||
o The maximum Classic ECN marking probability, p_Cmax, before | o The maximum Classic ECN marking probability, p_Cmax, before | |||
switching over to drop. | switching over to drop. | |||
2.5.2.2. Monitoring | ||||
An experimental DualQ Coupled AQM SHOULD allow the operator to | An experimental DualQ Coupled AQM SHOULD allow the operator to | |||
monitor each of the following operational statistics on demand, per | monitor each of the following operational statistics on demand, per | |||
queue and per configurable sample interval, for performance | queue and per configurable sample interval, for performance | |||
monitoring and perhaps also for accounting in some cases: | monitoring and perhaps also for accounting in some cases: | |||
o Bits forwarded, from which utilization can be calculated; | o Bits forwarded, from which utilization can be calculated; | |||
o Total packets arriving, enqueued and dequeued to distinguish tail | o Total packets in the three categories: arrived, presented to the | |||
discard from proactive AQM discard; | AQM, and forwarded. The difference between the first two will | |||
measure any non-AQM tail discard. The difference between the last | ||||
two will measure proactive AQM discard; | ||||
o ECN packets marked, non-ECN packets dropped, ECN packets dropped, | o ECN packets marked, non-ECN packets dropped, ECN packets dropped, | |||
from which marking and dropping probabilities can be calculated; | which can be combined with the three total packet counts above to | |||
calculate marking and dropping probabilities; | ||||
o Queue delay (not including serialization delay of the head packet | o Queue delay (not including serialization delay of the head packet | |||
or medium acquisition delay) - see further notes below. | or medium acquisition delay) - see further notes below. | |||
Unlike the other statistics, queue delay cannot be captured in a | Unlike the other statistics, queue delay cannot be captured in a | |||
simple accumulating counter. Therefore the type of queue delay | simple accumulating counter. Therefore the type of queue delay | |||
statistics produced (mean, percentiles, etc.) will depend on | statistics produced (mean, percentiles, etc.) will depend on | |||
implementation constraints. To facilitate comparative evaluation | implementation constraints. To facilitate comparative evaluation | |||
of different implementations and approaches, an implementation | of different implementations and approaches, an implementation | |||
SHOULD allow mean and 99th percentile queue delay to be derived | SHOULD allow mean and 99th percentile queue delay to be derived | |||
(per queue per sample interval). A relatively simple way to do | (per queue per sample interval). A relatively simple way to do | |||
this would be to store a coarse-grained histogram of queue delay. | this would be to store a coarse-grained histogram of queue delay. | |||
This could be done with a small number of bins with configurable | This could be done with a small number of bins with configurable | |||
edges that represent contiguous ranges of queue delay. Then, over | edges that represent contiguous ranges of queue delay. Then, over | |||
a sample interval, each bin would accumulate a count of the number | a sample interval, each bin would accumulate a count of the number | |||
of packets that had fallen within each range. The maximum queue | of packets that had fallen within each range. The maximum queue | |||
delay per queue per interval MAY also be recorded. | delay per queue per interval MAY also be recorded. | |||
2.5.2.3. Anomaly Detection | ||||
An experimental DualQ Coupled AQM SHOULD asynchronously report the | An experimental DualQ Coupled AQM SHOULD asynchronously report the | |||
following data about anomalous conditions: | following data about anomalous conditions: | |||
o Start-time and duration of overload state. | o Start-time and duration of overload state. | |||
A hysteresis mechanism SHOULD be used to prevent flapping in and | A hysteresis mechanism SHOULD be used to prevent flapping in and | |||
out of overload causing an event storm. For instance, exit from | out of overload causing an event storm. For instance, exit from | |||
overload state could trigger one report, but also latch a timer. | overload state could trigger one report, but also latch a timer. | |||
Then, during that time, if the AQM enters and exits overload state | Then, during that time, if the AQM enters and exits overload state | |||
any number of times, the duration in overload state is accumulated | any number of times, the duration in overload state is accumulated | |||
but no new report is generated until the first time the AQM is out | but no new report is generated until the first time the AQM is out | |||
of overload once the timer has expired. | of overload once the timer has expired. | |||
2.5.2.4. Deployment, Coexistence and Scaling | ||||
[RFC5706] suggests that deployment, coexistence and scaling should | [RFC5706] suggests that deployment, coexistence and scaling should | |||
also be covered as management requirements. The raison d'etre of the | also be covered as management requirements. The raison d'etre of the | |||
DualQ Couple AQM is to enable deployment and coexistence of scalable | DualQ Coupled AQM is to enable deployment and coexistence of Scalable | |||
congestion controls - as incremental replacements for today's TCP- | congestion controls - as incremental replacements for today's TCP- | |||
friendly controls that do not scale with bandwidth-delay product. | friendly controls that do not scale with bandwidth-delay product. | |||
Therefore, these motivating issues are explained in the Introduction | Therefore there is no need to repeat these motivating issues here | |||
and detailed in the L4S architecture [I-D.ietf-tsvwg-l4s-arch]. | given they are already explained in the Introduction and detailed in | |||
Also, the descriptions of specific DualQ Coupled AQM algorithms in | the L4S architecture [I-D.ietf-tsvwg-l4s-arch]. | |||
the appendices cover scaling of their configuration parameters, e.g. | ||||
with respect to RTT and sampling frequency. | The descriptions of specific DualQ Coupled AQM algorithms in the | |||
appendices cover scaling of their configuration parameters, e.g. with | ||||
respect to RTT and sampling frequency. | ||||
3. IANA Considerations | 3. IANA Considerations | |||
This specification contains no IANA considerations. | This specification contains no IANA considerations. | |||
4. Security Considerations | 4. Security Considerations | |||
4.1. Overload Handling | 4.1. Overload Handling | |||
Where the interests of users or flows might conflict, it could be | Where the interests of users or flows might conflict, it could be | |||
skipping to change at page 17, line 22 ¶ | skipping to change at page 20, line 11 ¶ | |||
Under overload the higher priority L4S service will have to sacrifice | Under overload the higher priority L4S service will have to sacrifice | |||
some aspect of its performance. Alternative solutions are provided | some aspect of its performance. Alternative solutions are provided | |||
below that each relax a different factor: e.g. throughput, delay, | below that each relax a different factor: e.g. throughput, delay, | |||
drop. These choices need to be made either by the developer or by | drop. These choices need to be made either by the developer or by | |||
operator policy, rather than by the IETF. | operator policy, rather than by the IETF. | |||
4.1.1. Avoiding Classic Starvation: Sacrifice L4S Throughput or Delay? | 4.1.1. Avoiding Classic Starvation: Sacrifice L4S Throughput or Delay? | |||
Priority of L4S is required to be conditional to avoid total | Priority of L4S is required to be conditional to avoid total | |||
throughput starvation of Classic by heavy L4S traffic. This raises | starvation of Classic by heavy L4S traffic. This raises the question | |||
the question of whether to sacrifice L4S throughput or L4S delay (or | of whether to sacrifice L4S throughput or L4S delay (or some other | |||
some other policy) to mitigate starvation of Classic: | policy) to mitigate starvation of Classic: | |||
Sacrifice L4S throughput: By using weighted round robin as the | Sacrifice L4S throughput: By using weighted round robin as the | |||
conditional priority scheduler, the L4S service can sacrifice some | conditional priority scheduler, the L4S service can sacrifice some | |||
throughput during overload to guarantee a minimum throughput | throughput during overload. This can either be thought of as | |||
service for Classic traffic. The scheduling weight of the Classic | guaranteeing a minimum throughput service for Classic traffic, or | |||
queue should be small (e.g. 1/16). Then, in most traffic | as guaranteeing a maximum delay for a packet at the head of the | |||
scenarios the scheduler will not interfere and it will not need to | Classic queue. | |||
- the coupling mechanism and the end-systems will share out the | ||||
capacity across both queues as if it were a single pool. However, | The scheduling weight of the Classic queue should be small (e.g. | |||
because the congestion coupling only applies in one direction | 1/16). Then, in most traffic scenarios the scheduler will not | |||
(from C to L), if L4S traffic is over-aggressive or unresponsive, | interfere and it will not need to - the coupling mechanism and the | |||
the scheduler weight for Classic traffic will at least be large | end-systems will share out the capacity across both queues as if | |||
enough to ensure it does not starve. | it were a single pool. However, because the congestion coupling | |||
only applies in one direction (from C to L), if L4S traffic is | ||||
over-aggressive or unresponsive, the scheduler weight for Classic | ||||
traffic will at least be large enough to ensure it does not | ||||
starve. | ||||
In cases where the ratio of L4S to Classic flows (e.g. 19:1) is | In cases where the ratio of L4S to Classic flows (e.g. 19:1) is | |||
greater than the ratio of their scheduler weights (e.g. 15:1), the | greater than the ratio of their scheduler weights (e.g. 15:1), the | |||
L4S flows will get less than an equal share of the capacity, but | L4S flows will get less than an equal share of the capacity, but | |||
only slightly. For instance, with the example numbers given, each | only slightly. For instance, with the example numbers given, each | |||
L4S flow will get (15/16)/19 = 4.9% when ideally each would get | L4S flow will get (15/16)/19 = 4.9% when ideally each would get | |||
1/20=5%. In the rather specific case of an unresponsive flow | 1/20=5%. In the rather specific case of an unresponsive flow | |||
taking up a large part of the capacity set aside for L4S, using | taking up just less than the capacity set aside for L4S (e.g. | |||
WRR could significantly reduce the capacity left for any | 14/16 in the above example), using WRR could significantly reduce | |||
responsive L4S flows. | the capacity left for any responsive L4S flows. | |||
The scheduling weight of the Classic queue should not be too | ||||
small, otherwise a C packet at the head of the queue could be | ||||
excessively delayed by a continually busy L queue. For instance | ||||
if the Classic weight is 1/16, the maximum that a Classic packet | ||||
at the head of the queue can be delayed by L traffic is the | ||||
serialization delay of 15 MTU-sized packets. | ||||
Sacrifice L4S Delay: To control milder overload of responsive | Sacrifice L4S Delay: To control milder overload of responsive | |||
traffic, particularly when close to the maximum congestion signal, | traffic, particularly when close to the maximum congestion signal, | |||
the operator could choose to control overload of the Classic queue | the operator could choose to control overload of the Classic queue | |||
by allowing some delay to 'leak' across to the L4S queue. The | by allowing some delay to 'leak' across to the L4S queue. The | |||
scheduler can be made to behave like a single First-In First-Out | scheduler can be made to behave like a single First-In First-Out | |||
(FIFO) queue with different service times by implementing a very | (FIFO) queue with different service times by implementing a very | |||
simple conditional priority scheduler that could be called a | simple conditional priority scheduler that could be called a | |||
"time-shifted FIFO" (see the Modifier Earliest Deadline First | "time-shifted FIFO" (see the Modifier Earliest Deadline First | |||
(MEDF) scheduler of [MEDF]). This scheduler adds tshift to the | (MEDF) scheduler of [MEDF]). This scheduler adds tshift to the | |||
queue delay of the next L4S packet, before comparing it with the | queue delay of the next L4S packet, before comparing it with the | |||
queue delay of the next Classic packet, then it selects the packet | queue delay of the next Classic packet, then it selects the packet | |||
with the greater adjusted queue delay. Under regular conditions, | with the greater adjusted queue delay. Under regular conditions, | |||
this time-shifted FIFO scheduler behaves just like a strict | this time-shifted FIFO scheduler behaves just like a strict | |||
priority scheduler. But under moderate or high overload it | priority scheduler. But under moderate or high overload it | |||
prevents starvation of the Classic queue, because the time-shift | prevents starvation of the Classic queue, because the time-shift | |||
(tshift) defines the maximum extra queuing delay of Classic | (tshift) defines the maximum extra queuing delay of Classic | |||
packets relative to L4S. | packets relative to L4S. | |||
The example implementation in Appendix A can implement either policy. | The example implementations in Appendix A and Appendix B could both | |||
be implemented with either policy. | ||||
4.1.2. Congestion Signal Saturation: Introduce L4S Drop or Delay? | 4.1.2. Congestion Signal Saturation: Introduce L4S Drop or Delay? | |||
To keep the throughput of both L4S and Classic flows roughly equal | To keep the throughput of both L4S and Classic flows roughly equal | |||
over the full load range, a different control strategy needs to be | over the full load range, a different control strategy needs to be | |||
defined above the point where one AQM first saturates to a | defined above the point where one AQM first saturates to a | |||
probability of 100% leaving no room to push back the load any harder. | probability of 100% leaving no room to push back the load any harder. | |||
If k>1, L4S will saturate first, even though saturation could be | If k>1, L4S will saturate first, even though saturation could be | |||
caused by unresponsive traffic in either queue. | caused by unresponsive traffic in either queue. | |||
skipping to change at page 19, line 38 ¶ | skipping to change at page 22, line 38 ¶ | |||
Experiments with the DualPI2 AQM (Appendix A) have shown that | Experiments with the DualPI2 AQM (Appendix A) have shown that | |||
introducing 'drop on saturation' at 100% L4S marking addresses this | introducing 'drop on saturation' at 100% L4S marking addresses this | |||
problem with unresponsive ECN as well as addressing the saturation | problem with unresponsive ECN as well as addressing the saturation | |||
problem. It leaves only a small range of congestion levels where | problem. It leaves only a small range of congestion levels where | |||
unresponsive traffic gains any advantage from using the ECN | unresponsive traffic gains any advantage from using the ECN | |||
capability, and the advantage is hardly detectable [DualQ-Test]. | capability, and the advantage is hardly detectable [DualQ-Test]. | |||
5. Acknowledgements | 5. Acknowledgements | |||
Thanks to Anil Agarwal, Sowmini Varadhan's and Gabi Bracha for | Thanks to Anil Agarwal, Sowmini Varadhan's, Gabi Bracha, Nicolas | |||
detailed review comments particularly of the appendices and | Kuhn, Tom Henderson and David Pullen for detailed review comments | |||
suggestions on how to make our explanation clearer. Thanks also to | particularly of the appendices and suggestions on how to make the | |||
Greg White for improving the normative requirements and both Greg and | explanations clearer. Thanks also to Tom Henderson for insights on | |||
Tom Henderson for insights on the choice of schedulers, queue delay | the choice of schedulers and queue delay measurement techniques. | |||
measurement techniques. | ||||
The authors' contributions were originally part-funded by the | The early contributions of Koen De Schepper, Bob Briscoe, Olga | |||
European Community under its Seventh Framework Programme through the | Bondarenko and Inton Tsang were part-funded by the European Community | |||
Reducing Internet Transport Latency (RITE) project (ICT-317700). Bob | under its Seventh Framework Programme through the Reducing Internet | |||
Briscoe's contribution was also part-funded by the Research Council | Transport Latency (RITE) project (ICT-317700). Bob Briscoe's | |||
of Norway through the TimeIn project. The views expressed here are | contribution was also part-funded by the Research Council of Norway | |||
solely those of the authors. | through the TimeIn project. The views expressed here are solely | |||
those of the authors. | ||||
6. References | 6. Contributors | |||
6.1. Normative References | The following contributed implementations and evaluations that | |||
validated and helped to improve this specification: | ||||
Olga Albisser <olga@albisser.org> of Simula Research Lab, Norway | ||||
(Olga Bondarenko during early drafts) implemented the prototype | ||||
DualPI2 AQM for Linux with Koen De Schepper and conducted | ||||
extensive evaluations as well as implementing the live performance | ||||
visualization GUI [L4Sdemo16]. | ||||
Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com> of Nokia | ||||
Bell Labs, Belgium prepared and maintains the Linux implementation | ||||
of DualPI2 for upstreaming. | ||||
Tom Henderson <tomh@tomh.org> of the University of Washington, WA, | ||||
US implemented various Coupled DualQ AQMs for ns3, including | ||||
DualPI2 and DualPIE over point to point and DOCSIS 3.1 link models | ||||
and conducted extensive evaluations. | ||||
Ing Jyh (Inton) Tsang of Nokia, Belgium built the End-to-End Data | ||||
Centre to the Home broadband testbed on which Coupled DualQ | ||||
implementations were tested. | ||||
7. References | ||||
7.1. Normative References | ||||
[I-D.ietf-tsvwg-ecn-l4s-id] | ||||
Schepper, K. and B. Briscoe, "Identifying Modified | ||||
Explicit Congestion Notification (ECN) Semantics for | ||||
Ultra-Low Queuing Delay (L4S)", draft-ietf-tsvwg-ecn-l4s- | ||||
id-06 (work in progress), March 2019. | ||||
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | |||
Requirement Levels", BCP 14, RFC 2119, | Requirement Levels", BCP 14, RFC 2119, | |||
DOI 10.17487/RFC2119, March 1997, | DOI 10.17487/RFC2119, March 1997, | |||
<https://www.rfc-editor.org/info/rfc2119>. | <https://www.rfc-editor.org/info/rfc2119>. | |||
6.2. Informative References | [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition | |||
of Explicit Congestion Notification (ECN) to IP", | ||||
RFC 3168, DOI 10.17487/RFC3168, September 2001, | ||||
<https://www.rfc-editor.org/info/rfc3168>. | ||||
[RFC8311] Black, D., "Relaxing Restrictions on Explicit Congestion | ||||
Notification (ECN) Experimentation", RFC 8311, | ||||
DOI 10.17487/RFC8311, January 2018, | ||||
<https://www.rfc-editor.org/info/rfc8311>. | ||||
7.2. Informative References | ||||
[Alizadeh-stability] | ||||
Alizadeh, M., Javanmard, A., and B. Prabhakar, "Analysis | ||||
of DCTCP: Stability, Convergence, and Fairness", ACM | ||||
SIGMETRICS 2011 , June 2011, | ||||
<https://dl.acm.org/citation.cfm?id=1993753>. | ||||
[AQMmetrics] | ||||
Kwon, M. and S. Fahmy, "A Comparison of Load-based and | ||||
Queue- based Active Queue Management Algorithms", Proc. | ||||
Int'l Soc. for Optical Engineering (SPIE) 4866:35--46 DOI: | ||||
10.1117/12.473021, 2002, | ||||
<https://www.cs.purdue.edu/homes/fahmy/papers/ldc.pdf>. | ||||
[ARED01] Floyd, S., Gummadi, R., and S. Shenker, "Adaptive RED: An | [ARED01] Floyd, S., Gummadi, R., and S. Shenker, "Adaptive RED: An | |||
Algorithm for Increasing the Robustness of RED's Active | Algorithm for Increasing the Robustness of RED's Active | |||
Queue Management", ACIRI Technical Report , August 2001, | Queue Management", ACIRI Technical Report , August 2001, | |||
<http://www.icir.org/floyd/red.html>. | <http://www.icir.org/floyd/red.html>. | |||
[BBRv1] Cardwell, N., Cheng, Y., Hassas Yeganeh, S., and V. | ||||
Jacobson, "BBR Congestion Control", Internet Draft draft- | ||||
cardwell-iccrg-bbr-congestion-control-00, July 2017, | ||||
<https://tools.ietf.org/html/ | ||||
draft-cardwell-iccrg-bbr-congestion-control-00>. | ||||
[CoDel] Nichols, K. and V. Jacobson, "Controlling Queue Delay", | [CoDel] Nichols, K. and V. Jacobson, "Controlling Queue Delay", | |||
ACM Queue 10(5), May 2012, | ACM Queue 10(5), May 2012, | |||
<http://queue.acm.org/issuedetail.cfm?issue=2208917>. | <http://queue.acm.org/issuedetail.cfm?issue=2208917>. | |||
[CRED_Insights] | [CRED_Insights] | |||
Briscoe, B., "Insights from Curvy RED (Random Early | Briscoe, B., "Insights from Curvy RED (Random Early | |||
Detection)", BT Technical Report TR-TUB8-2015-003, July | Detection)", BT Technical Report TR-TUB8-2015-003 | |||
2015, | arXiv:1904.07339 [cs.NI], July 2015, | |||
<http://www.bobbriscoe.net/projects/latency/credi_tr.pdf>. | <https://arxiv.org/abs/1904.07339>. | |||
[DCttH15] De Schepper, K., Bondarenko, O., Briscoe, B., and I. | [DCttH15] De Schepper, K., Bondarenko, O., Briscoe, B., and I. | |||
Tsang, "`Data Centre to the Home': Ultra-Low Latency for | Tsang, "`Data Centre to the Home': Ultra-Low Latency for | |||
All", 2015, <http://www.bobbriscoe.net/projects/latency/ | All", RITE project Technical Report , 2015, | |||
dctth_preprint.pdf>. | <http://riteproject.eu/publications/>. | |||
(Under submission) | [DOCSIS3.1] | |||
CableLabs, "MAC and Upper Layer Protocols Interface | ||||
(MULPI) Specification, CM-SP-MULPIv3.1", Data-Over-Cable | ||||
Service Interface Specifications DOCSIS(R) 3.1 Version i17 | ||||
or later, January 2019, <https://specification- | ||||
search.cablelabs.com/CM-SP-MULPIv3.1>. | ||||
[DualPI2Linux] | ||||
Albisser, O., De Schepper, K., Briscoe, B., Tilmans, O., | ||||
and H. Steen, "DUALPI2 - Low Latency, Low Loss and | ||||
Scalable (L4S) AQM", Proc. Linux Netdev 0x13 , March 2019, | ||||
<https://www.netdevconf.org/0x13/ | ||||
session.html?talk-DUALPI2-AQM>. | ||||
[DualQ-Test] | [DualQ-Test] | |||
Steen, H., "Destruction Testing: Ultra-Low Delay using | Steen, H., "Destruction Testing: Ultra-Low Delay using | |||
Dual Queue Coupled Active Queue Management", Masters | Dual Queue Coupled Active Queue Management", Masters | |||
Thesis, Dept of Informatics, Uni Oslo , May 2017. | Thesis, Dept of Informatics, Uni Oslo , May 2017. | |||
[I-D.briscoe-tsvwg-l4s-diffserv] | [I-D.briscoe-tsvwg-l4s-diffserv] | |||
Briscoe, B., "Interactions between Low Latency, Low Loss, | Briscoe, B., "Interactions between Low Latency, Low Loss, | |||
Scalable Throughput (L4S) and Differentiated Services", | Scalable Throughput (L4S) and Differentiated Services", | |||
draft-briscoe-tsvwg-l4s-diffserv-00 (work in progress), | draft-briscoe-tsvwg-l4s-diffserv-02 (work in progress), | |||
March 2018. | November 2018. | |||
[I-D.ietf-tsvwg-ecn-l4s-id] | ||||
Schepper, K., Briscoe, B., and I. Tsang, "Identifying | ||||
Modified Explicit Congestion Notification (ECN) Semantics | ||||
for Ultra-Low Queuing Delay", draft-ietf-tsvwg-ecn-l4s- | ||||
id-02 (work in progress), March 2018. | ||||
[I-D.ietf-tsvwg-l4s-arch] | [I-D.ietf-tsvwg-l4s-arch] | |||
Briscoe, B., Schepper, K., and M. Bagnulo, "Low Latency, | Briscoe, B., Schepper, K., and M. Bagnulo, "Low Latency, | |||
Low Loss, Scalable Throughput (L4S) Internet Service: | Low Loss, Scalable Throughput (L4S) Internet Service: | |||
Architecture", draft-ietf-tsvwg-l4s-arch-02 (work in | Architecture", draft-ietf-tsvwg-l4s-arch-03 (work in | |||
progress), March 2018. | progress), October 2018. | |||
[I-D.sridharan-tcpm-ctcp] | ||||
Sridharan, M., Tan, K., Bansal, D., and D. Thaler, | ||||
"Compound TCP: A New TCP Congestion Control for High-Speed | ||||
and Long Distance Networks", draft-sridharan-tcpm-ctcp-02 | ||||
(work in progress), November 2008. | ||||
[L4Sdemo16] | [L4Sdemo16] | |||
Bondarenko, O., De Schepper, K., Tsang, I., and B. | Bondarenko, O., De Schepper, K., Tsang, I., and B. | |||
Briscoe, "Ultra-Low Delay for All: Live Experience, Live | Briscoe, "Ultra-Low Delay for All: Live Experience, Live | |||
Analysis", Proc. MMSYS'16 pp33:1--33:4, May 2016, | Analysis", Proc. MMSYS'16 pp33:1--33:4, May 2016, | |||
<http://dl.acm.org/citation.cfm?doid=2910017.2910633 | <http://dl.acm.org/citation.cfm?doid=2910017.2910633 | |||
(videos of demos: https://riteproject.eu/ | (videos of demos: https://riteproject.eu/ | |||
dctth/#1511dispatchwg )>. | dctth/#1511dispatchwg )>. | |||
[LLD] White, G., Sundaresan, K., and B. Briscoe, "Low Latency | ||||
DOCSIS: Technology Overview", CableLabs White Paper , | ||||
February 2019, <https://cablela.bs/ | ||||
low-latency-docsis-technology-overview-february-2019>. | ||||
[Mathis09] | [Mathis09] | |||
Mathis, M., "Relentless Congestion Control", PFLDNeT'09 , | Mathis, M., "Relentless Congestion Control", PFLDNeT'09 , | |||
May 2009, <http://www.hpcc.jp/pfldnet2009/ | May 2009, <http://www.hpcc.jp/pfldnet2009/ | |||
Program_files/1569198525.pdf>. | Program_files/1569198525.pdf>. | |||
[MEDF] Menth, M., Schmid, M., Heiss, H., and T. Reim, "MEDF - a | [MEDF] Menth, M., Schmid, M., Heiss, H., and T. Reim, "MEDF - a | |||
simple scheduling algorithm for two real-time transport | simple scheduling algorithm for two real-time transport | |||
service classes with application in the UTRAN", Proc. IEEE | service classes with application in the UTRAN", Proc. IEEE | |||
Conference on Computer Communications (INFOCOM'03) Vol.2 | Conference on Computer Communications (INFOCOM'03) Vol.2 | |||
pp.1116-1122, March 2003. | pp.1116-1122, March 2003. | |||
[PI2] De Schepper, K., Bondarenko, O., Briscoe, B., and I. | [PI2] De Schepper, K., Bondarenko, O., Briscoe, B., and I. | |||
Tsang, "PI2: A Linearized AQM for both Classic and | Tsang, "PI2: A Linearized AQM for both Classic and | |||
Scalable TCP", ACM CoNEXT'16 , December 2016, | Scalable TCP", ACM CoNEXT'16 , December 2016, | |||
<https://riteproject.files.wordpress.com/2015/10/ | <https://riteproject.files.wordpress.com/2015/10/ | |||
pi2_conext.pdf>. | pi2_conext.pdf>. | |||
(To appear) | [PragueLinux] | |||
Briscoe, B., De Schepper, K., Albisser, O., Misund, J., | ||||
Tilmans, O., Kuehlewind, M., and A. Ahmed, "Implementing | ||||
the `TCP Prague' Requirements for Low Latency Low Loss | ||||
Scalable Throughput (L4S)", Proc. Linux Netdev 0x13 , | ||||
March 2019, <https://www.netdevconf.org/0x13/ | ||||
session.html?talk-tcp-prague-l4s>. | ||||
[RFC0970] Nagle, J., "On Packet Switches With Infinite Storage", | [RFC0970] Nagle, J., "On Packet Switches With Infinite Storage", | |||
RFC 970, DOI 10.17487/RFC0970, December 1985, | RFC 970, DOI 10.17487/RFC0970, December 1985, | |||
<https://www.rfc-editor.org/info/rfc970>. | <https://www.rfc-editor.org/info/rfc970>. | |||
[RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, | [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, | |||
S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., | S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., | |||
Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, | Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, | |||
S., Wroclawski, J., and L. Zhang, "Recommendations on | S., Wroclawski, J., and L. Zhang, "Recommendations on | |||
Queue Management and Congestion Avoidance in the | Queue Management and Congestion Avoidance in the | |||
Internet", RFC 2309, DOI 10.17487/RFC2309, April 1998, | Internet", RFC 2309, DOI 10.17487/RFC2309, April 1998, | |||
<https://www.rfc-editor.org/info/rfc2309>. | <https://www.rfc-editor.org/info/rfc2309>. | |||
[RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition | ||||
of Explicit Congestion Notification (ECN) to IP", | ||||
RFC 3168, DOI 10.17487/RFC3168, September 2001, | ||||
<https://www.rfc-editor.org/info/rfc3168>. | ||||
[RFC3246] Davie, B., Charny, A., Bennet, J., Benson, K., Le Boudec, | [RFC3246] Davie, B., Charny, A., Bennet, J., Benson, K., Le Boudec, | |||
J., Courtney, W., Davari, S., Firoiu, V., and D. | J., Courtney, W., Davari, S., Firoiu, V., and D. | |||
Stiliadis, "An Expedited Forwarding PHB (Per-Hop | Stiliadis, "An Expedited Forwarding PHB (Per-Hop | |||
Behavior)", RFC 3246, DOI 10.17487/RFC3246, March 2002, | Behavior)", RFC 3246, DOI 10.17487/RFC3246, March 2002, | |||
<https://www.rfc-editor.org/info/rfc3246>. | <https://www.rfc-editor.org/info/rfc3246>. | |||
[RFC3649] Floyd, S., "HighSpeed TCP for Large Congestion Windows", | [RFC3649] Floyd, S., "HighSpeed TCP for Large Congestion Windows", | |||
RFC 3649, DOI 10.17487/RFC3649, December 2003, | RFC 3649, DOI 10.17487/RFC3649, December 2003, | |||
<https://www.rfc-editor.org/info/rfc3649>. | <https://www.rfc-editor.org/info/rfc3649>. | |||
skipping to change at page 23, line 16 ¶ | skipping to change at page 27, line 33 ¶ | |||
and G. Judd, "Data Center TCP (DCTCP): TCP Congestion | and G. Judd, "Data Center TCP (DCTCP): TCP Congestion | |||
Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257, | Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257, | |||
October 2017, <https://www.rfc-editor.org/info/rfc8257>. | October 2017, <https://www.rfc-editor.org/info/rfc8257>. | |||
[RFC8290] Hoeiland-Joergensen, T., McKenney, P., Taht, D., Gettys, | [RFC8290] Hoeiland-Joergensen, T., McKenney, P., Taht, D., Gettys, | |||
J., and E. Dumazet, "The Flow Queue CoDel Packet Scheduler | J., and E. Dumazet, "The Flow Queue CoDel Packet Scheduler | |||
and Active Queue Management Algorithm", RFC 8290, | and Active Queue Management Algorithm", RFC 8290, | |||
DOI 10.17487/RFC8290, January 2018, | DOI 10.17487/RFC8290, January 2018, | |||
<https://www.rfc-editor.org/info/rfc8290>. | <https://www.rfc-editor.org/info/rfc8290>. | |||
[RFC8311] Black, D., "Relaxing Restrictions on Explicit Congestion | [RFC8298] Johansson, I. and Z. Sarker, "Self-Clocked Rate Adaptation | |||
Notification (ECN) Experimentation", RFC 8311, | for Multimedia", RFC 8298, DOI 10.17487/RFC8298, December | |||
DOI 10.17487/RFC8311, January 2018, | 2017, <https://www.rfc-editor.org/info/rfc8298>. | |||
<https://www.rfc-editor.org/info/rfc8311>. | ||||
[RFC8312] Rhee, I., Xu, L., Ha, S., Zimmermann, A., Eggert, L., and | [RFC8312] Rhee, I., Xu, L., Ha, S., Zimmermann, A., Eggert, L., and | |||
R. Scheffenegger, "CUBIC for Fast Long-Distance Networks", | R. Scheffenegger, "CUBIC for Fast Long-Distance Networks", | |||
RFC 8312, DOI 10.17487/RFC8312, February 2018, | RFC 8312, DOI 10.17487/RFC8312, February 2018, | |||
<https://www.rfc-editor.org/info/rfc8312>. | <https://www.rfc-editor.org/info/rfc8312>. | |||
[TCP-CA] Jacobson, V. and M. Karels, "Congestion Avoidance and | [SigQ-Dyn] | |||
Control", Laurence Berkeley Labs Technical Report , | Briscoe, B., "Rapid Signalling of Queue Dynamics", | |||
November 1988, <http://ee.lbl.gov/papers/congavoid.pdf>. | Technical Report TR-BB-2017-001 arXiv:1904.07044 [cs.NI], | |||
September 2017, <https://arxiv.org/abs/1904.07044>. | ||||
Appendix A. Example DualQ Coupled PI2 Algorithm | Appendix A. Example DualQ Coupled PI2 Algorithm | |||
As a first concrete example, the pseudocode below gives the DualPI2 | As a first concrete example, the pseudocode below gives the DualPI2 | |||
algorithm. DualPI2 follows the structure of the DualQ Coupled AQM | algorithm. DualPI2 follows the structure of the DualQ Coupled AQM | |||
framework in Figure 1. A simple step threshold (in units of queuing | framework in Figure 1. A simple ramp function (configured in units | |||
time) is used for the Native L4S AQM, but a ramp is also described as | of queuing time) with unsmoothed ECN marking is used for the Native | |||
an alternative. And the PI2 algorithm [PI2] is used for the Classic | L4S AQM. The ramp can also be configured as a step function. The | |||
AQM. PI2 is an improved variant of the PIE AQM [RFC8033]. | PI2 algorithm [PI2] is used for the Classic AQM. PI2 is an improved | |||
variant of the PIE AQM [RFC8033]. | ||||
We will introduce the pseudocode in two passes. The first pass | The pseudocode will be introduced in two passes. The first pass | |||
explains the core concepts, deferring handling of overload to the | explains the core concepts, deferring handling of overload to the | |||
second pass. To aid comparison, line numbers are kept in step | second pass. To aid comparison, line numbers are kept in step | |||
between the two passes by using letter suffixes where the longer code | between the two passes by using letter suffixes where the longer code | |||
needs extra lines. | needs extra lines. | |||
All variables are assumed to be floating point in their basic units | ||||
(size in bytes, time in seconds, rates in bytes/second, alpha and | ||||
beta in Hz, and probabilities from 0 to 1. Constants expressed in k, | ||||
M, G, u, m, %, ... are assumed to be converted to their appropriate | ||||
multiple or fraction. A real implementation that wants to use | ||||
integer values needs to handle appropriate scaling factors and allow | ||||
accordingly appropriate resolution of its integer types (including | ||||
temporary internal values during calculations). | ||||
A full open source implementation for Linux is available at: | A full open source implementation for Linux is available at: | |||
https://github.com/olgabo/dualpi2. | https://github.com/L4STeam/sch_dualpi2_upstream and explained in | |||
[DualPI2Linux]. The specification of the DualQ Coupled AQM for | ||||
DOCSIS cable modems and CMTSs is available in [DOCSIS3.1] and | ||||
explained in [LLD]. | ||||
A.1. Pass #1: Core Concepts | A.1. Pass #1: Core Concepts | |||
The pseudocode manipulates three main structures of variables: the | The pseudocode manipulates three main structures of variables: the | |||
packet (pkt), the L4S queue (lq) and the Classic queue (cq). The | packet (pkt), the L4S queue (lq) and the Classic queue (cq). The | |||
pseudocode consists of the following five functions: | pseudocode consists of the following six functions: | |||
o initialization code (Figure 2) that sets parameter defaults (the | o the initialization function dualpi2_params_init(...) (Figure 2) | |||
API for setting non-default values is omitted for brevity) | that sets parameter defaults (the API for setting non-default | |||
values is omitted for brevity) | ||||
o enqueue code (Figure 3) | o the enqueue function dualpi2_enqueue(lq, cq, pkt) (Figure 3) | |||
o dequeue code (Figure 4) | o the dequeue function dualpi2_dequeue(lq, cq, pkt) (Figure 4) | |||
o a ramp function (Figure 5) used to calculate the ECN-marking | o recur(likelihood) for de-randomized ECN marking (shown at the end | |||
probability for the L4S queue | of Figure 4). | |||
o code to regularly update the base probability (p) used in the | o the L4S AQM function laqm(qdelay) (Figure 5) used to calculate the | |||
dequeue code (Figure 6). | ECN-marking probability for the L4S queue | |||
o the base AQM function that implements the PI algorithm | ||||
dualpi2_update(lq, cq) (Figure 6) used to regularly update the | ||||
base probability (p'), which is squared for the Classic AQM as | ||||
well as being coupled across to the L4S queue. | ||||
It also uses the following functions that are not shown in full here: | It also uses the following functions that are not shown in full here: | |||
o scheduler(), which selects between the head packets of the two | o scheduler(), which selects between the head packets of the two | |||
queues; the choice of scheduler technology is discussed later; | queues; the choice of scheduler technology is discussed later; | |||
o cq.len() or lq.len() returns the current length (aka. backlog) of | o cq.len() or lq.len() returns the current length (aka. backlog) of | |||
the relevant queue in bytes; | the relevant queue in bytes; | |||
o cq.time() or lq.time() returns the current queuing delay (aka. | o cq.time() or lq.time() returns the current queuing delay (aka. | |||
sojourn time or service time) of the relevant queue in units of | sojourn time or service time) of the relevant queue in units of | |||
time; | time (see Note a); | |||
Queuing delay could be measured directly by storing a per-packet | o mark(pkt) and drop(pkt) for ECN-marking and dropping a packet; | |||
time-stamp as each packet is enqueued, and subtracting this from the | ||||
system time when the packet is dequeued. If time-stamping is not | ||||
easy to introduce with certain hardware, queuing delay could be | ||||
predicted indirectly by dividing the size of the queue by the | ||||
predicted departure rate, which might be known precisely for some | ||||
link technologies (see for example [RFC8034]). | ||||
In our experiments so far (building on experiments with PIE) on | In experiments so far (building on experiments with PIE) on broadband | |||
broadband access links ranging from 4 Mb/s to 200 Mb/s with base RTTs | access links ranging from 4 Mb/s to 200 Mb/s with base RTTs from 5 ms | |||
from 5 ms to 100 ms, DualPI2 achieves good results with the default | to 100 ms, DualPI2 achieves good results with the default parameters | |||
parameters in Figure 2. The parameters are categorised by whether | in Figure 2. The parameters are categorised by whether they relate | |||
they relate to the Base PI2 AQM, the L4S AQM or the framework | to the Base PI2 AQM, the L4S AQM or the framework coupling them | |||
coupling them together. Variables derived from these parameters are | together. Constants and variables derived from these parameters are | |||
also included at the end of each category. Each parameter is | also included at the end of each category. Each parameter is | |||
explained as it is encountered in the walk-through of the pseudocode | explained as it is encountered in the walk-through of the pseudocode | |||
below. | below. | |||
1: dualpi2_params_init(...) { % Set input parameter defaults | 1: dualpi2_params_init(...) { % Set input parameter defaults | |||
2: % PI2 AQM parameters | 2: % DualQ Coupled framework parameters | |||
3: target = 15 ms % PI AQM Classic queue delay target | 5: limit = MAX_LINK_RATE * 250 ms % Dual buffer size | |||
4: Tupdate = 16 ms % PI Classic queue sampling interval | 3: k = 2 % Coupling factor | |||
5: alpha = 10 Hz^2 % PI integral gain | 4: % NOT SHOWN % scheduler-dependent weight or equival't parameter | |||
6: beta = 100 Hz^2 % PI proportional gain | 6: | |||
7: p_Cmax = 1/4 % Max Classic drop/mark prob | 7: % PI2 AQM parameters | |||
8: % Constants derived from PI2 AQM parameters | 8: RTT_max = 100 ms % Worst case RTT expected | |||
9: alpha_U = alpha *Tupdate % PI integral gain per update interval | 9: RTT_typ = 15 ms % Typical RTT | |||
10: beta_U = beta * Tupdate % PI prop'nal gain per update interval | 11: % PI2 constants derived from above PI2 parameters | |||
11: | 10: p_Cmax = min(1/k^2, 1) % Max Classic drop/mark prob | |||
12: % DualQ Coupled framework parameters | 12: target = RTT_typ % PI AQM Classic queue delay target | |||
13: k = 2 % Coupling factor | 13: Tupdate = min(RTT_typ, RTT_max/3) % PI sampling interval | |||
14: % scheduler weight or equival't parameter (scheduler-dependent) | 14: alpha = 0.1 * Tupdate / RTT_max^2 % PI integral gain in Hz | |||
15: limit = MAX_LINK_RATE * 250 ms % Dual buffer size | 15: beta = 0.3 / RTT_max % PI proportional gain in Hz | |||
16: | 16: | |||
17: % L4S ramp AQM parameters | 17: % L4S ramp AQM parameters | |||
18: minTh = 475 us % L4S min marking threshold in time units | 18: minTh = 475 us % L4S min marking threshold in time units | |||
19: range = 525 us % Range of L4S ramp in time units | 19: range = 525 us % Range of L4S ramp in time units | |||
20: Th_len = 2 * MTU % Min L4S marking threshold in bytes | 20: Th_len = 2 * MTU % Min L4S marking threshold in bytes | |||
21: % Constants derived from L4S AQM parameters | 21: % L4S constants incl. those derived from other parameters | |||
22: p_Lmax = min(k*sqrt(p_Cmax), 1) % Max L4S marking prob | 22: p_Lmax = 1 % Max L4S marking prob | |||
23: floor = Th_len * 8 / MIN_LINK_RATE % MIN_LINK_RATE is in Mb/s | 23: floor = Th_len / MIN_LINK_RATE | |||
24: if (minTh < floor) { | 24: if (minTh < floor) { | |||
25: % Adjust ramp to exceed serialization time of 2 MTU | 25: % Shift ramp so minTh >= serialization time of 2 MTU | |||
26: range = max(range - (floor-minTh), 1) % 1us avoids /0 error | 26: minTh = floor | |||
27: minTh = floor | 27: } | |||
28: } | 28: maxTh = minTh+range % L4S max marking threshold in time units | |||
29: maxTh = minTh+range % L4S min marking threshold in time units | 29: } | |||
30: } | ||||
Figure 2: Example Header Pseudocode for DualQ Coupled PI2 AQM | Figure 2: Example Header Pseudocode for DualQ Coupled PI2 AQM | |||
For brevity the pseudocode shows some parameters in units of | The overall goal of the code is to maintain the base probability (p', | |||
microseconds (us), but a real implementation would probably use | p-prime as in Section 2.4), which is an internal variable from which | |||
nanoseconds. | the marking and dropping probabilities for L4S and Classic traffic | |||
(p_L and p_C) are derived, with p_L in turn being derived from p_CL. | ||||
The overall goal of the code is to maintain the base probability (p), | The probabilities p_CL and p_C are derived in lines 4 and 5 of the | |||
which is an internal variable from which the marking and dropping | dualpi2_update() function (Figure 6) then used in the | |||
probabilities for L4S and Classic traffic (p_L and p_C) are derived. | dualpi2_dequeue() function where p_L is also derived from p_CL at | |||
The variable named p in the pseudocode and in this walk-through is | line 6 (Figure 4). The code walk-through below builds up to | |||
the same as p' (p-prime) in Section 2.4. The probabilities p_L and | explaining that part of the code eventually, but it starts from | |||
p_C are derived in lines 3, 4 and 5 of the dualpi2_update() function | packet arrival. | |||
(Figure 6) then used in the dualpi2_dequeue() function (Figure 4). | ||||
The code walk-through below builds up to explaining that part of the | ||||
code eventually, but it starts from packet arrival. | ||||
1: dualpi2_enqueue(lq, cq, pkt) { % Test limit and classify lq or cq | 1: dualpi2_enqueue(lq, cq, pkt) { % Test limit and classify lq or cq | |||
2: if ( lq.len() + cq.len() > limit ) | 2: if ( lq.len() + cq.len() + MTU > limit) | |||
3: drop(pkt) % drop packet if buffer is full | 3: drop(pkt) % drop packet if buffer is full | |||
4: else { % Packet classifier | 4: timestamp(pkt) % attach arrival time to packet | |||
5: if ( ecn(pkt) modulo 2 == 1 ) % ECN bits = ECT(1) or CE | 5: % Packet classifier | |||
6: lq.enqueue(pkt) | 6: if ( ecn(pkt) modulo 2 == 1 ) % ECN bits = ECT(1) or CE | |||
7: else % ECN bits = not-ECT or ECT(0) | 7: lq.enqueue(pkt) | |||
8: cq.enqueue(pkt) | 8: else % ECN bits = not-ECT or ECT(0) | |||
9: } | 9: cq.enqueue(pkt) | |||
10: } | 10: } | |||
Figure 3: Example Enqueue Pseudocode for DualQ Coupled PI2 AQM | Figure 3: Example Enqueue Pseudocode for DualQ Coupled PI2 AQM | |||
1: dualpi2_dequeue(lq, cq, pkt) { % Couples L4S & Classic queues | 1: dualpi2_dequeue(lq, cq, pkt) { % Couples L4S & Classic queues | |||
2: while ( lq.len() + cq.len() > 0 ) | 2: while ( lq.len() + cq.len() > 0 ) | |||
3: if ( scheduler() == lq ) { | 3: if ( scheduler() == lq ) { | |||
4: lq.dequeue(pkt) % Scheduler chooses lq | 4: lq.dequeue(pkt) % Scheduler chooses lq | |||
5: p'_L = laqm(lq.time()) % Native L4S AQM | 5: p'_L = laqm(lq.time()) % Native L4S AQM | |||
6: p_L = max(p'_L, p_CL) % Combining function | 6: p_L = max(p'_L, p_CL) % Combining function | |||
7: if ( p_L > rand() ) % Linear marking | 7: if ( recur(p_L) ) % Linear marking | |||
8: mark(pkt) | 8: mark(pkt) | |||
9: } else { | 9: } else { | |||
10: cq.dequeue(pkt) % Scheduler chooses cq | 10: cq.dequeue(pkt) % Scheduler chooses cq | |||
11: if ( p_C > rand() ) { % probability p_C = p^2 | 11: if ( p_C > rand() ) { % probability p_C = p'^2 | |||
12: if ( ecn(pkt) == 0 ) { % if ECN field = not-ECT | 12: if ( ecn(pkt) == 0 ) { % if ECN field = not-ECT | |||
13: drop(pkt) % squared drop | 13: drop(pkt) % squared drop | |||
14: continue % continue to the top of the while loop | 14: continue % continue to the top of the while loop | |||
15: } | 15: } | |||
16: mark(pkt) % squared mark | 16: mark(pkt) % squared mark | |||
17: } | 17: } | |||
18: } | 18: } | |||
19: return(pkt) % return the packet and stop | 19: return(pkt) % return the packet and stop | |||
20: } | 20: } | |||
21: return(NULL) % no packet to dequeue | 21: return(NULL) % no packet to dequeue | |||
22: } | 22: } | |||
23: recur(likelihood) { % Returns TRUE with a certain likelihood | ||||
24: count += likelihood | ||||
25: if (count > 1) { | ||||
26: count -= 1 | ||||
27: return TRUE | ||||
28: } | ||||
29: return FALSE | ||||
30: } | ||||
Figure 4: Example Dequeue Pseudocode for DualQ Coupled PI2 AQM | Figure 4: Example Dequeue Pseudocode for DualQ Coupled PI2 AQM | |||
When packets arrive, first a common queue limit is checked as shown | When packets arrive, first a common queue limit is checked as shown | |||
in line 2 of the enqueuing pseudocode in Figure 3. Note that the | in line 2 of the enqueuing pseudocode in Figure 3. This assumes a | |||
limit is deliberately tested before enqueue to avoid any bias against | shared buffer for the two queues (Note b discusses the merits of | |||
larger packets (so depending whether the implementation stores a | separate buffers). In order to avoid any bias against larger | |||
packet while testing whether to drop it from the tail, it might be | packets, 1 MTU of space is always allowed and the limit is | |||
necessary for the actual buffer memory to be one MTU larger than | deliberately tested before enqueue. | |||
limit). | ||||
Line 2 assumes an implementation where lq and cq share common buffer | If limit is not exceeded, the packet is timestamped in line 4. This | |||
memory. An alternative implementation could use separate buffers for | assumes that queue delay is measured using the sojourn time technique | |||
each queue, in which case the arriving packet would have to be | (see Note a for alternatives). | |||
classified first to determine which buffer to check for available | ||||
space. The choice is a trade off; a shared buffer can use less | ||||
memory whereas separate buffers isolate the L4S queue from tail-drop | ||||
due to large bursts of Classic traffic (e.g. a Classic TCP during | ||||
slow-start over a long RTT). | ||||
Returning to the shared buffer case, if limit is not exceeded, the | At lines 5-9, the packet is classified and enqueued to the Classic or | |||
packet will be classified and enqueued to the Classic or L4S queue | L4S queue dependent on the least significant bit of the ECN field in | |||
dependent on the least significant bit of the ECN field in the IP | the IP header (line 6). Packets with a codepoint having an LSB of 0 | |||
header (line 5). Packets with a codepoint having an LSB of 0 (Not- | (Not-ECT and ECT(0)) will be enqueued in the Classic queue. | |||
ECT and ECT(0)) will be enqueued in the Classic queue. Otherwise, | Otherwise, ECT(1) and CE packets will be enqueued in the L4S queue. | |||
ECT(1) and CE packets will be enqueued in the L4S queue. Optional | Optional additional packet classification flexibility is omitted for | |||
additional packet classification flexibility is omitted for brevity | brevity (see [I-D.ietf-tsvwg-ecn-l4s-id]). | |||
(see [I-D.ietf-tsvwg-ecn-l4s-id]). | ||||
The dequeue pseudocode (Figure 4) is repeatedly called whenever the | The dequeue pseudocode (Figure 4) is repeatedly called whenever the | |||
lower layer is ready to forward a packet. It schedules one packet | lower layer is ready to forward a packet. It schedules one packet | |||
for dequeuing (or zero if the queue is empty) then returns control to | for dequeuing (or zero if the queue is empty) then returns control to | |||
the caller, so that it does not block while that packet is being | the caller, so that it does not block while that packet is being | |||
forwarded. While making this dequeue decision, it also makes the | forwarded. While making this dequeue decision, it also makes the | |||
necessary AQM decisions on dropping or marking. The alternative of | necessary AQM decisions on dropping or marking. The alternative of | |||
applying the AQMs at enqueue would shift some processing from the | applying the AQMs at enqueue would shift some processing from the | |||
critical time when each packet is dequeued. However, it would also | critical time when each packet is dequeued. However, it would also | |||
add a whole queue of delay to the control signals, making the control | add a whole queue of delay to the control signals, making the control | |||
loop very sloppy. | loop sloppier (for a typical RTT it would double the Classic queue's | |||
feedback delay). | ||||
All the dequeue code is contained within a large while loop so that | All the dequeue code is contained within a large while loop so that | |||
if it decides to drop a packet, it will continue until it selects a | if it decides to drop a packet, it will continue until it selects a | |||
packet to schedule. Line 3 of the dequeue pseudocode is where the | packet to schedule. Line 3 of the dequeue pseudocode is where the | |||
scheduler chooses between the L4S queue (lq) and the Classic queue | scheduler chooses between the L4S queue (lq) and the Classic queue | |||
(cq). Detailed implementation of the scheduler is not shown (see | (cq). Detailed implementation of the scheduler is not shown (see | |||
discussion later). | discussion later). | |||
o If an L4S packet is scheduled, lines 7 and 8 ECN-mark the packet | o If an L4S packet is scheduled, in lines 7 and 8 the packet is ECN- | |||
if a random marking decision is drawn according to p_L. Line 6 | marked with likelihood p_L. The recur() function at the end of | |||
calculates p_L as the maximum of the coupled L4S probability p_CL | Figure 4 is used, which is preferred over random marking because | |||
and the probability from the native L4S AQM p'_L. This implements | it avoids delay due to randomization when interpreting congestion | |||
the max() function shown in Figure 1 to couple the outputs of the | signals, but it still desynchronizes the saw-teeth of the flows. | |||
two AQMs together. Of the two probabilities input to p_L in line | Line 6 calculates p_L as the maximum of the coupled L4S | |||
6: | probability p_CL and the probability from the native L4S AQM p'_L. | |||
This implements the max() function shown in Figure 1 to couple the | ||||
outputs of the two AQMs together. Of the two probabilities input | ||||
to p_L in line 6: | ||||
* p'_L is calculated per packet in line 5 by the laqm() function | * p'_L is calculated per packet in line 5 by the laqm() function | |||
(see Figure 5), | (see Figure 5), | |||
* whereas p_CL is maintained by the dualpi2_update() function | * whereas p_CL is maintained by the dualpi2_update() function | |||
which runs every Tupdate (default 16ms) (see Figure 2). | which runs every Tupdate (Tupdate is set in line 13 of | |||
Figure 2. It defaults to 16 ms in the reference Linux | ||||
implementation because it has to be rounded to a multiple of 4 | ||||
ms). | ||||
o If a Classic packet is scheduled, lines 10 to 17 drop or mark the | o If a Classic packet is scheduled, lines 10 to 17 drop or mark the | |||
packet based on the squared probability p_C. | packet with probability p_C. | |||
The Native L4S AQM algorithm (Figure 5) is a ramp function, similar | The Native L4S AQM algorithm (Figure 5) is a ramp function, similar | |||
to the RED algorithm, but simpler due to the following differences: | to the RED algorithm, but simplified as follows: | |||
o The min and max of the ramp are defined in units of queuing delay, | o The extent of the ramp is defined in units of queuing delay, not | |||
not bytes, so that configuration remains invariant as the queue | bytes, so that configuration remains invariant as the queue | |||
departure rate varies. | departure rate varies. | |||
o It uses instantaneous queueing delay to remove smoothing delay | o It uses instantaneous queueing delay, which avoids the complexity | |||
(L4S senders smooth incoming ECN feedback when necessary). | of smoothing, but also avoids embedding a worst-case RTT of | |||
smoothing delay in the network (see Section 2.1). | ||||
o The ramp rises linearly directly from 0 to 1, not to a an | o The ramp rises linearly directly from 0 to 1, not to a an | |||
intermediate value of p'_L as RED would, because there is no need | intermediate value of p'_L as RED would, because there is no need | |||
to keep ECN marking probability low. | to keep ECN marking probability low. | |||
o Marking does not have to be randomized. Determinism is being | o Marking does not have to be randomized. Determinism is used | |||
experimented with instead of randomness; to reduce the delay | instead of randomness; to reduce the delay necessary to smooth out | |||
necessary to smooth out the noise of randomness from the signal. | the noise of randomness from the signal. | |||
In this case, for each packet, the algorithm would accumulate p_L | ||||
in a counter and mark the packet that took the counter over 1, | ||||
then subtract 1 from the counter and continue. | ||||
This ramp function requires two configuration parameters, the minimum | The ramp function requires two configuration parameters, the minimum | |||
threshold (minTh) and the width of the ramp (range), both in units of | threshold (minTh) and the width of the ramp (range), both in units of | |||
queuing time), as shown in the parameter initialization code in | queuing time), as shown in lines 18 & 19 of the initialization | |||
Figure 2. A minimum marking threshold parameter (Th_len) in | function in Figure 2. The ramp function can be configured as a step | |||
transmission units (default 2 MTU) is also necessary to ensure that | (see Note c). | |||
the ramp does not trigger excessive marking on slow links. The code | ||||
in lines 23-28 of Figure 2 converts 2 MTU into time units and adjusts | ||||
the ramp thresholds to be no shallower than this floor. | ||||
An operator can effectively turn the ramp into a step function, as | Although the DCTCP paper [Alizadeh-stability] recommends an ECN | |||
used by DCTCP, by setting the range to its minimum value (e.g. 1 ns). | marking threshold of 0.17*RTT_typ, it also shows that the threshold | |||
Then the condition for the ramp calculation will hardly ever arise. | can be much shallower with hardly any worse under-utilization of the | |||
There is some concern that using the step function of DCTCP for the | link (because the amplitude of DCTCP's sawteeth is so small). Based | |||
Native L4S AQM requires end-systems to smooth the signal for an | on extensive experiments, for the public Internet a default minimum | |||
unnecessarily large number of round trips to ensure sufficient | ECN marking threshold of about RTT_typ/30 is recommended. | |||
fidelity. A ramp seems to be no worse than a step in initial | ||||
experiments with existing DCTCP. Therefore, it is recommended that a | A minimum marking threshold parameter (Th_len) in transmission units | |||
ramp is configured in place of a step, which will allow congestion | (default 2 MTU) is also necessary to ensure that the ramp does not | |||
control algorithms to investigate faster smoothing algorithms. | trigger excessive marking on slow links. The code in lines 24-27 of | |||
the initialization function (Figure 2) converts 2 MTU into time units | ||||
and shifts the ramp so that the min threshold is no shallower than | ||||
this floor. | ||||
1: laqm(qdelay) { % Returns native L4S AQM probability | 1: laqm(qdelay) { % Returns native L4S AQM probability | |||
2: if (qdelay >= maxTh) | 2: if (qdelay >= maxTh) | |||
3: return 1 | 3: return 1 | |||
4: else if (qdelay > minTh) | 4: else if (qdelay > minTh) | |||
5: return (qdelay - minTh)/range % Divide would use a bit-shift | 5: return (qdelay - minTh)/range % Divide could use a bit-shift | |||
6: else | 6: else | |||
7: return 0 | 7: return 0 | |||
8: } | 8: } | |||
Figure 5: Example Pseudocode for the Native L4S AQM | Figure 5: Example Pseudocode for the Native L4S AQM | |||
1: dualpi2_update(lq, cq, target) { % Update p every Tupdate | 1: dualpi2_update(lq, cq) { % Update p' every Tupdate | |||
2: curq = cq.time() % use queuing time of first-in Classic packet | 2: curq = cq.time() % use queuing time of first-in Classic packet | |||
3: p = p + alpha_U * (curq - target) + beta_U * (curq - prevq) | 3: p' = p' + alpha * (curq - target) + beta * (curq - prevq) | |||
4: p_CL = p * k % Coupled L4S prob = base prob * coupling factor | 4: p_CL = k * p' % Coupled L4S prob = base prob * coupling factor | |||
5: p_C = p^2 % Classic prob = (base prob)^2 | 5: p_C = p'^2 % Classic prob = (base prob)^2 | |||
6: prevq = curq | 6: prevq = curq | |||
7: } | 7: } | |||
Figure 6: Example PI-Update Pseudocode for DualQ Coupled PI2 AQM | Figure 6: Example PI-Update Pseudocode for DualQ Coupled PI2 AQM | |||
p_CL depends on the base probability (p), which is kept up to date by | The coupled marking probability, p_CL depends on the base probability | |||
the core PI algorithm in Figure 6 executed every Tupdate. | (p'), which is kept up to date by the core PI algorithm in Figure 6 | |||
executed every Tupdate. | ||||
Note that p solely depends on the queuing time in the Classic queue. | Note that p' solely depends on the queuing time in the Classic queue. | |||
In line 2, the current queuing delay (curq) is evaluated from how | In line 2, the current queuing delay (curq) is evaluated from how | |||
long the head packet was in the Classic queue (cq). The function | long the head packet was in the Classic queue (cq). The function | |||
cq.time() (not shown) subtracts the time stamped at enqueue from the | cq.time() (not shown) subtracts the time stamped at enqueue from the | |||
current time and implicitly takes the current queuing delay as 0 if | current time (see Note a) and implicitly takes the current queuing | |||
the queue is empty. | delay as 0 if the queue is empty. | |||
The algorithm centres on line 3, which is a classical Proportional- | The algorithm centres on line 3, which is a classical Proportional- | |||
Integral (PI) controller that alters p dependent on: a) the error | Integral (PI) controller that alters p' dependent on: a) the error | |||
between the current queuing delay (curq) and the target queuing delay | between the current queuing delay (curq) and the target queuing delay | |||
('target' - see [RFC8033]); and b) the change in queuing delay since | ('target' - see [RFC8033]); and b) the change in queuing delay since | |||
the last sample. The name 'PI' represents the fact that the second | the last sample. The name 'PI' represents the fact that the second | |||
factor (how fast the queue is growing) is _P_roportional to load | factor (how fast the queue is growing) is _P_roportional to load | |||
while the first is the _I_ntegral of the load (so it removes any | while the first is the _I_ntegral of the load (so it removes any | |||
standing queue in excess of the target). | standing queue in excess of the target). | |||
The two 'gain factors' in line 3, alpha_U and beta_U, respectively | The two 'gain factors' in line 3, alpha and beta, respectively weight | |||
weight how strongly each of these elements ((a) and (b)) alters p. | how strongly each of these elements ((a) and (b)) alters p'. They | |||
They are in units of 'per second of delay' or Hz, because they | are in units of 'per second of delay' or Hz, because they transform | |||
transform differences in queueing delay into changes in probability. | differences in queueing delay into changes in probability (assuming | |||
probability has a value from 0 to 1). | ||||
alpha_U and beta_U are derived from the input parameters alpha and | alpha and beta determine how much p' ought to change after each | |||
beta (see lines 5 and 6 of Figure 2). These recommended values of | update interval (Tupdate). For smaller Tupdate, p' should change by | |||
alpha and beta come from the stability analysis in [PI2] so that the | the same amount per second, but in finer more frequent steps. So | |||
AQM can change p as fast as possible in response to changes in load | alpha depends on Tupdate (see line 14 of the initialization function | |||
without over-compensating and therefore causing oscillations in the | in Figure 2). It is best to update p' as frequently as possible, but | |||
queue. | Tupdate will probably be constrained by hardware performance. As | |||
shown in line 13, the update interval should be at least as frequent | ||||
as once per the RTT of a typical flow (RTT_typ) as long as it does | ||||
not exceed roughly RTT_max/3. For link rates from 4 - 200 Mb/s, a | ||||
target RTT of 15ms and a maximum RTT of 100ms, it has been verified | ||||
through extensive testing that Tupdate=16ms (as recommended in | ||||
[RFC8033]) is sufficient. | ||||
alpha and beta determine how much p ought to change if it was updated | The choice of alpha and beta also determines the AQM's stable | |||
every second. It is best to update p as frequently as possible, but | operating range. The AQM ought to change p' as fast as possible in | |||
the update interval (Tupdate) will probably be constrained by | response to changes in load without over-compensating and therefore | |||
hardware performance. For link rates from 4 - 200 Mb/s, we found | causing oscillations in the queue. Therefore, the values of alpha | |||
Tupdate=16ms (as recommended in [RFC8033]) is sufficient. However | and beta also depend on the RTT of the expected worst-case flow | |||
small the chosen value of Tupdate, p should change by the same amount | (RTT_max). | |||
per second, but in finer more frequent steps. So the gain factors | ||||
used for updating p in Figure 6 need to be scaled by (Tupdate/1s), | ||||
which is done in lines 9 and 10 of Figure 2). The suffix '_U' | ||||
represents 'per update time' (Tupdate). | ||||
In corner cases, p can overflow the range [0,1] so the resulting | Recommended derivations of the gain constants alpha and beta can be | |||
value of p has to be bounded (omitted from the pseudocode). Then, as | approximated for Reno over a PI2 AQM as: alpha = 0.1 * Tupdate / | |||
already explained, the coupled and Classic probabilities are derived | RTT_max^2; beta = 0.3 / RTT_max, as shown in lines 14 & 15 of | |||
from the new p in lines 4 and 5 as p_CL = k*p and p_C = p^2. | Figure 2. These are derived from the stability analysis in [PI2]. | |||
For the default values of Tupdate=16 ms and RTT_max = 100 ms, they | ||||
result in alpha = 0.16; beta = 3.2 (discrepancies are due to | ||||
rounding). These defaults have been verified with a wide range of | ||||
link rates, target delays and a range of traffic models with mixed | ||||
and similar RTTs, short and long flows, etc. | ||||
In corner cases, p' can overflow the range [0,1] so the resulting | ||||
value of p' has to be bounded (omitted from the pseudocode). Then, | ||||
as already explained, the coupled and Classic probabilities are | ||||
derived from the new p' in lines 4 and 5 of Figure 6 as p_CL = k*p' | ||||
and p_C = p'^2. | ||||
Because the coupled L4S marking probability (p_CL) is factored up by | Because the coupled L4S marking probability (p_CL) is factored up by | |||
k, the dynamic gain parameters alpha and beta are also inherently | k, the dynamic gain parameters alpha and beta are also inherently | |||
factored up by k for the L4S queue, which is necessary to ensure that | factored up by k for the L4S queue. So, the effective gain factor | |||
Classic TCP and DCTCP controls have the same stability. So, if alpha | for the L4S queue is k*alpha (with defaults alpha = 0.16 Hz and k=2, | |||
is 10 Hz^2, the effective gain factor for the L4S queue is k*alpha, | effective L4S alpha = 0.32 Hz). | |||
which is 20 Hz^2 with the default coupling factor of k=2. | ||||
Unlike in PIE [RFC8033], alpha_U and beta_U do not need to be tuned | Unlike in PIE [RFC8033], alpha and beta do not need to be tuned every | |||
every Tupdate dependent on p. Instead, in PI2, alpha_U and beta_U | Tupdate dependent on p'. Instead, in PI2, alpha and beta are | |||
are independent of p because the squaring applied to Classic traffic | independent of p' because the squaring applied to Classic traffic | |||
tunes them inherently. This is explained in [PI2], which also | tunes them inherently. This is explained in [PI2], which also | |||
explains why this more principled approach removes the need for most | explains why this more principled approach removes the need for most | |||
of the heuristics that had to be added to PIE. | of the heuristics that had to be added to PIE. | |||
{ToDo: Scaling beta with Tupdate and scaling both alpha & beta with | Notes: | |||
RTT} | ||||
a. The drain rate of the queue can vary if it is scheduled relative | ||||
to other queues, or to cater for fluctuations in a wireless | ||||
medium. To auto-adjust to changes in drain rate, the queue must | ||||
be measured in time, not bytes or packets [AQMmetrics] [CoDel]. | ||||
Queuing delay could be measured directly by storing a per-packet | ||||
time-stamp as each packet is enqueued, and subtracting this from | ||||
the system time when the packet is dequeued. If time-stamping is | ||||
not easy to introduce with certain hardware, queuing delay could | ||||
be predicted indirectly by dividing the size of the queue by the | ||||
predicted departure rate, which might be known precisely for some | ||||
link technologies (see for example [RFC8034]). | ||||
b. Line 2 of the dualpi2_enqueue() function (Figure 3) assumes an | ||||
implementation where lq and cq share common buffer memory. An | ||||
alternative implementation could use separate buffers for each | ||||
queue, in which case the arriving packet would have to be | ||||
classified first to determine which buffer to check for available | ||||
space. The choice is a trade off; a shared buffer can use less | ||||
memory whereas separate buffers isolate the L4S queue from tail- | ||||
drop due to large bursts of Classic traffic (e.g. a Classic TCP | ||||
during slow-start over a long RTT). | ||||
c. There has been some concern that using the step function of DCTCP | ||||
for the Native L4S AQM requires end-systems to smooth the signal | ||||
for an unnecessarily large number of round trips to ensure | ||||
sufficient fidelity. A ramp is no worse than a step in initial | ||||
experiments with existing DCTCP. Therefore, it is recommended | ||||
that a ramp is configured in place of a step, which will allow | ||||
congestion control algorithms to investigate faster smoothing | ||||
algorithms. | ||||
A ramp is more general that a step, because an operator can | ||||
effectively turn the ramp into a step function, as used by DCTCP, | ||||
by setting the range to zero. There will not be a divide by zero | ||||
problem at line 4 of Figure 5 because, if minTh is equal to | ||||
maxTh, the condition for this ramp calculation cannot arise. | ||||
A.2. Pass #2: Overload Details | A.2. Pass #2: Overload Details | |||
Figure 7 repeats the dequeue function of Figure 4, but with overload | Figure 7 repeats the dequeue function of Figure 4, but with overload | |||
details added. Similarly Figure 8 repeats the core PI algorithm of | details added. Similarly Figure 8 repeats the core PI algorithm of | |||
Figure 6 with overload details added. The initialization, enqueue | Figure 6 with overload details added. The initialization, enqueue, | |||
and L4S AQM functions are unchanged. | L4S AQM and recur functions are unchanged. | |||
In line 7 of the initialization function (Figure 2), the default | In line 10 of the initialization function (Figure 2), the maximum | |||
maximum Classic drop probability p_Cmax = 1/4 or 25%. This is the | Classic drop probability p_Cmax = min(1/k^2, 1) or 1/4 for the | |||
point at which it is deemed that the Classic queue has become | default coupling factor k=2. p_Cmax is the point at which it is | |||
persistently overloaded, so it switches to using solely drop, even | deemed that the Classic queue has become persistently overloaded, so | |||
for ECN-capable packets. This protects the queue against any | it switches to using drop, even for ECN-capable packets. ECT packets | |||
unresponsive traffic that falsely claims that it is responsive to ECN | that are not dropped can still be ECN-marked. | |||
marking, as required by [RFC3168] and [RFC7567]. | ||||
Line 22 of the initialization function translates this into a maximum | In practice, 25% has been found to be a good threshold to preserve | |||
L4S marking probability (p_Lmax) by rearranging Equation (1). With a | fairness between ECN capable and non ECN capable traffic. This | |||
coupling factor of k=2 (the default) or greater, this translates to a | protects the queues against both temporary overload from responsive | |||
maximum L4S marking probability of 1 (or 100%). This is intended to | flows and more persistent overload from any unresponsive traffic that | |||
ensure that the L4S queue starts to introduce dropping once marking | falsely claims to be responsive to ECN. | |||
saturates and can rise no further. The 'TCP Prague' requirements | ||||
When the Classic ECN marking probability reaches the p_Cmax threshold | ||||
(1/k^2), the marking probability coupled to the L4S queue, p_CL will | ||||
always be 100% for any k (by equation (1) in Section 2). So, for | ||||
readability, the constant p_Lmax is defined as 1 in line 22 of the | ||||
initialization function Figure 2. This is intended to ensure that | ||||
the L4S queue starts to introduce dropping once ECN-marking saturates | ||||
at 100% and can rise no further. The 'Prague L4S' requirements | ||||
[I-D.ietf-tsvwg-ecn-l4s-id] state that, when an L4S congestion | [I-D.ietf-tsvwg-ecn-l4s-id] state that, when an L4S congestion | |||
control detects a drop, it falls back to a response that coexists | control detects a drop, it falls back to a response that coexists | |||
with 'Classic' TCP. So it is correct that the L4S queue drops | with 'Classic' TCP. So it is correct that, when the L4S queue drops | |||
packets proportional to p^2, as if they are Classic packets. | packets, it drops them proportional to p'^2, as if they are Classic | |||
packets. | ||||
Both these switch-overs are triggered by the tests for overload | Both these switch-overs are triggered by the tests for overload | |||
introduced in lines 4b and 12b of the dequeue function (Figure 7). | introduced in lines 4b and 12b of the dequeue function (Figure 7). | |||
Lines 8c to 8g drop L4S packets with probability p^2. Lines 8h to 8i | Lines 8c to 8g drop L4S packets with probability p'^2. Lines 8h to | |||
mark the remaining packets with probability p_CL. If p_Lmax = 1, | 8i mark the remaining packets with probability p_CL. Given p_Lmax = | |||
which is the suggested default configuration, all remaining packets | 1, all remaining packets will be marked because, to have reached the | |||
will be marked because, to have reached the else block at line 8b, | else block at line 8b, p_CL >= 1. | |||
p_CL >= 1. | ||||
Lines 2c to 2d in the core PI algorithm (Figure 8) deal with overload | Lines 2c to 2d in the core PI algorithm (Figure 8) deal with overload | |||
of the L4S queue when there is no Classic traffic. This is | of the L4S queue when there is no Classic traffic. This is | |||
necessary, because the core PI algorithm maintains the appropriate | necessary, because the core PI algorithm maintains the appropriate | |||
drop probability to regulate overload, but it depends on the length | drop probability to regulate overload, but it depends on the length | |||
of the Classic queue. If there is no Classic queue the naive | of the Classic queue. If there is no Classic queue the naive PI | |||
algorithm in Figure 6 drops nothing, even if the L4S queue is | update function in Figure 6 would drop nothing, even if the L4S queue | |||
overloaded - so tail drop would have to take over (lines 3 and 4 of | were overloaded - so tail drop would have to take over (lines 2 and 3 | |||
Figure 3). | of Figure 3). | |||
If the test at line 2a finds that the Classic queue is empty, line 2d | Instead, the test at line 2a of the full PI update function in | |||
measures the current queue delay using the L4S queue instead. While | Figure 8 keeps delay on target using drop. If the test at line 2a of | |||
the L4S queue is not overloaded, its delay will always be tiny | finds that the Classic queue is empty, line 2d measures the current | |||
compared to the target Classic queue delay. So p_L will be driven to | queue delay using the L4S queue instead. While the L4S queue is not | |||
zero, and the L4S queue will naturally be governed solely by | overloaded, its delay will always be tiny compared to the target | |||
threshold marking (lines 5 and 6 of the dequeue algorithm in | Classic queue delay. So p_CL will be driven to zero, and the L4S | |||
Figure 7). But, if unresponsive L4S source(s) cause overload, the | queue will naturally be governed solely by p'_L from the native L4S | |||
DualQ transitions smoothly to L4S marking based on the PI algorithm. | AQM (lines 5 and 6 of the dequeue algorithm in Figure 7). But, if | |||
And as overload increases, it naturally transitions from marking to | unresponsive L4S source(s) cause overload, the DualQ transitions | |||
dropping by the switch-over mechanism already described. | smoothly to L4S marking based on the PI algorithm. If overload | |||
increases further, it naturally transitions from marking to dropping | ||||
by the switch-over mechanism already described. | ||||
1: dualpi2_dequeue(lq, cq) { % Couples L4S & Classic queues, lq & cq | 1: dualpi2_dequeue(lq, cq, pkt) { % Couples L4S & Classic queues | |||
2: while ( lq.len() + cq.len() > 0 ) | 2: while ( lq.len() + cq.len() > 0 ) { | |||
3: if ( scheduler() == lq ) { | 3: if ( scheduler() == lq ) { | |||
4a: lq.dequeue(pkt) | 4a: lq.dequeue(pkt) % L4S scheduled | |||
4b: if ( p_CL < p_Lmax ) { % Check for overload saturation | 4b: if ( p_CL < p_Lmax ) { % Check for overload saturation | |||
5: p'_L = laqm(lq.time()) % Native L4S AQM | 5: p'_L = laqm(lq.time()) % Native L4S AQM | |||
6: p_L = max(p'_L, p_CL) % Combining function | 6: p_L = max(p'_L, p_CL) % Combining function | |||
7: if ( p_L > rand() ) % Linear marking | 7: if ( recur(p_L) ) % Linear marking | |||
8a: mark(pkt) | 8a: mark(pkt) | |||
8b: } else { % overload saturation | 8b: } else { % overload saturation | |||
8c: if ( p_C > rand() ) { % probability p_C = p^2 | 8c: if ( p_C > rand() ) { % probability p_C = p'^2 | |||
8e: drop(pkt) % revert to Classic drop due to overload | 8e: drop(pkt) % revert to Classic drop due to overload | |||
8f: continue % continue to the top of the while loop | 8f: continue % continue to the top of the while loop | |||
8g: } | 8g: } | |||
8h: if ( p_CL > rand() ) % probability p_CL = k * p | 8h: if ( p_CL > rand() ) % probability p_CL = k * p' | |||
8i: mark(pkt) % linear marking of remaining packets | 8i: mark(pkt) % linear marking of remaining packets | |||
8j: } | 8j: } | |||
9: } else { | 9: } else { | |||
10: cq.dequeue(pkt) | 10: cq.dequeue(pkt) % Classic scheduled | |||
11: if ( p_C > rand() ) { % probability p_C = p^2 | 11: if ( p_C > rand() ) { % probability p_C = p'^2 | |||
12a: if ( (ecn(pkt) == 0) % ECN field = not-ECT | 12a: if ( (ecn(pkt) == 0) % ECN field = not-ECT | |||
12b: OR (p_C >= p_Cmax) ) { % Overload disables ECN | 12b: OR (p_C >= p_Cmax) ) { % Overload disables ECN | |||
13: drop(pkt) % squared drop, redo loop | 13: drop(pkt) % squared drop, redo loop | |||
14: continue % continue to the top of the while loop | 14: continue % continue to the top of the while loop | |||
15: } | 15: } | |||
16: mark(pkt) % squared mark | 16: mark(pkt) % squared mark | |||
17: } | 17: } | |||
18: } | 18: } | |||
19: return(pkt) % return the packet and stop | 19: return(pkt) % return the packet and stop | |||
20: } | 20: } | |||
21: return(NULL) % no packet to dequeue | 21: return(NULL) % no packet to dequeue | |||
22: } | 22: } | |||
Figure 7: Example Dequeue Pseudocode for DualQ Coupled PI2 AQM | Figure 7: Example Dequeue Pseudocode for DualQ Coupled PI2 AQM | |||
(Including Integer Arithmetic and Overload Code) | (Including Overload Code) | |||
1: dualpi2_update(lq, cq, target) { % Update p every Tupdate | 1: dualpi2_update(lq, cq) { % Update p' every Tupdate | |||
2a: if ( cq.len() > 0 ) | 2a: if ( cq.len() > 0 ) | |||
2b: curq = cq.time() %use queuing time of first-in Classic packet | 2b: curq = cq.time() %use queuing time of first-in Classic packet | |||
2c: else % Classic queue empty | 2c: else % Classic queue empty | |||
2d: curq = lq.time() % use queuing time of first-in L4S packet | 2d: curq = lq.time() % use queuing time of first-in L4S packet | |||
3: p = p + alpha_U * (curq - target) + beta_U * (curq - prevq) | 3: p' = p' + alpha * (curq - target) + beta * (curq - prevq) | |||
4: p_CL = p * k % Coupled L4S prob = base prob * coupling factor | 4: p_CL = p' * k % Coupled L4S prob = base prob * coupling factor | |||
5: p_C = p^2 % Classic prob = (base prob)^2 | 5: p_C = p'^2 % Classic prob = (base prob)^2 | |||
6: prevq = curq | 6: prevq = curq | |||
7: } | 7: } | |||
Figure 8: Example PI-Update Pseudocode for DualQ Coupled PI2 AQM | Figure 8: Example PI-Update Pseudocode for DualQ Coupled PI2 AQM | |||
(Including Overload Code) | (Including Overload Code) | |||
The choice of scheduler technology is critical to overload protection | The choice of scheduler technology is critical to overload protection | |||
(see Section 4.1). | (see Section 4.1). | |||
o A well-understood weighted scheduler such as weighted round robin | o A well-understood weighted scheduler such as weighted round robin | |||
(WRR) is recommended. The scheduler weight for Classic should be | (WRR) is recommended. As long as the scheduler weight for Classic | |||
low, e.g. 1/16. | is small (e.g. 1/16), its exact value is unimportant because it | |||
does not normally determine capacity shares. The weight is only | ||||
important to prevent unresponsive L4S traffic starving Classic | ||||
traffic. This is because capacity sharing between the queues is | ||||
normally determined by the coupled congestion signal, which | ||||
overrides the scheduler, by making L4S sources leave roughly equal | ||||
per-flow capacity available for Classic flows. | ||||
o Alternatively, a time-shifted FIFO could be used. This is a very | o Alternatively, a time-shifted FIFO (TS-FIFO) could be used. It | |||
simple scheduler, but it does not fully isolate latency in the L4S | works by selecting the head packet that has waited the longest, | |||
queue from uncontrolled bursts in the Classic queue. It works by | biased against the Classic traffic by a time-shift of tshift. To | |||
selecting the head packet that has waited the longest, biased | implement time-shifted FIFO, the scheduler() function in line 3 of | |||
against the Classic traffic by a time-shift of tshift. To | the dequeue code would simply be implemented as the scheduler() | |||
implement time-shifted FIFO, the "if (scheduler() == lq )" test in | function at the bottom of Figure 10 in Appendix B. For the public | |||
line 3 of the dequeue code would simply be replaced by "if ( | Internet a good value for tshift is 50ms. For private networks | |||
lq.time() + tshift >= cq.time() )". For the public Internet a | with smaller diameter, about 4*target would be reasonable. TS- | |||
good value for tshift is 50ms. For private networks with smaller | FIFO is a very simple scheduler, but complexity might need to be | |||
diameter, about 4*target would be reasonable. | added to address some deficiencies (which is why it is not | |||
recommended over WRR): | ||||
* TS-FIFO does not fully isolate latency in the L4S queue from | ||||
uncontrolled bursts in the Classic queue; | ||||
* TS-FIFO is only appropriate if time-stamping of packets is | ||||
feasible; | ||||
* Even if time-stamping is supported, the sojourn time of the | ||||
head packet is always stale. For instance, if a burst arrives | ||||
at an empty queue, the sojourn time will only measure the delay | ||||
of the burst once the burst is over, even though the queue knew | ||||
about it from the start. At the cost of more operations and | ||||
more storage, a 'scaled sojourn time' metric of queue delay can | ||||
be used, which is the sojourn time of a packet scaled by the | ||||
ratio of the queue sizes when the packet departed and arrived | ||||
[SigQ-Dyn]. | ||||
o A strict priority scheduler would be inappropriate, because it | o A strict priority scheduler would be inappropriate, because it | |||
would starve Classic if L4S was overloaded. | would starve Classic if L4S was overloaded. | |||
Appendix B. Example DualQ Coupled Curvy RED Algorithm | Appendix B. Example DualQ Coupled Curvy RED Algorithm | |||
As another example of a DualQ Coupled AQM algorithm, the pseudocode | As another example of a DualQ Coupled AQM algorithm, the pseudocode | |||
below gives the Curvy RED based algorithm we used and tested. | below gives the Curvy RED based algorithm. Although the AQM was | |||
Although we designed the AQM to be efficient in integer arithmetic, | designed to be efficient in integer arithmetic, to aid understanding | |||
to aid understanding it is first given using real-number arithmetic. | it is first given using floating point arithmetic (Figure 10). Then, | |||
Then, one possible optimization for integer arithmetic is given, also | one possible optimization for integer arithmetic is given, also in | |||
in pseudocode. To aid comparison, the line numbers are kept in step | pseudocode (Figure 11). To aid comparison, the line numbers are kept | |||
between the two by using letter suffixes where the longer code needs | in step between the two by using letter suffixes where the longer | |||
extra lines. | code needs extra lines. | |||
1: dualq_dequeue(lq, cq) { % Couples L4S & Classic queues, lq & cq | B.1. Curvy RED in Pseudocode | |||
2: if ( lq.dequeue(pkt) ) { | ||||
3a: p_L = cq.sec() / 2^S_L | ||||
3b: if ( lq.byt() > T ) | ||||
3c: mark(pkt) | ||||
3d: elif ( p_L > maxrand(U) ) | ||||
4: mark(pkt) | ||||
5: return(pkt) % return the packet and stop here | ||||
6: } | ||||
7: while ( cq.dequeue(pkt) ) { | ||||
8a: alpha = 2^(-f_C) | ||||
8b: Q_C = alpha * pkt.sec() + (1-alpha)* Q_C % Classic Q EWMA | ||||
9a: sqrt_p_C = Q_C / 2^S_C | ||||
9b: if ( sqrt_p_C > maxrand(2*U) ) | ||||
10: drop(pkt) % Squared drop, redo loop | ||||
11: else | ||||
12: return(pkt) % return the packet and stop here | ||||
13: } | ||||
14: return(NULL) % no packet to dequeue | ||||
15: } | ||||
16: maxrand(u) { % return the max of u random numbers | The pseudocode manipulates three main structures of variables: the | |||
17: maxr=0 | packet (pkt), the L4S queue (lq) and the Classic queue (cq) and | |||
18: while (u-- > 0) | consists of the following five functions: | |||
19: maxr = max(maxr, rand()) % 0 <= rand() < 1 | ||||
20: return(maxr) | o the initialization function cred_params_init(...) (Figure 2) that | |||
sets parameter defaults (the API for setting non-default values is | ||||
omitted for brevity); | ||||
o the dequeue function cred_dequeue(lq, cq, pkt) (Figure 4); | ||||
o the scheduling function scheduler(), which selects between the | ||||
head packets of the two queues. | ||||
It also uses the following functions that are either shown elsewhere, | ||||
or not shown in full here: | ||||
o the enqueue function, which is identical to that used for DualPI2, | ||||
dualpi2_enqueue(lq, cq, pkt) in Figure 3; | ||||
o mark(pkt) and drop(pkt) for ECN-marking and dropping a packet; | ||||
o cq.len() or lq.len() returns the current length (aka. backlog) of | ||||
the relevant queue in bytes; | ||||
o cq.time() or lq.time() returns the current queuing delay (aka. | ||||
sojourn time or service time) of the relevant queue in units of | ||||
time (see Note a in Appendix A.1). | ||||
Because Curvy RED was evaluated before DualPI2, certain improvements | ||||
introduced for DualPI2 were not evaluated for Curvy RED. In the | ||||
pseudocode below, the straightforward improvements have been added on | ||||
the assumption they will provide similar benefits, but that has not | ||||
been proven experimentally. They are: i) a conditional priority | ||||
scheduler instead of strict priority ii) a time-based threshold for | ||||
the native L4S AQM; iii) ECN support for the Classic AQM. A recent | ||||
evaluation has proved that a minimum ECN-marking threshold (minTh) | ||||
greatly improves performance, so this is also included in the | ||||
pseudocode. | ||||
Overload protection has not been added to the Curvy RED pseudocode | ||||
below so as not to detract from the main features. It would be added | ||||
in exactly the same way as in Appendix A.2 for the DualPI2 | ||||
pseudocode. The native L4S AQM uses a step threshold, but a ramp | ||||
like that described for DualPI2 could be used instead. The scheduler | ||||
uses the simple TS-FIFO algorithm, but it could be replaced with WRR. | ||||
The Curvy RED algorithm has not been maintained or evaluated to the | ||||
same degree as the DualPI2 algorithm. In initial experiments on | ||||
broadband access links ranging from 4 Mb/s to 200 Mb/s with base RTTs | ||||
from 5 ms to 100 ms, Curvy RED achieved good results with the default | ||||
parameters in Figure 9. | ||||
The parameters are categorised by whether they relate to the Classic | ||||
AQM, the L4S AQM or the framework coupling them together. Constants | ||||
and variables derived from these parameters are also included at the | ||||
end of each category. These are the raw input parameters for the | ||||
algorithm. A configuration front-end could accept more meaningful | ||||
parameters (e.g. RTT_max and RTT_typ) and convert them into these | ||||
raw parameters, as has been done for DualPI2 in Appendix A. Where | ||||
necessary, parameters are explained further in the walk-through of | ||||
the pseudocode below. | ||||
1: cred_params_init(...) { % Set input parameter defaults | ||||
2: % DualQ Coupled framework parameters | ||||
3: limit = MAX_LINK_RATE * 250 ms % Dual buffer size | ||||
4: k' = 1 % Coupling factor as a power of 2 | ||||
5: tshift = 50 ms % Time shift of TS-FIFO scheduler | ||||
6: % Constants derived from Classic AQM parameters | ||||
7: k = 2^k' % Coupling factor from Equation (1) | ||||
6: | ||||
7: % Classic AQM parameters | ||||
8: g_C = 5 % EWMA smoothing parameter as a power of 1/2 | ||||
9: S_C = -1 % Classic ramp scaling factor as a power of 2 | ||||
10: minTh = 500 ms % No Classic drop/mark below this queue delay | ||||
11: % Constants derived from Classic AQM parameters | ||||
12: gamma = 2^(-g_C) % EWMA smoothing parameter | ||||
13: range_C = 2^S_C % Range of Classic ramp | ||||
14: | ||||
15: % L4S AQM parameters | ||||
16: T = 1 ms % Queue delay threshold for native L4S AQM | ||||
17: % Constants derived from above parameters | ||||
18: S_L = S_C - k' % L4S ramp scaling factor as a power of 2 | ||||
19: range_L = 2^S_L % Range of L4S ramp | ||||
20: } | ||||
Figure 9: Example Header Pseudocode for DualQ Coupled Curvy RED AQM | ||||
1: cred_dequeue(lq, cq, pkt) { % Couples L4S & Classic queues | ||||
2: while ( lq.len() + cq.len() > 0 ) { | ||||
3: if ( scheduler() == lq ) { | ||||
4: lq.dequeue(pkt) % L4S scheduled | ||||
5a: p_CL = (cq.time() - minTh) / range_L | ||||
5b: if ( ( lq.time() > T ) | ||||
5c: OR ( p_CL > maxrand(U) ) ) | ||||
6: mark(pkt) | ||||
7: } else { | ||||
8: cq.dequeue(pkt) % Classic scheduled | ||||
9a: Q_C = gamma * qc.time() + (1-gamma) * Q_C % Classic Q EWMA | ||||
10a: sqrt_p_C = (Q_C - minTh) / range_C | ||||
10b: if ( sqrt_p_C > maxrand(2*U) ) { | ||||
11: if ( (ecn(pkt) == 0) { % ECN field = not-ECT | ||||
12: drop(pkt) % Squared drop, redo loop | ||||
13: continue % continue to the top of the while loop | ||||
14: } | ||||
15: mark(pkt) | ||||
16: } | ||||
17: } | ||||
18: return(pkt) % return the packet and stop here | ||||
19: } | ||||
20: return(NULL) % no packet to dequeue | ||||
21: } | 21: } | |||
Figure 9: Example Dequeue Pseudocode for DualQ Coupled Curvy RED AQM | 22: maxrand(u) { % return the max of u random numbers | |||
23: maxr=0 | ||||
24: while (u-- > 0) | ||||
25: maxr = max(maxr, rand()) % 0 <= rand() < 1 | ||||
26: return(maxr) | ||||
27: } | ||||
Packet classification code is not shown, as it is no different from | 28: scheduler() { | |||
Figure 3. Potential classification schemes are discussed in | 29: if ( lq.time() + tshift >= cq.time() ) | |||
Section 2.3. The Curvy RED algorithm has not been maintained to the | 30: return lq; | |||
same degree as the DualPI2 algorithm. Some ideas used in DualPI2 | 31: else | |||
would need to be translated into Curvy RED, such as i) the | 32: return cq; | |||
conditional priority scheduler instead of strict priority ii) the | 33: } | |||
time-based L4S threshold; iii) turning off ECN as overload | ||||
protection; iv) Classic ECN support. These are not shown in the | ||||
Curvy RED pseudocode, but would need to be implemented for | ||||
production. {ToDo} | ||||
At the outer level, the structure of dualq_dequeue() implements | Figure 10: Example Dequeue Pseudocode for DualQ Coupled Curvy RED AQM | |||
strict priority scheduling. The code is written assuming the AQM is | ||||
applied on dequeue (Note 1) . Every time dualq_dequeue() is called, | ||||
the if-block in lines 2-6 determines whether there is an L4S packet | ||||
to dequeue by calling lq.dequeue(pkt), and otherwise the while-block | ||||
in lines 7-13 determines whether there is a Classic packet to | ||||
dequeue, by calling cq.dequeue(pkt). (Note 2) | ||||
In the lower priority Classic queue, a while loop is used so that, if | ||||
the AQM determines that a classic packet should be dropped, it | ||||
continues to test for classic packets deciding whether to drop each | ||||
until it actually forwards one. Thus, every call to dualq_dequeue() | ||||
returns one packet if at least one is present in either queue, | ||||
otherwise it returns NULL at line 14. (Note 3) | ||||
Within each queue, the decision whether to drop or mark is taken as | The dequeue pseudocode (Figure 10) is repeatedly called whenever the | |||
follows (to simplify the explanation, it is assumed that U=1): | lower layer is ready to forward a packet. It schedules one packet | |||
for dequeuing (or zero if the queue is empty) then returns control to | ||||
the caller, so that it does not block while that packet is being | ||||
forwarded. While making this dequeue decision, it also makes the | ||||
necessary AQM decisions on dropping or marking. The alternative of | ||||
applying the AQMs at enqueue would shift some processing from the | ||||
critical time when each packet is dequeued. However, it would also | ||||
add a whole queue of delay to the control signals, making the control | ||||
loop very sloppy. | ||||
L4S: If the test at line 2 determines there is an L4S packet to | The code is written assuming the AQMs are applied on dequeue (Note | |||
dequeue, the tests at lines 3a and 3c determine whether to mark | 1). All the dequeue code is contained within a large while loop so | |||
it. The first is a simple test of whether the L4S queue (lq.byt() | that if it decides to drop a packet, it will continue until it | |||
in bytes) is greater than a step threshold T in bytes (Note 4). | selects a packet to schedule. If both queues are empty, the routine | |||
The second test is similar to the random ECN marking in RED, but | returns NULL at line 20. Line 3 of the dequeue pseudocode is where | |||
with the following differences: i) the marking function does not | the conditional priority scheduler chooses between the L4S queue (lq) | |||
start with a plateau of zero marking until a minimum threshold, | and the Classic queue (cq). The time-shifted FIFO scheduler is shown | |||
rather the marking probability starts to increase as soon as the | at lines 28-33, which would be suitable if simplicity is paramount | |||
queue is positive; ii) marking depends on queuing time, not bytes, | (see Note 2). | |||
in order to scale for any link rate without being reconfigured; | ||||
iii) marking of the L4S queue does not depend on itself, it | Within each queue, the decision whether to forward, drop or mark is | |||
depends on the queuing time of the _other_ (Classic) queue, where | taken as follows (to simplify the explanation, it is assumed that | |||
cq.sec() is the queuing time of the packet at the head of the | U=1): | |||
Classic queue (zero if empty); iv) marking depends on the | ||||
instantaneous queuing time (of the other Classic queue), not a | L4S: If the test at line 3 determines there is an L4S packet to | |||
smoothed average; v) the queue is compared with the maximum of U | dequeue, the tests at lines 5b and 5c determine whether to mark | |||
it. The first is a simple test of whether the L4S queue delay | ||||
(lq.time()) is greater than a step threshold T (Note 3). The | ||||
second test is similar to the random ECN marking in RED, but with | ||||
the following differences: ii) marking depends on queuing time, | ||||
not bytes, in order to scale for any link rate without being | ||||
reconfigured; ii) marking of the L4S queue does not depend on | ||||
itself, it depends on the queuing time of the _other_ (Classic) | ||||
queue, where cq.time() is the queuing time of the packet at the | ||||
head of the Classic queue (zero if empty); iii) marking depends on | ||||
the instantaneous queuing time (of the other Classic queue), not a | ||||
smoothed average; iv) the queue is compared with the maximum of U | ||||
random numbers (but if U=1, this is the same as the single random | random numbers (but if U=1, this is the same as the single random | |||
number used in RED). | number used in RED). | |||
Specifically, in line 3a the marking probability p_L is set to the | Specifically, in line 5a the coupled marking probability p_CL is | |||
Classic queueing time qc.sec() in seconds divided by the L4S | set to the excess of the Classic queueing delay qc.time() above | |||
scaling parameter 2^S_L, which represents the queuing time (in | the minimum queuing delay threshold (minTh) all divided by the L4S | |||
seconds) at which marking probability would hit 100%. Then in line | scaling parameter range_L. range_L represents the queuing delay | |||
3d (if U=1) the result is compared with a uniformly distributed | (in seconds) added to minTh at which marking probability would hit | |||
random number between 0 and 1, which ensures that marking | 100%. Then in line 5c (if U=1) the result is compared with a | |||
probability will linearly increase with queueing time. The | uniformly distributed random number between 0 and 1, which ensures | |||
scaling parameter is expressed as a power of 2 so that division | that marking probability will linearly increase with queueing | |||
can be implemented as a right bit-shift (>>) in line 3 of the | time. | |||
integer variant of the pseudocode (Figure 10). | ||||
Classic: If the test at line 7 determines that there is at least one | Classic: If the scheduler at line 3 chooses to dequeue a Classic | |||
Classic packet to dequeue, the test at line 9b determines whether | packet and jumps to line 7, the test at line 10b determines | |||
to drop it. But before that, line 8b updates Q_C, which is an | whether to drop or mark it. But before that, line 9a updates Q_C, | |||
exponentially weighted moving average (Note 5) of the queuing time | which is an exponentially weighted moving average (Note 4) of the | |||
in the Classic queue, where pkt.sec() is the instantaneous | queuing time in the Classic queue, where qc.time() is the current | |||
queueing time of the current Classic packet and alpha is the EWMA | instantaneous queueing time of the Classic queue and gamma is the | |||
constant for the classic queue. In line 8a, alpha is represented | EWMA constant (default 1/32, see line 12 of the initialization | |||
as an integer power of 2, so that in line 8 of the integer code | function). | |||
the division needed to weight the moving average can be | ||||
implemented by a right bit-shift (>> f_C). | ||||
Lines 9a and 9b implement the drop function. In line 9a the | Lines 10a and 10b implement the Classic AQM. In line 10a the | |||
averaged queuing time Q_C is divided by the Classic scaling | averaged queuing time Q_C is divided by the Classic scaling | |||
parameter 2^S_C, in the same way that queuing time was scaled for | parameter range_C, in the same way that queuing time was scaled | |||
L4S marking. This scaled queuing time is given the variable name | for L4S marking. This scaled queuing time will be squared to | |||
sqrt_p_C because it will be squared to compute Classic drop | compute Classic drop probability so, before it is squared, it is | |||
probability, so before it is squared it is effectively the square | effectively the square root of the drop probability, hence it is | |||
root of the drop probability. The squaring is done by comparing | given the variable name sqrt_p_C. The squaring is done by | |||
it with the maximum out of two random numbers (assuming U=1). | comparing it with the maximum out of two random numbers (assuming | |||
Comparing it with the maximum out of two is the same as the | U=1). Comparing it with the maximum out of two is the same as the | |||
logical `AND' of two tests, which ensures drop probability rises | logical `AND' of two tests, which ensures drop probability rises | |||
with the square of queuing time (Note 6). Again, the scaling | with the square of queuing time. | |||
parameter is expressed as a power of 2 so that division can be | ||||
implemented as a right bit-shift in line 9 of the integer | ||||
pseudocode. | ||||
The marking/dropping functions in each queue (lines 3 & 9) are two | The AQM functions in each queue (lines 5c & 10b) are two cases of a | |||
cases of a new generalization of RED called Curvy RED, motivated as | new generalization of RED called Curvy RED, motivated as follows. | |||
follows. When we compared the performance of our AQM with fq_CoDel | When the performance of this AQM was compared with fq_CoDel and PIE, | |||
and PIE, we came to the conclusion that their goal of holding queuing | their goal of holding queuing delay to a fixed target seemed | |||
delay to a fixed target is misguided [CRED_Insights]. As the number | misguided [CRED_Insights]. As the number of flows increases, if the | |||
of flows increases, if the AQM does not allow TCP to increase queuing | AQM does not allow TCP to increase queuing delay, it has to introduce | |||
delay, it has to introduce abnormally high levels of loss. Then loss | abnormally high levels of loss. Then loss rather than queuing | |||
rather than queuing becomes the dominant cause of delay for short | becomes the dominant cause of delay for short flows, due to timeouts | |||
flows, due to timeouts and tail losses. | and tail losses. | |||
Curvy RED constrains delay with a softened target that allows some | Curvy RED constrains delay with a softened target that allows some | |||
increase in delay as load increases. This is achieved by increasing | increase in delay as load increases. This is achieved by increasing | |||
drop probability on a convex curve relative to queue growth (the | drop probability on a convex curve relative to queue growth (the | |||
square curve in the Classic queue, if U=1). Like RED, the curve hugs | square curve in the Classic queue, if U=1). Like RED, the curve hugs | |||
the zero axis while the queue is shallow. Then, as load increases, | the zero axis while the queue is shallow. Then, as load increases, | |||
it introduces a growing barrier to higher delay. But, unlike RED, it | it introduces a growing barrier to higher delay. But, unlike RED, it | |||
requires only one parameter, the scaling, not three. The diadvantage | requires only two parameters, not three. The disadvantage of Curvy | |||
of Curvy RED is that it is not adapted to a wide range of RTTs. | RED is that it is not adapted to a wide range of RTTs. Curvy RED can | |||
Curvy RED can be used as is when the RTT range to support is limited | be used as is when the RTT range to be supported is limited, | |||
otherwise an adaptation mechanism is required. | otherwise an adaptation mechanism is required. | |||
There follows a summary listing of the two parameters used for each | From our limited experiments with Curvy RED so far, recommended | |||
of the two queues: | values of these parameters are: S_C = -1; g_C = 5; T = 5 * MTU at the | |||
link rate (about 1ms at 60Mb/s) for the range of base RTTs typical on | ||||
Classic: | the public Internet. [CRED_Insights] explains why these parameters | |||
are applicable whatever rate link this AQM implementation is deployed | ||||
on and how the parameters would need to be adjusted for a scenario | ||||
with a different range of RTTs (e.g. a data centre). The setting of | ||||
k depends on policy (see Section 2.5 and Appendix C respectively for | ||||
its recommended setting and guidance on alternatives). | ||||
S_C : The scaling factor of the dropping function scales Classic | There is also a cUrviness parameter, U, which is a small positive | |||
queuing times in the range [0, 2^(S_C)] seconds into a dropping | integer. It is likely to take the same hard-coded value for all | |||
probability in the range [0,1]. To make division efficient, it | implementations, once experiments have determined a good value. Only | |||
is constrained to be an integer power of two; | U=1 has been used in experiments so far, but results might be even | |||
better with U=2 or higher. | ||||
f_C : To smooth the queuing time of the Classic queue and make | Notes: | |||
multiplication efficient, we use a negative integer power of | ||||
two for the dimensionless EWMA constant, which we define as | ||||
alpha = 2^(-f_C). | ||||
L4S : | 1. The alternative of applying the AQMs at enqueue would shift some | |||
processing from the critical time when each packet is dequeued. | ||||
However, it would also add a whole queue of delay to the control | ||||
signals, making the control loop sloppier (for a typical RTT it | ||||
would double the Classic queue's feedback delay). On a platform | ||||
where packet timestamping is feasible, e.g. Linux, it is also | ||||
easiest to apply the AQMs at dequeue because that is where | ||||
queuing time is also measured. | ||||
S_L (and k'): As for the Classic queue, the scaling factor of | 2. WRR better isolates the L4S queue from large delay bursts in the | |||
the L4S marking function scales Classic queueing times in the | Classic queue, but it is slightly less simple than TS-FIFO. If | |||
range [0, 2^(S_L)] seconds into a probability in the range | WRR were used, a low default Classic weight (e.g. 1/16) would | |||
[0,1]. Note that S_L = S_C + k', where k' is the coupling | need to be configured in place of the time shift in line 5 of the | |||
between the queues. So S_L and k' count as only one parameter; | initialization function (Figure 9). | |||
k' is related to k in Equation (1) (Section 2.1) by k=2^k', | ||||
where both k and k' are constants. Then implementations can | ||||
avoid costly division by shifting p_L by k' bits to the right. | ||||
T : The queue size in bytes at which step threshold marking | 3. A step function is shown for simplicity. A ramp function (see | |||
starts in the L4S queue. | Figure 5 and the discussion around it in Appendix A.1) is | |||
recommended, because it is more general than a step and has the | ||||
potential to enable L4S congestion controls to converge more | ||||
rapidly. | ||||
{ToDo: These are the raw parameters used within the algorithm. A | 4. An EWMA is only one possible way to filter bursts; other more | |||
configuration front-end could accept more meaningful parameters and | adaptive smoothing methods could be valid and it might be | |||
convert them into these raw parameters.} | appropriate to decrease the EWMA faster than it increases, e.g. | |||
by using the minimum of the smoothed and instantaneous queue | ||||
delays, min(Q_C, qc.time()). | ||||
From our experiments so far, recommended values for these parameters | B.2. Efficient Implementation of Curvy RED | |||
are: S_C = -1; f_C = 5; T = 5 * MTU for the range of base RTTs | ||||
typical on the public Internet. [CRED_Insights] explains why these | ||||
parameters are applicable whatever rate link this AQM implementation | ||||
is deployed on and how the parameters would need to be adjusted for a | ||||
scenario with a different range of RTTs (e.g. a data centre) {ToDo | ||||
incorporate a summary of that report into this draft}. The setting of | ||||
k depends on policy (see Section 2.5 and Appendix C respectively for | ||||
its recommended setting and guidance on alternatives). | ||||
There is also a cUrviness parameter, U, which is a small positive | Although code optimization depends on the platform, the following | |||
integer. It is likely to take the same hard-coded value for all | notes explain where the design of Curvy RED was particularly | |||
implementations, once experiments have determined a good value. We | motivated by efficient implementation. | |||
have solely used U=1 in our experiments so far, but results might be | ||||
even better with U=2 or higher. | ||||
Note that the dropping function at line 9 calls maxrand(2*U), which | The Classic AQM at line 10b calls maxrand(2*U), which gives twice as | |||
gives twice as much curviness as the call to maxrand(U) in the | much curviness as the call to maxrand(U) in the marking function at | |||
marking function at line 3. This is the trick that implements the | line 5c. This is the trick that implements the square rule in | |||
square rule in equation (1) (Section 2.1). This is based on the fact | equation (1) (Section 2.1). This is based on the fact that, given a | |||
that, given a number X from 1 to 6, the probability that two dice | number X from 1 to 6, the probability that two dice throws will both | |||
throws will both be less than X is the square of the probability that | be less than X is the square of the probability that one throw will | |||
one throw will be less than X. So, when U=1, the L4S marking | be less than X. So, when U=1, the L4S marking function is linear and | |||
function is linear and the Classic dropping function is squared. If | the Classic dropping function is squared. If U=2, L4S would be a | |||
U=2, L4S would be a square function and Classic would be quartic. | square function and Classic would be quartic. And so on. | |||
And so on. | ||||
The maxrand(u) function in lines 16-21 simply generates u random | The maxrand(u) function in lines 16-21 simply generates u random | |||
numbers and returns the maximum (Note 7). Typically, maxrand(u) | numbers and returns the maximum. Typically, maxrand(u) could be run | |||
could be run in parallel out of band. For instance, if U=1, the | in parallel out of band. For instance, if U=1, the Classic queue | |||
Classic queue would require the maximum of two random numbers. So, | would require the maximum of two random numbers. So, instead of | |||
instead of calling maxrand(2*U) in-band, the maximum of every pair of | calling maxrand(2*U) in-band, the maximum of every pair of values | |||
values from a pseudorandom number generator could be generated out- | from a pseudorandom number generator could be generated out-of-band, | |||
of-band, and held in a buffer ready for the Classic queue to consume. | and held in a buffer ready for the Classic queue to consume. | |||
1: dualq_dequeue(lq, cq) { % Couples L4S & Classic queues, lq & cq | 1: cred_dequeue(lq, cq, pkt) { % Couples L4S & Classic queues | |||
2: if ( lq.dequeue(pkt) ) { | 2: while ( lq.len() + cq.len() > 0 ) { | |||
3: if ((lq.byt() > T) || ((cq.ns() >> (S_L-2)) > maxrand(U))) | 3: if ( scheduler() == lq ) { | |||
4: mark(pkt) | 4: lq.dequeue(pkt) % L4S scheduled | |||
5: return(pkt) % return the packet and stop here | 5: if ((lq.time() > T) OR (cq.ns() >> (S_L-2) > maxrand(U))) | |||
6: } | 6: mark(pkt) | |||
7: while ( cq.dequeue(pkt) ) { | 7: } else { | |||
8: Q_C += (pkt.ns() - Q_C) >> f_C % Classic Q EWMA | 8: cq.dequeue(pkt) % Classic scheduled | |||
9: if ( (Q_C >> (S_C-2) ) > maxrand(2*U) ) | 9: Q_C += (cq.ns() - Q_C) >> g_C % Classic Q EWMA | |||
10: drop(pkt) % Squared drop, redo loop | 10: if ( (Q_C >> (S_C-2) ) > maxrand(2*U) ) { | |||
11: else | 11: if ( (ecn(pkt) == 0) { % ECN field = not-ECT | |||
12: return(pkt) % return the packet and stop here | 12: drop(pkt) % Squared drop, redo loop | |||
13: } | 13: continue % continue to the top of the while loop | |||
14: return(NULL) % no packet to dequeue | 14: } | |||
15: } | 15: mark(pkt) | |||
16: } | ||||
17: } | ||||
18: return(pkt) % return the packet and stop here | ||||
19: } | ||||
20: return(NULL) % no packet to dequeue | ||||
21: } | ||||
Figure 10: Optimised Example Dequeue Pseudocode for Coupled DualQ AQM | Figure 11: Optimised Example Dequeue Pseudocode for Coupled DualQ AQM | |||
using Integer Arithmetic | using Integer Arithmetic | |||
Notes: | The two ranges, range_L and range_C are expressed as powers of 2 so | |||
that division can be implemented as a right bit-shift (>>) in lines 5 | ||||
1. The drain rate of the queue can vary if it is scheduled relative | and 10 of the integer variant of the pseudocode (Figure 11). | |||
to other queues, or to cater for fluctuations in a wireless | ||||
medium. To auto-adjust to changes in drain rate, the queue must | ||||
be measured in time, not bytes or packets [CoDel]. In our Linux | ||||
implementation, it was easiest to measure queuing time at | ||||
dequeue. Queuing time can be estimated when a packet is enqueued | ||||
by measuring the queue length in bytes and dividing by the recent | ||||
drain rate. | ||||
2. An implementation has to use priority queueing, but it need not | ||||
implement strict priority. | ||||
3. If packets can be enqueued while processing dequeue code, an | ||||
implementer might prefer to place the while loop around both | ||||
queues so that it goes back to test again whether any L4S packets | ||||
arrived while it was dropping a Classic packet. | ||||
4. In order not to change too many factors at once, for now, we keep | ||||
the marking function for DCTCP-only traffic as similar as | ||||
possible to DCTCP. However, unlike DCTCP, all processing is at | ||||
dequeue, so we determine whether to mark a packet at the head of | ||||
the queue by the byte-length of the queue _behind_ it. We plan | ||||
to test whether using queuing time will work in all | ||||
circumstances, and if we find that the step can cause | ||||
oscillations, we will investigate replacing it with a steep | ||||
random marking curve. | ||||
5. An EWMA is only one possible way to filter bursts; other more | For the integer variant of the pseudocode, an integer version of the | |||
adaptive smoothing methods could be valid and it might be | rand() function used at line 25 of the maxrand(function) in Figure 10 | |||
appropriate to decrease the EWMA faster than it increases. | would be arranged to return an integer in the range 0 <= maxrand() < | |||
2^32 (not shown). This would scale up all the floating point | ||||
probabilities in the range [0,1] by 2^32. | ||||
6. In practice at line 10 the Classic queue would probably test for | Queuing delays are also scaled up by 2^32, but in two stages: i) In | |||
ECN capability on the packet to determine whether to drop or mark | lines 5 and 10 queuing times cq.ns() and pkt.ns() are returned in | |||
the packet. However, for brevity such detail is omitted. All | integer nanoseconds, making the values about 2^30 times larger than | |||
packets classified into the L4S queue have to be ECN-capable, so | when the units were seconds, ii) then in lines 3 and 9 an adjustment | |||
no dropping logic is necessary at line 3. Nonetheless, L4S | of -2 to the right bit-shift multiplies the result by 2^2, to | |||
packets could be dropped by overload code (see Section 4.1). | complete the scaling by 2^32. | |||
7. In the integer variant of the pseudocode (Figure 10) real numbers | In line 8 of the initialization function, the EWMA constant gamma is | |||
are all represented as integers scaled up by 2^32. In lines 3 & | represented as an integer power of 2, g_C, so that in line 9 of the | |||
9 the function maxrand() is arranged to return an integer in the | integer code the division needed to weight the moving average can be | |||
range 0 <= maxrand() < 2^32. Queuing times are also scaled up by | implemented by a right bit-shift (>> g_C). | |||
2^32, but in two stages: i) In lines 3 and 8 queuing times | ||||
cq.ns() and pkt.ns() are returned in integer nanoseconds, making | ||||
the values about 2^30 times larger than when the units were | ||||
seconds, ii) then in lines 3 and 9 an adjustment of -2 to the | ||||
right bit-shift multiplies the result by 2^2, to complete the | ||||
scaling by 2^32. | ||||
Appendix C. Guidance on Controlling Throughput Equivalence | Appendix C. Guidance on Controlling Throughput Equivalence | |||
+---------------+------+-------+ | +---------------+------+-------+ | |||
| RTT_C / RTT_L | Reno | Cubic | | | RTT_C / RTT_L | Reno | Cubic | | |||
+---------------+------+-------+ | +---------------+------+-------+ | |||
| 1 | k'=1 | k'=0 | | | 1 | k'=1 | k'=0 | | |||
| 2 | k'=2 | k'=1 | | | 2 | k'=2 | k'=1 | | |||
| 3 | k'=2 | k'=2 | | | 3 | k'=2 | k'=2 | | |||
| 4 | k'=3 | k'=2 | | | 4 | k'=3 | k'=2 | | |||
skipping to change at page 40, line 32 ¶ | skipping to change at page 49, line 11 ¶ | |||
for k', if it wants to slow DCTCP down to roughly the same throughput | for k', if it wants to slow DCTCP down to roughly the same throughput | |||
as Classic flows, to compensate for Classic flows slowing themselves | as Classic flows, to compensate for Classic flows slowing themselves | |||
down by causing themselves extra queuing delay. | down by causing themselves extra queuing delay. | |||
The values for k' in the table are derived from the formulae, which | The values for k' in the table are derived from the formulae, which | |||
was developed in [DCttH15]: | was developed in [DCttH15]: | |||
2^k' = 1.64 (RTT_reno / RTT_dc) (2) | 2^k' = 1.64 (RTT_reno / RTT_dc) (2) | |||
2^k' = 1.19 (RTT_cubic / RTT_dc ) (3) | 2^k' = 1.19 (RTT_cubic / RTT_dc ) (3) | |||
For localized traffic from a particular ISP's data centre, we used | For localized traffic from a particular ISP's data centre, using the | |||
the measured RTTs to calculate that a value of k'=3 (equivalant to | measured RTTs, it was calculated that a value of k'=3 (equivalant to | |||
k=8) would achieve throughput equivalence, and our experiments | k=8) would achieve throughput equivalence, and experiments verified | |||
verified the formula very closely. | the formula very closely. | |||
For a typical mix of RTTs from local data centres and across the | For a typical mix of RTTs from local data centres and across the | |||
general Internet, a value of k'=1 (equivalent to k=2) is recommended | general Internet, a value of k'=1 (equivalent to k=2) is recommended | |||
as a good workable compromise. | as a good workable compromise. | |||
Appendix D. Open Issues | ||||
Most of the following open issues are also tagged '{ToDo}' at the | ||||
appropriate point in the document: | ||||
PI2 appendix: scaling of alpha & beta, esp. dependence of beta_U | ||||
on Tupdate | ||||
Curvy RED appendix: complete the unfinished parts | ||||
Authors' Addresses | Authors' Addresses | |||
Koen De Schepper | Koen De Schepper | |||
Nokia Bell Labs | Nokia Bell Labs | |||
Antwerp | Antwerp | |||
Belgium | Belgium | |||
Email: koen.de_schepper@nokia.com | Email: koen.de_schepper@nokia.com | |||
URI: https://www.bell-labs.com/usr/koen.de_schepper | URI: https://www.bell-labs.com/usr/koen.de_schepper | |||
Bob Briscoe (editor) | Bob Briscoe (editor) | |||
CableLabs | CableLabs | |||
UK | UK | |||
Email: ietf@bobbriscoe.net | Email: ietf@bobbriscoe.net | |||
URI: http://bobbriscoe.net/ | URI: http://bobbriscoe.net/ | |||
Olga Bondarenko | Greg White | |||
Simula Research Lab | CableLabs | |||
Lysaker | Louisville, CO | |||
Norway | US | |||
Email: olgabnd@gmail.com | ||||
URI: https://www.simula.no/people/olgabo | ||||
Ing-jyh Tsang | ||||
Nokia | ||||
Antwerp | ||||
Belgium | ||||
Email: ing-jyh.tsang@nokia.com | Email: G.White@CableLabs.com | |||
End of changes. 189 change blocks. | ||||
844 lines changed or deleted | 1196 lines changed or added | |||
This html diff was produced by rfcdiff 1.47. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |