--- 1/draft-ietf-tsvwg-aqm-dualq-coupled-21.txt 2022-03-04 16:14:26.053162727 -0800 +++ 2/draft-ietf-tsvwg-aqm-dualq-coupled-22.txt 2022-03-04 16:14:26.185166065 -0800 @@ -1,22 +1,22 @@ Transport Area working group (tsvwg) K. De Schepper Internet-Draft Nokia Bell Labs Intended status: Experimental B. Briscoe, Ed. -Expires: 5 August 2022 Independent +Expires: 5 September 2022 Independent G. White CableLabs - 1 February 2022 + 4 March 2022 DualQ Coupled AQMs for Low Latency, Low Loss and Scalable Throughput (L4S) - draft-ietf-tsvwg-aqm-dualq-coupled-21 + draft-ietf-tsvwg-aqm-dualq-coupled-22 Abstract This specification defines a framework for coupling the Active Queue Management (AQM) algorithms in two queues intended for flows with different responses to congestion. This provides a way for the Internet to transition from the scaling problems of standard TCP Reno-friendly ('Classic') congestion controls to the family of 'Scalable' congestion controls. These are designed for consistently very Low queuing Latency, very Low congestion Loss and Scaling of @@ -42,81 +42,82 @@ Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." - This Internet-Draft will expire on 5 August 2022. + This Internet-Draft will expire on 5 September 2022. Copyright Notice Copyright (c) 2022 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Outline of the Problem . . . . . . . . . . . . . . . . . 3 - 1.2. Scope . . . . . . . . . . . . . . . . . . . . . . . . . . 5 + 1.2. Scope . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . 7 1.4. Features . . . . . . . . . . . . . . . . . . . . . . . . 9 2. DualQ Coupled AQM . . . . . . . . . . . . . . . . . . . . . . 11 2.1. Coupled AQM . . . . . . . . . . . . . . . . . . . . . . . 11 2.2. Dual Queue . . . . . . . . . . . . . . . . . . . . . . . 12 - 2.3. Traffic Classification . . . . . . . . . . . . . . . . . 12 + 2.3. Traffic Classification . . . . . . . . . . . . . . . . . 13 2.4. Overall DualQ Coupled AQM Structure . . . . . . . . . . . 13 2.5. Normative Requirements for a DualQ Coupled AQM . . . . . 17 2.5.1. Functional Requirements . . . . . . . . . . . . . . . 17 2.5.1.1. Requirements in Unexpected Cases . . . . . . . . 18 2.5.2. Management Requirements . . . . . . . . . . . . . . . 19 2.5.2.1. Configuration . . . . . . . . . . . . . . . . . . 19 2.5.2.2. Monitoring . . . . . . . . . . . . . . . . . . . 21 2.5.2.3. Anomaly Detection . . . . . . . . . . . . . . . . 22 2.5.2.4. Deployment, Coexistence and Scaling . . . . . . . 22 3. IANA Considerations (to be removed by RFC Editor) . . . . . . 22 4. Security Considerations . . . . . . . . . . . . . . . . . . . 22 - 4.1. Overload Handling . . . . . . . . . . . . . . . . . . . . 22 - 4.1.1. Avoiding Classic Starvation: Sacrifice L4S Throughput - or Delay? . . . . . . . . . . . . . . . . . . . . . . 23 - 4.1.2. Congestion Signal Saturation: Introduce L4S Drop or - Delay? . . . . . . . . . . . . . . . . . . . . . . . 24 + 4.1. Low Delay without Requiring Per-Flow Processing . . . . . 22 + 4.2. Handling Unresponsive Flows and Overload . . . . . . . . 23 + 4.2.1. Unresponsive Traffic without Overload . . . . . . . . 24 + 4.2.2. Avoiding Short-Term Classic Starvation: Sacrifice L4S + Throughput or Delay? . . . . . . . . . . . . . . . . 25 - 4.1.3. Protecting against Unresponsive ECN-Capable - Traffic . . . . . . . . . . . . . . . . . . . . . . . 25 - 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 26 - 6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 26 - 7. References . . . . . . . . . . . . . . . . . . . . . . . . . 27 - 7.1. Normative References . . . . . . . . . . . . . . . . . . 27 - 7.2. Informative References . . . . . . . . . . . . . . . . . 27 - Appendix A. Example DualQ Coupled PI2 Algorithm . . . . . . . . 33 - A.1. Pass #1: Core Concepts . . . . . . . . . . . . . . . . . 33 - A.2. Pass #2: Edge-Case Details . . . . . . . . . . . . . . . 44 - Appendix B. Example DualQ Coupled Curvy RED Algorithm . . . . . 48 - B.1. Curvy RED in Pseudocode . . . . . . . . . . . . . . . . . 48 - B.2. Efficient Implementation of Curvy RED . . . . . . . . . . 54 - Appendix C. Choice of Coupling Factor, k . . . . . . . . . . . . 56 - C.1. RTT-Dependence . . . . . . . . . . . . . . . . . . . . . 56 - C.2. Guidance on Controlling Throughput Equivalence . . . . . 57 - Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 61 + 4.2.3. L4S ECN Saturation: Introduce Drop or Delay? . . . . 26 + 4.2.3.1. Protecting against Overload by Unresponsive + ECN-Capable Traffic . . . . . . . . . . . . . . . . 28 + 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 28 + 6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 29 + 7. References . . . . . . . . . . . . . . . . . . . . . . . . . 29 + 7.1. Normative References . . . . . . . . . . . . . . . . . . 29 + 7.2. Informative References . . . . . . . . . . . . . . . . . 30 + Appendix A. Example DualQ Coupled PI2 Algorithm . . . . . . . . 35 + A.1. Pass #1: Core Concepts . . . . . . . . . . . . . . . . . 36 + A.2. Pass #2: Edge-Case Details . . . . . . . . . . . . . . . 46 + Appendix B. Example DualQ Coupled Curvy RED Algorithm . . . . . 51 + B.1. Curvy RED in Pseudocode . . . . . . . . . . . . . . . . . 51 + B.2. Efficient Implementation of Curvy RED . . . . . . . . . . 57 + Appendix C. Choice of Coupling Factor, k . . . . . . . . . . . . 59 + C.1. RTT-Dependence . . . . . . . . . . . . . . . . . . . . . 59 + C.2. Guidance on Controlling Throughput Equivalence . . . . . 60 + Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 64 1. Introduction This document specifies a framework for DualQ Coupled AQMs, which is the network part of the L4S architecture [I-D.ietf-tsvwg-l4s-arch]. L4S enables both very low queuing latency (sub-millisecond on average) and high throughput at the same time, for ad hoc numbers of capacity-seeking applications all sharing the same capacity. 1.1. Outline of the Problem @@ -156,25 +157,25 @@ delay to vary or cause the link to be under-utilized. These AQMs are tuned to allow a typical capacity-seeking Reno-friendly flow to induce an average queue that roughly doubles the base RTT, adding 5-15 ms of queuing on average (cf. 500 microseconds with L4S for the same mix of long-running and web traffic). However, for many applications low delay is not useful unless it is consistently low. With these AQMs, 99th percentile queuing delay is 20-30 ms (cf. 2 ms with the same traffic over L4S). * Similarly, recent research into using e2e congestion control - without needing an AQM in the network - (e.g.BBR [I-D.cardwell-iccrg-bbr-congestion-control]) seems to - have hit a similar lower limit to queuing delay of about 20ms on - average but there are also regular 25ms delay spikes due to - bandwidth probes and 60ms spikes due to flow-starts. + without needing an AQM in the network (e.g. BBR + [I-D.cardwell-iccrg-bbr-congestion-control]) seems to have hit a + similar lower limit to queuing delay of about 20ms on average but + there are also regular 25ms delay spikes due to bandwidth probes + and 60ms spikes due to flow-starts. L4S learns from the experience of Data Center TCP [RFC8257], which shows the power of complementary changes both in the network and on end-systems. DCTCP teaches us that two small but radical changes to congestion control are needed to cut the two major outstanding causes of queuing delay variability: 1. Far smaller rate variations (sawteeth) than Reno-friendly congestion controls; @@ -202,65 +203,65 @@ the queue is signalled immediately. Without ECN, either of these would lead to very high loss levels. But, with ECN, the resulting high marking levels are just signals, not impairments. BBRv2 combines the best of both worlds - it works as a scalable congestion control when ECN is available, but also aims to minimize delay when it isn't. However, until now, Scalable congestion controls (like DCTCP) did not co-exist well in a shared ECN-capable queue with existing ECN-capable - TCP Reno [RFC5681] or Cubic [RFC8312] congestion controls --- - Scalable controls are so aggressive that these 'Classic' algorithms - would drive themselves to a small capacity share. Therefore, until - now, L4S controls could only be deployed where a clean-slate - environment could be arranged, such as in private data centres (hence - the name DCTCP). + TCP Reno [RFC5681] or Cubic [RFC8312] congestion controls -- Scalable + controls are so aggressive that these 'Classic' algorithms would + drive themselves to a small capacity share. Therefore, until now, + L4S controls could only be deployed where a clean-slate environment + could be arranged, such as in private data centres (hence the name + DCTCP). This document specifies a `DualQ Coupled AQM' extension that solves the problem of coexistence between Scalable and Classic flows, without having to inspect flow identifiers. It is not like flow- queuing approaches [RFC8290] that classify packets by flow identifier into separate queues in order to isolate sparse flows from the higher latency in the queues assigned to heavier flows. If a flow needs both low delay and high throughput, having a queue to itself does not isolate it from the harm it causes to itself. In contrast, DualQ - Coupled AQMs address the root cause of the latency problem --- they + Coupled AQMs address the root cause of the latency problem -- they are an enabler for the smooth low latency scalable behaviour of Scalable congestion controls, so that every packet in every flow can potentially enjoy very low latency, then there would be no need to isolate each flow into a separate queue. 1.2. Scope L4S involves complementary changes in the network and on end-systems: Network: A DualQ Coupled AQM (defined in the present document) or a - modification to flow-queue AQMs (described in section 4.2.b of - [I-D.ietf-tsvwg-l4s-arch]); + modification to flow-queue AQMs (described in section 4.2.b of the + L4S architecture [I-D.ietf-tsvwg-l4s-arch]); End-system: A Scalable congestion control (defined in section 4 of - [I-D.ietf-tsvwg-ecn-l4s-id]). + the L4S ECN protocol [I-D.ietf-tsvwg-ecn-l4s-id]). Packet identifier: The network and end-system parts of L4S can be deployed incrementally, because they both identify L4S packets using the experimentally assigned explicit congestion notification (ECN) codepoints in the IP header: ECT(1) and CE [RFC8311] [I-D.ietf-tsvwg-ecn-l4s-id]. Data Center TCP (DCTCP [RFC8257]) is an example of a Scalable congestion control for controlled environments that has been deployed for some time in Linux, Windows and FreeBSD operating systems. During the progress of this document through the IETF a number of - other Scalable congestion controls were implemented, e.g. TCP - Prague [I-D.briscoe-iccrg-prague-congestion-control] [PragueLinux], - BBRv2 [I-D.cardwell-iccrg-bbr-congestion-control], QUIC Prague and + other Scalable congestion controls were implemented, e.g. TCP Prague + [I-D.briscoe-iccrg-prague-congestion-control] [PragueLinux], BBRv2 + [BBRv2], [I-D.cardwell-iccrg-bbr-congestion-control], QUIC Prague and the L4S variant of SCREAM for real-time media [RFC8298]. The focus of this specification is to enable deployment of the network part of the L4S service. Then, without any management intervention, applications can exploit this new network capability as their operating systems migrate to Scalable congestion controls, which can then evolve _while_ their benefits are being enjoyed by everyone on the Internet. The DualQ Coupled AQM framework can incorporate any AQM designed for @@ -332,41 +333,41 @@ evolve, such as the examples listed earlier (Relentless, SCReAM, etc.). Classic Congestion Control: A congestion control behaviour that can co-exist with standard TCP Reno [RFC5681] without causing significantly negative impact on its flow rate [RFC5033]. With Classic congestion controls, such as Reno or Cubic, because flow rate has scaled since TCP congestion control was first designed in 1988, it now takes hundreds of round trips (and growing) to recover after a congestion signal (whether a loss or an ECN mark) - as shown in the examples in section 5.1 of - [I-D.ietf-tsvwg-l4s-arch] and in [RFC3649]. Therefore control of - queuing and utilization becomes very slack, and the slightest - disturbances (e.g. from new flows starting) prevent a high rate - from being attained. + as shown in the examples in section 5.1 of the L4S + architecture [I-D.ietf-tsvwg-l4s-arch] and in [RFC3649]. + Therefore control of queuing and utilization becomes very slack, + and the slightest disturbances (e.g. from new flows starting) + prevent a high rate from being attained. Scalable Congestion Control: A congestion control where the average time from one congestion signal to the next (the recovery time) remains invariant as the flow rate scales, all other factors being equal. This maintains the same degree of control over queueing and utilization whatever the flow rate, as well as ensuring that high throughput is robust to disturbances. For instance, DCTCP averages 2 congestion signals per round-trip whatever the flow rate, as do other recently developed scalable congestion controls, e.g. Relentless TCP [Mathis09], TCP Prague [I-D.briscoe-iccrg-prague-congestion-control], [PragueLinux], - BBRv2 [I-D.cardwell-iccrg-bbr-congestion-control] and the L4S - variant of SCREAM for real-time media [SCReAM], [RFC8298]). For - the public Internet a Scalable transport has to comply with the - requirements in Section 4 of [I-D.ietf-tsvwg-ecn-l4s-id] (aka. the - 'Prague L4S requirements'). + BBRv2 [BBRv2], [I-D.cardwell-iccrg-bbr-congestion-control] and the + L4S variant of SCREAM for real-time media [SCReAM], [RFC8298]). + For the public Internet a Scalable transport has to comply with + the requirements in Section 4 of [I-D.ietf-tsvwg-ecn-l4s-id] + (aka. the 'Prague L4S requirements'). C: Abbreviation for Classic, e.g. when used as a subscript. L: Abbreviation for L4S, e.g. when used as a subscript. The terms Classic or L4S can also qualify other nouns, such as 'codepoint', 'identifier', 'classification', 'packet', 'flow'. For example: an L4S packet means a packet with an L4S identifier sent from an L4S congestion control. @@ -567,27 +568,27 @@ identifiers, because it can apply the appropriate marking or dropping probability to all flows of each type. A separate specification [I-D.ietf-tsvwg-ecn-l4s-id] requires the network to treat the ECT(1) and CE codepoints of the ECN field as this identifier. An additional process document has proved necessary to make the ECT(1) codepoint available for experimentation [RFC8311]. For policy reasons, an operator might choose to steer certain packets (e.g. from certain flows or with certain addresses) out of the L queue, even though they identify themselves as L4S by their ECN - codepoints. In such cases, [I-D.ietf-tsvwg-ecn-l4s-id] says that the - device "MUST NOT alter the end-to-end L4S ECN identifier", so that it - is preserved end-to-end. The aim is that each operator can choose - how it treats L4S traffic locally, but an individual operator does - not alter the identification of L4S packets, which would prevent - other operators downstream from making their own choices on how to - treat L4S traffic. + codepoints. In such cases, the L4S ECN + protocol [I-D.ietf-tsvwg-ecn-l4s-id] says that the device "MUST NOT + alter the end-to-end L4S ECN identifier", so that it is preserved + end-to-end. The aim is that each operator can choose how it treats + L4S traffic locally, but an individual operator does not alter the + identification of L4S packets, which would prevent other operators + downstream from making their own choices on how to treat L4S traffic. In addition, an operator could use other identifiers to classify certain additional packet types into the L queue that it deems will not risk harm to the L4S service. For instance addresses of specific applications or hosts; specific Diffserv codepoints such as EF (Expedited Forwarding), Voice-Admit or the Non-Queue-Building (NQB) per-hop behaviour; or certain protocols (e.g. ARP, DNS) (see Section 5.4.1 of [I-D.ietf-tsvwg-ecn-l4s-id]). Note that the mechanism only reads these identifiers. [I-D.ietf-tsvwg-ecn-l4s-id] says it "MUST NOT alter these non-ECN identifiers". Thus, the L @@ -685,38 +687,37 @@ C queue. This is because, as the C queue grows, the base AQM applies more congestion signals to L traffic (as well as C). As L flows reduce their rate in response, they use less than the scheduling share for L traffic. So, because the scheduler is work preserving, it schedules any C traffic in the gaps. Giving priority to the L queue has the benefit of very low L queue delay, because the L queue is kept empty whenever L traffic is controlled by the coupling. Also there only has to be a coupling in one direction - from Classic to L4S. Priority has to be conditional - in some way to prevent the C queue being starved by excessive - unresponsive L traffic (see Section 4.1) and to give C traffic a - means to push in, as explained next. With normal responsive L - traffic, the coupled ECN marking gives C traffic the ability to push - back against even strict priority, by congestion marking the L - traffic to make it yield some space. However, if there is just a - small finite set of C packets (e.g. a DNS request or an initial - window of data) some Classic AQMs will not induce enough ECN marking - in the L queue, no matter how long the small set of C packets waits. - Then, if the L queue happens to remain busy, the C traffic would - never get a scheduling opportunity from a strict priority scheduler. - Ideally the Classic AQM would be designed to increase the coupled - marking the longer that C packets have been waiting, but this is not - always practical - hence the need for L priority to be conditional. - Giving a small weight or limited waiting time for C traffic improves - response times for short Classic messages, such as DNS requests and - improves Classic flow startup because immediate capacity is - available. + in some way to prevent the C queue being starved in the short-term + (see Section 4.2.2) to give C traffic a means to push in, as + explained next. With normal responsive L traffic, the coupled ECN + marking gives C traffic the ability to push back against even strict + priority, by congestion marking the L traffic to make it yield some + space. However, if there is just a small finite set of C packets + (e.g. a DNS request or an initial window of data) some Classic AQMs + will not induce enough ECN marking in the L queue, no matter how long + the small set of C packets waits. Then, if the L queue happens to + remain busy, the C traffic would never get a scheduling opportunity + from a strict priority scheduler. Ideally the Classic AQM would be + designed to increase the coupled marking the longer that C packets + have been waiting, but this is not always practical - hence the need + for L priority to be conditional. Giving a small weight or limited + waiting time for C traffic improves response times for short Classic + messages, such as DNS requests, and improves Classic flow startup + because immediate capacity is available. Example DualQ Coupled AQM algorithms called DualPI2 and Curvy RED are given in Appendix A and Appendix B. Either example AQM can be used to couple packet marking and dropping across a dual Q. DualPI2 uses a Proportional-Integral (PI) controller as the Base AQM. Indeed, this Base AQM with just the squared output and no L4S queue can be used as a drop-in replacement for PIE [RFC8033], in which case it is just called PI2 [PI2]. PI2 is a principled simplification of PIE that is both more responsive and more stable in the face of @@ -735,22 +736,22 @@ different drain rates. With AQMs in a dualQ structure this is particularly important because the drain rate of each queue can vary rapidly as flows for the two queues arrive and depart, even if the combined link rate is constant. It would be possible to control the queues with other alternative AQMs, as long as the normative requirements (those expressed in capitals) in Section 2.5 are observed. The two queues could optionally be part of a larger queuing - hierarchy, such as the initial example ideas - in [I-D.briscoe-tsvwg-l4s-diffserv]. + hierarchy, such as the initial example ideas in + [I-D.briscoe-tsvwg-l4s-diffserv]. 2.5. Normative Requirements for a DualQ Coupled AQM The following requirements are intended to capture only the essential aspects of a DualQ Coupled AQM. They are intended to be independent of the particular AQMs used for each queue. 2.5.1. Functional Requirements A Dual Queue Coupled AQM implementation MUST comply with the @@ -762,45 +763,46 @@ technology underlying any L4S AQM. A Dual Queue Coupled AQM implementation MUST utilize two queues, each with an AQM algorithm. The AQM algorithm for the low latency (L) queue MUST be able to apply ECN marking to ECN-capable packets. The scheduler draining the two queues MUST give L4S packets priority over Classic, although priority MUST be bounded in order not to - starve Classic traffic. The scheduler SHOULD be work-conserving, or - otherwise close to work-conserving. This is because Classic traffic - needs to be able to efficiently fill any space left by L4S traffic - even though the scheduler would otherwise allocate it to L4S. + starve Classic traffic (see Section 4.2.2). The scheduler SHOULD be + work-conserving, or otherwise close to work-conserving. This is + because Classic traffic needs to be able to efficiently fill any + space left by L4S traffic even though the scheduler would otherwise + allocate it to L4S. [I-D.ietf-tsvwg-ecn-l4s-id] defines the meaning of an ECN marking on L4S traffic, relative to drop of Classic traffic. In order to ensure coexistence of Classic and Scalable L4S traffic, it says, "The likelihood that an AQM drops a Not-ECT Classic packet (p_C) MUST be roughly proportional to the square of the likelihood that it would have marked it if it had been an L4S packet (p_L)." The term 'likelihood' is used to allow for marking and dropping to be either probabilistic or deterministic. For the current specification, this translates into the following requirement. A DualQ Coupled AQM MUST apply ECN marking to traffic in the L queue that is no lower than that derived from the likelihood of drop (or ECN marking) in the Classic queue using Eqn. (1). The constant of proportionality, k, in Eqn (1) determines the relative flow rates of Classic and L4S flows when the AQM concerned - is the bottleneck (all other factors being equal). - [I-D.ietf-tsvwg-ecn-l4s-id] says, "The constant of proportionality - (k) does not have to be standardised for interoperability, but a - value of 2 is RECOMMENDED." + is the bottleneck (all other factors being equal). The L4S ECN + protocol [I-D.ietf-tsvwg-ecn-l4s-id] says, "The constant of + proportionality (k) does not have to be standardised for + interoperability, but a value of 2 is RECOMMENDED." Assuming Scalable congestion controls for the Internet will be as aggressive as DCTCP, this will ensure their congestion window will be roughly the same as that of a standards track TCP Reno congestion control (Reno) [RFC5681] and other Reno-friendly controls, such as TCP Cubic in its Reno-compatibility mode. The choice of k is a matter of operator policy, and operators MAY choose a different value using the guidelines in Appendix C.2. @@ -830,23 +832,22 @@ * If a packet that does not carry an ECT(1) or CE codepoint is classified into the L queue: - if the packet is ECT(0), the L AQM SHOULD apply CE-marking using a probability appropriate to Classic congestion control and appropriate to the target delay in the L queue - if the packet is Not-ECT, the appropriate action depends on whether some other function is protecting the L queue from - misbehaving flows (e.g. per-flow queue - protection [I-D.briscoe-docsis-q-protection] or latency - policing): + misbehaving flows (e.g. per-flow queue protection + [I-D.briscoe-docsis-q-protection] or latency policing): o If separate queue protection is provided, the L AQM SHOULD ignore the packet and forward it unchanged, meaning it should not calculate whether to apply congestion notification and it should neither drop nor CE-mark the packet (for instance, the operator might classify EF traffic that is unresponsive to drop into the L queue, alongside responsive L4S-ECN traffic) o if separate queue protection is not provided, the L AQM @@ -860,24 +861,26 @@ - the C AQM SHOULD apply CE-marking using the coupled AQM probability p_CL (= k*p'). The above requirements are worded as "SHOULDs", because operator- specific classifiers are for flexibility, by definition. Therefore, alternative actions might be appropriate in the operator's specific circumstances. An example would be where the operator knows that certain legacy traffic marked with one codepoint actually has a congestion response associated with another codepoint. - If the DualQ Coupled AQM has detected overload, it MUST begin using - Classic drop, and continue until the overload episode has subsided. - Introducing drop if ECN marking is persistently high is required by - Section 7 of [RFC3168] and Section 4.2.1 of [RFC7567]. + If the DualQ Coupled AQM has detected overload, it MUST introduce + Classic drop to both types of ECN-capable traffic until the overload + episode has subsided. Introducing drop if ECN marking is + persistently high is recommended by Section 7 of the ECN + specification [RFC3168] and Section 4.2.1 of the AQM + Recommendations [RFC7567]. 2.5.2. Management Requirements 2.5.2.1. Configuration By default, a DualQ Coupled AQM SHOULD NOT need any configuration for use at a bottleneck on the public Internet [RFC7567]. The following parameters MAY be operator-configurable, e.g. to tune for non- Internet settings: @@ -920,21 +923,21 @@ * Coupling factor, k (see Appendix C.2); * A limit to the conditional priority of L4S. This is scheduler- dependent, but it SHOULD be expressed as a relation between the max delay of a C packet and an L packet. For example: - for a WRR scheduler a weight ratio between L and C of w:1 means that the maximum delay to a C packet is w times that of an L packet. - - for a time-shifted FIFO (TS-FIFO) scheduler (see Section 4.1.1) + - for a time-shifted FIFO (TS-FIFO) scheduler (see Section 4.2.2) a time-shift of tshift means that the maximum delay to a C packet is tshift greater than that of an L packet. tshift could be expressed as a multiple of the typical RTT rather than as an absolute delay. * The maximum Classic ECN marking probability, p_Cmax, before introducing drop. 2.5.2.2. Monitoring @@ -1001,137 +1004,241 @@ The descriptions of specific DualQ Coupled AQM algorithms in the appendices cover scaling of their configuration parameters, e.g. with respect to RTT and sampling frequency. 3. IANA Considerations (to be removed by RFC Editor) This specification contains no IANA considerations. 4. Security Considerations -4.1. Overload Handling +4.1. Low Delay without Requiring Per-Flow Processing - Where the interests of users or flows might conflict, it could be - necessary to police traffic to isolate any harm to the performance of - individual flows. However it is hard to avoid unintended side- - effects with policing, and in a trusted environment policing is not - necessary. Therefore per-flow policing - (e.g. [I-D.briscoe-docsis-q-protection]) needs to be separable from a - basic AQM, as an option under policy control. + The L4S architecture [I-D.ietf-tsvwg-l4s-arch] compares the DualQ and + per-flow-queuing (FQ) approaches to L4S. The privacy considerations + section in that document motivates the DualQ on the grounds that + users who want to encrypt application flow identifiers, e.g. in IPSec + or other encrypted VPN tunnels, don't have to sacrifice low delay + ([RFC8404] encourages avoidance of such privacy compromises). - However, a basic DualQ AQM does at least need to handle overload. A - useful objective would be for the overload behaviour of the DualQ AQM - to be at least no worse than a single queue AQM. However, a trade- - off needs to be made between complexity and the risk of either - traffic class harming the other. In each of the following three - subsections, an overload issue specific to the DualQ is described, - followed by proposed solution(s). + The security considerations section of the L4S architecture also + includes subsections on policing of relative flow-rates (section 8.1) + and on policing of flows that cause excessive queuing delay (section + 8.2). It explains that the interests of users do not collide in the + same way for delay as they do for bandwidth. For someone to get more + of the bandwidth of a shared link, someone else necessarily gets less + (a 'zero-sum game'), whereas queuing delay can be reduced for + everyone, without any need for someone else to lose out. It also + explains that, on the current Internet, scheduling usually enforces + separation between 'sites' (e.g. households, businesses or mobile + users), but it is not common to need to schedule or police individual + application flows. - Under overload the higher priority L4S service will have to sacrifice - some aspect of its performance. Alternative solutions are provided - below that each relax a different factor: e.g. throughput, delay, - drop. These choices need to be made either by the developer or by - operator policy, rather than by the IETF. + By the above arguments, per-flow policing might not be necessary and + in trusted environments it is certainly unlikely to be needed. + Therefore, because it is hard to avoid complexity and unintended + side-effects with per-flow policing, it needs to be separable from a + basic AQM, as an option, under policy control. On this basis, the + DualQ Coupled AQM provides low delay without prejudging the question + of per-flow policing. -4.1.1. Avoiding Classic Starvation: Sacrifice L4S Throughput or Delay? + Nonetheless, the interests of users or flows might conflict, e.g. in + case of accident or malice. Then per-flow control could be + necessary. If flow-rate control is needed, it can be provided as a + modular addition to a DualQ. And similarly, if protection against + excessive queue delay is needed, a per-flow queue protection option + can be added to a DualQ (e.g. [I-D.briscoe-docsis-q-protection]). + +4.2. Handling Unresponsive Flows and Overload + + In the absence of any per-flow control, it is important that the + basic DualQ Coupled AQM gives unresponsive flows no more throughput + advantage than a single-queue AQM would, and that it at least handles + overload situations. Overload means that incoming load significantly + or persistently exceeds output capacity, but it is not intended to be + a precise term -- significant and persistent are matters of degree. + + A trade-off needs to be made between complexity and the risk of + either traffic class harming the other. In overloaded conditions the + higher priority L4S service will have to sacrifice some aspect of its + performance. Depending on the degree of overload, alternative + solutions may relax a different factor: e.g. throughput, delay, drop. + These choices need to be made either by the developer or by operator + policy, rather than by the IETF. Subsequent subsections discuss + aspects relating to handling of different degrees of overload: + + * Unresponsive flows (L and/or C) but not overloaded, i.e. the sum + of unresponsive load before adding any responsive traffic is below + capacity; + + This case is handled by the regular Coupled DualQ (Section 2.1) + but not discussed there. So below, Section 4.2.1 explains the + design goal, and how it is achieved in practice; + + * Unresponsive flows (L and/or C) causing persistent overload, + i.e. the sum of unresponsive load even before adding any + responsive traffic persistently exceeds capacity; + + This case is not covered by the regular Coupled DualQ mechanism + (Section 2.1) but the last para in Section 2.5.1.1 sets out a + requirement to handle the case where ECN-capable traffic could + starve non-ECN-capable traffic. Section 4.2.3 below discusses + the general options and gives specific examples. + + * Short-term overload that lies between the 'not overloaded' and + 'persistently overloaded' cases. + + For the period before overload is deemed persistent, + Section 4.2.2 discusses options for more immediate mechanisms + at the scheduler timescale. These prevent short-term + starvation of the C queue by making the priority of the L queue + conditional, as required in Section 2.5.1. + +4.2.1. Unresponsive Traffic without Overload + + When one or more L flows and/or C flows are unresponsive, but their + total load is within the link capacity so that they do not saturate + the coupled marking (below 100%), the goal of a DualQ AQM is to + behave no worse than a single-queue AQM. + + Tests have shown that this is indeed the case with no additional + mechanism beyond the regular Coupled DualQ of Section 2.1 (see the + results of 'overload experiments' in [DCttH19]). Perhaps counter- + intuitively, whether the unresponsive flow classifies itself into the + L or the C queue, the DualQ system behaves as if it has subtracted + from the overall link capacity. Then, the coupling shares out the + remaining capacity between any competing responsive flows (in either + queue). See also Section 4.2.2, which discusses scheduler-specific + details. + +4.2.2. Avoiding Short-Term Classic Starvation: Sacrifice L4S Throughput + or Delay? Priority of L4S is required to be conditional (see Section 2.4 & - Section 2.5.1) to avoid total starvation of Classic by heavy L4S - traffic. This raises the question of whether to sacrifice L4S - throughput or L4S delay (or some other policy) to mitigate starvation - of Classic: + Section 2.5.1) to avoid short-term starvation of Classic. Otherwise, + as explained in Section 2.4, even a lone responsive L4S flow could + temporarily block a small finite set of C packets (e.g. an initial + window or DNS request). The blockage would only be brief, but it + could be longer for certain AQM implementations that can only + increase the congestion signal coupled from the C queue when C + packets are actually being dequeued. There is then the question of + whether to sacrifice L4S throughput or L4S delay (or some other + policy) to make the priority conditional: Sacrifice L4S throughput: By using weighted round robin as the conditional priority scheduler, the L4S service can sacrifice some throughput during overload. This can either be thought of as guaranteeing a minimum throughput service for Classic traffic, or as guaranteeing a maximum delay for a packet at the head of the Classic queue. + Cautionary note: a WRR scheduler can only guarantee Classic + throughput if Classic sources are sending enough to use it -- + congestion signals can undermine scheduling because they determine + how much responsive traffic of each class arrives for scheduling + in the first place. This is why scheduling is only relied on to + handle short-term starvation; until congestion signals build up + and the sources react. Even during long-term overload (discussed + more fully in Section 4.2.3), it's pragmatic to discard packets + from both queues, which again thins the traffic before it reaches + the scheduler. This is because a scheduler cannot be relied on to + handle long-term overload since the right scheduler weight cannot + be known for every scenario. + The scheduling weight of the Classic queue should be small - (e.g. 1/16). Then, in most traffic scenarios the scheduler will - not interfere and it will not need to - the coupling mechanism and - the end-systems will share out the capacity across both queues as - if it were a single pool. However, because the congestion - coupling only applies in one direction (from C to L), if L4S - traffic is over-aggressive or unresponsive, the scheduler weight - for Classic traffic will at least be large enough to ensure it - does not starve. + (e.g. 1/16). In most traffic scenarios the scheduler will not + interfere and it will not need to, because the coupling mechanism + and the end-systems will determine the share of capacity across + both queues as if it were a single pool. However, if L4S traffic + is over-aggressive or unresponsive, the scheduler weight for + Classic traffic will at least be large enough to ensure it does + not starve in the short-term. - In cases where the ratio of L4S to Classic flows (e.g. 19:1) is - greater than the ratio of their scheduler weights (e.g. 15:1), the - L4S flows will get less than an equal share of the capacity, but - only slightly. For instance, with the example numbers given, each - L4S flow will get (15/16)/19 = 4.9% when ideally each would get + Although WRR scheduling is only expected to address short-term + overload, there are (somewhat rare) cases when WRR has an effect + on capacity shares over longer time-scales. But its effect is + minor, and it certainly does no harm. Specifically, in cases + where the ratio of L4S to Classic flows (e.g. 19:1) is greater + than the ratio of their scheduler weights (e.g. 15:1), the L4S + flows will get less than an equal share of the capacity, but only + slightly. For instance, with the example numbers given, each L4S + flow will get (15/16)/19 = 4.9% when ideally each would get 1/20=5%. In the rather specific case of an unresponsive flow taking up just less than the capacity set aside for L4S (e.g. 14/16 in the above example), using WRR could significantly reduce the capacity left for any responsive L4S flows. The scheduling weight of the Classic queue should not be too small, otherwise a C packet at the head of the queue could be excessively delayed by a continually busy L queue. For instance if the Classic weight is 1/16, the maximum that a Classic packet at the head of the queue can be delayed by L traffic is the serialization delay of 15 MTU-sized packets. - Sacrifice L4S Delay: To control milder overload of responsive - traffic, particularly when close to the maximum congestion signal, - the operator could choose to control overload of the Classic queue - by allowing some delay to 'leak' across to the L4S queue. The - scheduler can be made to behave like a single First-In First-Out - (FIFO) queue with different service times by implementing a very - simple conditional priority scheduler that could be called a - "time-shifted FIFO" (see the Modifier Earliest Deadline First - (MEDF) scheduler of [MEDF]). This scheduler adds tshift to the - queue delay of the next L4S packet, before comparing it with the - queue delay of the next Classic packet, then it selects the packet - with the greater adjusted queue delay. Under regular conditions, - this time-shifted FIFO scheduler behaves just like a strict - priority scheduler. But under moderate or high overload it - prevents starvation of the Classic queue, because the time-shift - (tshift) defines the maximum extra queuing delay of Classic - packets relative to L4S. + Sacrifice L4S Delay: The operator could choose to control overload + of the Classic queue by allowing some delay to 'leak' across to + the L4S queue. The scheduler can be made to behave like a single + First-In First-Out (FIFO) queue with different service times by + implementing a very simple conditional priority scheduler that + could be called a "time-shifted FIFO" (see the Modifier Earliest + Deadline First (MEDF) scheduler [MEDF]). This scheduler adds + tshift to the queue delay of the next L4S packet, before comparing + it with the queue delay of the next Classic packet, then it + selects the packet with the greater adjusted queue delay. + + Under regular conditions, this time-shifted FIFO scheduler behaves + just like a strict priority scheduler. But under moderate or high + overload it prevents starvation of the Classic queue, because the + time-shift (tshift) defines the maximum extra queuing delay of + Classic packets relative to L4S. This would control milder + overload of responsive traffic by introducing delay to defer + invoking the overload mechanisms in Section 4.2.3, particularly + when close to the maximum congestion signal. The example implementations in Appendix A and Appendix B could both be implemented with either policy. -4.1.2. Congestion Signal Saturation: Introduce L4S Drop or Delay? +4.2.3. L4S ECN Saturation: Introduce Drop or Delay? - To keep the throughput of both L4S and Classic flows roughly equal - over the full load range, a different control strategy needs to be - defined above the point where one AQM first saturates to a - probability of 100% leaving no room to push back the load any harder. - If k>1, L4S will saturate first, even though saturation could be - caused by unresponsive traffic in either queue. + This section concerns persistent overload caused by unresponsive L + and/or C flows. To keep the throughput of both L4S and Classic flows + roughly equal over the full load range, a different control strategy + needs to be defined above the point where the L4S AQM persistently + saturates to an ECN marking probability of 100% leaving no room to + push back the load any harder. L4S ECN marking will saturate first + (assuming the coupling factor k>1), even though saturation could be + caused by the sum of unresponsive traffic in either or both queues + exceeding the link capacity. The term 'unresponsive' includes cases where a flow becomes temporarily unresponsive, for instance, a real-time flow that takes a while to adapt its rate in response to congestion, or a standard Reno flow that is normally responsive, but above a certain congestion level it will not be able to reduce its congestion window below the allowed minimum of 2 segments [RFC5681], effectively becoming unresponsive. (Note that L4S traffic ought to remain responsive - below a window of 2 segments (see [I-D.ietf-tsvwg-ecn-l4s-id]). + below a window of 2 segments (see the L4S + requirements [I-D.ietf-tsvwg-ecn-l4s-id]). Saturation raises the question of whether to relieve congestion by introducing some drop into the L4S queue or by allowing delay to grow - in both queues (which could eventually lead to tail drop too): + in both queues (which could eventually lead to drop due to buffer + exhaustion anyway): - Drop on Saturation: Saturation can be avoided by setting a maximum - threshold for L4S ECN marking (assuming k>1) before saturation - starts to make the flow rates of the different traffic types - diverge. Above that the drop probability of Classic traffic is - applied to all packets of all traffic types. Then experiments - have shown that queueing delay can be kept at the target in any - overload situation, including with unresponsive traffic, and no - further measures are required [DualQ-Test]. + Drop on Saturation: Persistent saturation can be defined by a + maximum threshold for coupled L4S ECN marking (assuming k>1) + before saturation starts to make the flow rates of the different + traffic types diverge. Above that, the drop probability of + Classic traffic is applied to all packets of all traffic types. + Then experiments have shown that queueing delay can be kept at the + target in any overload situation, including with unresponsive + traffic, and no further measures are required (Section 4.2.3.1). Delay on Saturation: When L4S marking saturates, instead of introducing L4S drop, the drop and marking probabilities of both queues could be capped. Beyond that, delay will grow either solely in the queue with unresponsive traffic (if WRR is used), or in both queues (if time-shifted FIFO is used). In either case, the higher delay ought to control temporary high congestion. If the overload is more persistent, eventually the combined DualQ will overflow and tail drop will control congestion. @@ -1139,57 +1246,67 @@ saturation" policy. The DOCSIS specification of a DualQ Coupled AQM [DOCSIS3.1] also implements the 'drop on saturation' policy with a very shallow L buffer. However, the addition of DOCSIS per-flow Queue Protection [I-D.briscoe-docsis-q-protection] turns this into 'delay on saturation' by redirecting some packets of the flow(s) most responsible for L queue overload into the C queue, which has a higher delay target. If overload continues, this again becomes 'drop on saturation' as the level of drop in the C queue rises to maintain the target delay of the C queue. -4.1.3. Protecting against Unresponsive ECN-Capable Traffic +4.2.3.1. Protecting against Overload by Unresponsive ECN-Capable + Traffic - Unresponsive traffic has a greater advantage if it is also ECN- - capable. The advantage is undetectable at normal low levels of drop/ - marking, but it becomes significant with the higher levels of drop/ - marking typical during overload. This is an issue whether the ECN- - capable traffic is L4S or Classic. + Without a specific overload mechanism, unresponsive traffic would + have a greater advantage if it were also ECN-capable. The advantage + is undetectable at normal low levels of marking. However, it would + become significant with the higher levels of marking typical during + overload, when it could evade a significant degree of drop. This is + an issue whether the ECN-capable traffic is L4S or Classic. This raises the question of whether and when to introduce drop of - ECN-capable traffic, as required by both Section 7 of [RFC3168] and - Section 4.2.1 of [RFC7567]. + ECN-capable traffic, as required by both Section 7 of the ECN + spec [RFC3168] and Section 4.2.1 of the AQM + recommendations [RFC7567]. - Experiments with the DualPI2 AQM (Appendix A) have shown that - introducing 'drop on saturation' at 100% L4S marking addresses this - problem with unresponsive ECN as well as addressing the saturation - problem. It leaves only a small range of congestion levels where - unresponsive traffic gains any advantage from using the ECN - capability (relative to being unresponsive without ECN), and the - advantage is hardly detectable [DualQ-Test]. + As an example, experiments with the DualPI2 AQM (Appendix A) have + shown that introducing 'drop on saturation' at 100% coupled L4S + marking addresses this problem with unresponsive ECN as well as + addressing the saturation problem. At saturation, DualPI2 switches + into overload mode, where the base AQM is driven by the max delay of + both queues and it introduces probabilistic drop to both queues + equally. It leaves only a small range of congestion levels just + below saturation where unresponsive traffic gains any advantage from + using the ECN capability (relative to being unresponsive without + ECN), and the advantage is hardly detectable (see [DualQ-Test] and + section IV-E of [DCttH19]. Also overload with an unresponsive ECT(1) + flow gets no more bandwidth advantage than with ECT(0). 5. Acknowledgements Thanks to Anil Agarwal, Sowmini Varadhan's, Gabi Bracha, Nicolas Kuhn, Greg Skinner, Tom Henderson, David Pullen, Mirja Kuehlewind, Gorry Fairhurst, Pete Heist and Ermin Sakic for detailed review comments particularly of the appendices and suggestions on how to make the explanations clearer. Thanks also to Tom Henderson for insights on the choice of schedulers and queue delay measurement techniques. The early contributions of Koen De Schepper, Bob Briscoe, Olga Bondarenko and Inton Tsang were part-funded by the European Community under its Seventh Framework Programme through the Reducing Internet - Transport Latency (RITE) project (ICT-317700). Bob Briscoe's - contribution was also part-funded by the Comcast Innovation Fund and - the Research Council of Norway through the TimeIn project. The views - expressed here are solely those of the authors. + Transport Latency (RITE) project (ICT-317700). Contributions of Koen + De Schepper and Olivier Tilmans were also part-funded by the 5Growth + and DAEMON EU H2020 projects. Bob Briscoe's contribution was also + part-funded by the Comcast Innovation Fund and the Research Council + of Norway through the TimeIn project. The views expressed here are + solely those of the authors. 6. Contributors The following contributed implementations and evaluations that validated and helped to improve this specification: Olga Albisser of Simula Research Lab, Norway (Olga Bondarenko during early drafts) implemented the prototype DualPI2 AQM for Linux with Koen De Schepper and conducted extensive evaluations as well as implementing the live performance @@ -1210,23 +1327,23 @@ implementations were tested. 7. References 7.1. Normative References [I-D.ietf-tsvwg-ecn-l4s-id] Schepper, K. D. and B. Briscoe, "Explicit Congestion Notification (ECN) Protocol for Very Low Queuing Delay (L4S)", Work in Progress, Internet-Draft, draft-ietf- - tsvwg-ecn-l4s-id-23, 24 December 2021, + tsvwg-ecn-l4s-id-24, 1 February 2022, . + ecn-l4s-id-24>. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, DOI 10.17487/RFC3168, September 2001, . @@ -1249,20 +1366,24 @@ Queue- based Active Queue Management Algorithms", Proc. Int'l Soc. for Optical Engineering (SPIE) 4866:35--46 DOI: 10.1117/12.473021, 2002, . [ARED01] Floyd, S., Gummadi, R., and S. Shenker, "Adaptive RED: An Algorithm for Increasing the Robustness of RED's Active Queue Management", ACIRI Technical Report , August 2001, . + [BBRv2] Cardwell, N., "BRTCP BBR v2 Alpha/Preview Release", github + repository; Linux congestion control module, + . + [CCcensus19] Mishra, A., Sun, X., Jain, A., Pande, S., Joshi, R., and B. Leong, "The Great Internet TCP Congestion Control Census", Proc. ACM on Measurement and Analysis of Computing Systems 3(3), December 2019, . [CoDel] Nichols, K. and V. Jacobson, "Controlling Queue Delay", ACM Queue 10(5), May 2012, . @@ -1330,23 +1451,23 @@ Jacobson, "BBR Congestion Control", Work in Progress, Internet-Draft, draft-cardwell-iccrg-bbr-congestion- control-01, 7 November 2021, . [I-D.ietf-tsvwg-l4s-arch] Briscoe, B., Schepper, K. D., Bagnulo, M., and G. White, "Low Latency, Low Loss, Scalable Throughput (L4S) Internet Service: Architecture", Work in Progress, Internet-Draft, - draft-ietf-tsvwg-l4s-arch-15, 24 December 2021, + draft-ietf-tsvwg-l4s-arch-16, 1 February 2022, . + l4s-arch-16>. [L4Sdemo16] Bondarenko, O., De Schepper, K., Tsang, I., and B. Briscoe, "Ultra-Low Delay for All: Live Experience, Live Analysis", Proc. MMSYS'16 pp33:1--33:4, May 2016, . [L4S_5G] Willars, P., Wittenmark, E., Ronkainen, H., Östberg, C., @@ -1472,20 +1593,25 @@ [RFC8298] Johansson, I. and Z. Sarker, "Self-Clocked Rate Adaptation for Multimedia", RFC 8298, DOI 10.17487/RFC8298, December 2017, . [RFC8312] Rhee, I., Xu, L., Ha, S., Zimmermann, A., Eggert, L., and R. Scheffenegger, "CUBIC for Fast Long-Distance Networks", RFC 8312, DOI 10.17487/RFC8312, February 2018, . + [RFC8404] Moriarty, K., Ed. and A. Morton, Ed., "Effects of + Pervasive Encryption on Operators", RFC 8404, + DOI 10.17487/RFC8404, July 2018, + . + [SCReAM] Johansson, I., "SCReAM", github repository; , . [SigQ-Dyn] Briscoe, B., "Rapid Signalling of Queue Dynamics", Technical Report TR-BB-2017-001 arXiv:1904.07044 [cs.NI], September 2017, . Appendix A. Example DualQ Coupled PI2 Algorithm @@ -1671,21 +1797,21 @@ If limit is not exceeded, the packet is timestamped in line 4. This assumes that queue delay is measured using the sojourn time technique (see Note a for alternatives). At lines 5-9, the packet is classified and enqueued to the Classic or L4S queue dependent on the least significant bit of the ECN field in the IP header (line 6). Packets with a codepoint having an LSB of 0 (Not-ECT and ECT(0)) will be enqueued in the Classic queue. Otherwise, ECT(1) and CE packets will be enqueued in the L4S queue. Optional additional packet classification flexibility is omitted for - brevity (see [I-D.ietf-tsvwg-ecn-l4s-id]). + brevity (see the L4S ECN protocol [I-D.ietf-tsvwg-ecn-l4s-id]). The dequeue pseudocode (Figure 4) is repeatedly called whenever the lower layer is ready to forward a packet. It schedules one packet for dequeuing (or zero if the queue is empty) then returns control to the caller, so that it does not block while that packet is being forwarded. While making this dequeue decision, it also makes the necessary AQM decisions on dropping or marking. The alternative of applying the AQMs at enqueue would shift some processing from the critical time when each packet is dequeued. However, it would also add a whole queue of delay to the control signals, making the control @@ -1792,21 +1918,21 @@ Integral (PI) controller that alters p' dependent on: a) the error between the current queuing delay (curq) and the target queuing delay, 'target'; and b) the change in queuing delay since the last sample. The name 'PI' represents the fact that the second factor (how fast the queue is growing) is _P_roportional to load while the first is the _I_ntegral of the load (so it removes any standing queue in excess of the target). The target parameter can be set based on local knowledge, but the aim is for the default to be a good compromise for anywhere in the - intended deployment environment---the public Internet. According to + intended deployment environment -- the public Internet. According to [PI2param], the target queuing delay on line 9 of Figure 2 is related to the typical base RTT worldwide, RTT_typ, by two factors: target = RTT_typ * g * f. Below we summarize the rationale behind these factors and introduce a further adjustment. The two factors ensure that, in a large proportion of cases (say 90%), the sawtooth variations in RTT of a single flow will fit within the buffer without underutilizing the link. Frankly, these factors are educated guesses, but with the emphasis closer to 'educated' than to 'guess' (see [PI2param] for full background): @@ -1822,27 +1948,27 @@ significant outlier and, on reflection, the experimental technique seemed inappropriate to the CDN market in China. * g is taken as 0.38. The factor g is a geometry factor that characterizes the shape of the sawteeth of prevalent Classic congestion controllers. The geometry factor is the fraction of the amplitude of the sawtooth variability in queue delay that lies below the AQM's target. For instance, at low bit rate, the geometry factor of standard Reno is 0.5, but at higher rates it tends to just under 1. According to the census of congestion - controllers conducted by Mishra _et al_ in Jul-Oct 2019 - [CCcensus19], most Classic TCP traffic uses Cubic. And, according - to the analysis in [PI2param], if running over a PI2 AQM, a large - proportion of this Cubic traffic would be in its Reno-Friendly - mode, which has a geometry factor of ~0.39 (all known - implementations). The rest of the Cubic traffic would be in true - Cubic mode, which has a geometry factor of ~0.36. Without + controllers conducted by Mishra _et al_ in Jul-Oct + 2019 [CCcensus19], most Classic TCP traffic uses Cubic. And, + according to the analysis in [PI2param], if running over a PI2 + AQM, a large proportion of this Cubic traffic would be in its + Reno-Friendly mode, which has a geometry factor of ~0.39 (all + known implementations). The rest of the Cubic traffic would be in + true Cubic mode, which has a geometry factor of ~0.36. Without modelling the sawtooth profiles from all the other less prevalent congestion controllers, we estimate a 7:3 weighted average of these two, resulting in an average geometry factor of 0.38. * f is taken as 2. The factor f is a safety factor that increases the target queue to allow for the distribution of RTT_typ around its mean. Otherwise the target queue would only avoid underutilization for those users below the mean. It also provides a safety margin for the proportion of paths in use that span beyond the distance between a user and their local CDN. Currently @@ -1879,21 +2005,21 @@ alpha depends on Tupdate (see line 13 of the initialization function in Figure 2). It is best to update p' as frequently as possible, but Tupdate will probably be constrained by hardware performance. As shown in line 13, the update interval should be frequent enough to update at least once in the time taken for the target queue to drain ('target') as long as it updates at least three times per maximum RTT. Tupdate defaults to 16 ms in the reference Linux implementation because it has to be rounded to a multiple of 4 ms. For link rates from 4 to 200 Mb/s and a maximum RTT of 100ms, it has been verified through extensive testing that Tupdate=16ms (as also recommended in - [RFC8033]) is sufficient. + the PIE spec [RFC8033]) is sufficient. The choice of alpha and beta also determines the AQM's stable operating range. The AQM ought to change p' as fast as possible in response to changes in load without over-compensating and therefore causing oscillations in the queue. Therefore, the values of alpha and beta also depend on the RTT of the expected worst-case flow (RTT_max). The maximum RTT of a PI controller (RTT_max in line 10 of Figure 2) is not an absolute maximum, but more instability (more queue @@ -1962,21 +2088,22 @@ a. The drain rate of the queue can vary if it is scheduled relative to other queues, or to cater for fluctuations in a wireless medium. To auto-adjust to changes in drain rate, the queue needs to be measured in time, not bytes or packets [AQMmetrics], [CoDel]. Queuing delay could be measured directly by storing a per-packet time-stamp as each packet is enqueued, and subtracting this from the system time when the packet is dequeued. If time- stamping is not easy to introduce with certain hardware, queuing delay could be predicted indirectly by dividing the size of the queue by the predicted departure rate, which might be known - precisely for some link technologies (see for example [RFC8034]). + precisely for some link technologies (see for example in DOCSIS + PIE [RFC8034]). b. Line 2 of the dualpi2_enqueue() function (Figure 3) assumes an implementation where lq and cq share common buffer memory. An alternative implementation could use separate buffers for each queue, in which case the arriving packet would have to be classified first to determine which buffer to check for available space. The choice is a trade off; a shared buffer can use less memory whereas separate buffers isolate the L4S queue from tail- drop due to large bursts of Classic traffic (e.g. a Classic Reno TCP during slow-start over a long RTT). @@ -2060,41 +2187,43 @@ coexists with 'Classic' Reno congestion control. So it is correct that, when the L4S queue drops packets, it drops them proportional to p'^2, as if they are Classic packets. The two queues each test for overload in lines 4b and 12b of the dequeue function (Figure 7). Lines 8c to 8g drop L4S packets with probability p'^2. Lines 8h to 8i mark the remaining packets with probability p_CL. Given p_Lmax = 1, all remaining packets will be marked because, to have reached the else block at line 8b, p_CL >= 1. - Lines 2c to 2d in the core PI algorithm (Figure 8) deal with overload - of the L4S queue when there is no Classic traffic. This is + Line 2a in the core PI algorithm (Figure 8) deals with overload of + the L4S queue when there is little or no Classic traffic. This is necessary, because the core PI algorithm maintains the appropriate drop probability to regulate overload, but it depends on the length - of the Classic queue. If there is no Classic queue the naive PI - update function in Figure 6 would drop nothing, even if the L4S queue - were overloaded - so tail drop would have to take over (lines 2 and 3 - of Figure 3). + of the Classic queue. If there is little or no Classic queue the + naive PI update function in Figure 6 would drop nothing, even if the + L4S queue were overloaded - so tail drop would have to take over + (lines 2 and 3 of Figure 3). - Instead, the test at line 2a of the full PI update function in - Figure 8 keeps delay on target using drop. If the test at line 2a of - Figure 8 finds that the Classic queue is empty, line 2d measures the - current queue delay using the L4S queue instead. While the L4S queue - is not overloaded, its delay will always be tiny compared to the - target Classic queue delay. So p_CL will be driven to zero, and the - L4S queue will naturally be governed solely by p'_L from the native - L4S AQM (lines 5 and 6 of the dequeue algorithm in Figure 7). But, - if unresponsive L4S source(s) cause overload, the DualQ transitions - smoothly to L4S marking based on the PI algorithm. If overload - increases further, it naturally transitions from marking to dropping - by the mechanism already described. + Instead, line 2a of the full PI update function in Figure 8 ensures + that the base PI AQM in line 3 is driven by whichever of the two + queue delays is greater, but line 3 still always uses the same + Classic target (default 15 ms). If L queue delay is greater just + because there is little or no Classic traffic, normally it will still + be well below the base AQM target. This is because L4S traffic is + also governed by the shallow threshold of its own native AQM (lines 5 + and 6 of the dequeue algorithm in Figure 7). So the base AQM will be + driven to zero and not contribute. However, if the L queue is + overloaded by traffic that is unresponsive to its marking, the max() + in line 2 enables the L queue to smoothly take over driving the base + AQM into overload mode even if there is little or no Classic traffic. + Then the base AQM will keep the L queue to the Classic target + (default 15 ms) by shedding L packets. 1: dualpi2_dequeue(lq, cq, pkt) { % Couples L4S & Classic queues 2: while ( lq.byt() + cq.byt() > 0 ) { 3: if ( scheduler() == lq ) { 4a: lq.dequeue(pkt) % L4S scheduled 4b: if ( p_CL < p_Lmax ) { % Check for overload saturation 5a: if (lq.len()>Th_len) % >1 packet queued 5b: p'_L = laqm(lq.time()) % Native LAQM 5c: else 5d: p'_L = 0 % Suppress marking 1 pkt queue @@ -2122,45 +2251,42 @@ 18: } 19: return(pkt) % return the packet and stop 20: } 21: return(NULL) % no packet to dequeue 22: } Figure 7: Example Dequeue Pseudocode for DualQ Coupled PI2 AQM (Including Code for Edge-Cases) 1: dualpi2_update(lq, cq) { % Update p' every Tupdate - 2a: if ( cq.byt() > 0 ) - 2b: curq = cq.time() %use queuing time of first-in Classic packet - 2c: else % Classic queue empty - 2d: curq = lq.time() % use queuing time of first-in L4S packet + 2a: curq = max(cq.time(), lq.time()) % use greatest queuing time 3: p' = p' + alpha * (curq - target) + beta * (curq - prevq) 4: p_CL = p' * k % Coupled L4S prob = base prob * coupling factor 5: p_C = p'^2 % Classic prob = (base prob)^2 6: prevq = curq 7: } - Figure 8: Example PI-Update Pseudocode for DualQ Coupled PI2 AQM (Including Overload Code) The choice of scheduler technology is critical to overload protection - (see Section 4.1). + (see Section 4.2.2). * A well-understood weighted scheduler such as weighted round robin (WRR) is recommended. As long as the scheduler weight for Classic is small (e.g. 1/16), its exact value is unimportant because it does not normally determine capacity shares. The weight is only important to prevent unresponsive L4S traffic starving Classic - traffic. This is because capacity sharing between the queues is - normally determined by the coupled congestion signal, which - overrides the scheduler, by making L4S sources leave roughly equal - per-flow capacity available for Classic flows. + traffic in the short term (see Section 4.2.2). This is because + capacity sharing between the queues is normally determined by the + coupled congestion signal, which overrides the scheduler, by + making L4S sources leave roughly equal per-flow capacity available + for Classic flows. * Alternatively, a time-shifted FIFO (TS-FIFO) could be used. It works by selecting the head packet that has waited the longest, biased against the Classic traffic by a time-shift of tshift. To implement time-shifted FIFO, the scheduler() function in line 3 of the dequeue code would simply be implemented as the scheduler() function at the bottom of Figure 10 in Appendix B. For the public Internet a good value for tshift is 50ms. For private networks with smaller diameter, about 4*target would be reasonable. TS- FIFO is a very simple scheduler, but complexity might need to be @@ -2195,22 +2321,22 @@ known; if it is not, a moving average of the delay per unit backlog can be maintained. This delay consists of serialization as well as media acquisition for shared media. So the details will depend strongly on the specific link technology, This approach should be less sensitive to timing errors and cost less in operations and memory than the otherwise equivalent 'scaled sojourn time' metric, which is the sojourn time of a packet scaled by the ratio of the queue sizes when the packet departed and arrived [SigQ-Dyn]. - * A strict priority scheduler would be inappropriate, because it - would starve Classic if L4S was overloaded. + * A strict priority scheduler would be inappropriate as discussed in + Section 4.2.2. Appendix B. Example DualQ Coupled Curvy RED Algorithm As another example of a DualQ Coupled AQM algorithm, the pseudocode below gives the Curvy RED based algorithm. Although the AQM was designed to be efficient in integer arithmetic, to aid understanding it is first given using floating point arithmetic (Figure 10). Then, one possible optimization for integer arithmetic is given, also in pseudocode (Figure 11). To aid comparison, the line numbers are kept in step between the two by using letter suffixes where the longer @@ -2563,26 +2689,27 @@ flow rates depend not only on the congestion probability, but also on their end-to-end RTT (= base RTT + queue delay). The rates of Reno [RFC5681] flows competing over an AQM are roughly inversely proportional to their RTTs. Cubic exhibits similar RTT-dependence when in Reno-compatibility mode, but it is less RTT-dependent otherwise. Until the early experiments with the DualQ Coupled AQM, the importance of the reasonably large Classic queue in mitigating RTT- dependence when the base RTT is low had not been appreciated. - Appendix A.1.6 of [I-D.ietf-tsvwg-ecn-l4s-id] uses numerical examples - to explain why bloated buffers had concealed the RTT-dependence of - Classic congestion controls before that time. Then it explains why, - the more that queuing delays have reduced, the more that RTT- - dependence has surfaced as a potential starvation problem for long - RTT flows, when competing against very short RTT flows. + Appendix A.1.6 of the L4S ECN protocol [I-D.ietf-tsvwg-ecn-l4s-id] + uses numerical examples to explain why bloated buffers had concealed + the RTT-dependence of Classic congestion controls before that time. + Then it explains why, the more that queuing delays have reduced, the + more that RTT-dependence has surfaced as a potential starvation + problem for long RTT flows, when competing against very short RTT + flows. Given that congestion control on end-systems is voluntary, there is no reason why it has to be voluntarily RTT-dependent. The RTT- dependence of existing Classic traffic cannot be 'undeployed'. Therefore, [I-D.ietf-tsvwg-ecn-l4s-id] requires L4S congestion controls to be significantly less RTT-dependent than the standard Reno congestion control [RFC5681], at least at low RTT. Then RTT- dependence ought to be no worse than it is with appropriately sized Classic buffers. Following this approach means there is no need for network devices to address RTT-dependence, although there would be no @@ -2626,29 +2753,29 @@ On the Classic side, we consider Reno as the most sensitive and therefore worst-case Classic congestion control. We will also consider Cubic in its Reno-friendly mode ('CReno'), as the most prevalent congestion control, according to the references and analysis in [PI2param]. In either case, the Classic packet rate in steady state is given by the well-known square root formula for Reno congestion control: r_C = 1.22 / (R_C * p_C^0.5) (5) - On the L4S side, we consider the Prague congestion control - [I-D.briscoe-iccrg-prague-congestion-control] as the reference for - steady-state dependence on congestion. Prague conforms to the same - equation as DCTCP, but we do not use the equation derived in the - DCTCP paper, which is only appropriate for step marking. The coupled - marking, p_CL, is the appropriate one when considering throughput - equivalence with Classic flows. Unlike step marking, coupled - markings are inherently spaced out, so we use the formula for DCTCP - packet rate with probabilistic marking derived in Appendix A of + On the L4S side, we consider the Prague congestion + control [I-D.briscoe-iccrg-prague-congestion-control] as the + reference for steady-state dependence on congestion. Prague conforms + to the same equation as DCTCP, but we do not use the equation derived + in the DCTCP paper, which is only appropriate for step marking. The + coupled marking, p_CL, is the appropriate one when considering + throughput equivalence with Classic flows. Unlike step marking, + coupled markings are inherently spaced out, so we use the formula for + DCTCP packet rate with probabilistic marking derived in Appendix A of [PI2]. We use the equation without RTT-independence enabled, which will be explained later. r_L = 2 / (R_L * p_CL) (6) For packet rate equivalence, we equate the two packet rates and rearrange into the same form as Equation (1), so the two can be equated and simplified to produce a formula for a theoretical coupling factor, which we shall call k*: @@ -2668,23 +2795,23 @@ RTT-dependence is caused by window-based congestion control, so it ought to be reversed there, not in the network. Therefore, we use a fixed coupling factor in the network, and reduce RTT-dependence in L4S senders. We cannot expect Classic senders to all be updated to reduce their RTT-dependence. But solely addressing the problem in L4S senders at least makes RTT-dependence no worse - not just between L4S senders, but also between L4S and Classic senders. Traditionally, throughput equivalence has been defined for flows - under comparable conditions, including with the same base RTT - [RFC2914]. So if we assume the same base RTT, R_b, for comparable - flows, we can put both R_C and R_L in terms of R_b. + under comparable conditions, including with the same base + RTT [RFC2914]. So if we assume the same base RTT, R_b, for + comparable flows, we can put both R_C and R_L in terms of R_b. We can approximate the L4S RTT to be hardly greater than the base RTT, i.e. R_L ~= R_b. And we can replace R_C with (R_b + q_C), where the Classic queue, q_C, depends on the target queue delay that the operator has configured for the Classic AQM. Taking PI2 as an example Classic AQM, it seems that we could just take R_C = R_b + target (recommended 15 ms by default in Appendix A.1). However, target is roughly the queue depth reached by the tips of the sawteeth of a congestion control, not the average @@ -2732,24 +2859,24 @@ Internet: r_L / r_C = 2 (R_C * p_C^0.5) / 1.22 (R_L * p_CL) = (R_C * p_CL) / (1.22 * R_L * p_CL) = R_C / (1.22 * R_L) (10) As an example, we can then consider single competing CReno and Prague flows, by expressing both their RTTs in (10) in terms of their base RTTs, R_bC and R_bL. So R_C is replaced by equation (8) for CReno. And R_L is replaced by the max() function below, which represents the - effective RTT of the current Prague congestion control - [I-D.briscoe-iccrg-prague-congestion-control] in its (default) RTT- - independent mode, because it sets a floor to the effective RTT that - it uses for additive increase: + effective RTT of the current Prague congestion + control [I-D.briscoe-iccrg-prague-congestion-control] in its + (default) RTT-independent mode, because it sets a floor to the + effective RTT that it uses for additive increase: ~= 0.85 * (R_bC + target) / (1.22 * max(R_bL, R_typ)) ~= (R_bC + target) / (1.4 * max(R_bL, R_typ)) It can be seen that, for base RTTs below target (15 ms), both the numerator and the denominator plateau, which has the desired effect of limiting RTT-dependence. At the start of the above derivations, an explanation was promised for why the L4S throughput equation in equation (6) did not need to