--- 1/draft-ietf-tsvwg-aqm-dualq-coupled-16.txt 2021-10-04 07:13:42.125334738 -0700 +++ 2/draft-ietf-tsvwg-aqm-dualq-coupled-17.txt 2021-10-04 07:13:42.241337644 -0700 @@ -1,22 +1,22 @@ Transport Area working group (tsvwg) K. De Schepper Internet-Draft Nokia Bell Labs Intended status: Experimental B. Briscoe, Ed. -Expires: January 7, 2022 Independent +Expires: 7 April 2022 Independent G. White CableLabs - July 6, 2021 + 4 October 2021 DualQ Coupled AQMs for Low Latency, Low Loss and Scalable Throughput (L4S) - draft-ietf-tsvwg-aqm-dualq-coupled-16 + draft-ietf-tsvwg-aqm-dualq-coupled-17 Abstract The Low Latency Low Loss Scalable Throughput (L4S) architecture allows data flows over the public Internet to achieve consistent low queuing latency, generally zero congestion loss and scaling of per- flow throughput without the scaling problems of standard TCP Reno- friendly congestion controls. To achieve this, L4S data flows have to use one of the family of 'Scalable' congestion controls (TCP Prague and Data Center TCP are examples) and a form of Explicit @@ -51,80 +51,80 @@ Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." - This Internet-Draft will expire on January 7, 2022. + This Internet-Draft will expire on 7 April 2022. Copyright Notice Copyright (c) 2021 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal - Provisions Relating to IETF Documents - (https://trustee.ietf.org/license-info) in effect on the date of - publication of this document. Please review these documents - carefully, as they describe your rights and restrictions with respect - to this document. Code Components extracted from this document must - include Simplified BSD License text as described in Section 4.e of - the Trust Legal Provisions and are provided without warranty as - described in the Simplified BSD License. + Provisions Relating to IETF Documents (https://trustee.ietf.org/ + license-info) in effect on the date of publication of this document. + Please review these documents carefully, as they describe your rights + and restrictions with respect to this document. Code Components + extracted from this document must include Simplified BSD License text + as described in Section 4.e of the Trust Legal Provisions and are + provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 - 1.1. Outline of the Problem . . . . . . . . . . . . . . . . . 3 + 1.1. Outline of the Problem . . . . . . . . . . . . . . . . . 4 1.2. Scope . . . . . . . . . . . . . . . . . . . . . . . . . . 6 - 1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . 7 + 1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . 8 1.4. Features . . . . . . . . . . . . . . . . . . . . . . . . 9 - 2. DualQ Coupled AQM . . . . . . . . . . . . . . . . . . . . . . 10 + 2. DualQ Coupled AQM . . . . . . . . . . . . . . . . . . . . . . 11 2.1. Coupled AQM . . . . . . . . . . . . . . . . . . . . . . . 11 2.2. Dual Queue . . . . . . . . . . . . . . . . . . . . . . . 12 - 2.3. Traffic Classification . . . . . . . . . . . . . . . . . 12 + 2.3. Traffic Classification . . . . . . . . . . . . . . . . . 13 2.4. Overall DualQ Coupled AQM Structure . . . . . . . . . . . 13 - 2.5. Normative Requirements for a DualQ Coupled AQM . . . . . 16 - 2.5.1. Functional Requirements . . . . . . . . . . . . . . . 16 + 2.5. Normative Requirements for a DualQ Coupled AQM . . . . . 17 + 2.5.1. Functional Requirements . . . . . . . . . . . . . . . 17 2.5.1.1. Requirements in Unexpected Cases . . . . . . . . 18 2.5.2. Management Requirements . . . . . . . . . . . . . . . 19 2.5.2.1. Configuration . . . . . . . . . . . . . . . . . . 19 - 2.5.2.2. Monitoring . . . . . . . . . . . . . . . . . . . 20 + 2.5.2.2. Monitoring . . . . . . . . . . . . . . . . . . . 21 2.5.2.3. Anomaly Detection . . . . . . . . . . . . . . . . 21 - 2.5.2.4. Deployment, Coexistence and Scaling . . . . . . . 21 + 2.5.2.4. Deployment, Coexistence and Scaling . . . . . . . 22 3. IANA Considerations (to be removed by RFC Editor) . . . . . . 22 4. Security Considerations . . . . . . . . . . . . . . . . . . . 22 4.1. Overload Handling . . . . . . . . . . . . . . . . . . . . 22 4.1.1. Avoiding Classic Starvation: Sacrifice L4S Throughput - or Delay? . . . . . . . . . . . . . . . . . . . . . . 22 + or Delay? . . . . . . . . . . . . . . . . . . . . . . 23 4.1.2. Congestion Signal Saturation: Introduce L4S Drop or Delay? . . . . . . . . . . . . . . . . . . . . . . . 24 - 4.1.3. Protecting against Unresponsive ECN-Capable Traffic . 25 - 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 25 - 6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 25 + 4.1.3. Protecting against Unresponsive ECN-Capable + Traffic . . . . . . . . . . . . . . . . . . . . . . . 25 + 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 26 + 6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 26 7. References . . . . . . . . . . . . . . . . . . . . . . . . . 26 7.1. Normative References . . . . . . . . . . . . . . . . . . 26 - 7.2. Informative References . . . . . . . . . . . . . . . . . 26 + 7.2. Informative References . . . . . . . . . . . . . . . . . 27 Appendix A. Example DualQ Coupled PI2 Algorithm . . . . . . . . 32 - A.1. Pass #1: Core Concepts . . . . . . . . . . . . . . . . . 32 - A.2. Pass #2: Overload Details . . . . . . . . . . . . . . . . 42 - Appendix B. Example DualQ Coupled Curvy RED Algorithm . . . . . 46 - B.1. Curvy RED in Pseudocode . . . . . . . . . . . . . . . . . 46 - B.2. Efficient Implementation of Curvy RED . . . . . . . . . . 52 - Appendix C. Choice of Coupling Factor, k . . . . . . . . . . . . 54 - C.1. RTT-Dependence . . . . . . . . . . . . . . . . . . . . . 54 - C.2. Guidance on Controlling Throughput Equivalence . . . . . 55 - Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 56 + A.1. Pass #1: Core Concepts . . . . . . . . . . . . . . . . . 33 + A.2. Pass #2: Overload Details . . . . . . . . . . . . . . . . 43 + Appendix B. Example DualQ Coupled Curvy RED Algorithm . . . . . 47 + B.1. Curvy RED in Pseudocode . . . . . . . . . . . . . . . . . 47 + B.2. Efficient Implementation of Curvy RED . . . . . . . . . . 53 + Appendix C. Choice of Coupling Factor, k . . . . . . . . . . . . 55 + C.1. RTT-Dependence . . . . . . . . . . . . . . . . . . . . . 55 + C.2. Guidance on Controlling Throughput Equivalence . . . . . 56 + Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 57 1. Introduction This document specifies a framework for DualQ Coupled AQMs, which is the network part of the L4S architecture [I-D.ietf-tsvwg-l4s-arch]. L4S enables both very low queuing latency (sub-millisecond on average) and high throughput at the same time, for ad hoc numbers of capacity-seeking applications all sharing the same capacity. 1.1. Outline of the Problem @@ -148,36 +148,36 @@ now it has not been possible to allow any number of low latency, high throughput applications to seek to fully utilize available capacity, because the capacity-seeking process itself causes too much queuing delay. To reduce this queuing delay caused by the capacity seeking process, changes either to the network alone or to end-systems alone are in progress. L4S involves a recognition that both approaches are yielding diminishing returns: - o Recent state-of-the-art active queue management (AQM) in the + * Recent state-of-the-art active queue management (AQM) in the network, e.g. FQ-CoDel [RFC8290], PIE [RFC8033], Adaptive RED [ARED01] ) has reduced queuing delay for all traffic, not just a select few applications. However, no matter how good the AQM, the capacity-seeking (sawtoothing) rate of TCP-like congestion controls represents a lower limit that will either cause queuing delay to vary or cause the link to be under-utilized. These AQMs are tuned to allow a typical capacity-seeking Reno-friendly flow to induce an average queue that roughly doubles the base RTT, adding 5-15 ms of queuing on average (cf. 500 microseconds with L4S for the same mix of long-running and web traffic). However, for many applications low delay is not useful unless it is consistently low. With these AQMs, 99th percentile queuing delay is 20-30 ms (cf. 2 ms with the same traffic over L4S). - o Similarly, recent research into using e2e congestion control + * Similarly, recent research into using e2e congestion control without needing an AQM in the network (e.g.BBR [BBRv1], [I-D.cardwell-iccrg-bbr-congestion-control]) seems to have hit a similar lower limit to queuing delay of about 20ms on average (and any additional BBRv1 flow adds another 20ms of queuing) but there are also regular 25ms delay spikes due to bandwidth probes and 60ms spikes due to flow-starts. L4S learns from the experience of Data Center TCP [RFC8257], which shows the power of complementary changes both in the network and on end-systems. DCTCP teaches us that two small but radical changes to @@ -285,25 +285,25 @@ For the public Internet, nearly all the benefit will typically be achieved by deploying the Coupled AQM into either end of the access link between a 'site' and the Internet, which is invariably the bottleneck (see section 6.4 of[I-D.ietf-tsvwg-l4s-arch] about deployment, which also defines the term 'site' to mean a home, an office, a campus or mobile user equipment). Latency is not the only concern of L4S: - o The 'Low Loss" part of the name denotes that L4S generally + * The 'Low Loss" part of the name denotes that L4S generally achieves zero congestion loss (which would otherwise cause retransmission delays), due to its use of ECN. - o The "Scalable throughput" part of the name denotes that the per- + * The "Scalable throughput" part of the name denotes that the per- flow throughput of Scalable congestion controls should scale indefinitely, avoiding the imminent scaling problems with 'TCP- Friendly' congestion control algorithms [RFC3649]. The former is clearly in scope of this AQM document. However, the latter is an outcome of the end-system behaviour, and therefore outside the scope of this AQM document, even though the AQM is an enabler. The overall L4S architecture [I-D.ietf-tsvwg-l4s-arch] gives more @@ -448,41 +448,41 @@ achieve low delay. The L4S queue can be filled with a heavy load of capacity-seeking flows (TCP Prague etc.) and still achieve low delay. The L4S queue does not rely on the presence of other traffic in the Classic queue that can be 'overtaken'. It gives low latency to L4S traffic whether or not there is Classic traffic, and the latency of Classic traffic does not suffer when a proportion of the traffic is L4S. The two queues are only necessary because: - o the large variations (sawteeth) of Classic flows need roughly a + * the large variations (sawteeth) of Classic flows need roughly a base RTT of queuing delay to ensure full utilization - o Scalable flows do not need a queue to keep utilization high, but + * Scalable flows do not need a queue to keep utilization high, but they cannot keep latency predictably low if they are mixed with Classic traffic, The L4S queue has latency priority within sub-round trip timescales, but over longer periods the coupling from the Classic to the L4S AQM (explained below) ensures that it does not have bandwidth priority over the Classic queue. 2. DualQ Coupled AQM There are two main aspects to the approach: - o The Coupled AQM that addresses throughput equivalence between + * The Coupled AQM that addresses throughput equivalence between Classic (e.g. Reno, Cubic) flows and L4S flows (that satisfy the Prague L4S requirements). - o The Dual Queue structure that provides latency separation for L4S + * The Dual Queue structure that provides latency separation for L4S flows to isolate them from the typically large Classic queue. 2.1. Coupled AQM In the 1990s, the `TCP formula' was derived for the relationship between the steady-state congestion window, cwnd, and the drop probability, p of standard Reno congestion control [RFC5681] . To a first order approximation, the steady-state cwnd of Reno is inversely proportional to the square root of p. @@ -661,24 +661,24 @@ `----------'\\ | AQM |---->: ,'|`-.___.-' \\ | |p' | <' | \\ `-------' (p'^2) //`' \\ ^ | // \\,. | v p_C // < | _________ .------.// `\| | | | Drop |/ Classic |queue |===>|/mark | __|______| `------' - Legend: ===> traffic flow; ---> control dependency. - Figure 1: DualQ Coupled AQM Schematic + Legend: ===> traffic flow; ---> control dependency. + After the AQMs have applied their dropping or marking, the scheduler forwards their packets to the link. Even though the scheduler gives priority to the L queue, it is not as strong as the coupling from the C queue. This is because, as the C queue grows, the base AQM applies more congestion signals to L traffic (as well as C). As L flows reduce their rate in response, they use less than the scheduling share for L traffic. So, because the scheduler is work preserving, it schedules any C traffic in the gaps. Giving priority to the L queue has the benefit of very low L queue @@ -800,50 +799,50 @@ 2.5.1.1. Requirements in Unexpected Cases The flexibility to allow operator-specific classifiers (Section 2.3) leads to the need to specify what the AQM in each queue ought to do with packets that do not carry the ECN field expected for that queue. It is expected that the AQM in each queue will inspect the ECN field to determine what sort of congestion notification to signal, then it will decide whether to apply congestion notification to this particular packet, as follows: - o If a packet that does not carry an ECT(1) or CE codepoint is + * If a packet that does not carry an ECT(1) or CE codepoint is classified into the L queue: - * if the packet is ECT(0), the L AQM SHOULD apply CE-marking + - if the packet is ECT(0), the L AQM SHOULD apply CE-marking using a probability appropriate to Classic congestion control and appropriate to the target delay in the L queue - * if the packet is Not-ECT, the appropriate action depends on + - if the packet is Not-ECT, the appropriate action depends on whether some other function is protecting the L queue from misbehaving flows (e.g. per-flow queue protection [I-D.briscoe-docsis-q-protection] or latency policing): - + If separate queue protection is provided, the L AQM SHOULD + o If separate queue protection is provided, the L AQM SHOULD ignore the packet and forward it unchanged, meaning it should not calculate whether to apply congestion notification and it should neither drop nor CE-mark the packet (for instance, the operator might classify EF traffic that is unresponsive to drop into the L queue, alongside responsive L4S-ECN traffic) - + if separate queue protection is not provided, the L AQM + o if separate queue protection is not provided, the L AQM SHOULD apply drop using a drop probability appropriate to Classic congestion control and appropriate to the target delay in the L queue - o If a packet that carries an ECT(1) codepoint is classified into + * If a packet that carries an ECT(1) codepoint is classified into the C queue: - * the C AQM SHOULD apply CE-marking using the coupled AQM + - the C AQM SHOULD apply CE-marking using the coupled AQM probability p_CL (= k*p'). The above requirements are worded as "SHOULDs", because operator- specific classifiers are for flexibility, by definition. Therefore, alternative actions might be appropriate in the operator's specific circumstances. An example would be where the operator knows that certain legacy traffic marked with one codepoint actually has a congestion response associated with another codepoint. If the DualQ Coupled AQM has detected overload, it MUST begin using @@ -853,94 +852,94 @@ 2.5.2. Management Requirements 2.5.2.1. Configuration By default, a DualQ Coupled AQM SHOULD NOT need any configuration for use at a bottleneck on the public Internet [RFC7567]. The following parameters MAY be operator-configurable, e.g. to tune for non- Internet settings: - o Optional packet classifier(s) to use in addition to the ECN field + * Optional packet classifier(s) to use in addition to the ECN field (see Section 2.3); - o Expected typical RTT, which can be used to determine the queuing + * Expected typical RTT, which can be used to determine the queuing delay of the Classic AQM at its operating point, in order to prevent typical lone flows from under-utilizing capacity. For example: - * for the PI2 algorithm (Appendix A) the queuing delay target is + - for the PI2 algorithm (Appendix A) the queuing delay target is dependent on the typical RTT; - * for the Curvy RED algorithm (Appendix B) the queuing delay at + - for the Curvy RED algorithm (Appendix B) the queuing delay at the desired operating point of the curvy ramp is configured to encompass a typical RTT; - * if another Classic AQM was used, it would be likely to need an + - if another Classic AQM was used, it would be likely to need an operating point for the queue based on the typical RTT, and if so it SHOULD be expressed in units of time. An operating point that is manually calculated might be directly configurable instead, e.g. for links with large numbers of flows where under-utilization by a single flow would be unlikely. - o Expected maximum RTT, which can be used to set the stability + * Expected maximum RTT, which can be used to set the stability parameter(s) of the Classic AQM. For example: - * for the PI2 algorithm (Appendix A), the gain parameters of the + - for the PI2 algorithm (Appendix A), the gain parameters of the PI algorithm depend on the maximum RTT. - * for the Curvy RED algorithm (Appendix B) the smoothing + - for the Curvy RED algorithm (Appendix B) the smoothing parameter is chosen to filter out transients in the queue within a maximum RTT. Stability parameter(s) that are manually calculated assuming a maximum RTT might be directly configurable instead. - o Coupling factor, k (see Appendix C.2); + * Coupling factor, k (see Appendix C.2); - o A limit to the conditional priority of L4S. This is scheduler- + * A limit to the conditional priority of L4S. This is scheduler- dependent, but it SHOULD be expressed as a relation between the max delay of a C packet and an L packet. For example: - * for a WRR scheduler a weight ratio between L and C of w:1 means + - for a WRR scheduler a weight ratio between L and C of w:1 means that the maximum delay to a C packet is w times that of an L packet. - * for a time-shifted FIFO (TS-FIFO) scheduler (see Section 4.1.1) + - for a time-shifted FIFO (TS-FIFO) scheduler (see Section 4.1.1) a time-shift of tshift means that the maximum delay to a C packet is tshift greater than that of an L packet. tshift could be expressed as a multiple of the typical RTT rather than as an absolute delay. - o The maximum Classic ECN marking probability, p_Cmax, before + * The maximum Classic ECN marking probability, p_Cmax, before switching over to drop. 2.5.2.2. Monitoring An experimental DualQ Coupled AQM SHOULD allow the operator to monitor each of the following operational statistics on demand, per queue and per configurable sample interval, for performance monitoring and perhaps also for accounting in some cases: - o Bits forwarded, from which utilization can be calculated; + * Bits forwarded, from which utilization can be calculated; - o Total packets in the three categories: arrived, presented to the + * Total packets in the three categories: arrived, presented to the AQM, and forwarded. The difference between the first two will measure any non-AQM tail discard. The difference between the last two will measure proactive AQM discard; - o ECN packets marked, non-ECN packets dropped, ECN packets dropped, + * ECN packets marked, non-ECN packets dropped, ECN packets dropped, which can be combined with the three total packet counts above to calculate marking and dropping probabilities; - o Queue delay (not including serialization delay of the head packet + * Queue delay (not including serialization delay of the head packet or medium acquisition delay) - see further notes below. Unlike the other statistics, queue delay cannot be captured in a simple accumulating counter. Therefore the type of queue delay statistics produced (mean, percentiles, etc.) will depend on implementation constraints. To facilitate comparative evaluation of different implementations and approaches, an implementation SHOULD allow mean and 99th percentile queue delay to be derived (per queue per sample interval). A relatively simple way to do this would be to store a coarse-grained histogram of queue delay. @@ -948,21 +947,21 @@ edges that represent contiguous ranges of queue delay. Then, over a sample interval, each bin would accumulate a count of the number of packets that had fallen within each range. The maximum queue delay per queue per interval MAY also be recorded. 2.5.2.3. Anomaly Detection An experimental DualQ Coupled AQM SHOULD asynchronously report the following data about anomalous conditions: - o Start-time and duration of overload state. + * Start-time and duration of overload state. A hysteresis mechanism SHOULD be used to prevent flapping in and out of overload causing an event storm. For instance, exit from overload state could trigger one report, but also latch a timer. Then, during that time, if the AQM enters and exits overload state any number of times, the duration in overload state is accumulated but no new report is generated until the first time the AQM is out of overload once the timer has expired. 2.5.2.4. Deployment, Coexistence and Scaling @@ -1183,23 +1182,25 @@ Ing Jyh (Inton) Tsang of Nokia, Belgium built the End-to-End Data Centre to the Home broadband testbed on which DualQ Coupled AQM implementations were tested. 7. References 7.1. Normative References [I-D.ietf-tsvwg-ecn-l4s-id] Schepper, K. D. and B. Briscoe, "Explicit Congestion - Notification (ECN) Protocol for Ultra-Low Queuing Delay - (L4S)", draft-ietf-tsvwg-ecn-l4s-id-14 (work in progress), - March 2021. + Notification (ECN) Protocol for Very Low Queuing Delay + (L4S)", Work in Progress, Internet-Draft, draft-ietf- + tsvwg-ecn-l4s-id-19, 26 July 2021, + . [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, DOI 10.17487/RFC3168, September 2001, . @@ -1257,131 +1258,144 @@ . [DCttH15] De Schepper, K., Bondarenko, O., Briscoe, B., and I. Tsang, "`Data Centre to the Home': Ultra-Low Latency for All", RITE project Technical Report , 2015, . [DOCSIS3.1] CableLabs, "MAC and Upper Layer Protocols Interface (MULPI) Specification, CM-SP-MULPIv3.1", Data-Over-Cable - Service Interface Specifications DOCSIS(R) 3.1 Version i17 - or later, January 2019, . [DualPI2Linux] Albisser, O., De Schepper, K., Briscoe, B., Tilmans, O., and H. Steen, "DUALPI2 - Low Latency, Low Loss and Scalable (L4S) AQM", Proc. Linux Netdev 0x13 , March 2019, . [DualQ-Test] Steen, H., "Destruction Testing: Ultra-Low Delay using Dual Queue Coupled Active Queue Management", Masters - Thesis, Dept of Informatics, Uni Oslo , May 2017. + Thesis, Dept of Informatics, Uni Oslo , May 2017, + . [I-D.briscoe-docsis-q-protection] Briscoe, B. and G. White, "Queue Protection to Preserve - Low Latency", draft-briscoe-docsis-q-protection-00 (work - in progress), July 2019. + Low Latency", Work in Progress, Internet-Draft, draft- + briscoe-docsis-q-protection-00, 8 July 2019, + . [I-D.briscoe-iccrg-prague-congestion-control] Schepper, K. D., Tilmans, O., and B. Briscoe, "Prague - Congestion Control", draft-briscoe-iccrg-prague- - congestion-control-00 (work in progress), March 2021. + Congestion Control", Work in Progress, Internet-Draft, + draft-briscoe-iccrg-prague-congestion-control-00, 9 March + 2021, . [I-D.briscoe-tsvwg-l4s-diffserv] Briscoe, B., "Interactions between Low Latency, Low Loss, Scalable Throughput (L4S) and Differentiated Services", - draft-briscoe-tsvwg-l4s-diffserv-02 (work in progress), - November 2018. + Work in Progress, Internet-Draft, draft-briscoe-tsvwg-l4s- + diffserv-02, 4 November 2018, + . [I-D.cardwell-iccrg-bbr-congestion-control] Cardwell, N., Cheng, Y., Yeganeh, S. H., and V. Jacobson, - "BBR Congestion Control", draft-cardwell-iccrg-bbr- - congestion-control-00 (work in progress), July 2017. + "BBR Congestion Control", Work in Progress, Internet- + Draft, draft-cardwell-iccrg-bbr-congestion-control-00, 3 + July 2017, . [I-D.ietf-tsvwg-l4s-arch] Briscoe, B., Schepper, K. D., Bagnulo, M., and G. White, "Low Latency, Low Loss, Scalable Throughput (L4S) Internet - Service: Architecture", draft-ietf-tsvwg-l4s-arch-08 (work - in progress), November 2020. + Service: Architecture", Work in Progress, Internet-Draft, + draft-ietf-tsvwg-l4s-arch-10, 1 July 2021, + . [I-D.ietf-tsvwg-nqb] White, G. and T. Fossati, "A Non-Queue-Building Per-Hop - Behavior (NQB PHB) for Differentiated Services", draft- - ietf-tsvwg-nqb-05 (work in progress), March 2021. + Behavior (NQB PHB) for Differentiated Services", Work in + Progress, Internet-Draft, draft-ietf-tsvwg-nqb-07, 28 July + 2021, . [L4Sdemo16] Bondarenko, O., De Schepper, K., Tsang, I., and B. Briscoe, "Ultra-Low Delay for All: Live Experience, Live Analysis", Proc. MMSYS'16 pp33:1--33:4, May 2016, . [Labovitz10] Labovitz, C., Iekel-Johnson, S., McPherson, D., Oberheide, J., and F. Jahanian, "Internet Inter-Domain Traffic", Proc ACM SIGCOMM; ACM CCR 40(4):75--86, August 2010, . [LLD] White, G., Sundaresan, K., and B. Briscoe, "Low Latency DOCSIS: Technology Overview", CableLabs White Paper , February 2019, . - [Mathis09] - Mathis, M., "Relentless Congestion Control", PFLDNeT'09 , + [Mathis09] Mathis, M., "Relentless Congestion Control", PFLDNeT'09 , May 2009, . [MEDF] Menth, M., Schmid, M., Heiss, H., and T. Reim, "MEDF - a simple scheduling algorithm for two real-time transport service classes with application in the UTRAN", Proc. IEEE Conference on Computer Communications (INFOCOM'03) Vol.2 - pp.1116-1122, March 2003. + pp.1116-1122, March 2003, + . [PI2] De Schepper, K., Bondarenko, O., Briscoe, B., and I. Tsang, "PI2: A Linearized AQM for both Classic and Scalable TCP", ACM CoNEXT'16 , December 2016, . - [PI2param] - Briscoe, B., "PI2 Parameters", Technical Report TR-BB- + [PI2param] Briscoe, B., "PI2 Parameters", Technical Report TR-BB- 2021-001 arXiv:2107.01003 [cs.NI], July 2021, . [PragueLinux] Briscoe, B., De Schepper, K., Albisser, O., Misund, J., - Tilmans, O., Kuehlewind, M., and A. Ahmed, "Implementing + Tilmans, O., Kühlewind, M., and A.S. Ahmed, "Implementing the `TCP Prague' Requirements for Low Latency Low Loss Scalable Throughput (L4S)", Proc. Linux Netdev 0x13 , March 2019, . [RFC0970] Nagle, J., "On Packet Switches With Infinite Storage", RFC 970, DOI 10.17487/RFC0970, December 1985, . [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, S., Wroclawski, J., and L. Zhang, "Recommendations on Queue Management and Congestion Avoidance in the Internet", RFC 2309, DOI 10.17487/RFC2309, April 1998, . - [RFC3246] Davie, B., Charny, A., Bennet, J., Benson, K., Le Boudec, - J., Courtney, W., Davari, S., Firoiu, V., and D. + [RFC3246] Davie, B., Charny, A., Bennet, J.C.R., Benson, K., Le + Boudec, J.Y., Courtney, W., Davari, S., Firoiu, V., and D. Stiliadis, "An Expedited Forwarding PHB (Per-Hop Behavior)", RFC 3246, DOI 10.17487/RFC3246, March 2002, . [RFC3649] Floyd, S., "HighSpeed TCP for Large Congestion Windows", RFC 3649, DOI 10.17487/RFC3649, December 2003, . [RFC5033] Floyd, S. and M. Allman, "Specifying New Congestion Control Algorithms", BCP 133, RFC 5033, @@ -1436,22 +1450,21 @@ [RFC8312] Rhee, I., Xu, L., Ha, S., Zimmermann, A., Eggert, L., and R. Scheffenegger, "CUBIC for Fast Long-Distance Networks", RFC 8312, DOI 10.17487/RFC8312, February 2018, . [SCReAM] Johansson, I., "SCReAM", github repository; , . - [SigQ-Dyn] - Briscoe, B., "Rapid Signalling of Queue Dynamics", + [SigQ-Dyn] Briscoe, B., "Rapid Signalling of Queue Dynamics", Technical Report TR-BB-2017-001 arXiv:1904.07044 [cs.NI], September 2017, . Appendix A. Example DualQ Coupled PI2 Algorithm As a first concrete example, the pseudocode below gives the DualPI2 algorithm. DualPI2 follows the structure of the DualQ Coupled AQM framework in Figure 1. A simple ramp function (configured in units of queuing time) with unsmoothed ECN marking is used for the Native L4S AQM. The ramp can also be configured as a step function. The @@ -1479,51 +1492,52 @@ [DualPI2Linux]. The specification of the DualQ Coupled AQM for DOCSIS cable modems and CMTSs is available in [DOCSIS3.1] and explained in [LLD]. A.1. Pass #1: Core Concepts The pseudocode manipulates three main structures of variables: the packet (pkt), the L4S queue (lq) and the Classic queue (cq). The pseudocode consists of the following six functions: - o The initialization function dualpi2_params_init(...) (Figure 2) + * The initialization function dualpi2_params_init(...) (Figure 2) that sets parameter defaults (the API for setting non-default values is omitted for brevity) - o The enqueue function dualpi2_enqueue(lq, cq, pkt) (Figure 3) - o The dequeue function dualpi2_dequeue(lq, cq, pkt) (Figure 4) + * The enqueue function dualpi2_enqueue(lq, cq, pkt) (Figure 3) - o The recurrence function recur(q, likelihood) for de-randomized ECN + * The dequeue function dualpi2_dequeue(lq, cq, pkt) (Figure 4) + + * The recurrence function recur(q, likelihood) for de-randomized ECN marking (shown at the end of Figure 4). - o The L4S AQM function laqm(qdelay) (Figure 5) used to calculate the + * The L4S AQM function laqm(qdelay) (Figure 5) used to calculate the ECN-marking probability for the L4S queue - o The base AQM function that implements the PI algorithm + * The base AQM function that implements the PI algorithm dualpi2_update(lq, cq) (Figure 6) used to regularly update the base probability (p'), which is squared for the Classic AQM as well as being coupled across to the L4S queue. It also uses the following functions that are not shown in full here: - o scheduler(), which selects between the head packets of the two + * scheduler(), which selects between the head packets of the two queues; the choice of scheduler technology is discussed later; - o cq.len() or lq.len() returns the current length (aka. backlog) of + * cq.len() or lq.len() returns the current length (aka. backlog) of the relevant queue in bytes; - o cq.time() or lq.time() returns the current queuing delay + * cq.time() or lq.time() returns the current queuing delay (aka. sojourn time or service time) of the relevant queue in units of time (see Note a); - o mark(pkt) and drop(pkt) for ECN-marking and dropping a packet; + * mark(pkt) and drop(pkt) for ECN-marking and dropping a packet; In experiments so far (building on experiments with PIE) on broadband access links ranging from 4 Mb/s to 200 Mb/s with base RTTs from 5 ms to 100 ms, DualPI2 achieves good results with the default parameters in Figure 2. The parameters are categorised by whether they relate to the Base PI2 AQM, the L4S AQM or the framework coupling them together. Constants and variables derived from these parameters are also included at the end of each category. Each parameter is explained as it is encountered in the walk-through of the pseudocode below, and the rationale for the chosen defaults are given so that @@ -1651,57 +1665,57 @@ loop sloppier (for a typical RTT it would double the Classic queue's feedback delay). All the dequeue code is contained within a large while loop so that if it decides to drop a packet, it will continue until it selects a packet to schedule. Line 3 of the dequeue pseudocode is where the scheduler chooses between the L4S queue (lq) and the Classic queue (cq). Detailed implementation of the scheduler is not shown (see discussion later). - o If an L4S packet is scheduled, in lines 7 and 8 the packet is ECN- + * If an L4S packet is scheduled, in lines 7 and 8 the packet is ECN- marked with likelihood p_L. The recur() function at the end of Figure 4 is used, which is preferred over random marking because it avoids delay due to randomization when interpreting congestion signals, but it still desynchronizes the saw-teeth of the flows. Line 6 calculates p_L as the maximum of the coupled L4S probability p_CL and the probability from the native L4S AQM p'_L. This implements the max() function shown in Figure 1 to couple the outputs of the two AQMs together. Of the two probabilities input to p_L in line 6: - * p'_L is calculated per packet in line 5 by the laqm() function + - p'_L is calculated per packet in line 5 by the laqm() function (see Figure 5), - * Whereas p_CL is maintained by the dualpi2_update() function + - Whereas p_CL is maintained by the dualpi2_update() function which runs every Tupdate (Tupdate is set in line 13 of Figure 2). - o If a Classic packet is scheduled, lines 10 to 17 drop or mark the + * If a Classic packet is scheduled, lines 10 to 17 drop or mark the packet with probability p_C. The Native L4S AQM algorithm (Figure 5) is a ramp function, similar to the RED algorithm, but simplified as follows: - o The extent of the ramp is defined in units of queuing delay, not + * The extent of the ramp is defined in units of queuing delay, not bytes, so that configuration remains invariant as the queue departure rate varies. - o It uses instantaneous queueing delay, which avoids the complexity + * It uses instantaneous queueing delay, which avoids the complexity of smoothing, but also avoids embedding a worst-case RTT of smoothing delay in the network (see Section 2.1). - o The ramp rises linearly directly from 0 to 1, not to an + * The ramp rises linearly directly from 0 to 1, not to an intermediate value of p'_L as RED would, because there is no need to keep ECN marking probability low. - o Marking does not have to be randomized. Determinism is used + * Marking does not have to be randomized. Determinism is used instead of randomness; to reduce the delay necessary to smooth out the noise of randomness from the signal. The ramp function requires two configuration parameters, the minimum threshold (minTh) and the width of the ramp (range), both in units of queuing time), as shown in lines 18 & 19 of the initialization function in Figure 2. The ramp function can be configured as a step (see Note c). Although the DCTCP paper [Alizadeh-stability] recommends an ECN @@ -1731,24 +1745,24 @@ Figure 5: Example Pseudocode for the Native L4S AQM 1: dualpi2_update(lq, cq) { % Update p' every Tupdate 2: curq = cq.time() % use queuing time of first-in Classic packet 3: p' = p' + alpha * (curq - target) + beta * (curq - prevq) 4: p_CL = k * p' % Coupled L4S prob = base prob * coupling factor 5: p_C = p'^2 % Classic prob = (base prob)^2 6: prevq = curq 7: } - (Clamping p' within the range [0,1] omitted for clarity - see text) - Figure 6: Example PI-Update Pseudocode for DualQ Coupled PI2 AQM + (Clamping p' within the range [0,1] omitted for clarity - see text) + The coupled marking probability, p_CL depends on the base probability (p'), which is kept up to date by the core PI algorithm in Figure 6 executed every Tupdate. Note that p' solely depends on the queuing time in the Classic queue. In line 2, the current queuing delay (curq) is evaluated from how long the head packet was in the Classic queue (cq). The function cq.time() (not shown) subtracts the time stamped at enqueue from the current time (see Note a) and implicitly takes the current queuing delay as 0 if the queue is empty. @@ -1766,43 +1780,43 @@ is for the default to be a good compromise for anywhere in the intended deployment environment---the public Internet. The target queuing delay is related to the typical base RTT, RTT_typ, by two factors, shown in the comment on line 9 of Figure 2 as target = RTT_typ * 0.22 * 2. These factors ensure that, in a large proportion of cases (say 90%), the sawtooth variations in RTT will fit within the buffer without underutilizing the link. Frankly, these factors are educated guesses, but with the emphasis closer to 'educated' than to 'guess' (see [PI2param] for background investigations): - o RTT_typ is taken as 34 ms. This is based on an average CDN + * RTT_typ is taken as 34 ms. This is based on an average CDN latency measured in each country weighted by the number of Internet users in that country to produce an overall weighted average for the Internet [PI2param]. - o The factor 0.22 is a geometry factor that characterizes the shape + * The factor 0.22 is a geometry factor that characterizes the shape of the sawteeth of prevalent Classic congestion controllers. The geometry factor is the difference between the minimum and the average queue delays of the sawteeth, relative to the base RTT. For instance, the geometry factor of standard Reno is 0.5. According to the census of congestion controllers conducted by Mishra _et al_ in Jul-Oct 2019 [CCcensus19], most Classic TCP traffic uses Cubic. And, according to the analysis in [PI2param], if running over a PI2 AQM, a large proportion of this Cubic traffic would be in its Reno-Friendly mode, which has a geometry factor of 0.21 (Linux implementation). The rest of the Cubic traffic would be in true Cubic mode, which has a geometry factor of 0.32. Without modelling the sawtooth profiles from all the other less prevalent congestion controllers, we estimate a 9:1 weighted average of these two, resulting in an average geometry factor of 0.22. - o The factor 2, is a safety factor that increases the target queue + * The factor 2, is a safety factor that increases the target queue to allow for the distribution of RTT_typ around its mean. Otherwise the target queue would only avoid underutilization for those users below the mean. It also provides a safety margin for the proportion of paths in use that span beyond the distance between a user and their local CDN. Currently no data is available on the variance of queue delay around the mean in each region, so there is plenty of room for this guess to become more educated. The two 'gain factors' in line 3 of Figure 6, alpha and beta, @@ -1874,31 +1888,31 @@ Tupdate dependent on p'. Instead, in PI2, alpha and beta are independent of p' because the squaring applied to Classic traffic tunes them inherently. This is explained in [PI2], which also explains why this more principled approach removes the need for most of the heuristics that had to be added to PIE. Nonetheless, an implementer might wish to add selected heuristics to either AQM. For instance the Linux reference DualPI2 implementation includes the following: - o Prior to enqueuing an L4S packet, if the L queue contains <2 + * Prior to enqueuing an L4S packet, if the L queue contains <2 packets, the packet is flagged to suppress any native L4S AQM marking at dequeue (which depends on sojourn time); - o Classic and coupled marking or dropping (i.e. based on p_C and + * Classic and coupled marking or dropping (i.e. based on p_C and p_CL from the PI controller) is only applied to a packet if the respective queue length in bytes is > 2 MTU (prior to enqueuing the packet or after dequeuing it, depending on whether the AQM is configured to be applied at enqueue or dequeue); - o In the WRR scheduler, the 'credit' indicating which queue should + * In the WRR scheduler, the 'credit' indicating which queue should transmit is only changed if there are packets in both queues (i.e. if there is actual resource contention). This means that a properly paced L flow might never be delayed by the WRR. The WRR credit is reset in favour of the L queue when the link is idle. An implementer might also wish to add other heuristics, e.g. burst protection [RFC8033] or enhanced burst protection [RFC8034]. Notes: @@ -2047,99 +2061,98 @@ 5: p_C = p'^2 % Classic prob = (base prob)^2 6: prevq = curq 7: } Figure 8: Example PI-Update Pseudocode for DualQ Coupled PI2 AQM (Including Overload Code) The choice of scheduler technology is critical to overload protection (see Section 4.1). - o A well-understood weighted scheduler such as weighted round robin + * A well-understood weighted scheduler such as weighted round robin (WRR) is recommended. As long as the scheduler weight for Classic is small (e.g. 1/16), its exact value is unimportant because it does not normally determine capacity shares. The weight is only important to prevent unresponsive L4S traffic starving Classic traffic. This is because capacity sharing between the queues is normally determined by the coupled congestion signal, which overrides the scheduler, by making L4S sources leave roughly equal per-flow capacity available for Classic flows. - o Alternatively, a time-shifted FIFO (TS-FIFO) could be used. It + * Alternatively, a time-shifted FIFO (TS-FIFO) could be used. It works by selecting the head packet that has waited the longest, biased against the Classic traffic by a time-shift of tshift. To implement time-shifted FIFO, the scheduler() function in line 3 of the dequeue code would simply be implemented as the scheduler() function at the bottom of Figure 10 in Appendix B. For the public Internet a good value for tshift is 50ms. For private networks with smaller diameter, about 4*target would be reasonable. TS- FIFO is a very simple scheduler, but complexity might need to be added to address some deficiencies (which is why it is not recommended over WRR): - * TS-FIFO does not fully isolate latency in the L4S queue from + - TS-FIFO does not fully isolate latency in the L4S queue from uncontrolled bursts in the Classic queue; - * TS-FIFO is only appropriate if time-stamping of packets is + - TS-FIFO is only appropriate if time-stamping of packets is feasible; - * Even if time-stamping is supported, the sojourn time of the + - Even if time-stamping is supported, the sojourn time of the head packet is always stale. For instance, if a burst arrives at an empty queue, the sojourn time will only measure the delay of the burst once the burst is over, even though the queue knew about it from the start. At the cost of more operations and more storage, a 'scaled sojourn time' metric of queue delay can be used, which is the sojourn time of a packet scaled by the ratio of the queue sizes when the packet departed and arrived [SigQ-Dyn]. - o A strict priority scheduler would be inappropriate, because it + * A strict priority scheduler would be inappropriate, because it would starve Classic if L4S was overloaded. Appendix B. Example DualQ Coupled Curvy RED Algorithm As another example of a DualQ Coupled AQM algorithm, the pseudocode below gives the Curvy RED based algorithm. Although the AQM was designed to be efficient in integer arithmetic, to aid understanding it is first given using floating point arithmetic (Figure 10). Then, one possible optimization for integer arithmetic is given, also in pseudocode (Figure 11). To aid comparison, the line numbers are kept in step between the two by using letter suffixes where the longer code needs extra lines. B.1. Curvy RED in Pseudocode The pseudocode manipulates three main structures of variables: the packet (pkt), the L4S queue (lq) and the Classic queue (cq) and consists of the following five functions: - o The initialization function cred_params_init(...) (Figure 2) that + * The initialization function cred_params_init(...) (Figure 2) that sets parameter defaults (the API for setting non-default values is omitted for brevity); - o The dequeue function cred_dequeue(lq, cq, pkt) (Figure 4); + * The dequeue function cred_dequeue(lq, cq, pkt) (Figure 4); - o The scheduling function scheduler(), which selects between the + * The scheduling function scheduler(), which selects between the head packets of the two queues. It also uses the following functions that are either shown elsewhere, or not shown in full here: - o The enqueue function, which is identical to that used for DualPI2, + * The enqueue function, which is identical to that used for DualPI2, dualpi2_enqueue(lq, cq, pkt) in Figure 3; - o mark(pkt) and drop(pkt) for ECN-marking and dropping a packet; - - o cq.len() or lq.len() returns the current length (aka. backlog) of + * mark(pkt) and drop(pkt) for ECN-marking and dropping a packet; + * cq.len() or lq.len() returns the current length (aka. backlog) of the relevant queue in bytes; - o cq.time() or lq.time() returns the current queuing delay + * cq.time() or lq.time() returns the current queuing delay (aka. sojourn time or service time) of the relevant queue in units of time (see Note a in Appendix A.1). Because Curvy RED was evaluated before DualPI2, certain improvements introduced for DualPI2 were not evaluated for Curvy RED. In the pseudocode below, the straightforward improvements have been added on the assumption they will provide similar benefits, but that has not been proven experimentally. They are: i) a conditional priority scheduler instead of strict priority ii) a time-based threshold for the native L4S AQM; iii) ECN support for the Classic AQM. A recent @@ -2413,22 +2426,22 @@ 13: continue % continue to the top of the while loop 14: } 15: mark(pkt) 16: } 17: } 18: return(pkt) % return the packet and stop here 19: } 20: return(NULL) % no packet to dequeue 21: } - Figure 11: Optimised Example Dequeue Pseudocode for Coupled DualQ AQM - using Integer Arithmetic + Figure 11: Optimised Example Dequeue Pseudocode for Coupled DualQ + AQM using Integer Arithmetic The two ranges, range_L and range_C are expressed as powers of 2 so that division can be implemented as a right bit-shift (>>) in lines 5 and 10 of the integer variant of the pseudocode (Figure 11). For the integer variant of the pseudocode, an integer version of the rand() function used at line 25 of the maxrand(function) in Figure 10 would be arranged to return an integer in the range 0 <= maxrand() < 2^32 (not shown). This would scale up all the floating point probabilities in the range [0,1] by 2^32. @@ -2475,32 +2488,39 @@ At the time of writing, the range of approaches to RTT-dependence in L4S congestion controls has not settled. Therefore, the guidance on the choice of the coupling factor in Appendix C.2 is given against DCTCP [RFC8257], which has well-understood RTT-dependence. The guidance is given for various RTT ratios, so that it can be adapted to future circumstances. C.2. Guidance on Controlling Throughput Equivalence - +---------------+------+-------+ + +===============+======+=======+ | RTT_C / RTT_L | Reno | Cubic | - +---------------+------+-------+ + +===============+======+=======+ | 1 | k'=1 | k'=0 | + +---------------+------+-------+ | 2 | k'=2 | k'=1 | + +---------------+------+-------+ | 3 | k'=2 | k'=2 | + +---------------+------+-------+ | 4 | k'=3 | k'=2 | + +---------------+------+-------+ | 5 | k'=3 | k'=3 | +---------------+------+-------+ - Table 1: Value of k' for which DCTCP throughput is roughly the same - as Reno or Cubic, for some example RTT ratios + Table 1: Value of k' for + which DCTCP throughput is + roughly the same as Reno or + Cubic, for some example RTT + ratios In the above appendices that give example DualQ Coupled algorithms, to aid efficient implementation, a coupling factor that is an integer power of 2 is always used. k' is always used to denote the power. k' is related to the coupling factor k in Equation (1) (Section 2.1) by k=2^k'. To determine the appropriate coupling factor policy, the operator first has to judge whether it wants DCTCP flows to have roughly equal throughput with Reno or with Cubic (because, even in its Reno- @@ -2542,21 +2562,20 @@ Koen De Schepper Nokia Bell Labs Antwerp Belgium Email: koen.de_schepper@nokia.com URI: https://www.bell-labs.com/usr/koen.de_schepper Bob Briscoe (editor) Independent - UK + United Kingdom Email: ietf@bobbriscoe.net URI: http://bobbriscoe.net/ Greg White CableLabs - Louisville, CO - US - + Louisville, CO, + United States of America Email: G.White@CableLabs.com