draft-ietf-tsvwg-aqm-dualq-coupled-20.txt   draft-ietf-tsvwg-aqm-dualq-coupled-21.txt 
Transport Area working group (tsvwg) K. De Schepper Transport Area working group (tsvwg) K. De Schepper
Internet-Draft Nokia Bell Labs Internet-Draft Nokia Bell Labs
Intended status: Experimental B. Briscoe, Ed. Intended status: Experimental B. Briscoe, Ed.
Expires: 27 June 2022 Independent Expires: 5 August 2022 Independent
G. White G. White
CableLabs CableLabs
24 December 2021 1 February 2022
DualQ Coupled AQMs for Low Latency, Low Loss and Scalable Throughput DualQ Coupled AQMs for Low Latency, Low Loss and Scalable Throughput
(L4S) (L4S)
draft-ietf-tsvwg-aqm-dualq-coupled-20 draft-ietf-tsvwg-aqm-dualq-coupled-21
Abstract Abstract
This specification defines a framework for coupling the Active Queue This specification defines a framework for coupling the Active Queue
Management (AQM) algorithms in two queues intended for flows with Management (AQM) algorithms in two queues intended for flows with
different responses to congestion. This provides a way for the different responses to congestion. This provides a way for the
Internet to transition from the scaling problems of standard TCP Internet to transition from the scaling problems of standard TCP
Reno-friendly ('Classic') congestion controls to the family of Reno-friendly ('Classic') congestion controls to the family of
'Scalable' congestion controls. These are designed for consistently 'Scalable' congestion controls. These are designed for consistently
very Low queuing Latency, very Low congestion Loss and Scaling of very Low queuing Latency, very Low congestion Loss and Scaling of
skipping to change at page 2, line 10 skipping to change at page 2, line 10
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/. Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on 27 June 2022. This Internet-Draft will expire on 5 August 2022.
Copyright Notice Copyright Notice
Copyright (c) 2021 IETF Trust and the persons identified as the Copyright (c) 2022 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/ Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document. license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components and restrictions with respect to this document. Code Components
extracted from this document must include Revised BSD License text as extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License. provided without warranty as described in the Revised BSD License.
skipping to change at page 3, line 14 skipping to change at page 3, line 14
4.1.3. Protecting against Unresponsive ECN-Capable 4.1.3. Protecting against Unresponsive ECN-Capable
Traffic . . . . . . . . . . . . . . . . . . . . . . . 25 Traffic . . . . . . . . . . . . . . . . . . . . . . . 25
5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 26 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 26
6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 26 6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 26
7. References . . . . . . . . . . . . . . . . . . . . . . . . . 27 7. References . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.1. Normative References . . . . . . . . . . . . . . . . . . 27 7.1. Normative References . . . . . . . . . . . . . . . . . . 27
7.2. Informative References . . . . . . . . . . . . . . . . . 27 7.2. Informative References . . . . . . . . . . . . . . . . . 27
Appendix A. Example DualQ Coupled PI2 Algorithm . . . . . . . . 33 Appendix A. Example DualQ Coupled PI2 Algorithm . . . . . . . . 33
A.1. Pass #1: Core Concepts . . . . . . . . . . . . . . . . . 33 A.1. Pass #1: Core Concepts . . . . . . . . . . . . . . . . . 33
A.2. Pass #2: Overload Details . . . . . . . . . . . . . . . . 44 A.2. Pass #2: Edge-Case Details . . . . . . . . . . . . . . . 44
Appendix B. Example DualQ Coupled Curvy RED Algorithm . . . . . 48 Appendix B. Example DualQ Coupled Curvy RED Algorithm . . . . . 48
B.1. Curvy RED in Pseudocode . . . . . . . . . . . . . . . . . 48 B.1. Curvy RED in Pseudocode . . . . . . . . . . . . . . . . . 48
B.2. Efficient Implementation of Curvy RED . . . . . . . . . . 54 B.2. Efficient Implementation of Curvy RED . . . . . . . . . . 54
Appendix C. Choice of Coupling Factor, k . . . . . . . . . . . . 56 Appendix C. Choice of Coupling Factor, k . . . . . . . . . . . . 56
C.1. RTT-Dependence . . . . . . . . . . . . . . . . . . . . . 56 C.1. RTT-Dependence . . . . . . . . . . . . . . . . . . . . . 56
C.2. Guidance on Controlling Throughput Equivalence . . . . . 57 C.2. Guidance on Controlling Throughput Equivalence . . . . . 57
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 61 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 61
1. Introduction 1. Introduction
skipping to change at page 4, line 26 skipping to change at page 4, line 26
delay to vary or cause the link to be under-utilized. These AQMs delay to vary or cause the link to be under-utilized. These AQMs
are tuned to allow a typical capacity-seeking Reno-friendly flow are tuned to allow a typical capacity-seeking Reno-friendly flow
to induce an average queue that roughly doubles the base RTT, to induce an average queue that roughly doubles the base RTT,
adding 5-15 ms of queuing on average (cf. 500 microseconds with adding 5-15 ms of queuing on average (cf. 500 microseconds with
L4S for the same mix of long-running and web traffic). However, L4S for the same mix of long-running and web traffic). However,
for many applications low delay is not useful unless it is for many applications low delay is not useful unless it is
consistently low. With these AQMs, 99th percentile queuing delay consistently low. With these AQMs, 99th percentile queuing delay
is 20-30 ms (cf. 2 ms with the same traffic over L4S). is 20-30 ms (cf. 2 ms with the same traffic over L4S).
* Similarly, recent research into using e2e congestion control * Similarly, recent research into using e2e congestion control
without needing an AQM in the network (e.g.BBR [BBRv1], without needing an AQM in the network
[I-D.cardwell-iccrg-bbr-congestion-control]) seems to have hit a (e.g.BBR [I-D.cardwell-iccrg-bbr-congestion-control]) seems to
similar lower limit to queuing delay of about 20ms on average (and have hit a similar lower limit to queuing delay of about 20ms on
any additional BBRv1 flow adds another 20ms of queuing) but there average but there are also regular 25ms delay spikes due to
are also regular 25ms delay spikes due to bandwidth probes and bandwidth probes and 60ms spikes due to flow-starts.
60ms spikes due to flow-starts.
L4S learns from the experience of Data Center TCP [RFC8257], which L4S learns from the experience of Data Center TCP [RFC8257], which
shows the power of complementary changes both in the network and on shows the power of complementary changes both in the network and on
end-systems. DCTCP teaches us that two small but radical changes to end-systems. DCTCP teaches us that two small but radical changes to
congestion control are needed to cut the two major outstanding causes congestion control are needed to cut the two major outstanding causes
of queuing delay variability: of queuing delay variability:
1. Far smaller rate variations (sawteeth) than Reno-friendly 1. Far smaller rate variations (sawteeth) than Reno-friendly
congestion controls; congestion controls;
skipping to change at page 6, line 20 skipping to change at page 6, line 20
using the experimentally assigned explicit congestion notification using the experimentally assigned explicit congestion notification
(ECN) codepoints in the IP header: ECT(1) and CE [RFC8311] (ECN) codepoints in the IP header: ECT(1) and CE [RFC8311]
[I-D.ietf-tsvwg-ecn-l4s-id]. [I-D.ietf-tsvwg-ecn-l4s-id].
Data Center TCP (DCTCP [RFC8257]) is an example of a Scalable Data Center TCP (DCTCP [RFC8257]) is an example of a Scalable
congestion control for controlled environments that has been deployed congestion control for controlled environments that has been deployed
for some time in Linux, Windows and FreeBSD operating systems. for some time in Linux, Windows and FreeBSD operating systems.
During the progress of this document through the IETF a number of During the progress of this document through the IETF a number of
other Scalable congestion controls were implemented, e.g. TCP other Scalable congestion controls were implemented, e.g. TCP
Prague [I-D.briscoe-iccrg-prague-congestion-control] [PragueLinux], Prague [I-D.briscoe-iccrg-prague-congestion-control] [PragueLinux],
BBRv2 [BBRv2], QUIC Prague and the L4S variant of SCREAM for real- BBRv2 [I-D.cardwell-iccrg-bbr-congestion-control], QUIC Prague and
time media [RFC8298]. the L4S variant of SCREAM for real-time media [RFC8298].
The focus of this specification is to enable deployment of the The focus of this specification is to enable deployment of the
network part of the L4S service. Then, without any management network part of the L4S service. Then, without any management
intervention, applications can exploit this new network capability as intervention, applications can exploit this new network capability as
their operating systems migrate to Scalable congestion controls, their operating systems migrate to Scalable congestion controls,
which can then evolve _while_ their benefits are being enjoyed by which can then evolve _while_ their benefits are being enjoyed by
everyone on the Internet. everyone on the Internet.
The DualQ Coupled AQM framework can incorporate any AQM designed for The DualQ Coupled AQM framework can incorporate any AQM designed for
a single queue that generates a statistical or deterministic mark/ a single queue that generates a statistical or deterministic mark/
skipping to change at page 8, line 26 skipping to change at page 8, line 26
Scalable Congestion Control: A congestion control where the average Scalable Congestion Control: A congestion control where the average
time from one congestion signal to the next (the recovery time) time from one congestion signal to the next (the recovery time)
remains invariant as the flow rate scales, all other factors being remains invariant as the flow rate scales, all other factors being
equal. This maintains the same degree of control over queueing equal. This maintains the same degree of control over queueing
and utilization whatever the flow rate, as well as ensuring that and utilization whatever the flow rate, as well as ensuring that
high throughput is robust to disturbances. For instance, DCTCP high throughput is robust to disturbances. For instance, DCTCP
averages 2 congestion signals per round-trip whatever the flow averages 2 congestion signals per round-trip whatever the flow
rate, as do other recently developed scalable congestion controls, rate, as do other recently developed scalable congestion controls,
e.g. Relentless TCP [Mathis09], TCP Prague e.g. Relentless TCP [Mathis09], TCP Prague
[I-D.briscoe-iccrg-prague-congestion-control], [PragueLinux], [I-D.briscoe-iccrg-prague-congestion-control], [PragueLinux],
BBRv2 [BBRv2] and the L4S variant of SCREAM for real-time BBRv2 [I-D.cardwell-iccrg-bbr-congestion-control] and the L4S
media [SCReAM], [RFC8298]). For the public Internet a Scalable variant of SCREAM for real-time media [SCReAM], [RFC8298]). For
transport has to comply with the requirements in Section 4 of the public Internet a Scalable transport has to comply with the
[I-D.ietf-tsvwg-ecn-l4s-id] (aka. the 'Prague L4S requirements'). requirements in Section 4 of [I-D.ietf-tsvwg-ecn-l4s-id] (aka. the
'Prague L4S requirements').
C: Abbreviation for Classic, e.g. when used as a subscript. C: Abbreviation for Classic, e.g. when used as a subscript.
L: Abbreviation for L4S, e.g. when used as a subscript. L: Abbreviation for L4S, e.g. when used as a subscript.
The terms Classic or L4S can also qualify other nouns, such as The terms Classic or L4S can also qualify other nouns, such as
'codepoint', 'identifier', 'classification', 'packet', 'flow'. 'codepoint', 'identifier', 'classification', 'packet', 'flow'.
For example: an L4S packet means a packet with an L4S identifier For example: an L4S packet means a packet with an L4S identifier
sent from an L4S congestion control. sent from an L4S congestion control.
skipping to change at page 21, line 12 skipping to change at page 21, line 12
that the maximum delay to a C packet is w times that of an L that the maximum delay to a C packet is w times that of an L
packet. packet.
- for a time-shifted FIFO (TS-FIFO) scheduler (see Section 4.1.1) - for a time-shifted FIFO (TS-FIFO) scheduler (see Section 4.1.1)
a time-shift of tshift means that the maximum delay to a C a time-shift of tshift means that the maximum delay to a C
packet is tshift greater than that of an L packet. tshift could packet is tshift greater than that of an L packet. tshift could
be expressed as a multiple of the typical RTT rather than as an be expressed as a multiple of the typical RTT rather than as an
absolute delay. absolute delay.
* The maximum Classic ECN marking probability, p_Cmax, before * The maximum Classic ECN marking probability, p_Cmax, before
switching over to drop. introducing drop.
2.5.2.2. Monitoring 2.5.2.2. Monitoring
An experimental DualQ Coupled AQM SHOULD allow the operator to An experimental DualQ Coupled AQM SHOULD allow the operator to
monitor each of the following operational statistics on demand, per monitor each of the following operational statistics on demand, per
queue and per configurable sample interval, for performance queue and per configurable sample interval, for performance
monitoring and perhaps also for accounting in some cases: monitoring and perhaps also for accounting in some cases:
* Bits forwarded, from which utilization can be calculated; * Bits forwarded, from which utilization can be calculated;
skipping to change at page 25, line 19 skipping to change at page 25, line 19
Drop on Saturation: Saturation can be avoided by setting a maximum Drop on Saturation: Saturation can be avoided by setting a maximum
threshold for L4S ECN marking (assuming k>1) before saturation threshold for L4S ECN marking (assuming k>1) before saturation
starts to make the flow rates of the different traffic types starts to make the flow rates of the different traffic types
diverge. Above that the drop probability of Classic traffic is diverge. Above that the drop probability of Classic traffic is
applied to all packets of all traffic types. Then experiments applied to all packets of all traffic types. Then experiments
have shown that queueing delay can be kept at the target in any have shown that queueing delay can be kept at the target in any
overload situation, including with unresponsive traffic, and no overload situation, including with unresponsive traffic, and no
further measures are required [DualQ-Test]. further measures are required [DualQ-Test].
Delay on Saturation: When L4S marking saturates, instead of Delay on Saturation: When L4S marking saturates, instead of
switching to drop, the drop and marking probabilities could be introducing L4S drop, the drop and marking probabilities of both
capped. Beyond that, delay will grow either solely in the queue queues could be capped. Beyond that, delay will grow either
with unresponsive traffic (if WRR is used), or in both queues (if solely in the queue with unresponsive traffic (if WRR is used), or
time-shifted FIFO is used). In either case, the higher delay in both queues (if time-shifted FIFO is used). In either case,
ought to control temporary high congestion. If the overload is the higher delay ought to control temporary high congestion. If
more persistent, eventually the combined DualQ will overflow and the overload is more persistent, eventually the combined DualQ
tail drop will control congestion. will overflow and tail drop will control congestion.
The example implementation in Appendix A solely applies the "drop on The example implementation in Appendix A solely applies the "drop on
saturation" policy. The DOCSIS specification of a DualQ Coupled saturation" policy. The DOCSIS specification of a DualQ Coupled
AQM [DOCSIS3.1] also implements the 'drop on saturation' policy with AQM [DOCSIS3.1] also implements the 'drop on saturation' policy with
a very shallow L buffer. However, the addition of DOCSIS per-flow a very shallow L buffer. However, the addition of DOCSIS per-flow
Queue Protection [I-D.briscoe-docsis-q-protection] turns this into Queue Protection [I-D.briscoe-docsis-q-protection] turns this into
'delay on saturation' by redirecting some packets of the flow(s) most 'delay on saturation' by redirecting some packets of the flow(s) most
responsible for L queue overload into the C queue, which has a higher responsible for L queue overload into the C queue, which has a higher
delay target. If overload continues, this again becomes 'drop on delay target. If overload continues, this again becomes 'drop on
saturation' as the level of drop in the C queue rises to maintain the saturation' as the level of drop in the C queue rises to maintain the
skipping to change at page 27, line 13 skipping to change at page 27, line 13
implementations were tested. implementations were tested.
7. References 7. References
7.1. Normative References 7.1. Normative References
[I-D.ietf-tsvwg-ecn-l4s-id] [I-D.ietf-tsvwg-ecn-l4s-id]
Schepper, K. D. and B. Briscoe, "Explicit Congestion Schepper, K. D. and B. Briscoe, "Explicit Congestion
Notification (ECN) Protocol for Very Low Queuing Delay Notification (ECN) Protocol for Very Low Queuing Delay
(L4S)", Work in Progress, Internet-Draft, draft-ietf- (L4S)", Work in Progress, Internet-Draft, draft-ietf-
tsvwg-ecn-l4s-id-22, 8 November 2021, tsvwg-ecn-l4s-id-23, 24 December 2021,
<https://datatracker.ietf.org/doc/html/draft-ietf-tsvwg- <https://datatracker.ietf.org/doc/html/draft-ietf-tsvwg-
ecn-l4s-id-22>. ecn-l4s-id-23>.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997, DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>. <https://www.rfc-editor.org/info/rfc2119>.
[RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
of Explicit Congestion Notification (ECN) to IP", of Explicit Congestion Notification (ECN) to IP",
RFC 3168, DOI 10.17487/RFC3168, September 2001, RFC 3168, DOI 10.17487/RFC3168, September 2001,
<https://www.rfc-editor.org/info/rfc3168>. <https://www.rfc-editor.org/info/rfc3168>.
skipping to change at page 28, line 5 skipping to change at page 28, line 5
Queue- based Active Queue Management Algorithms", Proc. Queue- based Active Queue Management Algorithms", Proc.
Int'l Soc. for Optical Engineering (SPIE) 4866:35--46 DOI: Int'l Soc. for Optical Engineering (SPIE) 4866:35--46 DOI:
10.1117/12.473021, 2002, 10.1117/12.473021, 2002,
<https://www.cs.purdue.edu/homes/fahmy/papers/ldc.pdf>. <https://www.cs.purdue.edu/homes/fahmy/papers/ldc.pdf>.
[ARED01] Floyd, S., Gummadi, R., and S. Shenker, "Adaptive RED: An [ARED01] Floyd, S., Gummadi, R., and S. Shenker, "Adaptive RED: An
Algorithm for Increasing the Robustness of RED's Active Algorithm for Increasing the Robustness of RED's Active
Queue Management", ACIRI Technical Report , August 2001, Queue Management", ACIRI Technical Report , August 2001,
<http://www.icir.org/floyd/red.html>. <http://www.icir.org/floyd/red.html>.
[BBRv1] Cardwell, N., Cheng, Y., Hassas Yeganeh, S., and V.
Jacobson, "BBR Congestion Control", Internet Draft draft-
cardwell-iccrg-bbr-congestion-control-00, July 2017,
<https://tools.ietf.org/html/draft-cardwell-iccrg-bbr-
congestion-control-00>.
[BBRv2] Cardwell, N., "BRTCP BBR v2 Alpha/Preview Release", github
repository; Linux congestion control module,
<https://github.com/google/bbr/blob/v2alpha/README.md>.
[CCcensus19] [CCcensus19]
Mishra, A., Sun, X., Jain, A., Pande, S., Joshi, R., and Mishra, A., Sun, X., Jain, A., Pande, S., Joshi, R., and
B. Leong, "The Great Internet TCP Congestion Control B. Leong, "The Great Internet TCP Congestion Control
Census", Proc. ACM on Measurement and Analysis of Census", Proc. ACM on Measurement and Analysis of
Computing Systems 3(3), December 2019, Computing Systems 3(3), December 2019,
<https://doi.org/10.1145/3366693>. <https://doi.org/10.1145/3366693>.
[CoDel] Nichols, K. and V. Jacobson, "Controlling Queue Delay", [CoDel] Nichols, K. and V. Jacobson, "Controlling Queue Delay",
ACM Queue 10(5), May 2012, ACM Queue 10(5), May 2012,
<http://queue.acm.org/issuedetail.cfm?issue=2208917>. <http://queue.acm.org/issuedetail.cfm?issue=2208917>.
skipping to change at page 29, line 19 skipping to change at page 29, line 8
<https://www.duo.uio.no/bitstream/handle/10852/57424/ <https://www.duo.uio.no/bitstream/handle/10852/57424/
thesis-henrste.pdf?sequence=1>. thesis-henrste.pdf?sequence=1>.
[Heist21] Heist, P. and J. Morton, "L4S Tests", github README, [Heist21] Heist, P. and J. Morton, "L4S Tests", github README,
August 2021, <https://github.com/heistp/l4s- August 2021, <https://github.com/heistp/l4s-
tests/#underutilization-with-bursty-traffic>. tests/#underutilization-with-bursty-traffic>.
[I-D.briscoe-docsis-q-protection] [I-D.briscoe-docsis-q-protection]
Briscoe, B. and G. White, "The DOCSIS(r) Queue Protection Briscoe, B. and G. White, "The DOCSIS(r) Queue Protection
Algorithm to Preserve Low Latency", Work in Progress, Algorithm to Preserve Low Latency", Work in Progress,
Internet-Draft, draft-briscoe-docsis-q-protection-01, 17 Internet-Draft, draft-briscoe-docsis-q-protection-02, 31
December 2021, <https://datatracker.ietf.org/doc/html/ January 2022, <https://datatracker.ietf.org/doc/html/
draft-briscoe-docsis-q-protection-01>. draft-briscoe-docsis-q-protection-02>.
[I-D.briscoe-iccrg-prague-congestion-control] [I-D.briscoe-iccrg-prague-congestion-control]
Schepper, K. D., Tilmans, O., and B. Briscoe, "Prague Schepper, K. D., Tilmans, O., and B. Briscoe, "Prague
Congestion Control", Work in Progress, Internet-Draft, Congestion Control", Work in Progress, Internet-Draft,
draft-briscoe-iccrg-prague-congestion-control-00, 9 March draft-briscoe-iccrg-prague-congestion-control-00, 9 March
2021, <https://datatracker.ietf.org/doc/html/draft- 2021, <https://datatracker.ietf.org/doc/html/draft-
briscoe-iccrg-prague-congestion-control-00>. briscoe-iccrg-prague-congestion-control-00>.
[I-D.briscoe-tsvwg-l4s-diffserv] [I-D.briscoe-tsvwg-l4s-diffserv]
Briscoe, B., "Interactions between Low Latency, Low Loss, Briscoe, B., "Interactions between Low Latency, Low Loss,
skipping to change at page 29, line 50 skipping to change at page 29, line 39
Jacobson, "BBR Congestion Control", Work in Progress, Jacobson, "BBR Congestion Control", Work in Progress,
Internet-Draft, draft-cardwell-iccrg-bbr-congestion- Internet-Draft, draft-cardwell-iccrg-bbr-congestion-
control-01, 7 November 2021, control-01, 7 November 2021,
<https://datatracker.ietf.org/doc/html/draft-cardwell- <https://datatracker.ietf.org/doc/html/draft-cardwell-
iccrg-bbr-congestion-control-01>. iccrg-bbr-congestion-control-01>.
[I-D.ietf-tsvwg-l4s-arch] [I-D.ietf-tsvwg-l4s-arch]
Briscoe, B., Schepper, K. D., Bagnulo, M., and G. White, Briscoe, B., Schepper, K. D., Bagnulo, M., and G. White,
"Low Latency, Low Loss, Scalable Throughput (L4S) Internet "Low Latency, Low Loss, Scalable Throughput (L4S) Internet
Service: Architecture", Work in Progress, Internet-Draft, Service: Architecture", Work in Progress, Internet-Draft,
draft-ietf-tsvwg-l4s-arch-14, 8 November 2021, draft-ietf-tsvwg-l4s-arch-15, 24 December 2021,
<https://datatracker.ietf.org/doc/html/draft-ietf-tsvwg- <https://datatracker.ietf.org/doc/html/draft-ietf-tsvwg-
l4s-arch-14>. l4s-arch-15>.
[L4Sdemo16] [L4Sdemo16]
Bondarenko, O., De Schepper, K., Tsang, I., and B. Bondarenko, O., De Schepper, K., Tsang, I., and B.
Briscoe, "Ultra-Low Delay for All: Live Experience, Live Briscoe, "Ultra-Low Delay for All: Live Experience, Live
Analysis", Proc. MMSYS'16 pp33:1--33:4, May 2016, Analysis", Proc. MMSYS'16 pp33:1--33:4, May 2016,
<http://dl.acm.org/citation.cfm?doid=2910017.2910633 <http://dl.acm.org/citation.cfm?doid=2910017.2910633
(videos of demos: (videos of demos:
https://riteproject.eu/dctth/#1511dispatchwg )>. https://riteproject.eu/dctth/#1511dispatchwg )>.
[L4S_5G] Willars, P., Wittenmark, E., Ronkainen, H., Östberg, C., [L4S_5G] Willars, P., Wittenmark, E., Ronkainen, H., Östberg, C.,
skipping to change at page 33, line 20 skipping to change at page 33, line 16
As a first concrete example, the pseudocode below gives the DualPI2 As a first concrete example, the pseudocode below gives the DualPI2
algorithm. DualPI2 follows the structure of the DualQ Coupled AQM algorithm. DualPI2 follows the structure of the DualQ Coupled AQM
framework in Figure 1. A simple ramp function (configured in units framework in Figure 1. A simple ramp function (configured in units
of queuing time) with unsmoothed ECN marking is used for the Native of queuing time) with unsmoothed ECN marking is used for the Native
L4S AQM. The ramp can also be configured as a step function. The L4S AQM. The ramp can also be configured as a step function. The
PI2 algorithm [PI2] is used for the Classic AQM. PI2 is an improved PI2 algorithm [PI2] is used for the Classic AQM. PI2 is an improved
variant of the PIE AQM [RFC8033]. variant of the PIE AQM [RFC8033].
The pseudocode will be introduced in two passes. The first pass The pseudocode will be introduced in two passes. The first pass
explains the core concepts, deferring handling of overload to the explains the core concepts, deferring handling of edge-cases like
second pass. To aid comparison, line numbers are kept in step overload to the second pass. To aid comparison, line numbers are
between the two passes by using letter suffixes where the longer code kept in step between the two passes by using letter suffixes where
needs extra lines. the longer code needs extra lines.
All variables are assumed to be floating point in their basic units All variables are assumed to be floating point in their basic units
(size in bytes, time in seconds, rates in bytes/second, alpha and (size in bytes, time in seconds, rates in bytes/second, alpha and
beta in Hz, and probabilities from 0 to 1. Constants expressed in k beta in Hz, and probabilities from 0 to 1. Constants expressed in k
(kilo), M (mega), G (giga), u (micro), m (milli) , %, ... are assumed (kilo), M (mega), G (giga), u (micro), m (milli) , %, ... are assumed
to be converted to their appropriate multiple or fraction to to be converted to their appropriate multiple or fraction to
represent the basic units. A real implementation that wants to use represent the basic units. A real implementation that wants to use
integer values needs to handle appropriate scaling factors and allow integer values needs to handle appropriate scaling factors and allow
accordingly appropriate resolution of its integer types (including accordingly appropriate resolution of its integer types (including
temporary internal values during calculations). temporary internal values during calculations).
skipping to change at page 36, line 22 skipping to change at page 36, line 22
8: else % ECN bits = not-ECT or ECT(0) 8: else % ECN bits = not-ECT or ECT(0)
9: cq.enqueue(pkt) 9: cq.enqueue(pkt)
10: } 10: }
Figure 3: Example Enqueue Pseudocode for DualQ Coupled PI2 AQM Figure 3: Example Enqueue Pseudocode for DualQ Coupled PI2 AQM
1: dualpi2_dequeue(lq, cq, pkt) { % Couples L4S & Classic queues 1: dualpi2_dequeue(lq, cq, pkt) { % Couples L4S & Classic queues
2: while ( lq.byt() + cq.byt() > 0 ) { 2: while ( lq.byt() + cq.byt() > 0 ) {
3: if ( scheduler() == lq ) { 3: if ( scheduler() == lq ) {
4: lq.dequeue(pkt) % Scheduler chooses lq 4: lq.dequeue(pkt) % Scheduler chooses lq
5: p'_L = laqm(lq.time()) && (lq.len() > Th_len) % Native LAQM 5: p'_L = laqm(lq.time()) % Native LAQM
6: p_L = max(p'_L, p_CL) % Combining function 6: p_L = max(p'_L, p_CL) % Combining function
7: if ( recur(lq, p_L) ) % Linear marking 7: if ( recur(lq, p_L) ) % Linear marking
8: mark(pkt) 8: mark(pkt)
9: } else { 9: } else {
10: cq.dequeue(pkt) % Scheduler chooses cq 10: cq.dequeue(pkt) % Scheduler chooses cq
11: if ( recur(cq, p_C) ) { % probability p_C = p'^2 11: if ( recur(cq, p_C) ) { % probability p_C = p'^2
12: if ( ecn(pkt) == 0 ) { % if ECN field = not-ECT 12: if ( ecn(pkt) == 0 ) { % if ECN field = not-ECT
13: drop(pkt) % squared drop 13: drop(pkt) % squared drop
14: continue % continue to the top of the while loop 14: continue % continue to the top of the while loop
15: } 15: }
skipping to change at page 38, line 49 skipping to change at page 39, line 5
Although the DCTCP paper [Alizadeh-stability] recommends an ECN Although the DCTCP paper [Alizadeh-stability] recommends an ECN
marking threshold of 0.17*RTT_typ, it also shows that the threshold marking threshold of 0.17*RTT_typ, it also shows that the threshold
can be much shallower with hardly any worse under-utilization of the can be much shallower with hardly any worse under-utilization of the
link (because the amplitude of DCTCP's sawteeth is so small). Based link (because the amplitude of DCTCP's sawteeth is so small). Based
on extensive experiments, for the public Internet the default minimum on extensive experiments, for the public Internet the default minimum
ECN marking threshold (target) in Figure 2 is considered a good ECN marking threshold (target) in Figure 2 is considered a good
compromise, even though it is significantly smaller fraction of compromise, even though it is significantly smaller fraction of
RTT_typ. RTT_typ.
A minimum marking threshold parameter (Th_len, default 1 packet) is
also necessary to ensure that the ramp does not trigger excessive
marking on slow links. Where an implementation knows the link rate,
it can set up this minimum at the time it is configured. For
instance, it would divide 1 MTU by the link rate to convert it into a
serialization time, then if the lower threshold of the Native L AQM
ramp was lower than this serialization time, it could increase the
thresholds to shift the bottom of the ramp to 2 MTU. This is the
approach used in DOCSIS [DOCSIS3.1], because the configured link rate
is dedicated to the DualQ.
In software implementations, as shown in the pseudocode, the link
rate might be shared with other queues. The second part of the
logical AND condition in Line 5 of Figure 4 caters for such cases.
Even if the outcome of the Native L4S AQM function, laqm(), is true,
it does not mark a packet unless the queue also exceeds 1 packet (but
see note later about the Linux implementation).
1: laqm(qdelay) { % Returns native L4S AQM probability 1: laqm(qdelay) { % Returns native L4S AQM probability
2: if (qdelay >= maxTh) 2: if (qdelay >= maxTh)
3: return 1 3: return 1
4: else if (qdelay > minTh) 4: else if (qdelay > minTh)
5: return (qdelay - minTh)/range % Divide could use a bit-shift 5: return (qdelay - minTh)/range % Divide could use a bit-shift
6: else 6: else
7: return 0 7: return 0
8: } 8: }
Figure 5: Example Pseudocode for the Native L4S AQM Figure 5: Example Pseudocode for the Native L4S AQM
skipping to change at page 43, line 16 skipping to change at page 43, line 5
Tupdate dependent on p'. Instead, in PI2, alpha and beta are Tupdate dependent on p'. Instead, in PI2, alpha and beta are
independent of p' because the squaring applied to Classic traffic independent of p' because the squaring applied to Classic traffic
tunes them inherently. This is explained in [PI2], which also tunes them inherently. This is explained in [PI2], which also
explains why this more principled approach removes the need for most explains why this more principled approach removes the need for most
of the heuristics that had to be added to PIE. of the heuristics that had to be added to PIE.
Nonetheless, an implementer might wish to add selected details to Nonetheless, an implementer might wish to add selected details to
either AQM. For instance the Linux reference DualPI2 implementation either AQM. For instance the Linux reference DualPI2 implementation
includes the following (not shown in the pseudocode above): includes the following (not shown in the pseudocode above):
* The check that the queue exceeds Th_len before marking with the
native L4S AQM is actually at enqueue, not dequeue, otherwise it
would exempt the last packet of a burst from being marked. The
result of the check is conveyed from enqueue to the dequeue
function via a boolean in the packet metadata.
* Classic and coupled marking or dropping (i.e. based on p_C and * Classic and coupled marking or dropping (i.e. based on p_C and
p_CL from the PI controller) is not applied to a packet if the p_CL from the PI controller) is not applied to a packet if the
respective queue length in bytes is < 2 MTU (prior to enqueuing respective queue length in bytes is < 2 MTU (prior to enqueuing
the packet or dequeuing it, depending on whether the AQM is the packet or dequeuing it, depending on whether the AQM is
configured to be applied at enqueue or dequeue); configured to be applied at enqueue or dequeue);
* In the WRR scheduler, the 'credit' indicating which queue should * In the WRR scheduler, the 'credit' indicating which queue should
transmit is only changed if there are packets in both queues transmit is only changed if there are packets in both queues
(i.e. if there is actual resource contention). This means that a (i.e. if there is actual resource contention). This means that a
properly paced L flow might never be delayed by the WRR. The WRR properly paced L flow might never be delayed by the WRR. The WRR
skipping to change at page 44, line 30 skipping to change at page 44, line 11
that a ramp is configured in place of a step, which will allow that a ramp is configured in place of a step, which will allow
congestion control algorithms to investigate faster smoothing congestion control algorithms to investigate faster smoothing
algorithms. algorithms.
A ramp is more general that a step, because an operator can A ramp is more general that a step, because an operator can
effectively turn the ramp into a step function, as used by DCTCP, effectively turn the ramp into a step function, as used by DCTCP,
by setting the range to zero. There will not be a divide by zero by setting the range to zero. There will not be a divide by zero
problem at line 5 of Figure 5 because, if minTh is equal to problem at line 5 of Figure 5 because, if minTh is equal to
maxTh, the condition for this ramp calculation cannot arise. maxTh, the condition for this ramp calculation cannot arise.
A.2. Pass #2: Overload Details A.2. Pass #2: Edge-Case Details
Figure 7 repeats the dequeue function of Figure 4, but with overload This section takes a second pass through the pseudocode adding
details added. Similarly Figure 8 repeats the core PI algorithm of details of two edge-cases: low link rate and overload. Figure 7
Figure 6 with overload details added. The initialization, enqueue, repeats the dequeue function of Figure 4, but with details of both
L4S AQM and recur functions are unchanged. edge-cases added. Similarly Figure 8 repeats the core PI algorithm
of Figure 6, but with overload details added. The initialization,
enqueue, L4S AQM and recur functions are unchanged.
The link rate can be so low that it takes a single packet queue
longer to serialize than the threshold delay at which ECN marking
starts to be applied in the L queue. Therefore, a minimum marking
threshold parameter in units of packets rather than time is necessary
(Th_len, default 1 packet in line 19 of Figure 2) to ensure that the
ramp does not trigger excessive marking on slow links. Where an
implementation knows the link rate, it can set up this minimum at the
time it is configured. For instance, it would divide 1 MTU by the
link rate to convert it into a serialization time, then if the lower
threshold of the Native L AQM ramp was lower than this serialization
time, it could increase the thresholds to shift the bottom of the
ramp to 2 MTU. This is the approach used in DOCSIS [DOCSIS3.1],
because the configured link rate is dedicated to the DualQ.
The pseudocode given here applies where the link rate is unknown,
which is more common for software implementations that might be
deployed in scenarios where the link is shared with other queues. In
lines 5a to 5d in Figure 7 the native L4S marking probability, p'_L,
is zeroed if the queue is only 1 packet (in the default
configuration).
Linux implementation note:
* In Linux, the check that the queue exceeds Th_len before marking
with the native L4S AQM is actually at enqueue, not dequeue,
otherwise it would exempt the last packet of a burst from being
marked. The result of the check is conveyed from enqueue to the
dequeue function via a boolean in the packet metadata.
Persistent overload is deemed to have occurred when Classic drop/
marking probability reaches p_Cmax. Above this point, the Classic
drop probability is applied to both L and C queues, irrespective of
whether any packet is ECN-capable. ECT packets that are not dropped
can still be ECN-marked.
In line 10 of the initialization function (Figure 2), the maximum In line 10 of the initialization function (Figure 2), the maximum
Classic drop probability p_Cmax = min(1/k^2, 1) or 1/4 for the Classic drop probability p_Cmax = min(1/k^2, 1) or 1/4 for the
default coupling factor k=2. p_Cmax is the point at which it is default coupling factor k=2. In practice, 25% has been found to be a
deemed that the Classic queue has become persistently overloaded, so good threshold to preserve fairness between ECN capable and non ECN
it switches to using drop, even for ECN-capable packets. ECT packets capable traffic. This protects the queues against both temporary
that are not dropped can still be ECN-marked. overload from responsive flows and more persistent overload from any
unresponsive traffic that falsely claims to be responsive to ECN.
In practice, 25% has been found to be a good threshold to preserve
fairness between ECN capable and non ECN capable traffic. This
protects the queues against both temporary overload from responsive
flows and more persistent overload from any unresponsive traffic that
falsely claims to be responsive to ECN.
When the Classic ECN marking probability reaches the p_Cmax threshold When the Classic ECN marking probability reaches the p_Cmax threshold
(1/k^2), the marking probability coupled to the L4S queue, p_CL will (1/k^2), the marking probability coupled to the L4S queue, p_CL will
always be 100% for any k (by equation (1) in Section 2). So, for always be 100% for any k (by equation (1) in Section 2). So, for
readability, the constant p_Lmax is defined as 1 in line 22 of the readability, the constant p_Lmax is defined as 1 in line 22 of the
initialization function (Figure 2). This is intended to ensure that initialization function (Figure 2). This is intended to ensure that
the L4S queue starts to introduce dropping once ECN-marking saturates the L4S queue starts to introduce dropping once ECN-marking saturates
at 100% and can rise no further. The 'Prague L4S' at 100% and can rise no further. The 'Prague L4S'
requirements [I-D.ietf-tsvwg-ecn-l4s-id] state that, when an L4S requirements [I-D.ietf-tsvwg-ecn-l4s-id] state that, when an L4S
congestion control detects a drop, it falls back to a response that congestion control detects a drop, it falls back to a response that
coexists with 'Classic' Reno congestion control. So it is correct coexists with 'Classic' Reno congestion control. So it is correct
that, when the L4S queue drops packets, it drops them proportional to that, when the L4S queue drops packets, it drops them proportional to
p'^2, as if they are Classic packets. p'^2, as if they are Classic packets.
Both these switch-overs are triggered by the tests for overload The two queues each test for overload in lines 4b and 12b of the
introduced in lines 4b and 12b of the dequeue function (Figure 7). dequeue function (Figure 7). Lines 8c to 8g drop L4S packets with
Lines 8c to 8g drop L4S packets with probability p'^2. Lines 8h to probability p'^2. Lines 8h to 8i mark the remaining packets with
8i mark the remaining packets with probability p_CL. Given p_Lmax = probability p_CL. Given p_Lmax = 1, all remaining packets will be
1, all remaining packets will be marked because, to have reached the marked because, to have reached the else block at line 8b, p_CL >= 1.
else block at line 8b, p_CL >= 1.
Lines 2c to 2d in the core PI algorithm (Figure 8) deal with overload Lines 2c to 2d in the core PI algorithm (Figure 8) deal with overload
of the L4S queue when there is no Classic traffic. This is of the L4S queue when there is no Classic traffic. This is
necessary, because the core PI algorithm maintains the appropriate necessary, because the core PI algorithm maintains the appropriate
drop probability to regulate overload, but it depends on the length drop probability to regulate overload, but it depends on the length
of the Classic queue. If there is no Classic queue the naive PI of the Classic queue. If there is no Classic queue the naive PI
update function in Figure 6 would drop nothing, even if the L4S queue update function in Figure 6 would drop nothing, even if the L4S queue
were overloaded - so tail drop would have to take over (lines 2 and 3 were overloaded - so tail drop would have to take over (lines 2 and 3
of Figure 3). of Figure 3).
skipping to change at page 45, line 41 skipping to change at page 46, line 9
Figure 8 keeps delay on target using drop. If the test at line 2a of Figure 8 keeps delay on target using drop. If the test at line 2a of
Figure 8 finds that the Classic queue is empty, line 2d measures the Figure 8 finds that the Classic queue is empty, line 2d measures the
current queue delay using the L4S queue instead. While the L4S queue current queue delay using the L4S queue instead. While the L4S queue
is not overloaded, its delay will always be tiny compared to the is not overloaded, its delay will always be tiny compared to the
target Classic queue delay. So p_CL will be driven to zero, and the target Classic queue delay. So p_CL will be driven to zero, and the
L4S queue will naturally be governed solely by p'_L from the native L4S queue will naturally be governed solely by p'_L from the native
L4S AQM (lines 5 and 6 of the dequeue algorithm in Figure 7). But, L4S AQM (lines 5 and 6 of the dequeue algorithm in Figure 7). But,
if unresponsive L4S source(s) cause overload, the DualQ transitions if unresponsive L4S source(s) cause overload, the DualQ transitions
smoothly to L4S marking based on the PI algorithm. If overload smoothly to L4S marking based on the PI algorithm. If overload
increases further, it naturally transitions from marking to dropping increases further, it naturally transitions from marking to dropping
by the switch-over mechanism already described. by the mechanism already described.
1: dualpi2_dequeue(lq, cq, pkt) { % Couples L4S & Classic queues 1: dualpi2_dequeue(lq, cq, pkt) { % Couples L4S & Classic queues
2: while ( lq.byt() + cq.byt() > 0 ) { 2: while ( lq.byt() + cq.byt() > 0 ) {
3: if ( scheduler() == lq ) { 3: if ( scheduler() == lq ) {
4a: lq.dequeue(pkt) % L4S scheduled 4a: lq.dequeue(pkt) % L4S scheduled
4b: if ( p_CL < p_Lmax ) { % Check for overload saturation 4b: if ( p_CL < p_Lmax ) { % Check for overload saturation
5: p'_L = laqm(lq.time()) && (lq.len()>Th_len) % Native LAQM 5a: if (lq.len()>Th_len) % >1 packet queued
5b: p'_L = laqm(lq.time()) % Native LAQM
5c: else
5d: p'_L = 0 % Suppress marking 1 pkt queue
6: p_L = max(p'_L, p_CL) % Combining function 6: p_L = max(p'_L, p_CL) % Combining function
7: if ( recur(lq, p_L) %Linear marking 7: if ( recur(lq, p_L) %Linear marking
8a: mark(pkt) 8a: mark(pkt)
8b: } else { % overload saturation 8b: } else { % overload saturation
8c: if ( recur(lq, p_C) ) { % probability p_C = p'^2 8c: if ( recur(lq, p_C) ) { % probability p_C = p'^2
8e: drop(pkt) % revert to Classic drop due to overload 8e: drop(pkt) % revert to Classic drop due to overload
8f: continue % continue to the top of the while loop 8f: continue % continue to the top of the while loop
8g: } 8g: }
8h: if ( recur(lq, p_CL) ) % probability p_CL = k * p' 8h: if ( recur(lq, p_CL) ) % probability p_CL = k * p'
8i: mark(pkt) % linear marking of remaining packets 8i: mark(pkt) % linear marking of remaining packets
skipping to change at page 46, line 39 skipping to change at page 46, line 48
15: } 15: }
16: mark(pkt) % squared mark 16: mark(pkt) % squared mark
17: } 17: }
18: } 18: }
19: return(pkt) % return the packet and stop 19: return(pkt) % return the packet and stop
20: } 20: }
21: return(NULL) % no packet to dequeue 21: return(NULL) % no packet to dequeue
22: } 22: }
Figure 7: Example Dequeue Pseudocode for DualQ Coupled PI2 AQM Figure 7: Example Dequeue Pseudocode for DualQ Coupled PI2 AQM
(Including Overload Code) (Including Code for Edge-Cases)
1: dualpi2_update(lq, cq) { % Update p' every Tupdate 1: dualpi2_update(lq, cq) { % Update p' every Tupdate
2a: if ( cq.byt() > 0 ) 2a: if ( cq.byt() > 0 )
2b: curq = cq.time() %use queuing time of first-in Classic packet 2b: curq = cq.time() %use queuing time of first-in Classic packet
2c: else % Classic queue empty 2c: else % Classic queue empty
2d: curq = lq.time() % use queuing time of first-in L4S packet 2d: curq = lq.time() % use queuing time of first-in L4S packet
3: p' = p' + alpha * (curq - target) + beta * (curq - prevq) 3: p' = p' + alpha * (curq - target) + beta * (curq - prevq)
4: p_CL = p' * k % Coupled L4S prob = base prob * coupling factor 4: p_CL = p' * k % Coupled L4S prob = base prob * coupling factor
5: p_C = p'^2 % Classic prob = (base prob)^2 5: p_C = p'^2 % Classic prob = (base prob)^2
6: prevq = curq 6: prevq = curq
skipping to change at page 58, line 38 skipping to change at page 58, line 38
k* = 1.64 * (R_C / R_L) (7) k* = 1.64 * (R_C / R_L) (7)
We say that this coupling factor is theoretical, because it is in We say that this coupling factor is theoretical, because it is in
terms of two RTTs, which raises two practical questions: i) for terms of two RTTs, which raises two practical questions: i) for
multiple flows with different RTTs, the RTT for each traffic class multiple flows with different RTTs, the RTT for each traffic class
would have to be derived from the RTTs of all the flows in that class would have to be derived from the RTTs of all the flows in that class
(actually the harmonic mean would be needed); ii) a network node (actually the harmonic mean would be needed); ii) a network node
cannot easily know the RTT of any of the flows anyway. cannot easily know the RTT of any of the flows anyway.
RTT-dependence is cuased by window-based congestion control, so it RTT-dependence is caused by window-based congestion control, so it
ought to be reversed there, not in the network, Therefore, we use a ought to be reversed there, not in the network. Therefore, we use a
fixed coupling factor in the network, and reduce RTT-dependence in fixed coupling factor in the network, and reduce RTT-dependence in
L4S senders. We cannot expect Classic senders to all be updated to L4S senders. We cannot expect Classic senders to all be updated to
reduce their RTT-dependence. But solely addressing the problem in reduce their RTT-dependence. But solely addressing the problem in
L4S senders at least makes RTT-dependence no worse - not just between L4S senders at least makes RTT-dependence no worse - not just between
L4S senders, but also between L4S and Classic senders. L4S senders, but also between L4S and Classic senders.
Traditionally, throughput equivalence has been defined for flows Traditionally, throughput equivalence has been defined for flows
under comparable conditions, including with the same base RTT under comparable conditions, including with the same base RTT
[RFC2914]. So if we assume the same base RTT, R_b, for comparable [RFC2914]. So if we assume the same base RTT, R_b, for comparable
flows, we can put both R_C and R_L in terms of R_b. flows, we can put both R_C and R_L in terms of R_b.
 End of changes. 29 change blocks. 
98 lines changed or deleted 98 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/