Transport Area Working Group B. Briscoe Internet-Draft BT Updates: 2309 (if approved)October 23, 2009J. Manner Intended status: Informational Aalto University Expires:April 26,January 13, 2011 July 12, 2010 Byte and Packet Congestion Notificationdraft-ietf-tsvwg-byte-pkt-congest-01draft-ietf-tsvwg-byte-pkt-congest-02 Abstract This memo concerns dropping or marking packets using active queue management (AQM) such as random early detection (RED) or pre- congestion notification (PCN). We give two strong recommendations: (1) packet size should not be taken into account when transports read congestion indications, not when network equipment writes them, and (2) byte-mode packet drop variant of AQM algorithms, such as RED, should not be used to drop fewer small packets. Status ofthisThis Memo This Internet-Draft is submittedto IETFin full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force(IETF), its areas, and its working groups.(IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts.Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html.This Internet-Draft will expire onApril 26, 2010.January 13, 2011. Copyright Notice Copyright (c)20092010 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of thisdocument (http://trustee.ietf.org/license-info).document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document.Abstract This memo concerns dropping or marking packets using active queue management (AQM) suchCode Components extracted from this document must include Simplified BSD License text asrandom early detection (RED) or pre- congestion notification (PCN). The primary conclusion is that packet size should be taken into account when transports read congestion indications, not when network equipment writes them. Reducing dropdescribed in Section 4.e ofsmall packets has some tempting advantages: i) it drops less control packets, which tend to be smallthe Trust Legal Provisions andii) it makes TCP's bit- rate less dependent on packet size. However, thereareways of addressing these issues at the transport layer, rather than reverse engineering network forwarding to fix specific transport problems. Network layer algorithms likeprovided without warranty as described in thebyte-mode packet drop variant of RED should not be used to drop fewer small packets, because that creates a perverse incentive for transports to use tiny segments, consequently also opening up a DoS vulnerability. TableSimplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . .64 1.1.Requirements NotationTerminology and Scoping . . . . . . . . . . . . . . . . . 6 1.2. Why now? . .9. . . . . . . . . . . . . . . . . . . . . . . 7 2. Motivating Arguments . . . . . . . . . . . . . . . . . . . . .98 2.1. Scaling Congestion Control with Packet Size . . . . . . .98 2.2. Avoiding Perverse Incentives to (ab)use Smaller Packets . 10 2.3. Small != Control . . . . . . . . . . . . . . . . . . . . .1211 2.4. Implementation Efficiency . . . . . . . . . . . . . . . .1211 3.Working DefinitionThe State ofCongestion Notification . . .the Art . . . . .12 4. Congestion Measurement. . . . . . . . . . . . . . . . 11 3.1. Congestion Measurement: Status . . . .13 4.1. Congestion Measurement by Queue Length. . . . . . . . . .13 4.1.1.12 3.1.1. Fixed Size Packet Buffers . . . . . . . . . . . . . . 134.2.3.1.2. Congestion Measurement without a Queue . . . . . . . . 14 3.2. Congestion Coding: Status . . .14 5. Idealised Wire Protocol Coding. . . . . . . . . . . . . 14 3.2.1. Network Bias when Encoding . . .15 6. The State of the Art. . . . . . . . . . . 14 3.2.2. Transport Bias when Decoding . . . . . . . . . .17 6.1. Congestion Measurement: Status. . . 16 3.2.3. Making Transports Robust against Control Packet Losses . . . . . . . . . . .17 6.2. Congestion Coding: Status. . . . . . . . . . . . . 17 3.2.4. Congestion Coding: Summary of Status . . .18 6.2.1. Network Bias when Encoding. . . . . . 18 4. Outstanding Issues and Next Steps . . . . . . . .18 6.2.2. Transport Bias when Decoding. . . . . . 20 4.1. Bit-congestible World . . . . . . .20 6.2.3. Making Transports Robust against Control Packet Losses. . . . . . . . . . . 20 4.2. Bit- & Packet-congestible World . . . . . . . . . . . . . 216.2.4. Congestion Coding: Summary of Status5. Recommendation and Conclusions . . . . . . . . .22 7. Outstanding Issues and Next Steps. . . . . . . 22 5.1. Recommendation on Queue Measurement . . . . . . . .24 7.1. Bit-congestible World. . . 22 5.2. Recommendation on Notifying Congestion . . . . . . . . . . 23 5.3. Recommendation on Responding to Congestion . . . . . . . . 247.2. Bit- & Packet-congestible World5.4. Recommended Future Research . . . . . . . . . . . . . . . 248.6. Security Considerations . . . . . . . . . . . . . . . . . . .25 9. Conclusions24 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 25 8. Comments Solicited . .26 10. Acknowledgements. . . . . . . . . . . . . . . . . . . . 25 9. References . . . .28 11. Comments Solicited. . . . . . . . . . . . . . . . . . . . . .28 12.25 9.1. Normative References . . . . . . . . . . . . . . . . . . . 25 9.2. Informative References . . . . . . . . . .28 12.1. Normative References. . . . . . . . 26 Appendix A. Congestion Notification Definition: Further Justification . . . . . . . . . . .28 12.2. Informative References. . . . . . . . . 30 Appendix B. Idealised Wire Protocol . . . . . . . . . . .29 Editorial Comments. . . . 30 B.1. Protocol Coding . . . . . . . . . . . . . . . . . . . .Appendix A.. 30 B.2. Example Scenarios . . . . . . . . . . . . . . . . . .32 A.1. Notation. . 32 B.2.1. Notation . . . . . . . . . . . . . . . . . . . . . . . 32A.2.B.2.2. Bit-congestible resource, equal bit rates (Ai) . . . .. .32A.3.B.2.3. Bit-congestible resource, equal packet rates (Bi) . .. .33A.4.B.2.4. Pkt-congestible resource, equal bit rates (Aii) . . .. .34A.5.B.2.5. Pkt-congestible resource, equal packet rates (Bii) . .. .35 AppendixB. Congestion Notification Definition: Further Justification . . . . . . . . . . . . . . . . . . . . 35 Appendix C. Byte-mode Drop Complicates PolicingC. Byte-mode Drop Complicates Policing Congestion Response . . . . . . . . . . . . . . . . . . . . . .36 Author's Address . . . . . . . . . . . . . .35 Appendix D. Changes from Previous Versions . . . . . . . . . . .37 Changes from Previous Versions To be removed by36 1. Introduction When notifying congestion, theRFC Editor on publication. Full incremental diffs between each version are available at <http://www.cs.ucl.ac.uk/staff/B.Briscoe/pubs.html#byte-pkt-congest> or <http://tools.ietf.org/wg/tsvwg/draft-ietf-tsvwg-byte-pkt-congest/> (courtesyproblem ofthe rfcdiff tool): From -00how (and whether) to-01 (this version): * Minor clarifications throughouttake packet sizes into account has exercised the minds of researchers andupdated references From briscoe-byte-pkt-mark-02practitioners for as long as active queue management (AQM) has been discussed. Indeed, one reason AQM was originally introduced was toietf-byte-pkt-congest-00: * Added notereduce the lock-out effects that small packets can have onrelationshiplarge packets in drop-tail queues. This memo aims toexisting RFCs * Posedstate thequestion of whether packet-congestion could become commonprinciples we should be using anddeferred ittothe IRTF ICCRG. Added refcome to conclusions on what these principles will mean for future protocol design, taking into account thedual-resource queue (DRQ) proposal. * Changed PCN references fromdeployments we have already. The byte vs. packet dilemma arises at three stages in thePCN charter & architecture tocongestion notification process: Measuring congestion: When thePCN marking behaviour draft most likelycongested resource decides locally toimminently becomemeasure how congested it is. (Should thestandards track WG item. From -01 to -02: * Abstract reorganised to align with clearer separation of issuequeue measure its length in bytes or packets?); Coding congestion notification into thememo. * Introduction reorganised with motivating arguments removedwire protocol: When the congested resource decides whether tonew Section 2. * Clarified avoiding lock-out of large packets is notnotify themain or only motivation for RED. * Mentioned choicelevel ofdropcongestion on each particular packet. (When a queue considers whether to notify congestion by dropping or markingexplicitly throughout, rather than trying to coinaword to mean either. * Generalisedparticular packet, should its decision depend on thediscussion throughout to anybyte-size of the particular packetforwarding function on any network equipment, not just routers. * Clarifiedbeing dropped or marked?); Decoding congestion notification from thelast point about why this is a good time to sort out this issue: because it will be hard / impossiblewire protocol: When the transport interprets the notification in order todesign new transports unless wedecidewhether the network orhow much to respond to congestion. (Should the transportis allowing for packet size. * Added statement explainingtake into account thehorizonbyte-size of each missing or marked packet?). Consensus has emerged over the years concerning the first stage: whether queues are measured in bytes or packets, termed byte-mode queue measurement or packet-mode queue measurement. This memois long term, but with short term expediencyrecords this consensus inmind. * Added materialthe RFC Series. In summary the choice solely depends onscaling congestion controlwhether the resource is congested by bytes or packets. The controversy is mainly around the last two stages to do withpacket size (Section 2.1). * Separated out issue of normalising TCP's bit rate from issue of preferenceencoding congestion notification into packets: whether tocontrol packets (Section 2.3). * Divided up Congestion Measurement sectionallow forclarity, including new material on fixedthe size of the specific packetbuffers and buffer carving (Section 4.1.1 & Section 6.2.1) and onnotifying congestionmeasurement in wireless link technologies without queues (Section 4.2). * Added sectioni) when the network encodes or ii) when the transport decodes the congestion notification. Currently, the RFC series is silent on'Making Transports Robust against Control Packet Losses' (Section 6.2.3) with existing & new material included. * Added tabulated resultsthis matter other than a paper trail ofvendor survey onadvice referenced from [RFC2309], which conditionally recommends byte-mode (packet-size dependent) dropvariant[pktByteEmail]. The primary purpose ofRED (Table 2). * From -00this memo is to-01: * Clarified applicabilitybuild a definitive consensus against such deliberate preferential treatment for small packets in AQM algorithms and todrop as well as ECN. * Highlighted DoS vulnerability. * Emphasised that drop-tail suffers from similar problemsrecord this advice within the RFC series. Fortunately all the implementers who responded tobyte-mode drop, so only byte-mode drop should be turned off,our survey (Section 3.2.4) have notRED itself. * Clarifiedfollowed theoriginal apparent motivations for recommending byte-mode drop included protecting SYNs and pure ACKs more than equalisingearlier advice, so thebit rates of TCPs with different segment sizes. Removed some conjectured motivations. * Added supportconsensus this memo argues forupdatesseems toTCPalready exist inprogress (ackcc & ecn-syn- ack). * Updated survey results with newly arrived data. * Pulled all recommendations together into the conclusions. * Moved some detailed points into two additional appendices and a note. * Considerable clarifications throughout. * Updated references 1. Introduction When notifying congestion, the problemimplementations. The primary conclusion ofhow (and whether) to takethis memo is that packetsizessize should be taken into accounthas exercised the mindswhen transports read congestion indications, not when network equipment writes them. Reducing drop ofresearchers and practitioners for as long as active queue management (AQM) has been discussed. Indeed, one reason AQM was originally introduced was to reduce the lock-out effects thatsmall packetscan have on large packets in drop-tail queues. This memo aimshas some tempting advantages: i) it drops less control packets, which tend tostate the principles we shouldbeusingsmall andto come to conclusionsii) it makes TCP's bit-rate less dependent onwhatpacket size. However, there are ways of addressing theseprinciples will mean for future protocol design, taking into accountissues at thedeployments we have already. Notetransport layer, rather than reverse engineering network forwarding to fix specific transport problems. The second conclusion is that network layer algorithms like thebyte vs.byte- mode packetdilemma concerns congestion notification irrespective of whether it is signalled implicitly bydropor using explicit congestion notification (ECN [RFC3168] or PCN [I-D.ietf-pcn-marking-behaviour]). Throughout this document, unless clear from the context, the term marking will be used to mean notifying congestion explicitly, while congestion notification willvariant of RED should not be used tomean notifying congestion either implicitly bydropor explicitly by marking. If the load onfewer small packets, because that creates aresource depends on the rate at which packets arrive, it is called packet-congestible. If the load depends on the rate at which bits arrive itperverse incentive for transports to use tiny segments, consequently also opening up a DoS vulnerability. This memo iscalled bit-congestible. Examples of packet-congestible resources are route look-up engines and firewalls, because load depends oninitially concerned with howmanywe should correctly scale congestion control functions with packetheaders they havesize for the long term. But it also recognises that expediency may be necessary toprocess. Examplesdeal with existing widely deployed protocols that don't live up to the long term goal. It turns out that the 'correct' variant ofbit-congestible resources are transmission links, radio power and most buffer memory, becauseRED to deploy seems to be theload depends on how many bits they haveone everyone has deployed, and no-one who responded totransmit or store. Some machine architectures use fixed size packet buffers, so buffer memory in these casesour survey has implemented the other variant. However, at the transport layer, TCP congestion control ispacket-congestible (see Section 4.1.1). Notea widely deployed protocol thatinformation is generally processed or transmittedwe argue doesn't scale correctly with packet size. To date this hasn't been aminimum granularity greatersignificant problem because most TCPs have been used with similar packet sizes. But, as we design new congestion controls, we should build in scaling with packet size rather thana bit (e.g. octets). The appropriate granularity forassuming we should follow TCP's example. This memo continues as follows. Terminology and scoping are discussed next, and theresource in question SHOULD be used, but for the sake of brevity we will talk in terms of bytesreasons to make the recommendations presented in thismemo. Resources may be congestible at higher levels of granularity than packets,memo now are given in Section 1.2. Motivating arguments forinstance stateful firewallsour advice areflow-congestiblegiven in Section 2. We then survey the advice given previously in the RFC series, the research literature andcall-servers are session-congestible. This memo focuses on congestion of connectionless resources, butthesame principles may be applicable for congestion notificationdeployed legacy (Section 3) before listing outstanding issues (Section 4) that will need resolution both to inform future protocolscontrolling per- flowdesigns andper-session processing or state.to handle legacy. We then give concrete recommendations for the way forward in (Section 5). We finally give security considerations in Section 6. The interested reader can also find further discussions about the theme of byte vs. packetdilemma arises at three stagesin thecongestion notification process: Measuring congestion Whenappendices. This memo intentionally includes a non-negligible amount of material on thecongested resource decides locally howsubject. A busy reader can jump right into Section 5 tomeasure how congested it is. (Shouldread a summary of thequeuerecommendations for the Internet community. 1.1. Terminology and Scoping The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to bemeasuredinterpreted as described inbytes or packets?); Coding congestion notification into the wire protocol: When the congested resource decides how[RFC2119]. Congestion Notification: Rather than aim tonotify the level ofachieve what many have tried and failed, this memo will not try to define congestion.(Should the levelIt will give a working definition of what congestion notificationdepend on the byte-size of each particular packet carrying the notification?); Decoding congestion notification from the wire protocol: When the transport interprets the notification. (Should the byte-size of a missing or marked packetshould be takeninto account?). In RED, whethertouse packets or bytes when measuring queues is called packet-mode or byte-mode queue measurement. This choice is now fairly well understood butmean for this document. Congestion notification isincluded in Section 4a changing signal that aims todocument it incommunicate theRFC series. The controversyratio E/L. E ismainly aroundtheother two stages: whetherinstantaneous excess load offered toallow for packet size when the network codes or when the transport decodes congestion notification. In RED, the variant that reduces drop probability for packets based on their size in bytes is called byte-mode drop, while the varianta resource thatdoesn'tit iscalled packet mode drop. Whether queues are measured in byteseither incapable of serving orpacketsunwilling to serve. L isan orthogonal choice, termed byte-mode queue measurement or packet-mode queue measurement. Currently,theRFC seriesinstantaneous offered load. The phrase `unwilling to serve' issilent on this matter other thanadded, because AQM systems (e.g. RED, PCN [RFC5670]) set apaper trail of advice referenced from [RFC2309], which conditionally recommends byte-mode (packet-size dependent) drop [pktByteEmail]. However, allvirtual limit smaller than theimplementers who respondedactual limit toour survey (Section 6.2.4) have not followed this advice. The primary purpose ofthe resource, then notify when thismemovirtual limit isto build a definitive consensus against deliberate preferential treatment for small packetsexceeded inAQM algorithms andorder torecord this advice withinavoid congestion of theRFC series. Now is a good time to discuss whether fairness between different sized packets would best be implementedactual capacity. Note that the denominator is offered load, not capacity. Therefore congestion notification is a real number bounded by the range [0,1]. This ties in with thenetwork layer, or atmost well-understood measure of congestion notification: drop fraction (often loosely called loss rate). It also means that congestion has a natural interpretation as a probability; thetransport, forprobability of offered traffic not being served (or being marked as at risk of not being served). Appendix A describes anumberfurther incidental benefit that arises from using load as the denominator ofreasons: 1.congestion notification. Explicit and Implicit Notification: Thepacket vs.byteissue requires speedy resolution because the IETF pre-congestionvs. packet dilemma concerns congestion notification(PCN) working group is about to standardise the external behaviourirrespective ofawhether it is signalled implicitly by drop or using explicit congestion notification (ECN [RFC3168] or PCN [RFC5670]). Throughout this document, unless clear from the context, the term marking will be used to mean notifying congestion explicitly, while congestion notification(AQM) algorithm [I-D.ietf-pcn-marking-behaviour]; 2. [RFC2309] says RED maywill be used to mean notifying congestion eithertake account of packet sizeimplicitly by drop ornot when dropping, but gives no recommendation betweenexplicitly by marking. Bit-congestible vs. Packet-congestible: If thetwo, referring instead to adviceload on a resource depends on theperformance implications in an email [pktByteEmail],rate at whichrecommends byte-mode drop. Further, just before RFC2309 was issued, an addendum was added topackets arrive, it is called packet- congestible. If thearchived email that revisitedload depends on theissue of packet vs. byte-mode drop in its last para, making the recommendation less clear-cut; 3. Without the present memo, the only advice in the RFC seriesrate at which bits arrive it is called bit-congestible. Examples of packet-congestible resources are route look-up engines and firewalls, because load depends on how many packetsize bias in AQM algorithms would be a referenceheaders they have toan archived email in [RFC2309] (including an addendum at the endprocess. Examples of bit-congestible resources are transmission links, radio power and most buffer memory, because theemail to correct the original). 4. The IRTF Internet Congestion Control Research Group (ICCRG) recently took on the challenge of building consensusload depends onwhat common congestion control support should be required from network forwarding functions in future [I-D.irtf-iccrg-welzl-congestion-control-open-research]. The wider Internet community needshow many bits they have todiscuss whether the complexity of adjusting for packet size should be in the networktransmit or store. Some machine architectures use fixed size packet buffers, so buffer memory intransports; 5. Given there are many good reasons why larger path max transmission units (PMTUs) would help solvethese cases is packet-congestible (see Section 3.1.1). Currently anumberdesign goal ofscaling issues, we don't wantnetwork processing equipment such as routers and firewalls is tocreate any bias against large packetskeep packet processing uncongested even under worst case bit rates with minimum packet sizes. Therefore, packet-congestion is currently rare, but there is no guarantee that it will not become common with future technology trends. Note that information is generally processed or transmitted with a minimum granularity greater thantheir true cost; 6.a bit (e.g. octets). TheIETF has started to considerappropriate granularity for thequestion of fairness between flows that use different packet sizes (e.g.resource in question should be used, but for thesmall-packet variantsake ofTCP-friendly rate control, TFRC-SP [RFC4828]). Given transports with different packet sizes, ifbrevity wedon't decide whether the network or the transport should allow for packet size, itwill talk in terms of bytes in this memo. Coarser granularity: Resources may behard if not impossible to design any transport protocol so that its bit-rate relative to other transports meets design guidelines [RFC5033] (Note however that, if the concern were fairness between users, rather than between flows [Rate_fair_Dis], relative rates between flows would have to come under run-time control rathercongestible at higher levels of granularity thanbeing embedded in protocol designs).packets, for instance stateful firewalls are flow-congestible and call-servers are session-congestible. This memois initially concerned with how we should correctly scalefocuses on congestioncontrol functions with packet size forof connectionless resources, but thelong term. But it also recognises that expediencysame principles may benecessary to deal with existing widely deployedapplicable for congestion notification protocolsthat don't live upcontrolling per-flow and per-session processing or state. RED Terminology: In RED, whether to use packets or bytes when measuring queues is respectively called packet-mode or byte-mode queue measurement. And if thelong term goal. It turns out thatprobability of dropping a packet depends on its byte-size it is called byte-mode drop, whereas if the'correct' variantdrop probability is independent ofRED to deploy seemsa packet's byte-size it is called packet-mode drop. 1.2. Why now? Now is a good time tobe the one everyone has deployed, and no-one who responded to our survey hasdiscuss whether fairness between different sized packets would best be implemented in theother variant. However,network layer, or at thetransport layer, TCP congestion control istransport, for awidely deployed protocol that we argue doesn't scale correctly withnumber of reasons: 1. The packetsize. To date this hasn't been a significant problemvs. byte issue requires speedy resolution becausemost TCPs have been used with similar packet sizes. But, as we design newthe IETF pre-congestion notification (PCN) working group is standardising the external behaviour of a PCN congestioncontrols, we should build in scaling withnotification (AQM) algorithm [RFC5670]; 2. [RFC2309] says RED may either take account of packet sizerather than assuming we should follow TCP's example. Motivating arguments for ouror not when dropping, but gives no recommendation between the two, referring instead to adviceare given nexton the performance implications inSection 2. Thenan email [pktByteEmail], which recommends byte-mode drop. Further, just before RFC2309 was issued, an addendum was added to thebody ofarchived email that revisited thememo starts from first principles, defining congestion notificationissue of packet vs. byte-mode drop inSection 3 then determiningits last paragraph, making thecorrect way to measure congestion (Section 4) and to design an idealised congestion notification protocol (Section 5). It then surveysrecommendation less clear- cut; 3. Without the present memo, the only advicegiven previouslyin the RFCseries,series on packet size bias in AQM algorithms would be a reference to an archived email in [RFC2309] (including an addendum at theresearch literature andend of thedeployed legacy (Section 6) before listing outstanding issues (Section 7) that will need resolution bothemail toachievecorrect theideal protocol andoriginal). 4. The IRTF Internet Congestion Control Research Group (ICCRG) recently took on the challenge of building consensus on what common congestion control support should be required from network forwarding functions in future [I-D.irtf-iccrg-welzl]. The wider Internet community needs tohandle legacy. After discussing security considerations (Section 8) strong recommendations fordiscuss whether theway forward are givencomplexity of adjusting for packet size should be in theconclusions (Section 9). 1.1. Requirements Notation The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL"network or inthis documenttransports; 5. Given there are many good reasons why larger path max transmission units (PMTUs) would help solve a number of scaling issues, we don't want to create any bias against large packets that is greater than their true cost; 6. The IETF has started to consider the question of fairness between flows that use different packet sizes (e.g. in the small-packet variant of TCP-friendly rate control, TFRC-SP [RFC4828]). Given transports with different packet sizes, if we don't decide whether the network or the transport should allow for packet size, it will beinterpreted as describedhard if not impossible to design any transport protocol so that its bit-rate relative to other transports meets design guidelines [RFC5033] (Note however that, if the concern were fairness between users, rather than between flows [Rate_fair_Dis], relative rates between flows would have to come under run-time control rather than being embedded in[RFC2119].protocol designs). 2. Motivating Arguments 2.1. Scaling Congestion Control with Packet Size There are two ways of interpreting a dropped or marked packet. It can either be considered as a single loss event or as loss/marking of the bytes in the packet. Here we try to design a test to see which approach scales with packet size. Given bit-congestible is the more commoncase,case (see Section 1.1), consider abit- congestiblebit-congestible link shared by many flows, so that each busy period tends to cause packets to be lost from different flows. The test compares two identical scenarios with the same applications, the same numbers of sources and the same load. But the sources break the load into large packets in one scenario and small packets in the other. Of course, because the load is the same, there will be proportionately more packets in the small packet case. The test of whether a congestion control scales with packet size is that it should respond in the same way to the same congestion excursion, irrespective of the size of the packets that the bytes causing congestion happen to be broken down into. A bit-congestible queue suffering a congestion excursion has to drop or mark the same excess bytes whether they are in a few large packets or many small packets. So for the same congestion excursion, the same amount of bytes have to be shed to get the load back to its operating point. But, of course, for smaller packets more packets will have to be discarded to shed the same bytes. If all the transports interpret each drop/mark as a single loss event irrespective of the size of the packet dropped, those with smaller packets will respond more to the same congestion excursion, failing our test. On the other hand, if they respond proportionately less when smaller packets are dropped/marked, overall they will be able to respond the same to the same congestion excursion. Therefore, for a congestion control to scale with packet size it should respond to dropped or marked bytes (as TFRC-SP [RFC4828] effectively does), not just to dropped or marked packets irrespective of packet size (as TCP does). The email [pktByteEmail] referred to by RFC2309 says the question of whether a packet's own size should affect its drop probability "depends on the dominant end-to-end congestion control mechanisms". But we argue the network layer should not be optimised for whatever transport is predominant. TCP congestion control ensures that flows competing for the same resource each maintain the same number of segments in flight, irrespective of segment size. So under similar conditions, flows with different segment sizes will get different bit rates. But even though reducing the drop probability of small packets helps ensure TCPs with different packet sizes will achieve similar bit rates, we argue this correction should beachieved inmade to TCP itself, notinto thenetwork.network in order to fix one transport, no matter how prominent it is. Effectively, favouring small packets is reverse engineering of the network layer around TCP, contrary to the excellent advice in [RFC3426], which asks designers to question "Why are you proposing a solution at this layer of the protocol stack, rather than at another layer?" 2.2. Avoiding Perverse Incentives to (ab)use Smaller Packets Increasingly, it is being recognised that a protocol design must take care not to cause unintended consequences by giving the parties in the protocol exchange perverse incentives [Evol_cc][RFC3426]. Again, imagine a scenario where the same bit rate of packets will contribute the same tocongestionbit-congestion of a link irrespective of whether it is sent as fewer larger packets or more smaller packets. A protocol design that caused larger packets to be more likely to be dropped than smaller ones would be dangerous in this case: Normal transports: Even if a transport is not actually malicious, if it finds small packets go faster, over time it will tend to act in its own interest and use them. Queues that give advantage to small packets create an evolutionary pressure for transports to send at the same bit-rate but break their data stream down into tiny segments to reduce their drop rate. Encouraging a high volume of tiny packets might in turn unnecessarily overload a completely unrelated part of the system, perhaps more limited by header-processing than bandwidth. Malicious transports: A queue that gives an advantage to small packets can be used to amplify the force of a flooding attack. By sending a flood of small packets, the attacker can get the queue to discard more traffic in large packets, allowing more attack traffic to get through to cause further damage. Such a queue allows attack traffic to have a disproportionately large effect on regular traffic without the attacker having to do much work. Note that, although the byte-mode drop variant of RED amplifies small packet attacks, drop-tail queues amplify small packet attacks even more (see Security Considerations in Section8).6). Wherever possible neither should be used.Normal transports: Even if a transport is not malicious, if it finds small packets go faster, it will tend to act in its own interest and use them. Queues that give advantage to small packets create an evolutionary pressure for transports to send at the same bit- rate but break their data stream down into tiny segments to reduce their drop rate. Encouraging a high volume of tiny packets might in turn unnecessarily overload a completely unrelated part of the system, perhaps more limited by header-processing than bandwidth.Imagine two unresponsive flows arrive at a bit-congestible transmission link each with the same bit rate, say 1Mbps, but one consists of 1500B and the other 60B packets, which are 25x smaller. Consider a scenario where gentle RED [gentle_RED] is used, along with the variant of RED we advise against, i.e. where the RED algorithm is configured to adjust the drop probability of packets in proportion to each packet's size (byte mode packet drop). In this case, if RED drops 25% of the larger packets, it will aim to drop 1% of the smaller packets (but in practice it may drop more as congestion increases[RFC4828](S.B.4)[Note_Variation]).[RFC4828](S.B.4)). Even though both flows arrive with the same bit rate, the bit rate the RED queue aims to pass to the line will be 750k for the flow of larger packet but 990k for the smaller packets (but because of rate variation it will be less than this target). It can be seen that this behaviour reopens the same denial of service vulnerability that drop tail queues offer to floods of small packet, though not necessarily as strongly (see Section8).6). 2.3. Small != Control It is tempting to drop small packets with lower probability to improve performance, because many control packets are small (TCP SYNs & ACKs, DNS queries & responses, SIP messages, HTTP GETs, etc) and dropping fewer control packets considerably improves performance. However, we must not give control packets preference purely by virtue of their smallness, otherwise it is too easy for any data source to get the same preferential treatment simply by sending data in smaller packets. Again we should not create perverse incentives to favour small packets rather than to favour control packets, which is what we intend. Just because many control packets are small does not mean all small packets are control packets. So again, rather than fix these problems in the network layer, we argue that the transport should be made more robust against losses of control packets (see 'Making Transports Robust against Control Packet Losses' in Section6.2.3).3.2.3). 2.4. Implementation Efficiency Allowing for packet size at the transport rather than in the network ensures that neither the network nor the transport needs to do a multiply operation--multiplication by packet size is effectively achieved as a repeated add when the transport adds to its count of marked bytes as each congestion event is fed to it. This isn't a principled reason in itself, but it is a happy consequence of the other principled reasons. 3.Working Definition of Congestion Notification Rather than aim to achieve what many have tried and failed, this memo will not try to define congestion. It will give a working definitionThe State ofwhat congestion notification should be taken to meanthe Art The original 1993 paper on RED [RED93] proposed two options forthis document. Congestion notification is a changing signal that aims to communicatetheratio E/L, where E is the instantaneous excess load offered to a resource that it cannot (or would not) serveRED active queue management algorithm: packet mode andL is the instantaneous offered load. The phrase `would not serve' is added, because AQM systems (e.g. RED, PCN [I-D.ietf-pcn-marking-behaviour]) use a virtual capacity smaller than actual capacity, then notify congestion of this virtual capacity in order to avoid congestion of the actual capacity. Note that the denominator is offered load, not capacity. Therefore congestion notification is a real number bounded bybyte mode. Packet mode measured therange [0,1]. This tiesqueue length in packets and dropped (or marked) individual packets withthe most well-understood measure of congestion notification: drop fraction (often loosely called loss rate). It also means that congestion hasanatural interpretation as a probability; theprobability independent ofoffered traffic not being served (or being marked as at risk of not being served). Appendix B describes a further incidental benefit that arises from using load astheir size. Byte mode measured thedenominator of congestion notification. 4. Congestion Measurement 4.1. Congestion Measurement by Queue Length Queuequeue lengthis usually the most correctin bytes andsimplest way to measure congestion of a resource. To avoid the pathological effects of drop tail,marked anAQM function can then be usedindividual packet with probability in proportion to its size (relative totransform queue length intotheprobability of dropping or marking amaximum packet(e.g. RED's piecewise linear function between thresholds). If the resource is bit-congestible,size). In thelengthpaper's outline of further work, it was stated that no recommendation had been made on whether the queueSHOULD besize should be measured inbytes. If the resource is packet-congestible, the length ofbytes or packets, but noted that thequeue SHOULDdifference could bemeasuredsignificant. When RED was recommended for general deployment inpackets. No other choice makes sense, because1998 [RFC2309], thenumbertwo modes were mentioned implying the choice between them was a question ofpackets waitingperformance, referring to a 1997 email [pktByteEmail] for advice on tuning. This email clarified that there were inthefact two orthogonal choices: whether to measure queueisn't relevant if the resource gets congested bylength in bytes or packets (Section 3.1 below) andvice versa. We discusswhether theimplications on RED's byte mode anddrop probability of an individual packetmode for measuringshould depend on its own size (Section 3.2 below). 3.1. Congestion Measurement: Status The choice of which metric to use to measure queue length was left open inSection 6. 4.1.1. Fixed Size Packet Buffers Some, mostly older, queuing hardware sets aside fixed sized buffersRFC2309. It is now well understood that queues for bit- congestible resources should be measured inwhich to store each packetbytes, and queues for packet-congestible resources should be measured inthe queue. Also, with some hardware, any fixed sizedpackets. Where buffersnot completely filled by a packetarepadded when transmittednot configured or legacy buffers cannot be configured to thewire. Ifabove guideline, weimaginedo not have to make allowances for such legacy in future protocol design. If atheoretical forwarding system with both queuing and transmissionbit-congestible buffer is measured infixed, MTU- sized units, it should clearly be treated as packet-congestible, becausepackets, thequeue length in packets would be a good model of congestion ofoperator will have set thelower layer link. If we now imaginethresholds mindful of ahybrid forwarding system with transmission delay largely dependent on the byte-sizetypical mix of packetsbut buffers of one MTU per packet, it should strictly require a more complexsizes. Any AQM algorithmto determine the probability of congestion. It shouldon such a buffer will betreated as two resources in sequence, where the sumoversensitive to high proportions ofthe byte-sizessmall packets, e.g. a DoS attack, and undersensitive to high proportions ofthe packets within each packetlarge packets. But an operator can safely keep such a legacy buffermodelsbecause any undersensitivity during unusual traffic mixes cannot lead to congestionof the line while the length ofcollapse given the buffer will eventually revert to tail drop, discarding proportionately more large packets. Some modern queue implementations give a choice for setting RED's thresholds inpackets models congestion of the queue. Then the probability of congesting the forwarding buffer wouldbyte-mode or packet-mode. This may merely bea conditional probability--conditional onan administrator-interface preference, not altering how thepreviously calculated probability of congestingqueue itself is measured but on some hardware it does actually change theline. However, in systems that use fixed size buffers,way it measures its queue. Whether a resource isunusual for allbit-congestible or packet- congestible is a property of thebuffers used byresource, so aninterface toadmin should not ever need to, or be able to, configure thesame size. Typically poolsway a queue measures itself. We believe the question ofdifferent sized buffers are provided (Cisco uses the term 'buffer carving' for the process of dividing up memory intowhether to measure queues in bytes or packets is fairly well understood thesepools [IOSArch]). Usually, ifdays. The only outstanding issues concern how to measure congestion when thepool of small buffersqueue isexhausted, arriving small packets can borrow space in the pool of large buffers,bit congestible butnotthe resource is packet congestible or vice versa.However, itThere iseasierno controversy over what should be done. It's just you have to be an expert in probability to work out what should be done (summarised in the following section) and, even ifwe temporarily setyou have, it's not always easy to find a practical algorithm to implement it. 3.1.1. Fixed Size Packet Buffers Some, mostly older, queuing hardware sets aside fixed sized buffers in which to store each packet in thepossibility of such borrowing. Then,queue. Also, with some hardware, any fixedpools of buffers for differentsizedpackets and no borrowing,buffers not completely filled by a packet are padded when transmitted to thesize of each poolwire. If we imagine a theoretical forwarding system with both queuing and transmission in fixed, MTU- sized units, it should clearly be treated as packet-congestible, because thecurrentqueue length ineach poolpackets wouldbothbemeasured in packets. So an AQM algorithm would have to maintain the queue length for each pool, and judge whether to drop/markapacketgood model of congestion of the lower layer link. If we now imagine aparticular size by looking athybrid forwarding system with transmission delay largely dependent on thepool forbyte-size of packets but buffers ofthat sizeone MTU per packet, it should strictly require a more complex algorithm to determine the probability of congestion. It should be treated as two resources in sequence, where the sum of the byte-sizes of the packets within each packet buffer models congestion of the line while the length of the queue in packets models congestion of the queue. Then the probability of congesting the forwarding buffer would be a conditional probability--conditional on the previously calculated probability of congesting the line. In systems that use fixed size buffers, it is unusual for all the buffers used by an interface to be the same size. Typically pools of different sized buffers are provided (Cisco uses the term 'buffer carving' for the process of dividing up memory into these pools [IOSArch]). Usually, if the pool of small buffers is exhausted, arriving small packets can borrow space in the pool of large buffers, but not vice versa. However, it is easier to work out what should be done if we temporarily set aside the possibility of such borrowing. Then, with fixed pools of buffers for different sized packets and no borrowing, the size of each pool and the current queue length in each pool would both be measured in packets. So an AQM algorithm would have to maintain the queue length for each pool, and judge whether to drop/mark a packet of a particular size by looking at the pool for packets of that size and using the length (in packets) of its queue. We now return to the issue we temporarily set aside: small packets borrowing space in larger buffers. In this case, the only difference is that the pools for smaller packets have a maximum queue size that includes all the pools for larger packets. And every time a packet takes a larger buffer, the current queue size has to be incremented for all queues in the pools of buffers less than or equal to the buffer size used. We will return to borrowing of fixed sized buffers when we discuss biasing the drop/marking probability of a specific packet because of its size in Section6.2.1.3.2.1. But here we can give a simple summary of the present discussion on how to measure the length of queues of fixed buffers: no matter how complicated the scheme is, ultimately any fixed buffer system will need to measure its queue length in packets not bytes.4.2.3.1.2. Congestion Measurement without a Queue AQM algorithms are nearly always described assuming there is a queue for a congested resource and the algorithm can use the queue length to determine the probability that it will drop or mark each packet. But not all congested resources lead to queues. For instance, wireless spectrum is bit-congestible (for a given coding scheme), because interference increases with the rate at which bits are transmitted. But wireless link protocols do not always maintain a queue that depends on spectrum interference. Similarly, power limited resources are also usually bit-congestible if energy is primarily required for transmission rather than header processing, but it is rare for a link protocol to build a queue as it approaches maximum power.However,Nonetheless, AQM algorithmsdon'tdo not require a queue in order to work. For instance spectrum congestion can be modelled by signal quality using target bit-energy-to-noise-density ratio. And, to model radio power exhaustion, transmission power levels can be measured and compared to the maximum power available. [ECNFixedWireless] proposes a practical and theoretically sound way to combine congestion notification for different bit-congestible resources at different layers along an end to end path, whether wireless or wired, and whether with or without queues.5. Idealised Wire Protocol Coding We will start by inventing an idealised congestion notification protocol before discussing how to make it practical. The idealised protocol is shown to be correct using examples in Appendix A.3.2. Congestionnotification involves the congested resource coding a congestion notification signal into the packet stream and the transports decoding it.Coding: Status 3.2.1. Network Bias when Encoding Theidealised protocol uses two different (imaginary) fields in each datagrampreviously mentioned email [pktByteEmail] referred tosignal congestion: one for byte congestion and one for packet congestion. We are not saying two ECN fields will be needed (andby [RFC2309] gave advice weare not sayingnow disagree with. It said thatsomehow a resource should be able todropa packet in one of two different ways so thatprobability should depend on thetransport can distinguish which sortsize ofdrop it was!). These two congestion notification channels are just a conceptual device. They allow us to defer having to decide whether to distinguish between byte andthe packetcongestion whenbeing considered for drop if thenetworkresourcecodes the signal or when the transport decodes it. However, although this idealised mechanism isn't intended for implementation, we do want to emphasise that we may need to find a way to implement it, becauseis bit-congestible, but not if itcould become necessary to somehow distinguish between bit and packet congestion [RFC3714]. Currently a design goal of network processing equipment such as routers and firewallsisto keep packet processing uncongested even under worst case bit rates with minimum packet sizes. Therefore,packet-congestion is currently rare,congestible, butthere is no guaranteeadvised thatit will not become common with future technology trends.most scarce resources in the Internet were currently bit-congestible. Theidealised wire protocol is given below. It accounts forargument continued that if packetsizes at the transport layer, not indrops were inflated by packet size (byte-mode dropping), "a flow's fraction of thenetwork, andpacket drops is thenonly in the casea good indication ofbit-congestible resources.that flow's fraction of the link bandwidth in bits per second". Thisavoidswas consistent with a referenced policing mechanism being worked on at theperverse incentive to send smaller packetstime for detecting unusually high bandwidth flows, eventually published in 1999 [pBox]. [The problem could and should have been solved by making theDoS vulnerability that would otherwise result ifpolicing mechanism count thenetwork werevolume of bytes randomly dropped, not the number of packets.] A few months before RFC2309 was published, an addendum was added tobias towards them (seethemotivating argument about avoiding perverse incentivesabove archived email referenced from the RFC, inSection 2.2): 1. A packet-congestible resource tryingwhich the final paragraph seemed tocode congestion level p_p intopartially retract what had previously been said. It clarified that the question of whether the probability of dropping/marking a packetstreamshouldmarkdepend on its size was not related to whether theidealised `packet congestion' fieldresource itself was bit congestible, but a completely orthogonal question. However the only example given had the queue measured ineachpackets but packetwith probability p_p irrespectivedrop depended on the byte-size of thepacket's size. The transport should then take apacketwithin question. No example was given thepacket congestion field marked to mean just one mark, irrespectiveother way round. In 2000, Cnodder et al [REDbyte] pointed out that there was an error in the part of thepacket size. 2. A bit-congestible resource tryingoriginal 1993 RED algorithm that aimed tocode time-varying byte- congestion level p_bdistribute drops uniformly, because it didn't correctly take intoa packet stream should markaccount the`byte congestion' field in eachadjustment for packetwith probability p_b, again irrespective of the packet'ssize.Unlike before, the transport should takeThey recommended an algorithm called RED_4 to fix this. But they also recommended a further change, RED_5, to adjust drop rate dependent on the square of relative packet size. This was indeed consistent withtheone implied motivation behind RED's bytecongestion field markedmode drop--that we should reverse engineer the network tocount as a mark on each byte inimprove thepacket. The worked examples in Appendix A show that transports can extract sufficient and correctperformance of dominant end-to- end congestionnotification from these protocolscontrol mechanisms. By 2003, a further change had been made to the adjustment forcases when two flows with differentpacketsizes have matching bit rates or matchingsize, this time in the RED algorithm of the ns2 simulator. Instead of taking each packet's size relative to a `maximum packetrates. Examples are also given that mix these two flows into onesize' it was taken relative toshow thataflow with mixed`mean packetsizes would stillsize', intended to be a static value representative of the `typical' packet size on the link. We have not been able toextract sufficient and correct information. Sufficient and correct congestion information means that there is sufficient informationfind a justification forthe two different types of transport requirements: Ratio-based: Established transport congestion controls like TCP's [RFC5681] aim to achieve equal segment rates per RTT through the same bottleneck--TCP friendliness [RFC3448]. They work with the ratio of dropped to delivered segments (or marked to unmarked segmentsthis change in thecase of ECN). The example scenarios showliterature, however Eddy and Allman conducted experiments [REDbias] thatthese ratio-based transports are effectively the same whether counting in bytes or packets, because the units cancel out. (Incidentally, this is why TCP's bit rate is still proportionalassessed how sensitive RED was topacket size even when byte-counting is used, as recommended for TCP in [RFC5681], mainly for orthogonal security reasons.) Absolute-target-based: Other congestion controls proposed in the research community aimthis parameter, amongst other things. No-one seems tolimit the volume of congestion causedhave pointed out that this changed algorithm can often lead toa constant weight parameter. [MulTCP][WindowPropFair] are examplesdrop probabilities ofweighted proportionally fair transports designed for cost-fair environments [Rate_fair_Dis]. In this case, the transport requires a count (notgreater than 1 [which should ring alarm bells hinting that there's aratio) of dropped/marked bytesmistake in thebit-congestible case andtheory somewhere]. On 10-Nov-2004, this variant ofdropped/marked packetsbyte-mode packet drop was made the default in thepacket congestible case. 6.ns2 simulator. TheStatebyte-mode drop variant ofthe Art The original 1993 paper onRED[RED93] proposed two options foris, of course, not theRED active queue management algorithm: packet mode and byte mode. Packet mode measured the queue length inonly possible bias towards small packetsand dropped (or marked) individualin queueing algorithms. We have already mentioned that tail-drop queues naturally tend to lock-out large packets once they are full. But also queues witha probability independent of their size. Byte mode measuredfixed sized buffers reduce thequeue length in bytes and marked an individual packet withprobabilityin proportion to its size (relativethat small packets will be dropped if (and only if) they allow small packets to borrow buffers from themaximum packet size). In the paper's outline of further work, itpools for larger packets. As wasstated that no recommendation had been madeexplained in Section 3.1.1 onwhetherfixed size buffer carving, borrowing effectively makes the maximum queue sizeshould be measured in bytes or packets, but notedfor small packets greater than thatthe difference could be significant. When RED was recommendedforgeneral deployment in 1998 [RFC2309], the two modes were mentioned implyinglarge packets, because more buffers can be used by small packets while less will fit large packets. In itself, thechoice between them was a question of performance, referring to a 1997 email [pktByteEmail]bias towards small packets caused by buffer borrowing is perfectly correct. Lower drop probability foradvice on tuning. This email clarified that there were in fact two orthogonal choices: whether to measure queue lengthsmall packets is legitimate inbytes orbuffer borrowing schemes, because small packets(Section 6.1 below) and whethergenuinely congest thedrop probability of an individual packet should depend on its own size (Section 6.2 below). 6.1. Congestion Measurement: Status The choice of which metric to use to measure queue length was left open in RFC2309. It is now well understood that queues for bit- congestible resources should be measured in bytes, and queues for packet-congestible resources should be measuredmachine's buffer memory less than large packets, given they can fit in more spaces. The bias towards small packets(see Section 4). Where buffers areis notconfigured or legacy buffers cannot be configured to the above guideline, we don't have to make allowances for such legacy in future protocol design. If a bit-congestible bufferartificially added (as it ismeasuredinpackets, the operator will have setRED's byte-mode drop algorithm), it merely reflects thethresholds mindful of a typical mixreality of the way fixed buffer memory gets congested. Incidentally, the bias towards small packetssizes. Any AQM algorithm on such afrom bufferwill be oversensitive to high proportionsborrowing is nothing like as large as that ofsmall packets, e.g. a DoS attack, and undersensitiveRED's byte- mode drop. Nonetheless, fixed-buffer memory with tail drop is still prone tohigh proportions oflock-out largepackets. But an operator can safely keep suchpackets, purely because of the tail-drop aspect. So alegacygood AQM algorithm like RED with packet-mode drop should be used with fixed bufferbecause any undersensitivity during unusual traffic mixes cannot leadmemories where possible. If RED is too complicated tocongestion collapse given theimplement with multiple fixed bufferwill eventually revertpools, the minimum necessary totail drop, discarding proportionately moreprevent largepackets. Some modern queue implementations give a choice for setting RED's thresholds in byte-mode or packet-mode. This may merely be an administrator-interface preference, not altering how the queue itselfpacket lock-out ismeasured but on some hardware it does actually changeto ensure smaller packets never use theway it measures its queue. Whether a resource is bit-congestible or packet- congestible is a propertylast available buffer in any of theresource, so an admin SHOULD NOT ever need to, or be able to, configurepools for larger packets. 3.2.2. Transport Bias when Decoding The above proposals to alter thewaynetwork equipment to bias towards smaller packets have largely carried on outside the IETF process (unless one counts aqueue measures itself. We believereference in an informational RFC to an archived email!). Whereas, within thequestion of whetherIETF, there are many different proposals tomeasure queues in bytes or packets is fairly well understood these days. The only outstanding issues concern howalter transport protocols tomeasure congestion whenachieve thequeue is bit congestible butsame goals, i.e. either to make theresource isflow bit-rate take account of packetcongestiblesize, orvice versa (see Section 4). But there is no controversy over what should be done. It's just you havetobe an expert in probabilityprotect control packets from loss. This memo argues that altering transport protocols is the more principled approach. A recently approved experimental RFC adapts its transport layer protocol towork out what should be done and, even if you have, it's not always easytake account of packet sizes relative tofindtypical TCP packet sizes. This proposes apractical algorithm to implement it. 6.2. Congestion Coding: Status 6.2.1. Network Bias when Encoding The previously mentioned email [pktByteEmail] referred to by [RFC2309] saidnew small-packet variant of TCP- friendly rate control [RFC3448] called TFRC-SP [RFC4828]. Essentially, it proposes a rate equation that inflates thechoice over whetherflow rate by the ratio of apacket's owntypical TCP segment sizeshould affect its drop probability "depends on(1500B including TCP header) over thedominant end-to- end congestion control mechanisms". [Section 2 argues againstactual segment size [PktSizeEquCC]. (There are also other important differences of detail relative to TFRC, such as using virtual packets [CCvarPktSize] to avoid responding to multiple losses per round trip and using a minimum inter-packet interval.) Section 4.5.1 of thisapproach, citingTFRC-SP spec discusses theexcellent adviceimplications of operating inRFC3246.] The referenced email went onan environment where queues have been configured toargue thatdrop smaller packets with proportionately lower probabilityshould depend on the size of the packet being considered for drop if the resource is bit- congestible, but not ifthan larger ones. But itis packet-congestible, but advised that most scarce resourcesonly discusses TCP operating in such an environment, only mentioning TFRC-SP briefly when discussing how to define fairness with TCP. And it only discusses theInternet were currently bit-congestible. The argument continued that if packet drops were inflated by packet size (byte-mode dropping), "a flow's fractionbyte-mode dropping version of RED as it was before Cnodder et al pointed out it didn't sufficiently bias towards small packets to make TCP independent ofthepacketdrops is then a good indicationsize. So the TFRC-SP spec doesn't address the issue ofthat flow's fractionwhich of thelink bandwidth in bits per second". This was consistent with a referenced policing mechanism being worked on at the time for detecting unusually high bandwidth flows, eventually published in 1999 [pBox]. [The problem could have been solved by making the policing mechanism countnetwork or thevolume of bytes randomly dropped, nottransport _should_ handle fairness between different packet sizes. In its Appendix B.4 it discusses thenumberpossibility ofpackets.] A few months before RFC2309 was published, an addendum was addedboth TFRC-SP and some network buffers duplicating each other's attempts to deliberately bias towards small packets. But theabove archived email referenced fromdiscussion is not conclusive, instead reporting simulations of many of theRFC,possibilities inwhich the final paragraph seemedorder topartially retract what had previously been said. It clarifiedassess performance but not recommending any particular course of action. The paper originally proposing TFRC with virtual packets (VP-TFRC) [CCvarPktSize] proposed that there should perhaps be two variants to cater for thequestiondifferent variants ofwhetherRED. However, as theprobability of dropping/markingTFRC-SP authors point out, there is no way for apacket should depend on its size was not relatedtransport to know whetherthe resource itself was bit congestible, but a completely orthogonal question. However the only example given had the queue measured in packets butsome queues on its path have deployed RED with byte-mode packet dropdepended on the byte-size of(except if an exhaustive survey found that no-one has deployed it!-- see Section 3.2.4). Incidentally, VP-TFRC also proposed that byte- mode RED dropping should really square the packetin question. No example was given the other way round. In 2000, Cnodder et al [REDbyte] pointed outsize compensation factor (like thatthere was an error in the partofthe original 1993 RED algorithm that aimedRED_5, but apparently unaware of it). Pre-congestion notification [I-D.ietf-pcn] is a proposal todistribute drops uniformly, because it didn't correctly take into account the adjustmentuse a virtual queue forpacket size. They recommended an algorithm called RED_4AQM marking for packets within one Diffserv class in order tofix this. But they also recommended a further change, RED_5,give early warning prior toadjust drop rate dependent on the squareany real queuing. The proposed PCN marking algorithms have been designed not to take account ofrelativepacketsize. This was indeed consistent with one stated motivation behind RED's byte mode drop--that we should reverse engineer the network to improvesize when forwarding through queues. Instead theperformance of dominant end-to- end congestion control mechanisms. By 2003, a further change hadgeneral principle has beenmadetothe adjustment for packet size, this time in the RED algorithmtake account of thens2 simulator. Insteadsizes oftaking each packet's size relative to a `maximum packet size' it was taken relative to a `mean packet size', intended to be a static value representativemarked packets when monitoring the fraction of marking at the`typical' packet size onedge of thelink. Wenetwork. 3.2.3. Making Transports Robust against Control Packet Losses Recently, two RFCs havenot been abledefined changes tofind a justification for this change in the literature, however Eddy and Allman conducted experiments [REDbias]TCP thatassessed how sensitive RED was to this parameter, amongst other things. No-one seems to have pointed out that this changed algorithm can often lead to drop probabilities of greater than 1 [which should ring alarm bells hinting that there's a mistake in the theory somewhere]. On 10-Nov-2004, this variant of byte-mode packet drop was made the default in the ns2 simulator. The byte-mode drop variant of RED is, of course, not the only possible bias towardsmake it more robust against losing small control packetsin queueing algorithms. We have already mentioned that tail-drop queues naturally tend to lock-out large packets once[RFC5562] [RFC5690]. In both cases theyare full. But also queues with fixed sized buffers reduce the probabilitynote thatsmall packets willthe case for these TCP changes would bedroppedweaker if(and only if) they allowRED were biased against dropping smallpackets to borrow buffers from the pools for largerpackets.As was explained in Section 4.1.1 on fixed size buffer carving, borrowing effectively makes the maximum queue size for small packets greater thanWe argue here thatfor large packets, becausethese two proposals are a safer and morebuffers canprincipled way to achieve TCP performance improvements than reverse engineering RED to benefit TCP. Although no proposals exist as far as we know, it would also beused by small packets while less will fit large packets. However, in itself, the bias towards smallpossible and perfectly valid to make control packetscausedrobust against drop bybuffer borrowing is perfectly correct. Lowerexplicitly requesting a lower drop probabilityfor small packets is legitimate in buffer borrowing schemes, because small packets genuinely congest the machine's buffer memory less than large packets, given they can fit in more spaces. The bias towards small packets is not artificially added (as it is in RED's byte-mode drop algorithm), it merely reflects the reality of the way fixed buffer memory gets congested. Incidentally, the bias towards small packets from buffer borrowing is nothing like as large as that of RED's byte- mode drop. Nonetheless, fixed-buffer memory with tail drop is still proneusing their Diffserv code point [RFC2474] tolock-out large packets, purely because of the tail-drop aspect. Sorequest agood AQM algorithm like RED with packet-mode drop should be usedscheduling class withfixed buffer memories where possible. If REDlower drop. The re-ECN protocol proposal [I-D.briscoe-tsvwg-re-ecn-tcp] istoo complicated to implement with multiple fixed buffer pools, the minimum necessarydesigned so that transports can be made more robust against losing control packets. It gives queues an incentive toprevent large packet lock-out isoptionally give preference against drop toensure smallerpacketsnever usewith thelast available buffer'feedback not established' codepoint inany ofthepools for larger packets. 6.2.2. Transport Bias when Decoding The above proposalsproposed 'extended ECN' field. Senders have incentives toalteruse this codepoint sparingly, but they can use it on control packets to reduce their chance of being dropped. For instance, thenetwork layerproposed modification togive a bias towards smaller packets have largely carriedTCP for re-ECN uses this codepoint onoutsidetheIETF process (unless one counts a reference in an informational RFCSYN and SYN-ACK. Although not brought toan archived email!). Whereas, withinthe IETF,there are many different proposals to alter transport protocols to achieve the same goals, i.e. either to make the flow bit-rate take account of packet size, or to protect control packetsa simple proposal fromloss. This memo arguesWischik [DupTCP] suggests thataltering transport protocols isthemore principled approach. A recently approved experimental RFC adapts its transport layer protocol to take accountfirst three packets ofpacket sizes relative to typicalevery TCPpacket sizes. This proposesflow should be routinely duplicated after anew small-packet variantshort delay. It shows that this would greatly improve the chances ofTCP- friendly rate control [RFC3448] called TFRC-SP [RFC4828]. Essentially,short flows completing quickly, but itproposes a rate equation that inflateswould hardly increase traffic levels on theflow rate byInternet, because Internet bytes have always been concentrated in theratiolarge flows. It further shows that the performance ofamany typicalTCP segment size (1500B including TCP header) over the actual segment size [PktSizeEquCC]. (There are also other important differencesapplications depends on completion ofdetail relative to TFRC, such as using virtual packets [CCvarPktSize] to avoid responding to multiple losses per round trip and using a minimum inter-packet interval.) Section 4.5.1long serial chains of short messages. It argues that, given most of the value people get from the Internet is concentrated within short flows, thisTFRC-SP spec discussessimple expedient would greatly increase theimplicationsvalue ofoperating in an environment where queues have been configured to drop smaller packets with proportionately lower probability than larger ones. But it only discusses TCP operating in such an environment, only mentioningthe best efforts Internet at minimal cost. 3.2.4. Congestion Coding: Summary of Status +-----------+----------------+-----------------+--------------------+ | transport | RED_1 (packet | RED_4 (linear | RED_5 (square byte | | cc | mode drop) | byte mode drop) | mode drop) | +-----------+----------------+-----------------+--------------------+ | TCP or | s/sqrt(p) | sqrt(s/p) | 1/sqrt(p) | | TFRC | | | | | TFRC-SPbriefly| 1/sqrt(p) | 1/sqrt(sp) | 1/(s.sqrt(p)) | +-----------+----------------+-----------------+--------------------+ Table 1: Dependence of flow bit-rate per RTT on packet size s and drop rate p whendiscussing hownetwork and/or transport bias towards small packets todefine fairness with TCP. And it only discussesvarying degrees Table 1 aims to summarise thebyte-mode dropping versionpositions we may now be in. Each column shows a different possible AQM behaviour in different queues in the network, using the terminology ofRED as it was beforeCnodder et alpointed out it didn't sufficiently bias towards small packets to makeoutlined earlier (RED_1 is basic RED with packet-mode drop). Each row shows a different transport behaviour: TCPindependent of packet size. So[RFC5681] and TFRC [RFC3448] on the top row with TFRC-SPspec doesn't address[RFC4828] below. Suppressing all inessential details theissue of which oftable shows that independence from packet size should either be achievable by not altering thenetworkTCP transport in a RED_5 network, or using thetransport _should_ handle fairness between differentsmall packetsizes. In its Appendix B.4 it discusses the possibility of bothTFRC-SPand sometransport in a networkbuffers duplicating each other's attempts to deliberately bias towards small packets. But the discussion is not conclusive, instead reporting simulations of many ofwithout any byte-mode dropping RED (top right and bottom left). Top left is thepossibilities`do nothing' scenario, while bottom right is the `do-both' scenario inorder to assess performance but not recommendingwhich bit-rate would become far too biased towards small packets. Of course, if anyparticular courseform ofaction. The paper originally proposing TFRC with virtual packets (VP-TFRC) [CCvarPktSize] proposed that there should perhaps be two variants to cater for the different variantsbyte-mode dropping RED has been deployed on a selection ofRED. However, as the TFRC-SP authors point out, there is no way forcongested queues, each path will present atransportdifferent hybrid scenario toknow whether some queues onitspath have deployed RED withtransport. Whatever, we can see that the linear byte-modepacketdrop(except if an exhaustive survey found that no-one has deployed it!-- see Section 6.2.4). Incidentally, VP-TFRC also proposedcolumn in the middle considerably complicates the Internet. It's a half-way house thatbyte- mode RED droppingdoesn't bias enough towards small packets even if one believes the network shouldreally squarebe doing thepacket size compensation factor (likebiasing. We argue below thatof RED_5, but apparently unaware of it). Pre-congestion notification [I-D.ietf-pcn-marking-behaviour] is a proposal to use a virtual queue for AQM marking for_all_ network layer bias towards small packetswithin one Diffserv class in order to give early warning prior toshould be turned off--if indeed anyreal queuing. The proposed PCN marking algorithmsequipment vendors havebeen designed not to take account ofimplemented it--leaving packet sizewhen forwarding through queues. Insteadbias solely as thegeneral principlepreserve of the transport layer (solely the leftmost, packet-mode drop column). A survey has beento take accountconducted of 84 vendors to assess how widely drop probability based on packet size has been implemented in RED. Prior to thesizes of marked packets when monitoringsurvey, an individual approach to Cisco received confirmation that, having checked thefractioncode-base for each ofmarking attheedgeproduct ranges, Cisco has not implemented any discrimination based on packet size in any AQM algorithm in any ofthe network. 6.2.3. Making Transports Robust against Control Packet Losses Recently, two drafts have proposed changesits products. Also an individual approach toTCPAlcatel-Lucent drew a confirmation thatmakeitmore robust against losing small control packets [I-D.ietf-tcpm-ecnsyn] [I-D.floyd-tcpm-ackcc]. In both cases they notewas very likely thatthe case for these TCP changes would be weaker ifnone of their products contained REDwere biased against dropping small packets. We argue herecode thatthese two proposals are a safer and more principled way to achieve TCP performance improvements than reverse engineering REDimplemented any packet-size bias. Turning tobenefit TCP.our more formal survey (Table 2), about 19% of those surveyed have replied so far, giving a sample size of 16. Althoughno proposals exist as far asweknow, it would also be possible and perfectly valid to make control packets robust against drop by explicitly requesting a lower drop probability using their Diffserv code point [RFC2474]do not have permission torequest a scheduling class with lower drop. The re-ECN protocol proposal [I-D.briscoe-tsvwg-re-ecn-tcp] is designed so that transportsidentify the respondents, we canbe made more robust against losing control packets. It gives queues an incentive to optionally give preference against drop to packets with the 'feedback not established' codepoint in the proposed 'extended ECN' field. Senderssay that those that haveincentives to use this codepoint sparingly, but they can use it on control packets to reduce their chanceresponded include most ofbeing dropped. For instance, the proposed modification to TCP for re-ECN uses this codepoint on the SYN and SYN-ACK. Although not brought to the IETF, a simple proposal from Wischik [DupTCP] suggests thatthefirst three packets of every TCP flow should be routinely duplicated afterlarger vendors, covering ashort delay. It shows that this would greatly improve the chanceslarge fraction ofshort flows completing quickly, but it would hardly increase traffic levels ontheInternet, because Internet bytes have always been concentrated inmarket. They range across the largeflows. It further showsnetwork equipment vendors at L3 & L2, firewall vendors, wireless equipment vendors, as well as large software businesses with a small selection of networking products. So far, all those who have responded have confirmed that they have not implemented theperformancevariant ofmany typical applications dependsRED with drop dependent oncompletion of long serial chains of short messages. It argues that, given mostpacket size (2 were fairly sure they had not but needed to check more thoroughly). +-------------------------------+----------------+-----------------+ | Response | No. ofthe value people get from the Internet is concentrated within short flows, this simple expedient would greatly increase the valuevendors | %age ofthe best efforts Internet at minimal cost. 6.2.4. Congestion Coding: Summary of Status +-----------+----------------+-----------------+--------------------+ | transport | RED_1 (packet | RED_4 (linear | RED_5 (square bytevendors | +-------------------------------+----------------+-----------------+ |ccNot implemented |mode drop)14 |byte mode drop)17% |mode drop)|+-----------+----------------+-----------------+--------------------+Not implemented (probably) |TCP or2 |s/sqrt(p)2% |sqrt(s/p)|1/sqrt(p)Implemented | 0 |TFRC0% | | No response | 68 | 81% |TFRC-SP|1/sqrt(p)Total companies/orgs surveyed |1/sqrt(sp)84 |1/(s.sqrt(p))100% |+-----------+----------------+-----------------+--------------------++-------------------------------+----------------+-----------------+ Table1: Dependence of flow bit-rate per RTT2: Vendor Survey onpacket size s andbyte-mode droprate p when network and/or transport bias towardsvariant of RED (lower drop probability for smallpackets to varying degrees Table 1 aims to summarisepackets) Where reasons have been given, thepositions we may now be in. Each column showsextra complexity of packet bias code has been most prevalent, though one vendor had adifferent possible AQM behaviour in different queues in the network, usingmore principled reason for avoiding it--similar to theterminologyargument ofCnodder et al outlined earlier (RED_1 is basicthis document. We have established that Linux does not implement RED withpacket-mode drop). Each row shows a different transport behaviour: TCP [RFC5681] and TFRC [RFC3448] on the top row with TFRC-SP [RFC4828] below. Suppressing all inessential details the table shows that independence frompacket sizeshould either be achievable bydrop bias, although we have notaltering the TCP transport in a RED_5 network, or using the small packet TFRC-SP transport ininvestigated anetwork without any byte-mode dropping RED (top right and bottom left). Top leftwider range of open source code. Finally, we repeat that RED's byte mode drop is not the`do nothing' scenario, while bottom right is the `do-both' scenario in which bit-rate would become far too biasedonly way to bias towards smallpackets. Of course, if any formpackets--tail-drop tends to lock-out large packets very effectively. Our survey was ofbyte-mode dropping REDvendor implementations, so we cannot be certain about operator deployment. But we believe many queues in the Internet are still tail-drop. The company of one of the co-authors (BT) hasbeenwidely deployedon a selection of congestedRED, but there are bound to be many tail-drop queues,each path will presentparticularly in access network equipment and on middleboxes like firewalls, where RED is not always available. Routers using adifferent hybrid scenario to its transport. Whatever, we can see that the linear byte-mode drop columnmemory architecture based on fixed size buffers with borrowing may also still be prevalent in themiddle considerably complicates theInternet.It'sAs explained in Section 3.2.1, these also provide ahalf-way house that doesn'tmarginal (but legitimate) biasenoughtowards smallpacketspackets. So evenif one believes the network should be doing the biasing. We argue below that _all_ network layerthough RED byte-mode drop is not prevalent, it is likely there is still some bias towards small packets in the Internet due to tail drop and fixed buffer borrowing. 4. Outstanding Issues and Next Steps 4.1. Bit-congestible World For a connectionless network with nearly all resources being bit- congestible we believe the recommended position is now unarguably clear--that the network shouldbe turned off--if indeed any equipment vendors have implemented it--leavingnot make allowance for packetsize bias solely as the preserve ofsizes and the transportlayer (solely the leftmost, packet-mode drop column). A survey has been conducted of 84 vendorsshould. This leaves two outstanding issues: o How toassess how widelyhandle any legacy of AQM with byte-mode dropprobability based on packet size has been implemented in RED. Prioralready deployed; o The need tothe survey, an individual approachstart a programme toCisco received confirmation that, having checked the code-base for eachupdate transport congestion control protocol standards to take account ofthe product ranges, Cisco has not implemented any discrimination based onpacketsize in any AQM algorithm in anysize. The sample ofits products. Also an individual approach to Alcatel-Lucent drew a confirmationreturns from our vendor survey Section 3.2.4 suggest that byte-mode packet drop seems not to be implemented at all let alone deployed, or if itwas veryis, it is likelythat none of their products contained RED code that implemented any packet-size bias. Turningtoour more formal survey (Table 2), about 19% of those surveyed have replied so far, giving a sample size of 16. Althoughbe very sparse. Therefore, we do nothave permissionreally need a migration strategy from all but nothing toidentify the respondents, we can say that those that have responded include mostnothing. A programme ofthe larger vendors, covering a large fractionstandards updates to take account of packet size in transport congestion control protocols has started with TFRC-SP [RFC4828], while weighted TCPs implemented in themarket. They range acrossresearch community [WindowPropFair] could form thelarge network equipment vendors at L3 & L2, firewall vendors, wireless equipment vendors, as well as large software businessesbasis of a future change to TCP congestion control [RFC5681] itself. 4.2. Bit- & Packet-congestible World Nonetheless, a connectionless network with both bit-congestible and packet-congestible resources is asmall selectiondifferent matter. If we believe we should allow for this possibility in the future, this space contains a truly open research issue. We develop the concept ofnetworking products. So far, all those who have responded have confirmedan idealised congestion notification protocol thatthey have not implemented the variantsupports both bit-congestible and packet-congestible resources in Appendix B. The congestion notification requires at least two flags for congestion ofREDbit-congestible and packet- congestible resources. This hides a fundamental problem--much more fundamental than whether we can magically create header space for yet another ECN flag in IPv4, or whether it would work while being deployed incrementally. A congestion notification protocol must survive a transition from low levels of congestion to high. Marking two states is feasible with explicit marking, but much harder if packets are dropped. Also, it will not always be cost-effective to implement AQM at every low level resource, so dropdependent onwill often have to suffice. Distinguishing drop from delivery naturally provides just one congestion flag--it is hard to drop a packetsize (2in two ways that arefairly sure they haven't but needdistinguishable remotely. This is a similar problem tocheck more thoroughly). +-------------------------------+----------------+-----------------+ | Response | No.that ofvendors | %agedistinguishing wireless transmission losses from congestive losses. We should also note that, strictly, packet-congestible resources are actually cycle-congestible because load also depends on the complexity ofvendors | +-------------------------------+----------------+-----------------+ | Not implemented | 14 | 17% | | Not implemented (probably) | 2 | 2% | | Implemented | 0 | 0% | | No response | 68 | 81% | | Total companies/orgs surveyed | 84 | 100% | +-------------------------------+----------------+-----------------+ Table 2: Vendor Survey on byte-mode drop variant of RED (lower drop probability for small packets) Where reasons have been given,each look-up and whether theextra complexitypattern ofpacket bias code has been most prevalent, though one vendor had a more principled reason for avoiding it--similararrivals is amenable tothe argument ofcaching or not. Further, thisdocument. We have establishedreminds us thatLinux does not implement RED with packet size drop bias, although we haveany solution must notinvestigatedrequire awider range of open source code. Finally, we repeat that RED's byte mode drop is not the only way to bias towards small packets--tail-drop tendsforwarding engine tolock-out large packets very effectively. Our survey was of vendor implementations, so we cannot be certain about operator deployment. But we believe many queuesuse excessive processor cycles in order to decide how to say it has no spare processor cycles. Recently, theInternet are still tail-drop. My own company (BT)dual resource queue (DRQ) proposal [DRQ] haswidely deployed RED, but there are bound to be many tail-drop queues, particularlybeen made on the premise that, as network processors become more cost effective, per packet operations will become more complex (irrespective of whether more function inaccessthe networkequipment and on middleboxes like firewalls, where REDlayer isnot always available. Routers using a memory architecture based on fixed size buffers with borrowing may also still be prevalent indesirable). Consequently theInternet. As explained in Section 6.2.1, these also providepremise is that CPU congestion will become more common. DRQ is amarginal (but legitimate) bias towards small packets. So even thoughproposed modification to the REDbyte-mode dropalgorithm that folds both bit congestion and packet congestion into one signal (either loss or ECN). The problem of signalling packet processing congestion is notprevalent, it is likely there is still some bias towards small packets in thepressing, as most Internetdue to tail drop and fixed buffer borrowing. 7. Outstanding Issues and Next Steps 7.1. Bit-congestible World For a connectionless network with nearly allresourcesbeingare designed to be bit- congestiblewe believe the recommended position is now unarguably clear--that the network should not make allowance forbefore packetsizes and the transport should. This leaves two outstanding issues: o How to handle any legacy of AQM with byte-mode drop already deployed; o The need to start a programmeprocessing starts toupdate transportcongest (see Section 1.1). However, the IRTF Internet congestion controlprotocol standards to take account of packet size. The sampleresearch group (ICCRG) has set itself the task ofreturns from our vendor survey Section 6.2.4 suggestreaching consensus on generic forwarding mechanisms thatbyte-mode packet drop seems notare necessary and sufficient tobe implementedsupport the Internet's future congestion control requirements (the first challenge in [I-D.irtf-iccrg-welzl]). Therefore, rather than not giving this problem any thought atall let alone deployed, or if it is,all, just because it islikely to be very sparse. Therefore,hard and currently hypothetical, wedo not really need a migration strategy from all but nothing to nothing. A programme of standards updates to take accountdefer the question of whether packetsize in transportcongestioncontrol protocols has started with TFRC-SP [RFC4828], while weighted TCPs implemented in the research community [WindowPropFair] could form the basis of a future changemight become common and what toTCP congestion control [RFC5681] itself. 7.2. Bit- & Packet-congestible World Nonetheless, a connectionless network with both bit-congestibledo if it does to the IRTF (the 'Small Packets' challenge in [I-D.irtf-iccrg-welzl]). 5. Recommendation andpacket-congestible resourcesConclusions 5.1. Recommendation on Queue Measurement Queue length isa different matter. If we believe we should allow for this possibility inusually thefuture, this space contains a truly open research issue. The idealised wire protocol coding described in Section 5 requires at least two flags formost correct and simplest way to measure congestion ofbit-congestible and packet- congestible resources. This hidesafundamental problem--much more fundamental than whether weresource. To avoid the pathological effects of drop tail, an AQM function canmagically create header space for yet another ECN flag in IPv4,then be used to transform queue length into the probability of dropping orwhether it would work while being deployed incrementally. A congestion notification protocol must survivemarking atransition from low levels of congestion to high. Marking two statespacket (e.g. RED's piecewise linear function between thresholds). If the resource isfeasible with explicit marking, but much harder if packets are dropped. Also, it will not alwaysbit-congestible, the length of the queue SHOULD becost-effective to implement AQM at every low level resource, so drop will often have to suffice. Distinguishing drop from delivery naturally provides just one congestion flag--it is hard to drop a packetmeasured intwo ways that are distinguishable remotely. Thisbytes. If the resource isa similar problem to that of distinguishing wireless transmission losses from congestive losses. We should also note that, strictly, packet-congestible resources are actually cycle-congestible because load also depends onpacket-congestible, thecomplexitylength ofeach look-up and whetherthepatternqueue SHOULD be measured in packets. No other choice makes sense, because the number ofarrivals is amenable to caching or not. Further, this reminds us that any solution must not require a forwarding engine to use excessive processor cyclespackets waiting inorder to decide how to say it has no spare processor cycles. Recently,thedual resourcequeue(DRQ) proposal [DRQ] has been made onisn't relevant if thepremise that, as network processors become more cost effective, perresource gets congested by bytes and vice versa. We discuss the implications on RED's byte mode and packetoperations will become more complex (irrespective of whether more functionmode for measuring queue length inthe network layer is desirable). Consequently the premise isSection 3. NOTE WELL thatCPU congestion will become more common. DRQRED's byte-mode queue measurement isa proposed modificationfine, being completely orthogonal tothebyte-mode drop. If a REDalgorithm that folds both bit congestion and packet congestion into one signal (either loss or ECN). The problem of signalling packet processing congestion isimplementation has a byte-mode but does notpressing, asspecify what sort of byte-mode, it is mostInternet resources are designed to be bit- congestible before packet processing starts to congest.probably byte-mode queue measurement, which is fine. However, if in doubt, theIRTF Internet congestion control research group (ICCRG) has set itself the task of reaching consensusvendor should be consulted. 5.2. Recommendation ongeneric forwarding mechanismsNotifying Congestion The strong recommendation is thatare necessary and sufficient to supportAQM algorithms such as RED SHOULD NOT use byte-mode drop. More generally, the Internet'sfuturecongestioncontrol requirements (the first challenge in [I-D.irtf-iccrg-welzl-congestion-control-open-research]). Therefore, rather than not giving this problem any thought at all, just because it is hard and currently hypothetical, we defer the questionnotification protocols (drop, ECN & PCN) SHOULD take account ofwhetherpacketcongestion might become common and what to do ifsize when the notification is read by the transport layer, NOT when itdoes tois written by theIRTF (the 'Small Packets' challenge in [I-D.irtf-iccrg-welzl-congestion-control-open-research]). 8. Security Considerationsnetwork layer. Thisdraft recommendsapproach offers sufficient and correct congestion information for all known and future transport protocols and also ensures no perverse incentives are created thatqueues do not biaswould encourage transports to use inappropriately small packet sizes. The alternative of deflating RED's drop probabilitytowards small packets as thisfor smaller packet sizes (byte-mode drop) has no enduring advantages. It is more complex, it createsathe perverse incentivefor transportstobreak down their flowsfragment segments into tinysegments. One ofpieces and it reopens thebenefits of implementing AQM was meant to bevulnerability toremove this perverse incentivefloods of small- packets that drop-tail queuesgavesuffered from and AQM was designed tosmall packets. Of course, if transports really wantremove. Byte-mode drop is a change tomakethegreatest gains, they don't have to respond to congestion anyway. But we don't want applicationsnetwork layer thatare tryingmakes allowance for an omission from the design of TCP, effectively reverse engineering the network layer tobehavecontrive todiscover that they can go fastermake two TCPs with different packet sizes run at equal bit rates (rather than packet rates) under the same path conditions. It also improves TCP performance byusing smaller packets. In practice, transports cannot allreducing the chance that a SYN or a pure ACK will betrusteddropped, because they are small. But we SHOULD NOT hack the network layer torespondimprove or fix certain transport protocols. No matter how predominant a transport protocol is (even if it's TCP), trying tocongestion. So another reasoncorrect forrecommending that queues do not bias drop probabilityits failings by biasing towards small packetsis to avoidin thevulnerabilitynetwork layer creates a perverse incentive tosmall packet DDoS attacks that would otherwise result. Onebreak down all flows from all transports into tiny segments. So far, our survey of 84 vendors across thebenefitsindustry has drawn responses from about 19%, none ofimplementing AQM was meant to be to remove drop-tail's DoS vulnerability to small packets, so we shouldn't add it back again. If most queueswhom have implementedAQM with byte-mode drop, the resulting network would amplifythepotency of a smallbyte mode packetDDoS attack. At the first queue the streamdrop variant ofpackets would push aside a greater proportion of large packets, so more of the small packets would surviveRED. Given there appears toattack the next queue. Thus a flood of small packets would continue on towards the destination, pushing regular traffic with large packets outbe little, if any, installed base it seems we can recommend removal ofthe way in one queue after the next, but suffering much lessbyte-mode dropitself. Appendix C explains why the ability of networks to police the response of _any_ transport to congestion depends on bit-congestible network resources only doing packet-mode notfrom RED with little, if any, incremental deployment impact. If a vendor has implemented byte-modedrop. In summary,drop, and an operator has turned itsays that making drop probability depend on the size of the packetson, it is strongly RECOMMENDED thatbits happen to be divided into simply encourages the bits toit SHOULD bedivided into smaller packets. Byte-mode drop would therefore irreversibly complicate any attempt to fix the Internet's incentive structures. 9. Conclusions The strong conclusion isturned off. Note thatAQM algorithms such asRED as a whole SHOULD NOTusebe turned off, as without it, a drop tail queue also biases against large packets. But note also that turning off byte-modedrop. More generally,may alter theInternet's congestion notification protocols (drop, ECN & PCN) SHOULD take accountrelative performance of applications using different packetsize whensizes, so it would be advisable to establish the implications before turning it off. 5.3. Recommendation on Responding to Congestion Instead of network equipment biasing its congestion notificationis read byfor small packets, the IETF transportlayer, NOT when it is written by the network layer. This approach offers sufficient and correctarea should continue its programme of updating congestioninformation for all known and future transportcontrol protocols to take account of packet size andalso ensures no perverse incentives are created that would encourageto make transports less sensitive touse inappropriately small packet sizes.losing control packets like SYNs and pure ACKS. 5.4. Recommended Future Research Thealternative of deflating RED's drop probabilityabove conclusions cater forsmaller packet sizes (byte-mode drop) has no enduring advantages. Itthe Internet as it is today with most, if not all, resources being primarily bit-congestible. A secondary conclusion of this memo is that we may see morecomplex,packet- congestible resources in the future, so research may be needed to extend the Internet's congestion notification (drop or ECN) so that it can handle a mix of bit-congestible and packet-congestible resources. 6. Security Considerations This draft recommends that queues do not bias drop probability towards small packets as this createsthea perverse incentive for transports tofragment segmentsbreak down their flows into tinypieces and it reopenssegments. One of thevulnerability to floodsbenefits ofsmall- packets that drop-tail queues suffered from andimplementing AQM wasdesignedmeant toremove. Byte-mode drop is a changebe tothe network layerremove this perverse incentive thatmakes allowance for an omission from the design of TCP, effectively reverse engineering the network layerdrop-tail queues gave tocontrivesmall packets. Of course, if transports really want to maketwo TCPs with different packet sizes run at equal bit rates (rather than packet rates) underthesame path conditions. It also improves TCP performance by reducing the chance that a SYN or a pure ACK will be dropped, becausegreatest gains, theyare small.don't have to respond to congestion anyway. But weSHOULD NOT hack the network layer to improve or fix certain transport protocols. No matter how predominant a transport protocol is (even if it's TCP),don't want applications that are trying tocorrect for its failingsbehave to discover that they can go faster bybiasingusing smaller packets. In practice, transports cannot all be trusted to respond to congestion. So another reason for recommending that queues do not bias drop probability towards small packetsin the network layer creates a perverse incentiveis tobreak down all flows from all transports into tiny segments. So far, our survey of 84 vendors acrossavoid theindustry has drawn responses from about 19%, nonevulnerability to small packet DDoS attacks that would otherwise result. One ofwhom have implementedthebyte mode packet drop variantbenefits ofRED. Given there appearsimplementing AQM was meant to belittle, if any, installed base it seemsto remove drop-tail's DoS vulnerability to small packets, so wecan recommend removal of byte-mode drop from RED with little, if any, incremental deployment impact.shouldn't add it back again. Ifa vendor hasmost queues implemented AQM with byte-mode drop,and an operator has turned it on, it is strongly RECOMMENDED that it SHOULD be turned off. Note that RED as a whole SHOULD NOT be turned off, as without it,the resulting network would amplify the potency of adrop tailsmall packet DDoS attack. At the first queuealso biases against large packets. But note also that turning off byte-mode may altertherelative performancestream ofapplications using different packet sizes,packets would push aside a greater proportion of large packets, soitmore of the small packets wouldbe advisablesurvive toestablish the implications before turning it off. Instead,attack theIETF transport area should continue its programmenext queue. Thus a flood ofupdating congestion control protocols to take accountsmall packets would continue on towards the destination, pushing regular traffic with large packets out ofpacket size and to make transportsthe way in one queue after the next, but suffering much lesssensitivedrop itself. Appendix C explains why the ability of networks tolosing control packets like SYNs and pure ACKS. NOTE WELL that RED's byte-mode queue measurement is fine, being completely orthogonalpolice the response of _any_ transport to congestion depends on bit-congestible network resources only doing packet-mode not byte-mode drop.If a RED implementation has a byte-mode but does not specify what sort of byte-mode,In summary, itis most probably byte-mode queue measurement, which is fine. However, if in doubt, the vendor should be consulted. The above conclusions cater forsays that making drop probability depend on theInternet as it is today with most, if not all, resources being primarily bit-congestible. A secondary conclusionsize ofthis memo isthe packets thatwe may see more packet- congestible resources inbits happen to be divided into simply encourages thefuture, so research maybits to beneededdivided into smaller packets. Byte-mode drop would therefore irreversibly complicate any attempt toextendfix the Internet'scongestion notification (drop or ECN) so that it can handle a mix of bit-congestible and packet-congestible resources. 10.incentive structures. 7. Acknowledgements Thank you to Sally Floyd, who gave extensive and useful review comments. Also thanks for the reviews from Philip Eardley, Toby Moncaster and Arnaud Jacquet as well as helpful explanations of different hardware approaches from Larry Dunn and Fred Baker. I am grateful to Bruce Davie and his colleagues for providing a timely and efficient survey of RED implementation in Cisco's product range. Also grateful thanks to Toby Moncaster, Will Dormann, John Regnault, Simon Carter and Stefaan De Cnodder who further helped survey the current status of RED implementation and deployment and, finally, thanks to the anonymous individuals who responded. Bob Briscoeisand Jukka Manner are partly funded by Trilogy, a research project (ICT- 216372) supported by the European Community under its Seventh Framework Programme. The views expressed here are those of theauthorauthors only.11.8. Comments Solicited Comments and questions are encouraged and very welcome. They can be addressed to the IETF Transport Area working group mailing list <tsvwg@ietf.org>, and/or to the authors.12.9. References12.1.9.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, S., Wroclawski, J., and L. Zhang, "Recommendations on Queue Management and Congestion Avoidance in the Internet", RFC 2309, April 1998. [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, September 2001. [RFC3426] Floyd, S., "General Architectural and Policy Considerations", RFC 3426, November 2002. [RFC5033] Floyd, S. and M. Allman, "Specifying New Congestion Control Algorithms", BCP 133, RFC 5033, August 2007.12.2.9.2. Informative References [CCvarPktSize] Widmer, J., Boutremans, C., and J-Y. Le Boudec, "Congestion Control for Flows with Variable Packet Size", ACM CCR 34(2) 137--151, 2004,<http://doi.acm.org/10.1145/997150.997162>.<http:// doi.acm.org/10.1145/997150.997162>. [DRQ] Shin, M., Chong, S., and I. Rhee, "Dual-Resource TCP/AQM for Processing-Constrained Networks", IEEE/ACM Transactions on Networking Vol 16, issue 2, April 2008,<http://dx.doi.org/10.1109/TNET.2007.900415>.<http://dx.doi.org/ 10.1109/TNET.2007.900415>. [DupTCP] Wischik, D., "Short messages", Royal Society workshop on networks: modelling and control , September 2007, <http://www.cs.ucl.ac.uk/staff/ucacdjw/Research/shortmsg.html>.www.cs.ucl.ac.uk/staff/ucacdjw/ Research/shortmsg.html>. [ECNFixedWireless] Siris, V., "Resource Control for Elastic Traffic in CDMA Networks", Proc. ACM MOBICOM'02 , September 2002, <http:// www.ics.forth.gr/netlab/publications/ resource_control_elastic_cdma.html>. [Evol_cc] Gibbens, R. and F. Kelly, "Resource pricing and the evolution of congestion control", Automatica35(12)1969-- 1985,35(12)1969--1985, December 1999,<http://www.statslab.cam.ac.uk/~frank/evol.html>.<http:// www.statslab.cam.ac.uk/~frank/ evol.html>. [I-D.briscoe-tsvwg-re-ecn-tcp] Briscoe, B., Jacquet, A., Moncaster, T., and A. Smith, "Re-ECN: Adding Accountability for Causing Congestion to TCP/IP",draft-briscoe-tsvwg-re-ecn-tcp-07 (work in progress), March 2009. [I-D.floyd-tcpm-ackcc] Floyd, S., "Adding Acknowledgement Congestion Control to TCP", draft-floyd-tcpm-ackcc-06draft-briscoe-tsvwg-re-ecn-tcp-08 (work in progress),JulySeptember 2009.[I-D.ietf-pcn-marking-behaviour][I-D.ietf-pcn] Eardley, P., "Metering and marking behaviour ofPCN- nodes",PCN-nodes", draft-ietf-pcn-marking-behaviour-05 (work in progress), August 2009.[I-D.ietf-tcpm-ecnsyn] Floyd, S., "Adding Explicit Congestion Notification (ECN) Capability to TCP's SYN/ACK Packets", draft-ietf-tcpm-ecnsyn-10 (work in progress), May 2009. [I-D.irtf-iccrg-welzl-congestion-control-open-research][I-D.irtf-iccrg-welzl] Welzl, M., Scharf, M., Briscoe, B., and D. Papadimitriou, "Open Research Issues in Internet Congestion Control",draft-irtf-iccrg-welzl-congestion-control-open-research-05draft-irtf-iccrg-welzl- congestion-control-open-research-07 (work in progress),September 2009.June 2010. [IOSArch] Bollapragada, V., White, R., and C. Murphy, "Inside Cisco IOS Software Architecture", Cisco Press: CCIE Professional Development ISBN13:978-1-57870-181-0,978- 1-57870-181-0, July 2000. [MulTCP] Crowcroft, J. and Ph. Oechslin, "Differentiated End to End Internet Services using a Weighted Proportional Fair Sharing TCP", CCR 28(3) 53--69, July 1998, <http://www.cs.ucl.ac.uk/staff/J.Crowcroft/hipparch/pricing.html>.www.cs.ucl.ac.uk/staff/J.Crowcroft/ hipparch/pricing.html>. [PktSizeEquCC] Vasallo, P., "Variable Packet Size Equation-Based Congestion Control", ICSI Technical Report tr-00-008, 2000,<http://http.icsi.berkeley.edu/ftp/global/pub/ techreports/2000/tr-00-008.pdf>.<http://http.icsi.berkeley.edu/ ftp/global/pub/techreports/2000/ tr-00-008.pdf>. [RED93] Floyd, S. and V. Jacobson, "Random Early Detection (RED) gateways for Congestion Avoidance", IEEE/ACM Transactions on Networking 1(4)397--413,397-- 413, August 1993,<http://www.icir.org/floyd/papers/red/red.html>.<http:// www.icir.org/floyd/papers/red/ red.html>. [REDbias] Eddy, W. and M. Allman, "A Comparison of RED's Byte and Packet Modes", Computer Networks 42(3) 261--280, June 2003,<http://www.ir.bbn.com/documents/articles/redbias.ps>.<http://www.ir.bbn.com/ documents/articles/redbias.ps>. [REDbyte] De Cnodder, S., Elloumi, O., and K. Pauwels, "RED behavior with different packet sizes", Proc. 5th IEEE Symposium on Computers and Communications (ISCC) 793--799, July 2000,<http://www.icir.org/floyd/red/Elloumi99.pdf>.<http://www.icir.org/ floyd/red/Elloumi99.pdf>. [RFC2474] Nichols, K., Blake, S., Baker, F., and D. Black, "Definition of the Differentiated Services Field (DS Field) in the IPv4 and IPv6 Headers", RFC 2474, December 1998. [RFC3448] Handley, M., Floyd, S., Padhye, J., and J. Widmer, "TCP Friendly Rate Control (TFRC): Protocol Specification", RFC 3448, January 2003. [RFC3714] Floyd, S. and J. Kempf, "IAB Concerns Regarding Congestion Control for Voice Traffic in the Internet", RFC 3714, March 2004. [RFC4782] Floyd, S., Allman, M., Jain, A., and P. Sarolahti,"Quick- Start"Quick-Start for TCP and IP", RFC 4782, January 2007. [RFC4828] Floyd, S. and E. Kohler, "TCP Friendly Rate Control (TFRC): The Small-Packet (SP) Variant", RFC 4828, April 2007. [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. Ramakrishnan, "Adding Explicit Congestion Notification (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, June 2009. [RFC5670] Eardley, P., "Metering and Marking Behaviour of PCN-Nodes", RFC 5670, November 2009. [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion Control", RFC 5681, September 2009. [RFC5690] Floyd, S., Arcia, A., Ros, D., and J. Iyengar, "Adding Acknowledgement Congestion Control to TCP", RFC 5690, February 2010. [Rate_fair_Dis] Briscoe, B., "Flow Rate Fairness: Dismantling aReligion", ACM CCR 37(2)63--74, April 2007, <http://portal.acm.org/citation.cfm?id=1232926>. [WindowPropFair] Siris, V., "Service DifferentiationReligion", ACM CCR 37(2)63--74, April 2007, <http:// portal.acm.org/ citation.cfm?id=1232926>. [WindowPropFair] Siris, V., "Service Differentiation and Performance of Weighted Window- Based Congestion Control and Packet Marking Algorithms in ECN Networks", Computer Communications 26(4) 314-- 326, 2002, <http://www.ics.forth.gr/ netgroup/publications/ weighted_window_control.html>. [gentle_RED] Floyd, S., "Recommendation on using the "gentle_" variant of RED", Web page , March 2000, <http:// www.icir.org/floyd/red/gentle.html>. [pBox] Floyd, S. and K. Fall, "Promoting the Use of End-to-End Congestion Control in the Internet", IEEE/ACM Transactions on Networking 7(4) 458-- 472, August 1999, <http:// www.aciri.org/floyd/ end2end-paper.html>. [pktByteEmail] Yes and J. Doe, "Missing for now", RFC 0000, May 2006. [xcp-spec] Falk, A., "Specification for the Explicit Control Protocol (XCP)", draft-falk-xcp-spec-03 (work in progress), July 2007. Appendix A. Congestion Notification Definition: Further Justification In Section 1.1 on the definition of congestion notification, load not capacity was used as the denominator. This also has a subtle significance in the related debate over the design of new transport protocols--typical new protocol designs (e.g. in XCP [xcp-spec] & Quickstart [RFC4782]) expect the sending transport to communicate its desired flow rate to the network and network elements to progressively subtract from this so that the achievable flow rate emerges at the receiving transport. Congestion notification with total load in the denominator can serve a similar purpose (though in retrospect not in advance like XCP & QuickStart). Congestion notification is a dimensionless fraction but each source can extract necessary rate information from it because it already knows what its own rate is. Even though congestion notification doesn't communicate a rate explicitly, from each source's point of view congestion notification represents the fraction of the rate it was sending a round trip ago that couldn't (or wouldn't) be served by available resources. Appendix B. Idealised Wire Protocol We will start by inventing an idealised congestion notification protocol before discussing how to make it practical. The idealised protocol is shown to be correct using examples later in this appendix. B.1. Protocol Coding Congestion notification involves the congested resource coding a congestion notification signal into the packet stream and the transports decoding it. The idealised protocol uses two different (imaginary) fields in each datagram to signal congestion: one for byte congestion and one for packet congestion. We are not saying two ECN fields will be needed (and we are not saying that somehow a resource should be able to drop a packet in one of two different ways so that the transport can distinguish which sort of drop it was!). These two congestion notification channels are just a conceptual device. They allow us to defer having to decide whether to distinguish between byte and packet congestion when the network resource codes the signal or when the transport decodes it. However, although this idealised mechanism isn't intended for implementation, we do want to emphasise that we may need to find a way to implement it, because it could become necessary to somehow distinguish between bit and packet congestion [RFC3714]. Currently, packet-congestion is not the common case, but there is no guarantee that it will not become common with future technology trends. The idealised wire protocol is given below. It accounts for packet sizes at the transport layer, not in the network, and then only in the case of bit-congestible resources. This avoids the perverse incentive to send smaller packets and the DoS vulnerability that would otherwise result if the network were to bias towards them (see the motivating argument about avoiding perverse incentives in Section 2.2): 1. A packet-congestible resource trying to code congestion level p_p into a packet stream should mark the idealised `packet congestion' field in each packet with probability p_p irrespective of the packet's size. The transport should then take a packet with the packet congestion field marked to mean just one mark, irrespective of the packet size. 2. A bit-congestible resource trying to code time-varying byte- congestion level p_b into a packet stream should mark the `byte congestion' field in each packet with probability p_b, again irrespective of the packet's size. Unlike before, the transport should take a packet with the byte congestion field marked to count as a mark on each byte in the packet. The worked examples in Appendix B.2 show that transports can extract sufficient and correct congestion notification from these protocols for cases when two flows with different packet sizes have matching bit rates or matching packet rates. Examples are also given that mix these two flows into one to show that a flow with mixed packet sizes would still be able to extract sufficient andPerformance of Weighted Window-Based Congestion Controlcorrect information. Sufficient andPacket Marking Algorithms in ECN Networks", Computer Communications 26(4) 314--326, 2002, <http:// www.ics.forth.gr/netgroup/publications/ weighted_window_control.html>. [gentle_RED] Floyd, S., "Recommendation on usingcorrect congestion information means that there is sufficient information for the"gentle_" varianttwo different types ofRED", Web page , March 2000, <http://www.icir.org/floyd/red/gentle.html>. [pBox] Floyd, S. and K. Fall, "Promotingtransport requirements: Ratio-based: Established transport congestion controls like TCP's [RFC5681] aim to achieve equal segment rates per RTT through theUse of End-to-End Congestion Control insame bottleneck--TCP friendliness [RFC3448]. They work with theInternet", IEEE/ACM Transactions on Networking 7(4) 458--472, August 1999, <http://www.aciri.org/floyd/end2end-paper.html>. [pktByteEmail] Floyd, S., "RED: Discussionsratio ofByte and Packet Modes", email , March 1997, <http://www-nrg.ee.lbl.gov/floyd/REDaveraging.txt>. [xcp-spec] Falk, A., "Specification for the Explicit Control Protocol (XCP)", draft-falk-xcp-spec-03 (workdropped to delivered segments (or marked to unmarked segments inprogress), July 2007. (Expired) Editorial Comments [Note_Variation] The algorithm ofthebyte-mode drop variantcase ofRED switches off any bias towards small packets whenever the smoothed queue length dictatesECN). The example scenarios show that these ratio-based transports are effectively thedrop probability of large packets should be 100%. In the examplesame whether counting in bytes or packets, because theIntroduction, as the largeunits cancel out. (Incidentally, this is why TCP's bit rate is still proportional to packetdrop probability varies around 25%size even when byte-counting is used, as recommended for TCP in [RFC5681], mainly for orthogonal security reasons.) Absolute-target-based: Other congestion controls proposed in thesmall packet drop probability will vary around 1%, but with occasional jumpsresearch community aim to100% wheneverlimit theinstantaneous queue (after drop) managesvolume of congestion caused tosustainalength above the 100% drop pointconstant weight parameter. [MulTCP][WindowPropFair] are examples of weighted proportionally fair transports designed forlonger thancost-fair environments [Rate_fair_Dis]. In this case, thequeue averaging period. Appendix A.transport requires a count (not a ratio) of dropped/marked bytes in the bit-congestible case and of dropped/marked packets in the packet congestible case. B.2. Example ScenariosA.1.B.2.1. Notation To prove our idealised wire protocol(Section 5)(Appendix B.1) is correct, we will compare two flows with different packet sizes, s_1 and s_2[bit/pkt],[bit/ pkt], to make sure their transports each see the correct congestion notification. Initially, within each flow we will take all packets as having equal sizes, but later we will generalise to flows within which packet sizes vary. A flow's bit rate, x [bit/s], is related to its packet rate, u [pkt/s], by x(t) = s.u(t). We will consider a 2x2 matrix of four scenarios: +-----------------------------+------------------+------------------+ | resource type and | A) Equal bit | B) Equal pkt | | congestion level | rates | rates | +-----------------------------+------------------+------------------+ | i) bit-congestible, p_b | (Ai) | (Bi) | | ii) pkt-congestible, p_p | (Aii) | (Bii) | +-----------------------------+------------------+------------------+ Table 3A.2.B.2.2. Bit-congestible resource, equal bit rates (Ai) Starting with the bit-congestible scenario, for two flows to maintain equal bit rates (Ai) the ratio of the packet rates must be the inverse of the ratio of packet sizes: u_2/u_1 = s_1/s_2. So, for instance, a flow of 60B packets would have to send 25x more packets to achieve the same bit rate as a flow of 1500B packets. If a congested resource marks proportion p_b of packets irrespective of size, the ratio of marked packets received by each transport will still be the same as the ratio of their packet rates, p_b.u_2/p_b.u_1 = s_1/s_2. So of the 25x more 60B packets sent, 25x more will be marked than in the 1500B packet flow, but 25x more won't be marked too. In this scenario, the resource is bit-congestible, so it always uses our idealised bit-congestion field when it marks packets. Therefore the transport should count marked bytes not packets. But it doesn't actually matter for ratio-based transports like TCP(Section 5).(Appendix B.1). The ratio of marked to unmarked bytes seen by each flow will be p_b, as will the ratio of marked to unmarked packets. Because they are ratios, the units cancel out. If a flow sent an inconsistent mixture of packet sizes, we have said it should count the ratio of marked and unmarked bytes not packets in order to correctly decode the level of congestion. But actually, if all it is trying to do is decode p_b, it still doesn't matter. For instance, imagine the two equal bit rate flows were actually one flow at twice the bit rate sending a mixture of one 1500B packet for every thirty 60B packets. 25x more small packets will be marked and 25x more will be unmarked. The transport can still calculate p_b whether it uses bytes or packets for the ratio. In general, for any algorithm which works on a ratio of marks to non-marks, either bytes or packets can be counted interchangeably, because the choice cancels out in the ratio calculation. However, where an absolute target rather than relative volume of congestion caused is important(Section 5),(Appendix B.1), as it is for congestion accountability [Rate_fair_Dis], the transport must count marked bytes not packets, in this bit-congestible case. Aside from the goal of congestion accountability, this is how the bit rate of a transport can be made independent of packet size; by ensuring the rate of congestion caused is kept to a constant weight [WindowPropFair], rather than merely responding to the ratio of marked and unmarked bytes. Note the unit of byte-congestion-volume is the byte.A.3.B.2.3. Bit-congestible resource, equal packet rates (Bi) If two flows send different packet sizes but at the same packet rate, their bit rates will be in the same ratio as their packet sizes, x_2/ x_1 = s_2/s_1. For instance, a flow sending 1500B packets at the same packet rate as another sending 60B packets will be sending at 25x greater bit rate. In this case, if a congested resource marks proportion p_b of packets irrespective of size, the ratio of packets received with the byte-congestion field marked by each transport will be the same, p_b.u_2/p_b.u_1 = 1. Because the byte-congestion field is marked, the transport should count marked bytes not packets. But because each flow sends consistently sized packets it still doesn't matter for ratio-based transports. The ratio of marked to unmarked bytes seen by each flow will be p_b, as will the ratio of marked to unmarked packets. Therefore, if the congestion control algorithm is only concerned with the ratio of marked to unmarked packets (as is TCP), both flows will be able to decode p_b correctly whether they count packets or bytes. But if the absolute volume of congestion is important, e.g. for congestion accountability, the transport must count marked bytes not packets. Then the lower bit rate flow using smaller packets will rightly be perceived as causing less byte-congestion even though its packet rate is the same. If the two flows are mixed into one, of bit rate x1+x2, with equal packet rates of each size packet, the ratio p_b will still be measurable by counting the ratio of marked to unmarked bytes (or packets because the ratio cancels out the units). However, if the absolute volume of congestion is required, the transport must count the sum of congestion marked bytes, which indeed gives a correct measure of the rate of byte-congestion p_b(x_1 + x_2) caused by the combined bit rate.A.4.B.2.4. Pkt-congestible resource, equal bit rates (Aii) Moving to the case of packet-congestible resources, we now take two flows that send different packet sizes at the same bit rate, but this time the pkt-congestion field is marked by the resource with probability p_p. As in scenario Ai with the same bit rates but a bit-congestible resource, the flow with smaller packets will have a higher packet rate, so more packets will be both marked and unmarked, but in the same proportion. This time, the transport should only count marks without taking into account packet sizes. Transports will get the same result, p_p, by decoding the ratio of marked to unmarked packets in either flow. If one flow imitates the two flows but merged together, the bit rate will double with more small packets than large. The ratio of marked to unmarked packets will still be p_p. But if the absolute number of pkt-congestion marked packets is counted it will accumulate at the combined packet rate times the marking probability, p_p(u_1+u_2), 26x faster than packet congestion accumulates in the single 1500B packet flow of our example, as required. But if the transport is interested in the absolute number of packet congestion, it should just count how many marked packets arrive. For instance, a flow sending 60B packets will see 25x more marked packets than one sending 1500B packets at the same bit rate, because it is sending more packets through a packet-congestible resource. Note the unit of packet congestion is a packet.A.5.B.2.5. Pkt-congestible resource, equal packet rates (Bii) Finally, if two flows with the same packet rate, pass through a packet-congestible resource, they will both suffer the same proportion of marking, p_p, irrespective of their packet sizes. On detecting thatthe pkt-congestion fieldthe pkt-congestion field is marked, the transport should count packets, and it will be able to extract the ratio p_p of marked to unmarked packets from both flows, irrespective of packet sizes. Even if the transport is monitoring the absolute amount of packets congestion over a period, still it will see the same amount of packet congestion from either flow. And if the two equal packet rates of different size packets are mixed together in one flow, the packet rate will double, so the absolute volume of packet-congestion will accumulate at twice the rate of either flow, 2p_p.u_1 = p_p(u_1+u_2). Appendix C. Byte-mode Drop Complicates Policing Congestion Response This appendix explains why the ability of networks to police the response of _any_ transport to congestion depends on bit-congestible network resources only doing packet-mode not byte-mode drop. To be able to police a transport's response to congestion when fairness can only be judged over time and over all an individual's flows, the policer has to have an integrated view of all the congestion an individual (not just one flow) has caused due to all traffic entering the Internet from that individual. This ismarked, the transport should count packets, and it will be abletermed congestion accountability. But a byte-mode drop algorithm has toextractdepend on theratio p_plocal MTU ofmarkedthe line - an algorithm needs tounmarked packets from both flows, irrespectiveuse some concept of a 'normal' packet size. Therefore, one dropped or marked packetsizes. Even if the transportismonitoringnot necessarily equivalent to another unless you know theabsolute amountMTU at the queue where it was dropped/marked. To have an integrated view ofpackets congestion overaperiod, stilluser, we believe congestion policing has to be located at an individual's attachment point to the Internet [I-D.briscoe-tsvwg-re-ecn-tcp]. But from there itwill seecannot know thesame amountMTU ofpacketeach remote queue that caused each drop/ mark. Therefore it cannot take an integrated approach to policing all the responses to congestionfrom either flow. And ifof all thetwo equal packet ratestransports ofdifferent size packets are mixed together inoneflow, the packet rate will double, soindividual. Therefore it cannot police anything. The security/incentive argument _for_ packet-mode drop is similar. Firstly, confining RED to packet-mode drop would not preclude bottleneck policing approaches such as [pBox] as it seems likely they could work just as well by monitoring theabsolutevolume ofpacket-congestion will accumulate at twicedropped bytes rather than packets. Secondly packet-mode dropping/marking naturally allows theratecongestion notification ofeither flow, 2p_p.u_1 = p_p(u_1+u_2). Appendix B. Congestion Notification Definition: Further Justification In Section 3packets to be globally meaningful without relying on MTU information held elsewhere. Because we recommend that a dropped/marked packet should be taken to mean that all thedefinition of congestion notification, load not capacity was used asbytes in thedenominator. This also haspacket are dropped/marked, asubtle significancepolicer can remain robust against bits being re-divided into different size packets or across different size flows [Rate_fair_Dis]. Therefore policing would work naturally with just simple packet-mode drop in RED. In summary, making drop probability depend on therelated debate over the designsize ofnew transport protocols--typical new protocol designs (e.g. in XCP [xcp-spec] & Quickstart [RFC4782]) expectthesending transport to communicate its desired flow ratepackets that bits happen to be divided into simply encourages thenetwork and network elementsbits toprogressively subtractbe divided into smaller packets. Byte-mode drop would therefore irreversibly complicate any attempt to fix the Internet's incentive structures. Appendix D. Changes fromthis so thatPrevious Versions To be removed by theachievable flow rate emergesRFC Editor on publication. Full incremental diffs between each version are available at <http://www.cs.ucl.ac.uk/staff/B.Briscoe/pubs.html#byte-pkt-congest> or <http://tools.ietf.org/wg/tsvwg/draft-ietf-tsvwg-byte-pkt-congest/> (courtesy of thereceiving transport. Congestion notification with total load inrfcdiff tool): From -01 to -02 (this version): * Restructured thedenominator can serve a similar purpose (though in retrospect notwhole document for (hopefully) easier reading and clarity. The concrete recommendation, inadvance like XCP & QuickStart). Congestion notificationRFC2119 language, isa dimensionless fraction but each source can extract necessary rate information from it because it already knows what its own rate is. Even though congestion notification doesn't communicate a rate explicitly, from each source's point of view congestion notification representsnow in Section 5. From -00 to -01: * Minor clarifications throughout and updated references From briscoe-byte-pkt-mark-02 to ietf-byte-pkt-congest-00: * Added note on relationship to existing RFCs * Posed thefractionquestion ofthe ratewhether packet-congestion could become common and deferred itwas sending a round trip ago that couldn't (or wouldn't) be served by available resources. After they were sent, all these fractions of each source's offered load added upto theaggregate fraction of offered load seen by the congested resource. So, the source can also know the total excess rate by multiplying total load by congestion level. Therefore congestion notification, as one scale-free dimensionless fraction, implicitly communicatesIRTF ICCRG. Added ref to theinstantaneous excess flow rate, albeit a RTT ago. Appendix C. Byte-mode Drop Complicates Policing Congestion Response This appendix explains whydual-resource queue (DRQ) proposal. * Changed PCN references from theability of networksPCN charter & architecture topolicetheresponse of _any_ transportPCN marking behaviour draft most likely tocongestion depends on bit-congestible network resources only doing packet-mode not byte-mode drop. To be ableimminently become the standards track WG item. From -01 topolice a transport's response-02: * Abstract reorganised tocongestion when fairness can only be judged over time and over all an individual's flows,align with clearer separation of issue in thepolicer hasmemo. * Introduction reorganised with motivating arguments removed tohave an integrated viewnew Section 2. * Clarified avoiding lock-out ofalllarge packets is not thecongestion an individual (not just one flow) has caused duemain or only motivation for RED. * Mentioned choice of drop or marking explicitly throughout, rather than trying toall traffic entering the Internet from that individual. This is termed congestion accountability. Butcoin abyte-mode drop algorithm hasword todepend on the local MTU ofmean either. * Generalised theline - an algorithm needsdiscussion throughout touse some concept of a 'normal' packet size. Therefore, one dropped or markedany packetisforwarding function on any network equipment, notnecessarily equivalent to another unless you know the MTU atjust routers. * Clarified thequeue that where it was dropped/marked. To have an integrated view oflast point about why this is auser, we believe congestion policing hasgood time to sort out this issue: because it will belocated at an individual's attachment pointhard / impossible to design new transports unless we decide whether theInternet [I-D.briscoe-tsvwg-re-ecn-tcp]. But from there it cannot knownetwork or theMTUtransport is allowing for packet size. * Added statement explaining the horizon ofeach remote queue that caused each drop/mark. Therefore it cannot take an integrated approach to policing alltheresponses tomemo is long term, but with short term expediency in mind. * Added material on scaling congestion control with packet size (Section 2.1). * Separated out issue ofall the transportsnormalising TCP's bit rate from issue ofone individual. Therefore it cannot police anything. The security/incentive argument _for_ packet-modepreference to control packets (Section 2.3). * Divided up Congestion Measurement section for clarity, including new material on fixed size packet buffers and buffer carving (Section 3.1.1 & Section 3.2.1) and on congestion measurement in wireless link technologies without queues (Section 3.1.2). * Added section on 'Making Transports Robust against Control Packet Losses' (Section 3.2.3) with existing & new material included. * Added tabulated results of vendor survey on byte-mode dropis similar. Firstly, confiningvariant of RED (Table 2). From -00 to -01: * Clarified applicability topacket-modedropwould not preclude bottleneck policing approaches such as [pBox] as it seems likely they could work justas wellby monitoring the volume of dropped bytes rather than packets. Secondly packet-mode dropping/marking naturally allows the congestion notification of packets to be globally meaningful without relying on MTU information held elsewhere. Because we recommendas ECN. * Highlighted DoS vulnerability. * Emphasised thata dropped/marked packetdrop-tail suffers from similar problems to byte-mode drop, so only byte-mode drop should betaken to mean that all the bytes inturned off, not RED itself. * Clarified thepacket are dropped/marked, a policer can remain robust against bits being re-divided into different size packets or across different size flows [Rate_fair_Dis]. Therefore policing would work naturally with just simple packet-mode drop in RED. In summary, makingoriginal apparent motivations for recommending byte-mode dropprobability depend onincluded protecting SYNs and pure ACKs more than equalising thesizebit rates ofthe packets that bits happenTCPs with different segment sizes. Removed some conjectured motivations. * Added support for updates tobe dividedTCP in progress (ackcc & ecn-syn- ack). * Updated survey results with newly arrived data. * Pulled all recommendations together intosimply encouragesthebits to be dividedconclusions. * Moved some detailed points intosmaller packets. Byte-mode drop would therefore irreversibly complicate any attempt to fix the Internet's incentive structures. Author's Addresstwo additional appendices and a note. * Considerable clarifications throughout. * Updated references Authors' Addresses Bob Briscoe BT B54/77, Adastral Park Martlesham Heath Ipswich IP5 3RE UK Phone: +44 1473 645196Email:EMail: bob.briscoe@bt.com URI: http://bobbriscoe.net/ Jukka Manner Aalto University Department of Communications and Networking (Comnet) P.O. Box 13000 FIN-00076 Aalto Finland Phone: +358 9 470 22481 EMail: jukka.manner@tkk.fi URI: http://www.netlab.tkk.fi/~jmanner/