IP Performance Working Group M. Mathis Internet-Draft Google, Inc Intended status: Experimental A. Morton Expires:December 23, 2013April 24, 2014 AT&T LabsJuneOctober 21, 2013 Model Based Bulk Performance Metricsdraft-ietf-ippm-model-based-metrics-00.txtdraft-ietf-ippm-model-based-metrics-01.txt Abstract We introduce a new class of model based metrics designed to determine if a long network path can meet predefined end-to-end application performancetargets. This is done by subpath at a time testing --targets by applying a suite ofsingle propertyIP diagnostic tests to successivesubpaths of a long path. In many cases these single property tests are based on existing IPPM metrics, with the addition of success and validity criteria.subpaths. The subpath at a time tests are designed tofacilitate IP providers eliminatingexclude all known conditionsthatwhich might prevent the fullend-to- endend-to-end path from meeting theusersuser's target application performance. This approach makes it possible to to determine the IP performance requirements needed to support the desired end-to-end TCP performance. The IP metrics are based on traffic patterns that mimic TCP or other transport protocol but are precomputed independently of the actual behavior ofTCPthe transport protocol over the subpath under test. This makes the measurements open loop, eliminating nearly all of the difficulties encountered by traditional bulk transport metrics, whichrelyfundamentally depend on congestion control equilibrium behavior. A natural consequence of this methodology is verifiable network measurement: measurements from any given vantage pointare repeatablecan be verified by repeating them from other vantage points. Formatted:Fri JunMon Oct 2118:23:2915:42:35 PDT 2013 Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire onDecember 23, 2013.April 24, 2014. Copyright Notice Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.1. TODO . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 6 3. New requirements relative to RFC 2330 . . . . . . . . . . . .89 4. Background . . . . . . . . . . . . . . . . . . . . . . . . . .910 4.1. TCP properties . . . . . . . . . . . . . . . . . . . . . .1112 5. Common Models and Parameters . . . . . . . . . . . . . . . . .1214 5.1. Target End-to-end parameters . . . . . . . . . . . . . . .1314 5.2. Common Model Calculations . . . . . . . . . . . . . . . .1315 5.3. Parameter Derating . . . . . . . . . . . . . . . . . . . .1416 6. Common testing procedures . . . . . . . . . . . . . . . . . .1516 6.1. Traffic generating techniques . . . . . . . . . . . . . .1516 6.1.1. Paced transmission . . . . . . . . . . . . . . . . . .1516 6.1.2. Constant window pseudo CBR . . . . . . . . . . . . . .16 6.1.2.1.17 6.1.3. Scanned window pseudo CBR . . . . . . . . . . . .16 6.1.3.. . 18 6.1.4. Concurrent or channelized testing . . . . . . . . . . 18 6.1.5. Intermittent Testing . . . . . . . . . . . . . . . . .16 6.1.4.19 6.1.6. Intermittent Scatter Testing . . . . . . . . . . . . .1720 6.2. Interpreting the Results . . . . . . . . . . . . . . . . .1720 6.2.1. Test outcomes . . . . . . . . . . . . . . . . . . . .1720 6.2.2. Statistical criteria for measuring run_length . . . .1721 6.2.3.Classifications of tests . . . . . . . . . . . . . . . 19 6.2.4.Reordering Tolerance . . . . . . . . . . . . . . . . .2023 6.3. Test Qualifications . . . . . . . . . . . . . . . . . . .2023 6.3.1. Verify the Traffic Generation Accuracy . . . . . . . .2023 6.3.2. Verify the absence of cross traffic . . . . . . . . .2124 6.3.3. Additional test preconditions . . . . . . . . . . . .2225 7.Single PropertyDiagnostic Tests . . . . . . . . . . . . . . . . . . . .22. . . 25 7.1. Basic Dataand LossRate and Run Length Tests . . . . . . . . . . .. . . 2225 7.1.1.Loss RateRun Length at Paced Full Data Rate . . . . . . . . . .2226 7.1.2.Loss Raterun length at Full Data Windowed Rate . . . . . . . .. 2326 7.1.3. BackgroundLoss RateRun Length Tests . . . . . . . . . . . . .. 2326 7.2. Standing Queue tests . . . . . . . . . . . . . . . . . . .2426 7.2.1. Congestion Avoidance . . . . . . . . . . . . . . . . .2428 7.2.2.Buffer BloatBufferbloat . . . . . . . . . . . . . . . . . . . . .2528 7.2.3. Non excessive loss . . . . . . . . . . . . . . . . . . 28 7.2.4. Duplex Self Interference . . . . . . . . . . . . . . .2528 7.3. Slowstart tests . . . . . . . . . . . . . . . . . . . . .2529 7.3.1. Full Window slowstart test . . . . . . . . . . . . . .2529 7.3.2. Slowstart AQM test . . . . . . . . . . . . . . . . . .2629 7.4. Sender Rate Burst tests . . . . . . . . . . . . . . . . .26 7.4.1. Sender TCP Send Offload (TSO) tests . .29 7.5. Combined Tests . . . . . . .26 7.4.2. Sender Full Window burst test. . . . . . . . . . . .26 8. Combined Tests. . . 30 7.5.1. Sustained burst test . . . . . . . . . . . . . . . . . 30 7.5.2. Live Streaming Media . . . .27 8.1. Sustained burst test. . . . . . . . . . . . . 31 8. Examples . . . . . .27 9. Calibration. . . . . . . . . . . . . . . . . . . . . 32 8.1. Near serving HD streaming video . . . .28 10. Acknowledgements. . . . . . . . . 32 8.2. Far serving SD streaming video . . . . . . . . . . . . . .28 11. Informative References32 8.3. Bulk delivery of remote scientific data . . . . . . . . . 33 9. Validation . . . . . . . . . . .28 Appendix A. Model Derivations. . . . . . . . . . . . . . . 33 10. Acknowledgements . . .29 Appendix B. old text. . . . . . . . . . . . . . . . . . . . 34 11. Informative References . .29 B.1. An earlier document. . . . . . . . . . . . . . . . . . 35 Appendix A. Model Derivations .30 B.2. End-to-end parameters from subpaths. . . . . . . . . . .31 B.3. Per subpath parameters. . . . . . 36 A.1. Aggregate Reno . . . . . . . . . . . .32 B.4. Version Control. . . . . . . . . . 37 A.2. CUBIC . . . . . . . . . . .32. . . . . . . . . . . . . . . 37 Appendix B. Version Control . . . . . . . . . . . . . . . . . . . 38 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . .3238 1. Introduction Model based bulk performance metrics evaluate an Internetpathspath's ability to carry bulk data. TCP models are used to design a targeted diagnostic suite (TDS) of IP performance tests which can be applied independently to each subpath of the full end-to-end path.TheA targeted diagnosticsuites aresuite is constructed such that independent tests of the subpaths will accurately predict if the full end-to-end path can deliver bulk data at the specified performance target, independent of the measurement vantage points or other details of the test procedures used to measure each subpath. Each test in thetargeted diagnostic suiteTDS consists of a precomputed traffic pattern and statistical criteria for evaluating packet delivery. TCP models are used to design traffic patterns that mimic TCP or other bulk transport protocol operating at the target performance and RTT over a full range of conditions, including flows that are bursty at multiple time scales. The traffic patterns are computed in advance based on the properties of the full end-to-end path and independent of the properties of individual subpaths. As much as possible the traffic is generated deterministically in ways that minimizes the extent to which test methodology, measurement points, measurement vantage or path partitioning effect the details of the traffic. Models are also used to compute thestatistical criteriabounds on the packet delivery statistics forevaluatingacceptable the IPdiagnostics tests.performance. The criteria for passing each testmust beare determined from the end-to-end target performance and are independent of theRTT or other properties of thesubpath under test. In addition to passing or failing, a test can be inconclusive if the precomputed traffic pattern was not authentically generated, test preconditions were not met or the measurement results were not statisticallysignificantly.significant. TCP's ability to compensate for less than ideal network conditions is fundamentally affected by the RTT and MTU of the end-to-end Internet path that ittraverses which are bothtraverses. The end-to-end path determines fixedproperties of the end-to- end path.bounds on these parameters. The target values for these three parameters, Data Rate, RTT and MTU, are determined by the application, its intended use and the physical infrastructure over which ittraverses. Theyis intended to traverse. These parameters are used to inform the models used to design thetargetedTDS. This document describes a framework for deriving the traffic and delivery statistics for model based metrics. It does not fully specify any measurement techniques. Important details such as packet type-p selection, sampling techniques, vantage selection, etc are out of scope for this document. We imagine Fully Specified Targeted Diagnostic Suites (FSTDS), that fully defines all of these details. We use TDS to refer to the subset of such a specification that is in scope for this document. A TDS includes specification for the traffic and delivery statistics for the diagnosticsuite.tests themselves, documentation of the models and any assumptions or derating used to derive the test parameters and a description of the test setup used to calibrate the models, as described in later sections. Section 2 defines terminology used throughout this document. It has been difficult to develop BTC metrics due to some overlooked requirements described in Section 3 and some intrinsic problems with using protocols for measurement, described in Section 4. In Section 5 we describe the models and common parameters used to derive the targeted diagnostic suite. In Section 6 we describe common testingprocedures used by all of the tests.procedures. Each subpath is evaluated using suite of far simpler and more predictablesingle propertydiagnostic tests described in Section 7. In Section 8describes some combined tests that are more efficientwe present three example TDS, one that might be representative of HD video, when served fairly close toimplement and deploy. However, if they fail they may not clearly indicatethenatureuser, a second that might be representative ofthe problem.standard video, served from a greater distance, and a third that might be representative of an network designed to support high performance bulk download. There exists a small risk that model based metric itself might yield a false pass result, in the sense that every subpath of an end-to-end path passes every IP diagnostic test and yet a real application falls to attain the performance target over the end-to-end path. If this happens, then thecalibrationvalidation procedure described in Section 9 needs to be used tovalidateprove and potentially revise the models. Future document will define model based metrics for other traffic classes and application types, such as realtime.time streaming media. 1.1. TODO Please send comments on this draft to ippm@ietf.org. See http://goo.gl/02tkD for more information including: interim drafts, an up to date todo list and information on contributing. Formatted:Fri JunMon Oct 2118:23:2915:42:35 PDT 2013 2. Terminology Terminology about paths, etc. See [RFC2330] and [I-D.morton-ippm-lmap-path]. [data] sender Host sending data and receiving ACKs, typically via TCP. [data] receiver Host receiving data and sending ACKs, typically via TCP. subpath A portion of the full path. Note that there is no requirement that subpaths be non-overlapping. Measurement Point Measurement points as described in [I-D.morton-ippm-lmap-path]. test path A path between two measurement points that includes a subpath of the end-to-end path under test, plus possibly additional infrastructure between the measurement points and the subpath. [Dominant] Bottleneck The Bottleneck that determines a flow's self clock. It generally determines the traffic statistics for the entire path. See Section 4.1. front path The subpath from the data sender to the dominant bottleneck. back path The subpath from the dominant bottleneck to the receiver. return path The path taken by the ACKs from the data receiver to the data sender. cross traffic Other, potentially interfering, traffic competing for resources (network and/or queue capacity). Properties determined by the end-to-end path and application. They are described in more detail in Section 5.1.end-to-end target parameters:Applicationor transport performance goalsData Rate General term for theend-to-end path. They includedata rate as seen by thetargetapplication above the transport layer. This is the payload data rate,RTTandMTU described below. Targetexcludes TCP/IP (or other protocol) headers and retransmits. Link DataRate: The application or ultimate user's performance goal.Rate General term for the data rate as seen by the link or lower layers. It includes transport and IP headers, retransmits and other transport layer overhead. This document is agnostic as to whether the link data rate includes or excludes framing, MAC or other lower layer overheads, except that they must be treated uniformly. end-to-end target parameters: Application or transport performance goals for the end-to-end path. They include the target data rate, RTT and MTU described below. Target Data Rate: The application or ultimate user's performance goal. When converted to link data rate, it must be slightly smaller than the actual link data rate, otherwise there is no margin for compensating for RTT or other pathprotperties.properties. These test will be excessively brittle if the target data rate does not include any built in headroom. Target RTT (Round Trip Time): The baseline (minimum) RTT of the longest end-to-end path the over which the applicationmustexpects to meet the target performance. This must be specified considering authentic packets sizes: MTU sized packets on the forward path, header_overhead sized packets on the return (ACK) path. Target MTU (Maximum Transmission Unit): The maximum MTU supported by the end-to-end path the over which the application expects to meet the target performance. Assume 1500 Bytes per packet unless otherwise specified. If some subpath forces a smaller MTU, then it becomes the target MTU, and allsubpathsmodel calculations and subpath tests mustbe tested withuse the same smaller MTU. Effective Bottleneck Data Rate: This is the bottleneck data rate that might be inferred from the ACK stream, by looking at how much data the ACK stream reports was delivered per unit time. See Section 4.1 for more details.Permitted Number of Connections: The target rate can be more easily obtained by dividing the traffic across more than one connection. In general the number of concurrent connections is determined by the application, however see the comments below on multiple connections.[sender] [interface] rate: The burst data rate, constrained by the data sender's interfaces. Today 1 or 10 Gb/s are typical. Header overhead: The IP and TCP header sizes, which are the portion of each MTU not available for carrying application payload. Without loss of generality this is assumed to be the size for returning acknowledgements (ACKs). For TCP, the Maximum Segment Size (MSS) is the Target MTU minus the header overhead.Terminology about paths, etc. See [RFC2330] and [I-D.morton-ippm-lmap-path]. [data] sender Host sending data and receiving ACKs, typically via TCP. [data] receiver Host receiving dataBasic parameters common to models andsending ACKs, typically via TCP.subpathSubpath as defined in [RFC2330]. Measurement Point Measurement points astests. They are described in[I-D.morton-ippm-lmap-path]. test pathmore detail in Section 5.2. pipe size Apath between two measurement points that includes a subpath of the end-to-end path under test, plus possibly additional infrastructure between the measurement points and the subpath. [Dominant] Bottleneck The Bottleneck that determines a flow's self clock. It generally determines the traffic statistics for the entire path. See Section 4.1. front path The subpath from the data sender to the dominant bottleneck. back path The subpath from the dominant bottleneck to the receiver. return path The path taken by the ACKs from the data receiver to the data sender. cross traffic Other, potentially interfering, traffic competing for resources (network and/or queue capacity). Basic parameters common to all models and subpath tests. They are described in more detail in Section 5.2. @ @@@ pipe size The numbergeneral term for number of packets needed in flight (the window size) to exactly fill some network path orsub path. Thesubpath. This is the window size which in normally the onset of queueing. target_pipe_size: The number of packets in flight (the window size) needed to exactly meet the target rate, with a single stream and no cross traffic for the specified target data rate, RTT and MTU.subpath pipe sizerun lengthObserved,A general term for the observed, measured or specified number of packets that are (to be) delivered between losses or ECN marks. Nominally one over the loss or ECN marking probability. target_run_length Required run length computed from the target data rate, RTT and MTU.reference_target_run_length: One specific conservative estimate of the number of packets that must be delivered between loss episodes in most diagnostic tests.Ancillary parameters used for some tests derating: Under some conditions the standard models are too conservative. The modeling framework permits some latitude in relaxing or derating somespecifictest parameters as described in Section5.3. Test types [These need work]5.3 in exchange for a more stringent TDS validation procedures, described in Section 9. subpath_data_rate The maximum IP data rate supported by a subpath. This typically includes TCP/IP overhead, including headers, retransmits, etc. test_path_RTT The RTT (using appropriate packet sizes) between two measurement points. test_path_pipe The amount of data necessary to fill a test path. Nominally the test path RTT times the subpath_data_rate (which should be part of the end-to-end subpath). test_window The window necessary to meet the target_rate over a subpath. Typically test_window=target_data_rate*test_RTT/ target_MTU. Tests can be classified into groups according to their applicability Capacity tests determine if a network subpath has sufficient capacitytests: For "capacity tests" is required that asto deliver the target performance. As long as the test traffic is within the proper envelope for the targetend-to- endend-to-end performance, the average packet losses or ECN must be below the threshold computed by the model.Engineering tests: Engineering tests verifyAs such, they reflect parameters that can transition from passing to failing as a consequence of additional presented load or the actions of other network users. By definition, capacity tests also consume significant network resources (data capacity and/or buffer space), and thesubpath undertestinteracts wellschedules must be balanced by their cost. Monitoring tests are design to capture the most important aspects of a capacity test, but without causing unreasonable ongoing load themselves. As such they may miss some details of the network performance, but can serve as a useful reduced cost proxy for a capacity test. Engineering tests evaluate how network algorithms (such as AQM and channel allocation) interact with TCP style self clocked protocolsusingand adaptive congestion control based on packet loss and ECN marks. These tests are likely to have complicated interactions with other traffic and under some conditions can be inversely sensitive to load. For example"AQM Tests"a test to verify thatwhen the presented load exceeds the capacity of the subpath, the subpath signals for the transport protocol to slow down, by appropriatelyan AQM algorithm causes ECNmarkingmarks ordropping some ofpacket drops early enough to limit queue occupancy may experience a false pass results in thepackets. Note while thatpresence of bursty crosstraffictraffic. It iscan cause capacity tests to fail, it has the potential to cause AQMimportant that engineering teststo false pass, whichbe performed under a wide range of conditions, including both in situ and bench testing, and over a wide variety of load conditions. Ongoing monitoring iswhy AQM tests require separate test procedures.less likely to be useful for engineering tests, although sparse in situ testing might be appropriate. 3. New requirements relative to RFC 2330Model Based[Move this entire section to a future paper] Model Based Metrics are designed tofulfilfulfill some additional requirement that were not recognized at the time RFC 2330 [RFC2330] was written. These missing requirements may have significantly contributed to policy difficulties in the IP measurement space. Some additional requirements are: o Metrics must be actionable by the ISP - they have to be interpreted in terms of behaviors or properties at the IP or lower layers, that an ISP can test, repair and verify. o Metrics must be vantage point invariant over a significant range of measurement point choices (e.g., measurement points as described in [I-D.morton-ippm-lmap-path]), including off path measurement points. The only requirements on MP selection should be that the portion of the path that is not under test is effectively ideal (or is non ideal in calibratable ways) and theend-to-endRTT between MPs is below some reasonable bound. o Metrics must be repeatable by multiple parties. It must be possible for different parties to make the same measurement and observe the same results. In particular it is specifically important that both a consumer (or their delegate) and ISP be able to perform the same measurement and get the same result. NB: All of the metric requirements in RFC 2330 should be reviewed and potentially revised. If such a document is opened soon enough, this entire section should be dropped. 4. Background [Move to a future paper, abridge here, ] At the time the IPPM WG was chartered, sound Bulk Transport Capacity measurement was known to be beyond our capabilities. By hindsight it is now clear why it is such a hard problem: o TCP is a control system with circular dependencies - everything affects performance, including components that are explicitly not part of the test. o Congestion control is an equilibrium process, transport protocols change the network (raise loss probability and/or RTT) to conform to their behavior. o TCP's ability to compensate for network flaws is directly proportional to the number of roundtrips per second (i.e. inversely proportional to the RTT). As a consequence a flawed link may pass a short RTT local test even though it fails when the path is extended by a perfect network to some larger RTT. o TCP has a meta Heisenberg problem - Measurement and cross traffic interact in unknown and ill defined ways. The situation is actually worse than the traditional physics problem where you can at least estimate the relative momentum of the measurement and measured particles. For network measurement you can not in general determine the relative "elasticity" of the measurement traffic and cross traffic, so you can not even gage the relative magnitude of their effects on each other. The MBM approach is to "open loop" TCP by precomputing traffic patterns that are typically generated by TCP operating at the given target parameters, and evaluating delivery statistics(losses(losses, ECN marks and delay). In this approach the measurement software explicitly controls the data rate, transmission pattern or cwnd (TCP's primary congestion control state variables) to create repeatable traffic patterns that mimic TCP behavior but are independent of the actual network behavior of the subpath under test. These patterns are manipulated to probe the network to verify that it can deliver all of the traffic patterns that a transport protocol is likely to generate under normal operation at the target rate and RTT. Models are used to determine the actual test parameters (burst size, loss rate, etc) from the target parameters. The basic method is to use models to estimate specific network properties required to sustain a given transport flow (or set of flows), and using a suite of metrics to confirm that the network meets the required properties. A network is expected to be able to sustain a Bulk TCP flow of a given data rate, MTU and RTT when the following conditions are met: o The raw link rate is higher than the target data rate. o The raw packetloss raterun length islowerlarger than required by a suitable TCP performance model o There is sufficient buffering at the dominant bottleneck to absorb a slowstart rate burst large enough to get the flow out of slowstart at a suitable window size. o There is sufficient buffering in the front path to absorb and smooth sender interface rate bursts at all scales that are likely to be generated by the application, any channel arbitration in the ACK path or other mechanisms. o When there is a standing queue at a bottleneck for a shared media subpath, there are suitable bounds on how the data and ACKs interact, for example due to the channel arbitration mechanism. o When there is a slowly rising standing queue at the bottleneck the onset of packet loss has to be at an appropriate point (time or queue depth) and progressive. The tests to verify these condition are described in Section 7.Note that this procedure is not invertible: aA singleton [RFC2330] measurement is a pass/fail evaluation of a given path or subpath at a given performance.MeasurementsNote that measurements to confirm that a link passes at one particular performancemaymight not begenerallybe useful to predict if the link will pass at a different performance.Although they are not invertible, they doA TDS does have severalothervaluable properties, such as natural ways to define several different composition metrics [RFC5835]. [Add text on algebra on metrics (A-Frame from [RFC2330]) and tomography.] The Spatial Composition of fundamental IPPM metrics has been studied and standardized. For example, the algebra to combine empirical assessments of loss ratio to estimate complete path performance is described in section 5.1.5. of [RFC6049]. We intend to use this and other composition metrics as necessary. We are developing a tool that can perform many of the tests described here[MBMSource]. 4.1. TCP properties [Move this entire section to a future paper] TCP and SCTP are self clocked protocols. The dominant steady state behavior is to have an approximately fixed quantity of data and acknowledgements (ACKs) circulating in the network. The receiver reports arriving data by returning ACKs to the data sender, the data sender most frequently responds by sending exactly the same quantity of data back into the network. The quantity of data plus the data represented by ACKs circulating in the network is referred to as the window. The mandatory congestion control algorithms incrementally adjust the widow by sending slightly more or less data in response to each ACK. The fundamentally important property of this systems is that it is entirely self clocked: The data transmissions are a reflection of the ACKs that were delivered by the network, the ACKs are a reflection of the data arriving from the network. A number of phenomena can cause bursts of data, even in idealized networks that are modeled as simple queueing systems. During slowstart the data rate is doubled on each RTT by sending twice as much data as was delivered to thereceiver.receiver on the prior RTT. For slowstart to be able to fill such a network the network must be able to tolerate slowstart bursts up to the full pipe size inflated by the anticipated window reduction on the firstloss.loss or ECN mark. For example, with classic Reno congestion control, an optimal slowstart has to end with a burst that is twice the bottleneck rate for exactly one RTT in duration. This burst causes a queue which is exactly equal to the pipe size (the window is exactly twice the pipe size) so when the window is halved, the new window will be exactly the pipe size. Another source of bursts are application pauses. If the application pauses (stops reading or writing data) for some fraction of one RTT, state-of-the-art TCP to "catches up" to the earlier window size by sending a burst of data at the full sender interface rate. To fill such a network with a realistic application, the network has to be able to tolerate interface rate bursts from the data sender large enough to coverthe worst caseapplicationpause.pauses. Note that if the bottleneck data rate is significantly slower than the rest of the path, the slowstart bursts will not cause significant queues anywhere else along the path; they primarily exercise the queue at the dominant bottleneck.FurthermoreFurthermore, although the interface rate bursts caused by the application are likely to be smaller thanburst at thelastRTTburst of a slowstart, they are at a higher rate so they can exercise queues at arbitrary points along the "front path" from the data sender up to and including the queue at the bottleneck. For many network technologies a simple queueing model does not apply: the network schedules, thins or otherwise alters the timing of ACKs anddata stream,data, generally to raise the efficiency of the channel allocation process when confronted with relatively widely spaced small ACKs. These efficiency strategies are ubiquitous forwireless and otherhalfduplexduplex, wireless or broadcast media. Altering the ACK stream generally has two consequences: raising the effective bottleneck data rate making slowstart burst at higher rates (possibly as high as the sender's interface rate) and effectively raising the RTT by the time that the ACKs were postponed. The first effect can be partially mitigated by reclocking ACKs once they arethroughbeyond the bottleneck on the return path to the sender, however this further raises the effective RTT. The most extreme example of this class of behaviors is a half duplex channel that is never released until the currentsenderend point has no pending traffic. Such environmentsintrinsicallycause self clocked protocols revert to extremely inefficient stop and wait behavior, where they send an entire window of data as a single burst, followed by the entire window of ACKs on the return path. If a particular end-to-end path contains a link or device that alters the ACK stream, then the entire path from the sender up to the bottleneck must be tested at the burst parameters implied by the ACK schedulingalgorithms.algorithm. The most important parameter is the Effective Bottleneck Data Rate, which is the average rate at which the ACKs advance snd.una. Note that thinning the ACKs (relying on the cumulative nature of seg.ack to permit discarding some ACKs) is implies an effectively infinite bottleneck data rate. To verify that a path can meet the performance target,Model Based Metrics needit is necessary to independently confirm that the entire path can tolerate burstsofin the dimensions that are likely to be induced by the application and any data or ACKscheduling.scheduling anywhere in the path. Two common cases are the most important: slowstart burstsof with more than the target_pipe_size dataat twice the effective bottleneck data rate; and somewhat smaller sender interface rate bursts.5. Common ModelsThe slowstart rate bursts must be at least as least as large target_pipe_size packets andParameters Transport performance models are used to deriveshould be twice as large (so thetest parameterspeak queue occupancy at the dominant bottleneck would be approximately target_pipe_size). There is no general model fortest suiteshow well the network needs to tolerate sender interface rate bursts. All existing TCP implementations send full sized full rate bursts under some typically uncommon conditions, such as application pauses that approximately match the RTT, or when ACKs are lost or thinned. Strawman: partial window bursts (some fraction ofsimple diagnostics fromtarget_pipe_size) should be tolerated without significantly raising theend-to-end target parametersloss probability. Full target_pipe_size bursts may slightly increase the loss probability. Interface rate bursts as large as twice target_pipe_size should not cause deterministic packet drops. 5. Common Models andadditional ancillary parameters.Parameters 5.1. Target End-to-end parameters The target end to end parameters are the target data rate, target RTT and target MTU as defined in Section 2 These parameters are determined by the needs of the application or the ultimate end user and the end-to-end Internetpath. They are in units that makepath over which the application is expected to operate. The target parameters are in units that make sense to the upper layer: payload bytesdelivered, excluding headerdelivered to the application, above TCP. They exclude overheadsfor IP,associated with TCP and IP headers, retransmitts and otherprotocol. Ancillaryprotocols (e.g. DNS). In addition, other end-to-end parameters include the effective bottleneck data rate, the sender interface data rate and thepermitted number of connections (numb_cons). The use of multiple connections has been very controversial since the beginning ofTCP/IP header sizes (overhead). Note that theWorld-Wide-Web[first complaint]. Modern browsers open many connections [BScope]. Experts associated with IETF transport area have frequently spoken against this practice [long list]. It is not inappropriatetarget parameters can be specified for a hypothetical path, for example toassume some small numberconstruct TDS designed for bench testing in the absence ofconcurrent connections (e.g. 4a real application, or6), to compensateforlimitation in TCP. However, choosing too largea real physical test, for in situ testing of production infrastructure. The numberis at riskofbeing interpreted asconcurrent connections is explicitly not asignal by the web browser community thatparameter to thispractice has been embraced bymodel [unlike earlier drafts]. If a subpath requires multiple connections in order to meet theInternet service provider community. It may notspecified performance, that must bedesirable to send such a signal.stated explicitly and the procedure described in Section 6.1.4 applies. 5.2. Common Model Calculations The most important derived parameter is target_pipe_size (in packets), which is the window size --- the number of packets needed exactly meet the target rate, withnumb_cons connections andno cross traffic for the specified target RTT and MTU. It is given by: target_pipe_size =(target_rate / numb_cons)target_rate * target_RTT / ( target_MTU - header_overhead ) If the transport protocol (e.g. TCP) average window size is smaller than this, it will not meet the target rate. Thereference_target_run_length, whichreference target_run_length, isthe mosta very conservative model for the minimum required spacing betweenlosses,losses or ECN marks. The reference target_run_length can derived as follows: assume thelink_data_ratesubpath_data_rate is infinitesimally larger than thetarget_data_rate.target_data_rate plus the required header overheads. Then target_pipe_size also predicts the onset of queueing. If the transport protocol (e.g. TCP) hasan averagea window size that is larger than the target_pipe_size, the excess packets willformraise the RTT, typically by forming a standing queue at the bottleneck.IfAssume the transport protocol is using standard Reno style Additive Increase, Multiplicative Decrease congestion control[RFC5681], then[RFC5681] and the receiver is using standard delayed ACKs. With delayed ACKs there must betarget_pipe_size2*target_pipe_size roundtrips between losses. Otherwise the multiplicative window reduction triggered by a loss would cause the network to be underfilled.Following [MSMO97], weWe derive the number of packets between losses from the area under the AIMD sawtooth following [MSMO97]. They must be no more frequent than every 1 in(3/2)(target_pipe_size^2)(3/2)*target_pipe_size*(2*target_pipe_size) packets. Thisprovides the reference value forsimplifies to: target_run_lengthwhich is typically the number of packets that must be delivered between loss episodes in the tests below: reference_target_run_length=(3/2)(target_pipe_size^2)3*(target_pipe_size^2) Note that this calculation is very conservative and is based on a number of assumptions that may not apply. Appendix A discusses these assumptions and provides some alternative models.TheIf a less conservative model is used, a fully specified TDS or FSTDS MUST document the actual method for computing target_run_lengthMUST be documentedalong with the rationale for the underlying assumptions and the ratio of chosen target_run_length toreference_target_run_length. @@@ MOVE Although this document gives a lot of latitude for calculatingthe reference target_run_length calculated above. These two parameters, target_pipe_size and target_run_length,people designing suitesdirectly imply most of the individual parameters for the testsneed to considerbelow. Target_pipe_size is theeffectwindow size, the amount oftheir choices oncirculating data required to meet theongoing conversationtarget data rate, andtussle aboutimplies therelevancescale of"TCP friendliness" as an appropriate model for capacity allocation. Choosing a target_run_length that is substantially smaller than reference_target_run_length is equivalent to sayingthe bursts thatit is appropriate forthetransport research community to abandon "TCP friendliness" as a fairness model and to develop more aggressive Internet transport protocols, and for applications to continue (or even increase)network might experience. Target_run_length is thenumberamount ofconnections that they open concurrently. The calculationsdata required between losses or ECN marks standard for standard congestion control. The individual parameters arepresented with thefor eachsingle property test. In general these calculations permit some derating as described in Section 5.3. Fordiagnostic testparameters that can be derated andis described below. In a few case there areproportional to target_pipe_size, itnot well established models for what isrecommended that the derating be specified relative to target_pipe_size calculations using numb_cons=1, althoughconsidered correct network operation. In many of these cases thederating may additionallyproblems might either bespecified relativepartially mitigated by future improvements tothe target_pipe_size common to other tests.TCP implementations. 5.3. Parameter Derating Since some aspects of the models are very conservative,the modelingthis framework permits some latitude in deratingsome specifictest parameters.For example classical performance models suggest that in order to be sure that a single TCP stream can fill a link, it needs to have a full bandwidth-delay-product worth of buffering at the bottleneck[QueueSize]. In real networks with real applications this is often overly conservative.Rather than trying to formalize more complicated models we permit some test parameters to be relaxed as long as they meet some additional procedural constraints: o The TDS or FSTDS MUST document and justify the actual method used computeand justifythe deratedmetrics is published in such a way that it becomes a matter of public record. @@@ introduce earliermetric parameters. o Thecalibrationvalidation procedures described in Section 9aremust be used to demonstrate the feasibility of meeting the performance targets with infrastructure that infinitessimally passes the deratedtest parameters.tests. o Thecalibrationvalidation process itselfismust be documented is such a way that other researchers can duplicate theexperiments and validate the results. In the test specifications in Section 7validation experiments. Except as noted, all tests below assume0 < derate <= 1,no derating. Tests where there is not currently aderating parameter. These will be individually named in the final document. In all cases making derate smaller makeswell established model for thetest more tolerant. Derate = 1 is "full strenght". Note that some testrequired parametersare not permittedinclude derating as a way tobe derated.indicate flexibility in the parameters. 6. Common testing procedures 6.1. Traffic generating techniques 6.1.1. Paced transmission Paced (burst) transmissions: send bursts of data on a timer to meet a particular target rate and pattern.Single:In all cases the specified data rate can either be the application or link rates. Header overheads must be included in the calculations as appropriate. Paced single packets: Send individual packets at the specified rate or headway. Burst: Send sender interface rate bursts on a timer. Specify any 3ofof: average rate, packet size, burst size (number of packets) and burst headway (burst start to start). These bursts are typically sent as back-to-back packets at the testers interface rate.Slowstart:Slowstart bursts: Send 4 packet sender interface rate bursts at an average data rate equal tothe minimum oftwice effective bottleneck link rateor(but not more than the sender interfacerate.rate). This corresponds to the average rate during a TCP slowstart when Appropriate Byte Counting [ABC] is present or delayed ack is disabled. RepeatedSlowstart:Slowstartpacing itself isbursts: Slowstart bursts are typically part of larger scale pattern of repeated bursts, such as sending target_pipe_size packets as slowstart bursts on a target_RTT headway (burst start to burst start). Such a stream has three different average rates, depending on the averaging time scale. At the finest time scale the average rate is the same as the sender interface rate, at a medium scale the average rate is twice the effective bottleneck link rate and at the longest time scales the average rate is the target datarate, adjusted to include header overhead.rate. Note that if the effective bottleneck link rate is more than half of the sender interface rate, slowstart bursts become sender interface rate bursts. 6.1.2. Constant window pseudo CBR Implement pseudoCBRconstant bit rate by running a standard protocol such as TCP with a fixed bound on the window size.This has the advantage that it can be implemented as part of real content delivery.The rate is only maintained in average over each RTT, and is subject to limitations of the transport protocol.For tests that have strongly prescribed data rates, ifThe bound on the window size is computed from the target_data_rate and the actual RTT of the test path. If the transport protocol fails to maintain the test ratefor any reason related towithin prescribed data rates, the test MUST NOT be considered passing. If there is a signature of a networkitself, such asproblem (e.g. the run length is too small) then the test can be considered to fail. Since packetlosses or congestion,loss and ECN marks are required to reduce the data rate for standard transport protocols, the testshouldspecification must include suitable allowances in the prescribed data rates. If there is not sufficient signature of a network problem, then failing to make the prescribed data rate must be considered inconclusive. Otherwise there are some cases where tester failures might cause false negativelinktest results.6.1.2.1.6.1.3. Scanned window pseudo CBR Same as the above, except the window isincremented once per 2*target_pipe_size, starting from below target_pipe[@@@ test pipe] and sweeping up to first loss or some other event. This is analogousscanned across a range of sizes designed to include two key events, thetests implemented in Windowed Ping [WPING] and pathdiag [Pathdiag] 6.1.3. Intermittent Testing Any test which does not depend ononset of queueing(e.g.and theCBR tests)onset of packet loss orexperiences periodic zero outstanding data during normal operation (e.g. between bursts for burst tests), can be formulated as an intermittent test.ECN marks. TheIntermittent testing can be used for ongoing monitoringwindow is scanned by incrementing it by one packet forchanges in subpath quality with minimal disruption users. It should be used in conjunction withevery 2*target_pipe_size delivered packets. This mimics thefull rate test becauseadditive increase phase of standard congestion avoidance and normally separates the the window increases by approximately twice the target_RTT. There are two versions of thismethod assesses an average_run_length overtest: one built by applying along time interval w.r.t. user sessions. It may false fail duewindow clamp toother legitimatestandard congestioncausing traffic or may false pass changes in underlying link properties (e.g.control and one one built by stiffening amodem retraining to an out of contract lower rate). [Need text about bias (false pass)non-standard transport protocol. When standard congestion control is in effect, any losses or ECN marks cause theshadow of loss caused by excessive bursts] 6.1.4. Intermittent Scatter Testing Intermittent scatter testing: when testing the network pathtransport to revert toor from an ISP subscriber aggregation point (CMTS, DSLAM, etc), intermittent tests can be spread acrossapool of userswindow smaller than the clamp such thatno one users experiences the full impact of the testing, even though the traffic to or fromtheISP subscriber aggregation point is sustained at full rate. 6.2. Interpretingscanning clamp looses control theResults 6.2.1. Test outcomes A singletonwindow size. The NPAD pathdiag tool is an example of this class of algorithms [Pathdiag]. Alternatively apass fail measurement. If any subpath fails any test itnon-standard congestion control algorithm canbe assumedrespond to losses by transmitting extra data, such thatthe end-to-end path will also failit (attempts) toattainmaintain thetarget performancespecified window size independent of losses or ECN marks. Such a stiffened transport explicitly violates mandatory Internet congestion control and is not suitable for in situ testing. It is only appropriate for engineering testing undersomelaboratory conditions.In addition we use "inconclusive" outcome to indicate thatThe Windowed Ping tools implemented such a testfailed to attain the required test conditions.[WPING]. This tool has been updated and isimportantunder test.[mpingSource] The test procedures in Section 7.2 describe how to theextent that the tests themselves use protocols that have built in control systems which might interfere with some aspect ofpartition thetest. For example consider a test is implemented by adding rate controlsscans into regions andinstrumentation to TCP: failinghow toattaininterpret thespecified data rate hasresults. 6.1.4. Concurrent or channelized testing The procedures described in his document are only directly applicable tobe treatedsingle stream performance measurement, e.g. one TCP connection. In aninconclusive, unless the test clearly fails (target_run_lenght is too small). ThisIdeal world, we would disallow all performance claims based multiple concurrent stream but this isbecause failingnot practical due toreach the targetat least two different issues. First, many very high rateis an ambiguous signature forlink technologies are channelized, and pin individual flows to specific channels to minimize reordering or solve other problemswith eitherand second TCP itself has scaling limits. Although thetest procedure (aformer problemwithmight be overcome through different design decisions, theTCP implementation orlater problem is more deeply rooted. All standard [RFC 5681] and de facto standard [CUBIC] congestion control algorithms have scaling limits, in thetest pathsense that as a network over a fixed RTTis too long) orand MTU gets faster all congestion control algorithms get less accurate. In general their noise immunity drops (a single packet drop should have less effect as individual packets become smaller relative to thesubpath itself. The vantage independence properties of Model Based Metrics depends onwindow size) and theaccuracycontrol frequency of thedistinction between failing and inconclusive tests. One ofAIMD sawtooth also drops, meaning that as TCP is using more total capacity it gets less information about thegoalsstate ofevolving test designs will be to keep sharpeningthedistinction between failingnetwork andinconclusive tests. One of the goalsother traffic. These properties are a direct consequence ofevolvingthetesting process, proceduresoriginal Reno design andmeasurement point selection shouldare implicitly required by the requirement that all transport protocols be "TCP friendly" [Guidelines] There are a number of reason tominimize thewant to specify performance in term of multiple concurrent flows. Although there are a number ofinconclusive tests. 6.2.2. Statistical criteria for measuring run_length When evaluating the observed run_length, we need to determine appropriate packet stream sizes and acceptable error levelsdownsides totest efficiently. In practice, can we compare@@@@ The use of multiple connections in theempirically estimated loss probabilities withInternet has been very controversial since thetargets asbeginning of thesample size grows? How large a sampleWorld-Wide-Web[first complaint]. Modern browsers open many connections [BScope]. Experts associated with IETF transport area have frequently spoken against this practice [long list]. It isneedednot inappropriate tosay that the measurementsassume some small number ofpacket transfer indicateconcurrent connections (e.g. 4 or 6), to compensate for limitation in TCP. However, choosing too large aparticular run-lengthnumber ispresent? The generalized measurement can be describedat risk of being interpreted asrecursive testing: sendaflight of packets and observesignal by thepacket transfer performance (loss ratio or other metric, any defect we define). As each flight is sent and measured, we have an ongoing estimate ofweb browser community that this practice has been embraced by theperformance in terms of defect to total packet ratio (or an empirical probability). ContinueInternet service provider community. It may not be desirable to senduntil conditions supportsuch aconclusion orsignal. Note that the current proposal for httpbis [SPDY] is specifically designed to work best with amaximumsingle TCP connection per client server pair, because it uses adaptive compression which requires sendinglimit has been reached. We have a target_defect_probability, 1 defectseparate compression dictionaries pertarget_run_length, where a "defect" is definedconnection. As long as TCP can use IW10 and some of the transport parameter can be cached, multiple connections provide alost packet, a packet with ECN mark, or other impairment. This constitutes the null Hypothesis: H0: no more than one defects in target_run_length = (3/2)*(flight)^2 packets and we can stop sending flights of packets if measurements support accepting H0 withnegative gain, due to thespecified Type I error = alpha (= 0.05 for example). We also have an alternative Hypothesisreplicated compression overhead. The specification toevaluate: if performanceuse multiple connections issignificantly lower than the target_defect_probability, say half the target: H1: one or more defects in target_run_length/2 packets and wenot recommended for data rates below several Mb/s, which canstop sending flights of packets if measurements support rejecting H0be attained with run lengths under 10000. Since run length goes as thespecified Type II error = beta, thus preferring the alternate H1. H0 and H1 constitutesquare of theSuccess and Failure outcomes described elsewhere indata rates, at higher rates (see Section 8.3) thememo,run lengths can be unfeasibly large, andwhilemultiple connection might be theongoing measurements doonly feasible approach. 6.1.5. Intermittent Testing Any test which does notsupport either hypothesisdepend on queueing (e.g. thecurrent status of measurements is inconclusive. The problem above is formulated to matchCBR tests) or experiences periodic zero outstanding data during normal operation (e.g. between bursts for theSequential Probability Ratio Test (SPRT) [StatQC] [temp ref: http://en.wikipedia.org/wiki/Sequential_probability_ratio_test ], which also starts with a pair of hypothesis specifiedvarious burst tests), can be formulated asabove: H0: p = p0 = one defectan intermittent test. The Intermittent testing can be used for ongoing monitoring for changes intarget_run_length H1: p = p1 = one defectsubpath quality with minimal disruption users. It should be used intarget_run_length/2 As flights are sent and measurements collected, the tester evaluatesconjunction with thecumulative log-likelihood ratio: S_ifull rate test because this method assesses an average_run_length over a long time interval w.r.t. user sessions. It may false fail due to other legitimate congestion causing traffic or may false pass changes in underlying link properties (e.g. a modem retraining to an out of contract lower rate). [Need text about bias (false pass) in the shadow of loss caused by excessive bursts] 6.1.6. Intermittent Scatter Testing Intermittent scatter testing: when testing the network path to or from an ISP subscriber aggregation point (CMTS, DSLAM, etc), intermittent tests can be spread across a pool of users such that no one users experiences the full impact of the testing, even though the traffic to or from the ISP subscriber aggregation point is sustained at full rate. 6.2. Interpreting the Results 6.2.1. Test outcomes A singleton is a pass/fail measurement of a subpath. If any subpath fails any test then the end-to-end path is also expected to fail to attain the target performance under some conditions. In addition we use "inconclusive outcome" to indicate that a test failed to attain the required test conditions. A test is inconclusive if the precomputed traffic pattern was not authentically generated, test preconditions were not met or the measurement results were not statistically significantly. This is important to the extent that the diagnostic tests use protocols which themselves include built in control systems which might interfere with some aspect of the test. For example consider a test that is implemented by adding rate controls and loss instrumentation to TCP: meeting the run length specification while failing to attain the specified data rate must be treated as an inconclusive result, because we can not a priori determine if the reduced data rate was caused by a TCP problem or a network problem, or if the reduced data rate had a material effect on the run length measurement. (Note that if the measured run length was too small, the test can be considered to have failed because it doesn't really matter that the test didn't attain the required data rate). The vantage independence properties of Model Based Metrics depends on the accuracy of the distinction between conclusive (pass or fail) and inconclusive tests. One way to view inconclusive tests is that they reflect situations where the signature is ambiguous between problems with the the subpath and problems with the diagnostic test itself. One of the goals for evolving diagnostic test designs will be to keep sharpening this distinction. One of the goals of evolving the testing process, procedures and measurement point selection should be to minimize the number of inconclusive tests. Note that procedures that attempt to sweep the target parameter space to find the bounds on some parameter (for example to find the highest data rate for a subpath) are likely to break the location independent properties of Model Based Metrics, because the boundary between passing and inconclusive is extremely likely to be RTT sensitive, because TCP's ability to compensate for problems scales with the number of round trips per second. 6.2.2. Statistical criteria for measuring run_length When evaluating the observed run_length, we need to determine appropriate packet stream sizes and acceptable error levels for efficient methods of measurement. In practice, can we compare the empirically estimated loss probabilities with the targets as the sample size grows? How large a sample is needed to say that the measurements of packet transfer indicate a particular run-length is present? The generalized measurement can be described as recursive testing: send packets (individually or in patterns) and observe the packet transfer performance (loss ratio or other metric, any defect we define). As each packet is sent and measured, we have an ongoing estimate of the performance in terms of defect to total packet ratio (or an empirical probability). We continue to send until conditions support a conclusion or a maximum sending limit has been reached. We have a target_defect_probability, 1 defect per target_run_length, where a "defect" is defined as a lost packet, a packet with ECN mark, or other impairment. This constitutes the null Hypothesis: H0: no more than one defect in target_run_length = 3*(target_pipe_size)^2 packets and we can stop sending packets if on-going measurements support accepting H0 with the specified Type I error = alpha (= 0.05 for example). We also have an alternative Hypothesis to evaluate: if performance is significantly lower than the target_defect_probability. Based on analysis of typical values and practical limits on measurement duration, we choose four times the H0 probability: H1: one or more defects in (target_run_length/4) packets and we can stop sending packets if measurements support rejecting H0 with the specified Type II error = beta (= 0.05 for example), thus preferring the alternate hypothesis H1. H0 and H1 constitute the Success and Failure outcomes described elsewhere in the memo, and while the ongoing measurements do not support either hypothesis the current status of measurements is inconclusive. The problem above is formulated to match the Sequential Probability Ratio Test (SPRT) [StatQC], which also starts with a pair of hypothesis specified as above: H0: p0 = one defect in target_run_length H1: p1 = one defect in target_run_length/4 As packets are sent and measurements collected, the tester evaluates the cumulative defect count against two boundaries representing H0 Acceptance or Rejection (and acceptance of H1): Acceptance line: Xa = -h1 + sn Rejection line: Xr =S_i-1h2 +log(Lambda_i)sn whereLambda_in increases linearly for each packet sent and h1 = { log((1-alpha)/beta) }/k h2 = { log((1-beta)/alpha) }/k k = log{ (p1(1-p0)) / (p0(1-p1)) } s = [ log{ (1-p0)/(1-p1) } ]/k for p0 and p1 as defined in the null and alternative Hypotheses statements above, and alpha and beta as the Type I and Type II error. The SPRT specifies simple stopping rules: o Xa < defect_count(n) < Xb: continue testing o defect_count(n) <= Xa: Accept H0 o defect_count(n) >= Xb: Accept H1 The calculations above are implemented in the R-tool for Statistical Analysis, in the add-on package for Cross-Validation via Sequential Testing (CVST) [http://www.r-project.org/] [Rtool] [CVST] . Using the equations above, we can calculate the minimum number of packets (n) needed to accept H0 when x defects are observed. For example, when x = 0: Xa = 0 = -h1 + sn and n = h1 / s 6.2.3. Reordering Tolerance All tests must be instrumented for reordering [RFC4737]. NB: there is no global consensus for how much reordering tolerance is appropriate or reasonable. ("None" is absolutely unreasonable.) Section 5 of [RFC4737] proposed a metric that may be sufficient to designate isolated reordered packets as effectively lost, because TCP's retransmission response would be the same. [As a strawman, we propose the following:] TCP should be able to adapt to reordering as long as the reordering extent is no more than the maximum of one half window or 1 mS, whichever is larger. Note that there is a fundamental tradeoff between tolerance to reordering and how quickly algorithms such as fast retransmit can repair losses. Within this limit on reorder extent, there should be no bound on reordering density. NB: Traditional TCP implementations were not compatible with this metric, however newer implementations still need to be evaluated Parameters: Reordering displacement: theratiomaximum of one half of target_pipe_size or 1 mS. 6.3. Test Qualifications This entire section might be summarized as "needs to be specified in a FSTDS" Things to monitor before, during and after a test. 6.3.1. Verify thetwo likelihood functions (calculatedTraffic Generation Accuracy [Excess detail for this doc. To be summarized] for most tests, failing to accurately generate the test traffic indicates an inconclusive tests, since it has to be presumed that the error in traffic generation might have affected the test outcome. To the extent that the network itself had an effect on themeasurement at packet i, and index i increases linearly over all flightsthe traffic generation (e.g. in the standing queue tests) the possibility exists that allowing too large ofpackets ) for p0 and p1 [temp ref: http://en.wikipedia.org/wiki/Likelihood_function ].error margin in the traffic generation might introduce feedback loops that comprise the vantage independents properties of these tests. Parameters: Maximum Data Rate Error TheSPRT specifies simple stopping rules: o a < S_i < b: continue testing o S_i <= a: Accept H0 o S_i >= b: Accept H1 where a and b are based onpermitted amount that theType I and II errors, alpha and beta:test traffic can be different than specified for the current test. This is a~= Log((beta/(1-alpha)) and b ~= Log((1-beta)/alpha)symmetrical bound. Maximum Data Rate Overage The permitted amount that the test traffic can be above than specified for the current test. Maximum Data Rate Underage The permitted amount that the test traffic can be less than specified for the current test. 6.3.2. Verify the absence of cross traffic [Excess detail for this doc. To be summarized] The proper treatment of cross traffic is different for different subpaths. In general when testing infrastructure which is associated with only one subscriber, theerror probabilities decided beforehand,test should be treated asabove. The calculations above are implemented ininconclusive it that subscriber is active on theR-toolnetwork. However, forStatistical Analysis, inshared infrastructure, theadd-on package for Cross-Validation via Sequential Testing (CVST) [http://www.r-project.org/] [Rtool] [CVST] . 6.2.3. Classifications of tests Tests are annotated with "(capacity)", "(engineering)" or "(monitoring)". @@@@MOVEquestion at hand is likely todefinitions? Capacity tests determinebe testing ifa network subpathprovider has sufficientcapacity to delivertotal capacity. In such cases thetarget performance. As such, they reflect parameters that can transition from passingpresence of cross traffic due tofailing as a consequenceother subscribers is explicitly part ofadditional presented load ortheactions of other network users. By definition, capacity tests also consumenetworkresources (capacity and/or buffer space),conditions andtheir test schedules must be balanced by their cost. Monitoring testsits effects aredesign to capture the most important aspects of a capacity test, but without causing unreasonable ongoing load themselves. As such they may miss some detailsexplicitly part of thenetwork performance, but can serve as a useful reduced cost proxy for a capacitytest.Engineering tests evaluate how network algorithms (such as AQM@@@@ Need to distinguish between ISP managed sharing andchannel allocation) interact with transport protocols. Theseunmanaged sharing. e.g. WiFi Note that canceling testsare likelydue tohave complicated interactions withload on subscriber lines may introduce sampling errors for testing othernetwork traffic and can be inversely sensitiveparts of the infrastructure. For this reason tests that are scheduled but not run due toload. For exampleload should be treated as atestspecial case of "inconclusive". Use a passive packet or SNMP monitoring to verify thatan AQM algorithm causes ECN marks or packet drops early enough to limit queue occupancy may experience a false pass results inthepresence of bursty cross traffic. It is important that engineering teststraffic volume on the subpath agrees with the traffic generated by a test. Ideally this should be performedunder a wide range of conditions, including both in situbefore, during andbench testing,after each test. The goal is provide quality assurance on the overall measurement process, andunderspecifically to detect the following measurement failure: avariety of load conditions. Ongoing monitoringuser observes unexpectedly poor application performance, the ISP observes that the access link isless likelyrunning at the rated capacity. Both fail to observe that the user's computer has been infected by a virus which is spewing traffic as fast as it can. Parameters: Maximum Cross Traffic Data Rate The amount of excess traffic permitted. Note that this will beuseful for these tests, although sparse in situ testing might be appropriate. @@@ Add single property vs combined tests here? 6.2.4. Reordering Tolerance All tests must be instrumenteddifferent forreordering [RFC4737]. NB: theredifferent tests. One possible method isno global consensusan adaptation of: www-didc.lbl.gov/papers/ SCNM-PAM03.pdf D Agarwal etal. "An Infrastructure forhow much reordering tolerance is appropriate or reasonable. ("None" is absolutely unreasonable.) Section 5Passive Network Monitoring of[RFC4737] proposed a metric that may be sufficient to designate isolated reordered packetsApplication Data Streams". Use the same technique aseffectively lost, because TCP's retransmission response would bethat paper to trigger thesame. [As a strawman, we proposecapture of SNMP statistics for thefollowing:] TCP shouldlink. 6.3.3. Additional test preconditions [Excess detail for this doc. To beablesummarized] Send pre-load traffic as needed toadaptactivate radios with a sleep mode, or other "reactive network" elements (term defined in [draft-morton-ippm-2330-update-01]). Use the procedure above toreordering as long asconfirm that thereordering extentpre-test background traffic isno more thanlow enough. 7. Diagnostic Tests The diagnostic tests are organized by which properties are being tested: run length, standing queues; slowstart bursts; sender rate bursts; and combined tests. The combined tests reduce overhead at themaximumexpense ofone half windowconflating the signatures of multiple failures. 7.1. Basic Data Rate and Run Length Tests We propose several versions of the basic data rate and run length test. All measure the number of packets delivered between losses or1 mS, whichever is larger. NoteECN marks, using a data stream thatthereisa fundamental tradeoff between tolerance to reordering andrate controlled at or below the target_data_rate. The tests below differ in howquickly algorithms such as fast retransmitthe data rate is controlled. The data canrepair losses. Within this limit on reorder extent, there shouldbeno boundpaced onreordering frequency. NB: Current TCP implementationsa timer, or window controlled at full target data rate. The first two tests implicitly confirm that sub_path has sufficient raw capacity to carry the target_data_rate. They arenot compatible with this metric. We view thisrecommend for relatively infrequent testing, such asbugs in current TCP implementations. Parameters: Reordering displacement: the maximum of one half of target_pipe_sizean installation or1 mS. 6.3. Test Qualifications Things to monitor before, during and afterauditing process. The third, background run length, is atest. 6.3.1. Verify the Traffic Generation Accuracylow rate test designed formost tests, failingongoing monitoring for changes in subpath quality. All rely on the receiver accumulating packet delivery statistics as described in Section 6.2.2 toaccurately generatescore the outcome: Pass: it is statistically significant that the observed run length is larger than the target_run_length. Fail: it is statistically significant that the observed run length is smaller than the target_run_length. A testtraffic indicates anis considered to be inconclusivetests, sinceif ithasfailed tobe presumed thatmeet theerrordata rate as specified below, meet the qualifications defined in Section 6.3 or neither run length statistical hypothesis was confirmed intraffic generation might have affectedthe allotted testoutcome. To the extentduration. 7.1.1. Run Length at Paced Full Data Rate Confirm that thenetwork itself had an effectobserved run length is at least the target_run_length while relying on timer to send data at the target_rate using thetraffic generation (e.g.procedure described inthe standing queue tests) the possibility exists that allowing too large of error marginin Section 6.1.1 with a burst size of 1 (single packets). The test is considered to be inconclusive if thetraffic generation might introduce feedback loopspacket transmission can not be accurately controlled for any reason. 7.1.2. run length at Full Data Windowed Rate Confirm thatcomprisethevantage independentsobserved run length is at least the target_run_length while sending at an average rate equal to the target_data_rate, by controlling (or clamping) the window size of a conventional transport protocol to a fixed value computed from the properties ofthese tests. Parameters: Maximum Data Rate Error The permitted amount thatthe testtraffic canpath, typically test_window=target_data_rate*test_RTT/target_MTU. Since losses and ECN marks generally cause transport protocols to at least temporarily reduce their data rates, this test is expected to bedifferent than specified forless precise about controlling its data rate. It should not be considered inconclusive as long as at least some of thecurrent test. This is a symmetrical bound. Maximum Data Rate Overage The permitted amount thatround trips reached the full target_data_rate, without incurring losses. To pass this testtraffic can be above than specified forthecurrentnetwork MUST deliver target_pipe_size packets in target_RTT time without any losses or ECN marks at least once per two target_pipe_size round trips, in addition to meeting the run length statistical test.Maximum Data Rate Underage7.1.3. Background Run Length Tests Thepermitted amount thatbackground run length is a low rate version of the target target rate testtraffic can be less than specifiedabove, designed forthe current test. 6.3.2. Verify the absence of cross traffic The proper treatment of cross traffic is differentongoing lightweight monitoring fordifferent subpaths. In general when testing infrastructure which is associated with only one subscriber,changes in thetestobserved subpath run length without disrupting users. It should betreated as inconclusiveused in conjunction with one of the above full rate tests because it does not confirm thatsubscriber is active on the network. However, for shared infrastructure,thequestion at hand is likely to be testing if provider has sufficient total capacity. Insubpath can support raw data rate. Existing loss metrics suchcasesas [RFC 6673] might be appropriate for measuring background run length. 7.2. Standing Queue tests These test confirm that thepresence of cross traffic due to other subscribersbottleneck isexplicitly partwell behaved across the onset of packet loss, which typically follows after thenetwork conditions and its effects are explicitly partonset of queueing. Well behaved generally means lossless for transient queues, but once thetest. Note that canceling tests due to load on subscriber lines may introduce sampling errorsqueue has been sustained fortesting other partsa sufficient period ofthe infrastructure. For this reason tests that are scheduled but not run due to loadtime (or a sufficient queue depth) there should betreated asaspecial casesmall number of"inconclusive". Use a passive packet or SNMP monitoringlosses toverify that the traffic volume on the subpath agrees with the traffic generated by a test. Ideally this should be performed before during and after each test. The goal is provide quality assurance on the overall measurement process, and specificallysignal todetect the following measurement failure: a user observes unexpectedly poor application performance,theISP observestransport protocol that it should reduce its window. Losses that are too early can prevent theaccess link is runningtransport from averaging at therated capacity. Both fail to observetarget_data_rate. Losses thatthe user's computer has been infected by a virus which is spewing traffic as fast as it can. Parameters: Maximum Cross Traffic Data Rate The amount of excess traffic permitted. Noteare too late indicate thatthis willthe queue might bedifferent for different tests. One possible method is an adaptation of: www-didc.lbl.gov/papers/ SCNM-PAM03.pdf D Agarwal etal. "An Infrastructuresubject to bufferbloat [Bufferbloat] and inflict excess queuing delays on all flows sharing the bottleneck. Excess losses make loss recovery problematic forPassive Network Monitoringthe transport protocol. Non-linear or erratic RTT fluctuations suggest poor interactions between the channel acquisition systems and the transport self clock. All ofApplication Data Streams". Usethe tests in this section use the sametechnique as that paper to triggerbasic scanning algorithm but score thecapturelink on the basis ofSNMP statistics forhow well it avoids each of these problems. For some technologies thelink. 6.3.3. Additional test preconditions Send pre-load traffic as neededdata might not be subject toactivate radios with a sleep mode, or other "reactive network" elements (term definedincreasing delays, in[draft-morton-ippm-2330-update-01]). Usewhich case theprocedure abovedata rate will vary with the window size all the way up toconfirm thatthepre-test background traffic is low enough. 7. Single Property Tests 7.1. Basic Data and Loss Rate Tests We propose several versionsonset ofthe loss rate test. All are rate controlled atlosses orbelowECN marks. For theses technologies, thetarget_data_rate. The first, performed at constant full data rate, is intrusive and recommend for infrequent testing, such as when a servicediscussion of queueing does not apply, but it isfirst turned up or as partstill required that the onset of losses (or ECN marks) be at anauditing process.appropriate point and progressive. Use the procedure in Section 6.1.3 to sweep the window across the onset of queueing and the onset of loss. Thesecond, background loss rate, is designed for ongoing monitoring for change is subpath quality. 7.1.1. Loss Rate at Paced Full Data Rate Confirmtests below all assume that theobserved run length is at leastscan emulates standard additive increase and delayed ACK by incrementing thetarget_run_lenght while sendingwindow by one packet for every 2*target_pipe_size packets delivered. A scan can be divided into three regions: below the onset of queueing, a standing queue, and at or beyond thetarget_rate. This test implicitly confirms that sub_path has sufficient raw capacity to carryonset of loss. Below thetarget_data_rate. This versiononset of queueing thelossRTT is typically fairly constant, and the data ratetest relies on timersvaries in proportion toschedulethe window size. Once the datatransmission at a true constant bitrate(CBR). Test Parameters: Run Length Same as target_run_lenght Data Rate Same as target_data_rate Maximum Cross Traffic A specified small fraction of target_data_rate. Note that target_run_lenghtreaches the link rate, the data rate becomes fairly constant, andtarget_data_rate parameters MUST NOT be derated. Ifthedefault parameters are too stringent an alternate model as describedRTT increases inAppendix Aproportion to the the window size. The precise transition from one region to the other can beusedidentified by the maximum network power, defined tocompute target_run_lenght. The test traffic is sent usingbe theprocedures in Section 6.1.1ratio data rate over the RTT[POWER]. For technologies that do not have conventional queues, start the scan attarget_data_rate withaburst size of 1, subject to the qualifications in Section 6.3. The receiver accumulates packet delivery statistics as described in Section 6.2window equal toscoretheoutcome: Pass: it is statistically significantly thattest_window, i.e. starting at theobserved run length is larger thantarget rate, instead of thetarget_run_length. Fail: itpower point. If there isstatistically significantly thatrandom background loss (e.g. bit errors, etc), precise determination of theobserved run length is smaller thanonset of packet loss may require multiple scans. Above thetarget_run_length. Inconclusive: The test failedonset of loss, all transport protocols are expected tomeetexperience periodic losses. For the stiffened transport case they will be determined by thequalifications definedAQM algorithm inSection 6.3the network orneither test was statistically significant. 7.1.2. Loss Rate at Full Data Windowed Rate Confirm thattheobserved run length is at leastdetails of how thetarget_run_lenght while sending atthetarget_rate. This test implicitly confirms that sub_path has sufficient raw capacitywindow increase function responds tocarryloss. For thetarget_data_rate. This versionstandard transport case the details of periodic losses are typically dominated by theloss rate test relies on a fixed window to self clock data transmission intobehavior of thenetwork. This is more authentic. Test Parameters: Run Length Same as target_run_lenght Data Rate Same as target_data_rate Maximum Cross Traffictransport protocol itself. 7.2.1. Congestion Avoidance Aspecified small fraction of target_data_rate. Note that target_run_lenght and target_data_rate parameters MUST NOT be derated. Iflink passes thedefault parameterscongestion avoidance standing queue test if more than target_run_length packets aretoo stringent an alternate model as described in Appendix A can be used to compute target_run_lenght. Thedelivered between the power point (or test_window) and the first loss or ECN mark. If this testtrafficissentimplemented usingthe procedures in Section 6.1.1 at target_data_ratea standards congestion control algorithm with aburst sizeclamp, it can be used in situ in the production internet as a capacity test. For an example of1, subjectsuch a test see [NPAD]. 7.2.2. Bufferbloat This test confirms that there is some mechanism to limit buffer occupancy (e.g. prevents bufferbloat). Note that this is not strictly a requirement for single stream bulk performance, however if there is no mechanism to limit buffer occupancy then a single stream with sufficient data to deliver is likely to cause thequalifications in Section 6.3. The receiver accumulates packet delivery statistics asproblems described inSection 6.2[RFC 2309] and [Bufferbloat]. This may cause only minor symptoms for the dominant flow, but has the potential toscoremake theoutcome: Pass: it is statistically significantly thatlink unusable for all other flows and applications. Pass if theobserved run lengthonset of loss islargerbefore a standing queue has introduced more delay thanthe target_run_length. Fail: itthan twice target_RTT, or other well defined limit. Note that there isstatistically significantlynot yet a model for how much standing queue is acceptable. The factor of two chosen here reflects a rule of thumb. Note that in conjunction with theobserved run lengthprevious test, this test implies that the first loss should occur at a queueing delay which issmaller thanbetween one and two times thetarget_run_length. Inconclusive: Thetarget_RTT. 7.2.3. Non excessive loss This testfailed to meetconfirm that thequalifications defined in Section 6.3 or neither test was statistically significant. 7.1.3. Background Loss Rate Tests The backgroundonset of lossrateisa low rate version ofnot excessive. Pass if losses are bound by thetarget rate test above, designed for ongoing monitoring for changes in subpath quality without disrupting users. It should be usedthe fluctuations inconjunction withtheabove full rate test because it may be subject to false results under some conditions,cross traffic, such that transient load (bursts) do not cause dips inparticular it may falseaggregate raw throughput. e.g. passchanges in underlying link properties (e.g. a modem retraining to an out of contract lower rate). Parameters: Run Length Sameastarget_run_lenght Data Rate Some small fraction of target_data_rate, suchlong as1%. Oncethepreconditions described in Section 6.3losses aremet, theno more bursty than are expected from a simple drop tail queue. Although this test could be made more precise it is really included here for pedantic completeness. 7.2.4. Duplex Self Interference This engineering testdata is sent at the prescribed rate withconfirms aburst size of 1. The receiver accumulates packet delivery statistics andbound on theprocedures described in Section 6.2.1interactions between the forward data path andSection 6.3 are used to scoretheoutcome: Pass: it is statistically significantly thatACK return path. Fail if theobserved run length is largerRTT rises by more than some fixed bound above thetarget_run_length. Fail: it is statistically significantly that the observed run length is smaller thanexpected queueing time computed from trom thetarget_run_length. Inconclusive: Neither test was statistically significant or there wasexcesscross traffic duringwindow divided by thetest. 7.2. Standing Queuelink data rate. @@@@ This needs further testing. 7.3. Slowstart tests Thesetest confirm that the bottlenecktests mimic slowstart: data iswell behaved across the onset of queueing. For conventional bottlenecks this will be from the onset of queuing tosent at twice thepoint where there is a full target_pipe of standing data. Well behaved generally means lossless for target_run_length, followed by a small number of losses to signaleffective bottleneck rate to exercise thetransport protocol that it should slow down. Losses that are too early can prevent the transport from averaging abovequeue at thetarget_rate. Losses thatdominant bottleneck. They aretoo late indicate thatdeemed inconclusive if thequeue might be subject to bufferbloat and subject other flowselapsed time toexcess queuing delay. Excess losses (moresend the data burst is not less than half ofof target_pipe) make loss recovery problematic forthetransport protcol. These tests can also observe some problems with channel acquisition systems, especially attime to receive the ACKs. (i.e. sending data too fast is ok, but sending it slower than twice theonset of persistent queueing. Details TBD. 7.2.1. Congestion Avoidance Useactual bottleneck rate as indicated by theprocedure in Section 6.1.2.1ACKs is deemed inconclusive). Space the bursts such that the average data rate is equal tosweepthewindow (rate) from below link_pipe uptarget_data_rate. 7.3.1. Full Window slowstart test This is a capacity test tobeyond target_pipe+link_pipe. Depending on eventsconfirm thathappen during the scan,slowstart is not likely to exit prematurely. Send slowstart bursts that are target_pipe_size total packets. Accumulate packet delivery statistics as described in Section 6.2.2 to score thelink. Identify the power_point=MAX(rate/RTT) asoutcome. Pass if it is statistically significant that thestart ofobserved run length is larger than thetest.target_run_length. Fail iffirst lossit istoo early (loss rate too high) on repeated tests or ifstatistically significant that thelosses are moreobserved run length is smaller thanhalf of the outstanding data. (a capacity test) 7.2.2. Buffer Bloat Use the procedure in Section 6.1.2.1 to sweepthewindow (rate) from below link_pipe up to beyond target_pipe+link_pipe. Depending on eventstarget_run_length. Note thathappen during the scan, score the link. Identifythese are the"power point:MAX(rate/RTT)same parameters as thestart ofSender Full Window burst test, except thetest (should be window=target_pipe) Fail if first lossburst rate istoo late (insufficientat slowestart rate, rather than sender interface rate. 7.3.2. Slowstart AQMand subject to bufferbloat - an engineering test). NO THEORY 7.2.3. Duplex Self Interference Usetest Do a continuous slowstart (send data continuously at slowstart_rate), until theprocedure in Section 6.1.2.1 to sweepfirst loss, stop, allow thewindow (rate) from below link_pipe upnetwork tobeyond target_pipe+required_queue. Dependingdrain and repeat, gathering statistics onevents that happen duringthescan, scorelast packet delivered before thelink. Identifyloss, the"power point:MAX(rate/RTT) asloss pattern, maximum RTT and window size. Justify thestartresults. There is not currently sufficient theory justifying requiring any particular result, however design decisions that affect the outcome of this tests also affect how thetest (should be window=target_pipe) @@@ add required_queuenetwork balances between long andpower_point Fail if RTTshort flows (the "mice and elephants" problem) This isnon-monotonic by more thanan engineering test: It would be best performed on asmall number of packet times (channel allocation self interference - engineering) IS THIS SUFFICIENT? 7.3. Slowstartquiescent network or testbed, since cross traffic has the potential to change the results. 7.4. Sender Rate Burst tests These testsmimic slowstart: data isdetermine how well the network can deliver bursts sent atslowstart_rate (twice subpath_rate). They are deemed inconclusive ifsender's interface rate. Note that this test most heavily exercises theelapsed timefront path, and is likely tosend the data burstinclude infrastructure nominally out of scope. Also, there are a several details that are not precisely defined. For starters there is notless than half of the (extrapolated) timea standard server interface rate. 1 Gb/s is very common today, but higher rates (e.g. 10 Gb/s) are becoming cost effective and can be expected toreceivebe dominant some time in theACKs. (i.e. sending data too fastfuture. Current standards permit TCP to send a full window bursts following an application pause. Congestion Window Validation [RFC 2861], isok,not required, butsendingeven if was itslowerdoes not take effect until an application pause is longer thantwicean RTO. Since this is standard behavior, it is desirable that theactual bottleneck ratenetwork be able to deliver it, otherwise application pauses will cause unwarranted losses. It isdeemed inconclusive). Spacealso understood in the application and serving community that interface rate burstssuchhave a cost to the network that has to be balanced against other costs in theaverage ACK rateservers themselves. For example TCP Segmentation Offload [TSO] reduces server CPU in exchange for larger network bursts, which increase the stress on network buffer memory. There isequalnot yet theory to unify these costs or to provide a framework for trying toor faster than the target_data_rate. These tests areoptimize global efficiency. We do notuseful at burst sizes smaller thanyet have a model for how much thesender interfacenetwork should tolerate server ratetests, sincebursts. Some bursts must be tolerated by thesender interface rate tests are more strenuous. Ifnetwork, but it isnecessaryprobably unreasonable toderate the sender interface rate tests, then the full window slowstart test (un-derated) would be important. 7.3.1. Full Window slowstart test Send (target_pipe_size+required_queue)*derate bursts must have fewer than one loss per target_run_length*derate. Note that these areexpect thesame parametersnetwork to efficiently deliver all data asthe Sender Full Window burst test, except the burst rate is at slowestart rate, rather than sender interface rate. SHOULD derate=1. Otherwise TCP will exit from slowstart prematurely, and only reachafull target_pipe_size window by wayseries ofcongestion avoidance. Thisbursts. For this reason, this isa capacity test: cross traffic may cause premature losses. 7.3.2. Slowstart AQMthe only testDofor which we explicitly encourage detrateing. A TDS should include acontinuous slowstart (date rate = slowstart_rate), until first loss, and repeat, gathering statistics ontable of pairs of derating parameters: what burst size to use as a fraction of thelast delivered packet's RTTtarget_pipe_size, andwindow size. Fail if too large (NO THEORY for value). Thishow much each burst size isan engineering test: It would be best performed on a quiescent network or testbed, since cross traffic might cause a false pass. 7.4. Sender Rate Burst testspermitted to reduce the run length, relative to to the target_run_length. @@@@ Needs more work and experimentation. 7.5. Combined Tests These testsus "senderare more efficient from a deployment/operational perspective, but may not be possible to diagnose if they fail. 7.5.1. Sustained burst test Send target_pipe_size*derate sender interfacerate" bursts. Although thisrate bursts every target_RTT*derate, for derate between 0 and 1. Verify that the observed run length meets target_run_length. Key observations: o This test isnot well defined it should be assumedsubpath RTT invariant, as long as the tester can generate the required pattern. o The subpath under test is expected tobe current statego idle for some fraction of theart server grade hardware (often 10Gb/s today). (load) 7.4.1. Sender TCP Send Offload (TSO) tests If MIN(target_pipe_size, 42) packet bursts meet target_run_lenght (Not derated!). Otherwisetime: (subpath_data_rate-target_rate)/subpath_data_rate. Failing to do so suggests a problem with the procedure. o This test is more strenuous than the slowstart tests: they are not needed if the linkwill interact badlypasses this test withmodern server NIC implementations, which as an optimizationderate=1. o A link that passes this test is likely toreduce host side interactions (interrupts etc) accept upbe able to64kB super packets and send them as 42 seperate packets onsustain higher rates (close to subpath_data_rate) for paths with RTTs smaller than thewire side.cc (load) 7.4.2. Sender Full Window bursttarget_RTT. Offsetting this performance underestimation is part of the rationale behind permitting derating in general. o This testtarget_pipe_size*derate bursts have fewer than one loss per target_run_length*derate. Otherwisecan be implemented with standard instrumented TCP[RFC 4898], using a specialized measurement applicationpauses will cause unwarranted losses. Current standards permitat one end and a minimal service at the other end [RFC 863, RFC 864]. It may require tweaks to the TCP implementation. o This test is efficient tosend a full cwnd burst following an application pause. (Cwnd validation in not required, but even soimplement, since it does nottake effect untilrequire per-packet timers, and can make use of TSO in modern NIC hardware. o This test is not totally sufficient: thepause is longer than RTO). NB: there is no model here for what is good enough. derate=1standing window engineering tests are also needed to be sure that the link issafest, but maywell behaved at and beyond the onset of congestion. o This one test can beunnecessarily conservative for some applications. Some application,proven to be the one capacity test to supplant them all. 7.5.2. Live Streaming Media Model Based Metrics can be implemented as a side effect of serving any non-throughput maximizing traffic, such as streamingvideo need derate=1media, by applying some additional controls tobe efficient whentheapplication pacing quantatraffic. The essential requirement islarger than cwnd. (load) 8. Combined Tests These tests are more efficient from a deployment/operational perspective, but may notthat the traffic bepossible to diagnose if they fail. 8.1. Sustained burst test Send target_pipe_size sender interface rateconstrained such that even with arbitrary application pauses, burstseveryand data rate fluctuations the traffic stays within the envelope determined by all of the individual tests described above, for a specific TDS. If the serving RTT is less than the target_RTT,verify thatthis constraint is most easily implemented by clamping theobserved run length meets target_run_length. Key observations: otransport window size to test_window=target_data_rate*serving_RTT/target_MTU. Thistest is RTT invariant, as long astest_window size will limit thetester can generateboth therequired pattern. o The subpath under test is expectedserving data rate and burst sizes togo idle for some fractionbe no larger than the procedures in Section 7.1.2 and Section 7.4, assuming burst size derating equal to the serving_RTT divided by the target_RTT. Note that if the application tolerates fluctuations in its actual data rate (say by use of a playout buffer) it is important that thetime: (link_rate-target_rate)/link_rate. Failing to dotarget_data_rate be above the actual average rate needed by the application sosuggests a problem withit can recover after transient pauses caused by congestion or theprocedure. o This testapplication itself. Since the serving RTT ismore strenuoussmaller than theslowstart tests: they are not needed iftarget_RTT, thelink passes underated sender interface rate burst tests. o This test couldworst case bursts that might bederatedgenerated under these conditions are smaller than called for byreducing both the burst size and headway (same average data rate). o A link that passesSection 7.4 8. Examples In thistestsection we present TDS for a couple of performance specifications. Tentatively: 5 Mb/s*50 ms, 1 Mb/s*50ms, 250kbp*100mS 8.1. Near serving HD streaming video Today the best quality HD video requires slightly less than 5 Mb/s [HDvideo]. Since it islikelydesirable to serve such content locally, we assume that the content will beablewithin 50 mS, which is enough tosustain higher rates (closecover continental Europe or either US coast. 5 Mb/s over a 50 ms path +----------------------+-------+---------+ | End tolink_rate) for paths with RTTs smaller than the target_RTT. Offsetting this performance underestimation isEnd Parameter | Value | units | +----------------------+-------+---------+ | target_rate | 5 | Mb/s | | target_RTT | 50 | ms | | traget_MTU | 1500 | bytes | | target_pipe_size | 22 | packets | | target_run_length | 1452 | packets | +----------------------+-------+---------+ Table 1 This example uses therationale behind permitting deratingmost conservative TCP model and no derating. 8.2. Far serving SD streaming video Standard Quality video typically fits ingeneral. o1 Mb/s [SDvideo]. Thistest shouldcan beimplementablereasonably delivered via longer paths withstandard instrumented TCP, [RFC 4898] using a specialized measurement application at one end andlarger. We assume 100mS. 5 Mb/s over aminimal service at the other end [RFC 863, RFC 864]. It may require tweaks50 ms path +----------------------+-------+---------+ | End to End Parameter | Value | units | +----------------------+-------+---------+ | target_rate | 1 | Mb/s | | target_RTT | 100 | ms | | traget_MTU | 1500 | bytes | | target_pipe_size | 9 | packets | | target_run_length | 243 | packets | +----------------------+-------+---------+ Table 2 This example uses the most conservative TCPimplementation. o This test is efficient to implement, since it does not require per-packet timers,model andcan make maximal useno derating. 8.3. Bulk delivery ofTSO in modern NIC hardware. oremote scientific data Thistest is not totally sufficient: the standing window engineering tests are also neededexample corresponds tobe sure100 Mb/s bulk scientific data over a moderately long RTT. Note that thelinktarget_run_length iswell behaved atinfeasible for most networks. 100 Mb/s over a 200 ms path +----------------------+---------+---------+ | End to End Parameter | Value | units | +----------------------+---------+---------+ | target_rate | 100 | Mb/s | | target_RTT | 200 | ms | | traget_MTU | 1500 | bytes | | target_pipe_size | 1741 | packets | | target_run_length | 9093243 | packets | +----------------------+---------+---------+ Table 3 9. Validation This document permits alternate models andbeyondparameter derating, as described in Section 5.2 and Section 5.3. In exchange for this latitude in theonset of congestion. o I believemodelling process it requires the ability to demonstrate authentic applications and protocol implementations meeting the target end-to-end performance goals over infrastructure thatthisinfinitessimally passes the TDS. The validation process relies on constructing a test network such that all of the individual load tests pass only infinitessimally, and proving that an authentic application running over a real TCP implementation (or other protocol as appropriate) can beprovenexpected tobemeet theone capacity test to supplant them all. Example To confirm thatend-to-end target parameters on such a100 Mb/s link can reliably deliver single 10 MByte/s stream atnetwork. For example using our example in our HD streaming video TDS described in Section 8.1, the bottleneck data rate should be 5 Mb/s, the per packet random background loss probability should be 1/1453, for adistancerun length of50 mS, test1452 packets, the bottleneck queue should be 22 packets and the front path should have just enough buffering to withstand 22 packet line rate bursts. We want every one of the TDS tests to fail if we slightly increase thelink byrelevant test parameter, so for example sending346a 23 packet slowstart burstseveryshould cause excess (possibly deterministic) packet drops at the dominant queue at the bottleneck. On this infinitessimally passing network it should be possible for a real ral application using a stock TCP implementation in the vendor's default configuration to attain 5 Mb/s over an 50 mS(10 MByte/s payload rate, assuming a 1500 Byte IP MTU and 52 Byte TCP/IP headers). These bursts are 4196288 bits onpath. @@@@ Need to better specify thewire (assuming 16 bytes of link overheadworkload: both short andframing) for an aggregate test data ratelong flows. The difficult part of8.4 Mb/s. Tothis process is arranging for each subpath to infinitesimally pass thetest using the most conservative TCP model forindividual tests. We suggest two approaches: constraining resources in devices by configuring them not to use all available buffer space or data rate; and preloading subpaths with cross traffic. Note that is it important that a singlestream the observed run length must be larger than 179574 packets. Thisenvironment isthe same as less than one loss per 519 bursts (1.5*346) or every 26 seconds. Noteconstructed thatthis test potentially cause transient 346 packet queues at the bottleneck. 9. Calibrationinfinitessimally passes all tests, otherwise there is a chance that TCP can exploit extra latitude in some parameters (such as data rate) to partially compensate for constraints in other parameters. Ifusing derated metrics, or when something goes wrong, the results must be calibrated againstatraditional BTC. The preferred diagnostic follow-upTDS validated according tocalibration issuesthese procedures is used torun open end-to-end measurements on aninform public dialog, the validation experiment itself should also be public with sufficient precision for the experiment to be replicated by other researchers. All components should either be openplatform, such as Measurement Lab [http://www.measurementlab.net/]source of fully specified proprietary implementations that are available to the research community. TODO: paper proving the validation process. 10. Acknowledgements Ganga Maguluri suggested the statistical test for measuring loss probability in the target run length. Meredith Whittaker for improving the clarity of the communications. 11. Informative References [RFC2330] Paxson, V., Almes, G., Mahdavi, J., and M. Mathis, "Framework for IP Performance Metrics", RFC 2330, May 1998. [RFC4737] Morton, A., Ciavattone, L., Ramachandran, G., Shalunov, S., and J. Perser, "Packet Reordering Metrics", RFC 4737, November 2006. [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion Control", RFC 5681, September 2009. [RFC5835] Morton, A. and S. Van den Berghe, "Framework for Metric Composition", RFC 5835, April 2010. [RFC6049] Morton, A. and E. Stephan, "Spatial Composition of Metrics", RFC 6049, January 2011. [I-D.morton-ippm-lmap-path] Bagnulo, M., Burbridge, T., Crawford, S., Eardley, P., and A. Morton, "A Reference Path and Measurement Points for LMAP", draft-morton-ippm-lmap-path-00 (work in progress), January 2013. [MSMO97] Mathis, M., Semke, J., Mahdavi, J., and T. Ott, "The Macroscopic Behavior of the TCP Congestion Avoidance Algorithm", Computer Communications Review volume 27, number3, July 1997. [WPING] Mathis, M., "Windowed Ping: An IP Level Performance Diagnostic", INET 94, June 1994. [mpingSource] Fan, X., Mathis, M., and D. Hamon, "Git Repository for mping: An IP Level Performance Diagnostic", Sept 2013, <https://github.com/m-lab/mping>. [MBMSource] Hamon, D., "Git Repository for Model Based Metrics", Sept 2013, <https://github.com/m-lab/MBM>. [Pathdiag] Mathis, M., Heffner, J., O'Neil, P., and P. Siemsen, "Pathdiag: Automated TCP Diagnosis", Passive and Active Measurement , June 2008. [BScope] Broswerscope, "Browserscope Network tests", Sept 2012, <http://www.browserscope.org/?category=network>. [Rtool] R Development Core Team, "R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/", , 2011. [StatQC] Montgomery, D., "Introduction to Statistical Quality Control - 2nd ed.", ISBN 0-471-51988-X, 1990. [CVST] Krueger, T. and M. Braun, "R package: Fast Cross- Validation via Sequential Testing", version 0.1, 11 2012. [LMCUBIC] Ledesma Goyzueta, R. and Y. Chen, "A Deterministic Loss Model Based Analysis of CUBIC, IEEE International Conference on Computing, Networking and Communications (ICNC), E-ISBN : 978-1-4673-5286-4", January 2013. Appendix A. Model DerivationsThis appendix describes several different ways to calculateThe reference target_run_lengthand the implication of the chosen calculation. Rederive MSMO97 under two different assumptions: target_rate = link_rate and target_rate < 2 * link_rate. Show equivalent derivation for CUBIC. Commentary on the consequence of the choice. Appendix B. old text This entire section is contains scraps of text to be moved, removed or absorbed elsewheredescribed inthe document B.1. An earlier document Step 0: select target end-to-end parameters: a target rate and target RTT. The primary test will be to confirm that the link qualitySection 5.2 issufficient to meet the specified target rate for the link under test, when extended to the target RTT by an ideal network. The target rate must be below the actual link rate and nominally the target RTT would be longer than the link RTT. There should probably be a convention for the relationship between link and target rates (e.g. 85%). For examplebased ona 10 Mb/s link, the target rate might be 1 MBytes/s, at an RTT of 100 mS (a typical continental scale path). Step 1: On the basis of the target rate and RTT and your favorite TCP performance model, compute the "required run length", which is the required number of consecutive non-losses between loss episodes. The run length resembles one over the loss probability, if clustered losses only count as a single event. Also select "test duration" and "test rate". The latter would nominally the same as the target rate, but might be different in some situations. There must be documentation connecting the test rate, duration and required run length,very conservative assumptions: that all window above target_pipe_size contributes to a standing queue that raises thetarget rateRTT, andRTT selected in step 0. Continuing the above example: Assuming a 1500 Byte MTU. The calculated model loss rate for a single TCP streamthat classic Reno congestion control isabout 0.01% (1 lossin1E4 packets). Step 2, the actual measurement proceeds as follows: Start an unconstrained bulk data floweffect. In this section we provide two alternative calculations usingany modern TCP (with large buffers and/or autotuning). During the first interval (no rate limits) observe the slowstart (e.g. tcpdump) and measure: Peak burst size; link clock rate (delivery rate for each round); peak data rate for the fastest single RTT interval; fraction of segments lost at the enddifferent assumptions. It may seem out ofslowstart. After the flow has fully recovered from the slowstart (details not important) throttle the flow down to the test rate (by clamping cwnd or application pacing at the sender or receiver). While clampedplace to allow such latitude in a measurement standard, but thetest rate, observe the losses (run length) for the chosen test duration. The link passessection provides offsetting requirements. These models provide estimates that make thetestmost sense if network performance is viewed logarithmically. In theslowstart ends with lessoperational internet, data rates span more thanapproximately 50% losses8 orders of magnitude, RTT spans more than 3 orders of magnitude, andno timeouts, the peak rate isloss probability spans at leastthe target rate, and the measured run length8 orders of magnitude. When viewed logarithmically (as in decibels), these correspond to 80 dB of dynamic range. On an 80 db scale, a 3 dB error isbetterless than 4% of therequired run length. There will also need to be some ancillary metrics, for example to discard tests where the receiver closes the window, invalidating the slowstart test. [This needs to be separated into multiple subtests] Optional step 3: In some casesscale, even though it mightmake sense to compute an "extrapolated rate", which is the minimumrepresent a factor of 2 in raw parameter. Although this document gives a lot of latitude for calculating target_run_length, people designing suites of tests need to consider theobserved peak rate, and the rate computed fromeffect of their choices on thespecified target RTTongoing conversation and tussle about theobserved run length by usingrelevance of "TCP friendliness" as an appropriate model for capacity allocation. Choosing asuitable TCP performance model. The extrapolated rate should be annotated to indicate if it was run length or peak rate limited, since these have different predictive values. Other issues: If the link RTTtarget_run_length that isnotsubstantially smaller than thetarget RTT and the actual run lengthreference target_run_length specified in Section 5.2 isclose to the target rate, a standards compliant TCP implementation might not be effective at accurately controlling the data rate. To be independent of the details of the TCP implementation, failingequivalent tocontrolsaying that it is appropriate for therate hastransport research community tobe treatedabandon "TCP friendliness" as aspoiled measurement, not a infrastructure failure. This can be overcome by "stiffening" TCP by using a non-standard congestion control algorithm. For example if the rate controlling by clamping cwnd then use "relentless TCP" style reductions on loss,fairness model andlock ssthreshtothe cwnd clamp. Alternatively, implement an explicit rate controllerdevelop more aggressive Internet transport protocols, and forTCP. In either case the test must be abandoned (aborted) ifapplications to continue (or even increase) themeasured run lengthnumber of connections that they open concurrently. A.1. Aggregate Reno In Section 5.2 it issubstantially belowassumed that the targetrun length. If the testrate isrun "in situ" inthe same as the link rate, and any excess window causes aproduction environment, there also needs tostanding queue at the bottleneck. This might bebaseline tests using alternate pathsrepresentative of a non-shared access link. An alternative situation would be a heavily aggregated subpath where individual flows do not significantly contribute toconfirm that there are no bottlenecks or congested links betweenthetest end pointsqueueing delay, and losses are determined monitoring thelink under test. It might make sense to run multiple tests with different parameters,average data rate, for exampleinfrequent tests with test rate equal to the target rate, and more frequent, less disruptive tests with the same target rate butby thetest rate equal to 1%use of a virtual queue as in [AFD]. In such a scheme thetarget rate. To observe the required run length,RTT is constant and TCP's AIMD congestion control causes thelowdata ratetest would take 100 times longer to run. Returningto fluctuate in a sawtooth. If theexample:traffic is being controlled in afull rate test would entail sending 690 pps (1 MByte/s) for several tens of seconds (e.g. 50k packets), and observingmanner thatthe total loss rateisbelow 1:1e4. A less disruptive test mightconsistent with the metrics here, goal would be tosend at 6.9 pps for 100 times longer, and observing B.2. End-to-end parameters from subpaths [This entire section needsmake the actual average rate equal tobe overhauled and should be skipped onthe target_data_rate. We can derive afirst reading. The concepts defined here are not used elsewhere.] The following optional parameters applymodel fortesting generalized end- to-end paths that include subpaths with known specific types of behaviors that are not well represented by simple queueing models: Bottleneck link clock rate: This applies to links that are using virtual queues or other techniques to police or shape users traffic at lower rates full link rate. The bottleneck link clock rate should be representativeReno TCP and delayed ACK under the above set ofqueue drain timesassumptions: forshort burstssome value of Wmin, the window will sweep from Wmin to 2*Wmin in 2*Wmin RTT. Between losses each sawtooth delivers (1/2)(Wmin+2*Wmin)(2Wmin) packetson an otherwise unloaded link. Channel hold time: For channels that have relatively expensive channel arbitration algorithms, this isin 2*Wmin round trip times. However, unlike thetypical (maximum?) time that dataqueueing case where Wmin = Target_pipe_size, we want the average of Wmin andor ACKs are held pending acquiring2*Wmin to be thechannel. While under heavy load,target_pipe_size, so theRTT may be inflated by this parameter, unless itaverage rate isbuilt intothe targetRTT Preload traffic volume: If the user's trafficrate. Thus we want Wmin = (2/3)*target_pipe_size. (@@@@ something isshaped on the basis of average traffic volume,wrong above) Substituting these together we get: target_run_length = (8/3)(target_pipe_size^2) Note that this isvolume necessary to invoke "heavy hitter" policies. Unloaded traffic volume: Ifalways 88% of theuser's traffic is shaped onreference run length. A.2. CUBIC CUBIC has three operating regions. The model for thebasisexpected value ofaverage traffic volume, this iswindow size derived in [LMCUBIC] assumes operation in themaximum traffic volume that"concave" region only, which is atest can usenon-TCP friendly region for long- lived flows. The authors make the following assumptions: packet loss probability, p, is independent andstay withinperiodic, losses occur one at a"light user" policies. Notetime, and they are true losses due to tail drop or corruption. This definition of p aligns very well with our definition of target_run_length and the requirement for progressive loss (AQM). Although CUBIC window increase depends ona ConEx enabled network [ConEx],continuous time, theword "traffic" inauthors transform thelast two items should be replaced by "congestion" i.e. "preload congestion volume" and "unloaded congestion volume". B.3. Per subpath parameters [This entire section needstime tobe overhauledreach the maximum Window size in terms of RTT andshould be skipped onafirst reading. The concepts defined here are not used elsewhere.] Some singleparametertestsfor the multiplicative rate decrease on observing loss, beta (whose default value is 0.2 in CUBIC). The expected value of Window size, E[W], is alsoneeddependent on C, a parameter of CUBIC that determines its window-growth aggressiveness (values from 0.01 to 4). E[W] = ( C*(RTT/p)^3 * ((4-beta)/beta) )^-4 and, further assuming Poisson arrival, thesubpath. subpath RTT: RTT of the subpathmean throughput, x, is x = E[W]/RTT We note that undertest. subpath link clock rate: If differentthese conditions (deterministic single losses), the value of E[W] is always greater than 0.8 of theBottleneck link clock rate B.4.maximum window size ~= reference_run_length. (as far as I can tell) Commentary on the consequence of the choice. Appendix B. Version Control Formatted:Fri JunMon Oct 2118:23:2915:42:35 PDT 2013 Authors' Addresses Matt Mathis Google, Inc 1600 Amphitheater Parkway Mountain View, California 93117 USA Email: mattmathis@google.com Al Morton AT&T Labs 200 Laurel Avenue South Middletown, NJ 07748 USA Phone: +1 732 420 1571 Email: acmorton@att.com URI: http://home.comcast.net/~acmacm/