--- 1/draft-ietf-opsawg-large-flow-load-balancing-10.txt 2014-04-22 12:14:36.094862705 -0700 +++ 2/draft-ietf-opsawg-large-flow-load-balancing-11.txt 2014-04-22 12:14:36.142863867 -0700 @@ -1,26 +1,26 @@ OPSAWG R. Krishnan Internet Draft Brocade Communications Intended status: Informational L. Yong -Expires: October 8, 2014 Huawei USA +Expires: October 22, 2014 Huawei USA A. Ghanwani Dell Ning So Tata Communications B. Khasnabish ZTE Corporation - April 8, 2014 + April 22, 2014 Mechanisms for Optimizing LAG/ECMP Component Link Utilization in Networks - draft-ietf-opsawg-large-flow-load-balancing-10.txt + draft-ietf-opsawg-large-flow-load-balancing-11.txt Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. This document may not be modified, and derivative works of it may not be created, except to publish it as an RFC and to translate it into languages other than English. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that @@ -31,21 +31,21 @@ and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html - This Internet-Draft will expire on October 8, 2014. + This Internet-Draft will expire on October 22, 2014. Copyright Notice Copyright (c) 2014 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents @@ -70,44 +70,44 @@ 1. Introduction...................................................3 1.1. Acronyms..................................................4 1.2. Terminology...............................................4 2. Flow Categorization............................................5 3. Hash-based Load Distribution in LAG/ECMP.......................5 4. Mechanisms for Optimizing LAG/ECMP Component Link Utilization..7 4.1. Differences in LAG vs ECMP................................8 4.2. Operational Overview......................................9 4.3. Large Flow Recognition...................................10 4.3.1. Flow Identification.................................10 - 4.3.2. Criteria for Recognizing a Large Flow...............10 + 4.3.2. Criteria and Techniques for Large Flow Recognition..11 4.3.3. Sampling Techniques.................................11 - 4.3.4. Automatic Hardware Recognition......................12 - 4.3.5. Use of More Than One Detection Method...............13 - 4.4. Load Rebalancing Options.................................13 + 4.3.4. Inline Data Path Measurement........................13 + 4.3.5. Use of More Than One Method for Large Flow Recognition13 + 4.4. Load Rebalancing Options.................................14 4.4.1. Alternative Placement of Large Flows................14 - 4.4.2. Redistributing Small Flows..........................14 + 4.4.2. Redistributing Small Flows..........................15 4.4.3. Component Link Protection Considerations............15 4.4.4. Load Rebalancing Algorithms.........................15 - 4.4.5. Load Rebalancing Example............................15 - 5. Information Model for Flow Rebalancing........................16 - 5.1. Configuration Parameters for Flow Rebalancing............16 - 5.2. System Configuration and Identification Parameters.......17 - 5.3. Information for Alternative Placement of Large Flows.....18 + 4.4.5. Load Rebalancing Example............................16 + 5. Information Model for Flow Rebalancing........................17 + 5.1. Configuration Parameters for Flow Rebalancing............17 + 5.2. System Configuration and Identification Parameters.......18 + 5.3. Information for Alternative Placement of Large Flows.....19 5.4. Information for Redistribution of Small Flows............19 - 5.5. Export of Flow Information...............................19 + 5.5. Export of Flow Information...............................20 5.6. Monitoring information...................................20 5.6.1. Interface (link) utilization........................20 - 5.6.2. Other monitoring information........................20 + 5.6.2. Other monitoring information........................21 6. Operational Considerations....................................21 6.1. Rebalancing Frequency....................................21 - 6.2. Handling Route Changes...................................21 - 7. IANA Considerations...........................................21 - 8. Security Considerations.......................................21 + 6.2. Handling Route Changes...................................22 + 7. IANA Considerations...........................................22 + 8. Security Considerations.......................................22 9. Contributing Authors..........................................22 10. Acknowledgements.............................................22 11. References...................................................22 11.1. Normative References....................................22 11.2. Informative References..................................22 1. Introduction Networks extensively use link aggregation groups (LAG) [802.1AX] and equal cost multi-paths (ECMP) [RFC 2991] as techniques for capacity @@ -431,30 +431,33 @@ for example: . Layer 2: source MAC address, destination MAC address, VLAN ID. . IP header: IP Protocol, IP source address, IP destination address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP destination port. . MPLS Labels. - For tunneling protocols like GRE, VXLAN, NVGRE, STT, etc., flow + For tunneling protocols like Generic Routing Encapsulation (GRE) + [RFC 2784], Virtual eXtensible Local Area Network (VXLAN) [VXLAN], + Network Virtualization using Generic Routing Encapsulation (NVGRE) + [NVGRE], Stateless Transport Tunneling (STT) [STT], etc., flow identification is possible based on inner and/or outer headers. The above list is not exhaustive. The mechanisms described in this document are agnostic to the fields that are used for flow identification. This method of flow identification is consistent with that of IPFIX [RFC 7011]. -4.3.2. Criteria for Recognizing a Large Flow +4.3.2. Criteria and Techniques for Large Flow Recognition From a bandwidth and time duration perspective, in order to recognize large flows we define an observation interval and observe the bandwidth of the flow over that interval. A flow that exceeds a certain minimum bandwidth threshold over that observation interval would be considered a large flow. The two parameters -- the observation interval, and the minimum bandwidth threshold over that observation interval -- should be programmable to facilitate handling of different use cases and @@ -529,88 +532,95 @@ . Requires minimal router resources. Disadvantages: . In order to minimize the error inherent in sampling, there is a minimum delay for the recognition time of large flows, and in the time that it takes to react to this information. With sampling, the detection of large flows can be done on the order - of one second [DevoFlow]. + of one second [DevoFlow]. A discussion on determining the + appropriate sampling frequency is available in the following + reference [SAMP-BASIC]. -4.3.4. Automatic Hardware Recognition +4.3.4. Inline Data Path Measurement - Implementations may perform automatic recognition of large flows in - hardware on a router. Since this is done in hardware, it is an inline - solution and would be expected to operate at line rate. + Implementations may perform recognition of large flows by performing + measurements on traffic in the data path of a router. Such an + approach would be expected to operate at the interface speed on every + interface, accounting for all packets processed by the data path of + the router. An example of such an approach is described in IPFIX + [RFC 5470]. - Using automatic hardware recognition of large flows, a faster + Using inline data path measurement, a faster and more accurate indication of large flows mapped to each of the component links in a - LAG/ECMP group is available (as compared to the sampling approach - described above). + LAG/ECMP group may be possible (as compared to the sampling-based + approach). - The advantages and disadvantages of automatic hardware recognition - are: + The advantages and disadvantages of inline data path measurement are: Advantages: - . Large flow detection is offloaded to hardware freeing up - software resources and possible dependence on an external - management station. - . As link speeds get higher, sampling rates are typically reduced to keep the number of samples manageable which places a lower - bound on the detection time. With automatic hardware - recognition, large flows can be detected in shorter windows on - higher link speeds since every packet is accounted for in - hardware [NDTM]. + bound on the detection time. With inline data path measurement, + large flows can be recognized in shorter windows on higher link + speeds since every packet is accounted for [NDTM]. + + . Eliminates the potential dependence on an external management + station for large flow recognition. Disadvantages: - . Such techniques are not supported in many routers. + . It is more resource intensive in terms of the tables sizes + required for monitoring all flows in order to perform the + measurement. As mentioned earlier, the observation interval for determining a large flow and the bandwidth threshold for classifying a flow as a large flow should be programmable parameters in a router. - The implementation of automatic hardware recognition of large flows - is vendor dependent and beyond the scope of this document. + The implementation details of inline data path measurement of large + flows is vendor dependent and beyond the scope of this document. -4.3.5. Use of More Than One Detection Method +4.3.5. Use of More Than One Method for Large Flow Recognition It is possible that a router may have line cards that support a - sampling technique while other line cards support automatic hardware - detection of large flows. As long as there is a way for the router + sampling technique while other line cards support inline data path + measurement of large flows. As long as there is a way for the router to reliably determine the mapping of large flows to component links of a LAG/ECMP group, it is acceptable for the router to use more than one method for large flow recognition. + If both methods are supported, inline data path measurement may be + preferable because of its speed of detection [FLOW-ACC]. + 4.4. Load Rebalancing Options Below are suggested techniques for load rebalancing. Equipment vendors should implement all of these techniques and allow the operator to choose one or more techniques based on their applications. Note that regardless of the method used, perfect rebalancing of large flows may not be possible since flows arrive and depart at different times. Also, any flows that are moved from one component link to another may experience momentary packet reordering. 4.4.1. Alternative Placement of Large Flows Within a LAG/ECMP group, the member component links with least average port utilization are identified. Some large flow(s) from the heavily loaded component links are then moved to those lightly-loaded - member component links using a PBR rule in the ingress processing - element(s) in the routers. + member component links using a policy-based routing (PBR) rule in the + ingress processing element(s) in the routers. With this approach, only certain large flows are subjected to momentary flow re-ordering. When a large flow is moved, this will increase the utilization of the link that it moved to potentially creating imbalance in the utilization once again across the component links. Therefore, when moving large flows, care must be taken to account for the existing load, and what the future load will be after large flow has been moved. Further, the appearance of new large flows may require a @@ -991,69 +1001,94 @@ [bin-pack] Coffman, Jr., E., M. Garey, and D. Johnson. Approximation Algorithms for Bin-Packing -- An Updated Survey. In Algorithm Design for Computer System Design, ed. by Ausiello, Lucertini, and Serafini. Springer-Verlag, 1984. [CAIDA] Caida Internet Traffic Analysis, http://www.caida.org/home. [DevoFlow] Mogul, J., et al., "DevoFlow: Cost-Effective Flow Management for High Performance Enterprise Networks," Proceedings of the ACM SIGCOMM, August 2011. + [FLOW-ACC] Zseby, T., et al., "Packet sampling for flow accounting: + challenges and limitations," Proceedings of the 9th international + conference on Passive and active network measurement, 2008. + [ID.ietf-rtgwg-cl-requirement] Villamizar, C. et al., "Requirements for MPLS over a Composite Link," September 2013. [ITCOM] Jo, J., et al., "Internet traffic load balancing using dynamic hashing with flow volume," SPIE ITCOM, 2002. [NDTM] Estan, C. and G. Varghese, "New directions in traffic measurement and accounting," Proceedings of ACM SIGCOMM, August 2002. + [NVGRE] Sridharan, M. et al., "NVGRE: Network Virtualization using + Generic Routing Encapsulation," draft-sridharan-virtualization- + nvgre-04, February 2014. + + [RFC 2784] Farinacci, D. et al., "Generic Routing Encapsulation + (GRE)," March 2000. + [RFC 2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and Multicast," November 2000. [RFC 6790] Kompella, K. et al., "The Use of Entropy Labels in MPLS Forwarding," November 2012. [RFC 1213] McCloghrie, K., "Management Information Base for Network Management of TCP/IP-based internets: MIB-II," March 1991. [RFC 2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path Algorithm," November 2000. [RFC 3273] Waldbusser, S., "Remote Network Monitoring Management Information Base for High Capacity Networks," July 2002. [RFC 3954] Claise, B., "Cisco Systems NetFlow Services Export Version 9," October 2004. - [RFC 5475] Zseby T., et al., "Sampling and Filtering Techniques for + [RFC 5470] G. Sadasivan et al., "Architecture for IP Flow Information + Export," March 2009. + + [RFC 5475] Zseby, T. et al., "Sampling and Filtering Techniques for IP Packet Selection," March 2009. + [RFC 5681] Allman, M. et al., "TCP Congestion Control," September + 2009. + [RFC 7011] Claise, B. et al., "Specification of the IP Flow Information Export (IPFIX) Protocol for the Exchange of IP Traffic Flow Information," September 2013. [RFC 7012] Claise, B. and B. Trammell, "Information Model for IP Flow Information Export (IPFIX)," September 2013. + [SAMP-BASIC] Phaal, P. and S. Panchen, "Packet Sampling Basics," + http://www.sflow.org/packetSamplingBasics/. + [sFlow-LAG] Phaal, P. and A. Ghanwani, "sFlow LAG counters structure," http://www.sflow.org/sflow_lag.txt, September 2012. [sFlow-v5] Phaal, P. and M. Lavine, "sFlow version 5," http://www.sflow.org/sflow_version_5.txt, July 2004. + [STT] Davie, B. (ed) and J. Gross, "A Stateless Transport Tunneling + Protocol for Network Virtualization (STT)," draft-davie-stt-06, March + 2014. + + [VXLAN] Mahalingam, M. et al., "VXLAN: A Framework for Overlaying + Virtualized Layer 2 Networks over Layer 3 Networks," draft- + mahalingam-dutt-dcops-vxlan-09, April 2014. + [YONG] Yong, L., "Enhanced ECMP and Large Flow Aware Transport," draft-yong-pwe3-enhance-ecmp-lfat-01, September 2010. - [RFC 5681] Allman, M. et al., "TCP Congestion Control," September - 2009 - Appendix A. Internet Traffic Analysis and Load Balancing Simulation Internet traffic [CAIDA] has been analyzed to obtain flow statistics such as the number of packets in a flow and the flow duration. The five tuples in the packet header (IP addresses, TCP/UDP Ports, and IP protocol) are used for flow identification. The analysis indicates that < ~2% of the flows take ~30% of total traffic volume while the rest of the flows (> ~98%) contributes ~70% [YONG]. The simulation has shown that given Internet traffic pattern, the hash-based technique does not evenly distribute the flows over ECMP