--- 1/draft-ietf-opsawg-large-flow-load-balancing-11.txt 2014-06-13 18:14:23.781445369 -0700 +++ 2/draft-ietf-opsawg-large-flow-load-balancing-12.txt 2014-06-13 18:14:23.833446633 -0700 @@ -1,26 +1,26 @@ OPSAWG R. Krishnan Internet Draft Brocade Communications Intended status: Informational L. Yong -Expires: October 22, 2014 Huawei USA +Expires: December 13, 2014 Huawei USA A. Ghanwani Dell Ning So Tata Communications B. Khasnabish ZTE Corporation - April 22, 2014 + June 13, 2014 Mechanisms for Optimizing LAG/ECMP Component Link Utilization in Networks - draft-ietf-opsawg-large-flow-load-balancing-11.txt + draft-ietf-opsawg-large-flow-load-balancing-12.txt Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. This document may not be modified, and derivative works of it may not be created, except to publish it as an RFC and to translate it into languages other than English. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that @@ -31,21 +31,21 @@ and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html - This Internet-Draft will expire on October 22, 2014. + This Internet-Draft will expire on December 13, 2014. Copyright Notice Copyright (c) 2014 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents @@ -64,55 +64,57 @@ link aggregation groups and equal cost multi-paths as techniques for bandwidth scaling. This draft explores some of the mechanisms useful for achieving this. Table of Contents 1. Introduction...................................................3 1.1. Acronyms..................................................4 1.2. Terminology...............................................4 2. Flow Categorization............................................5 - 3. Hash-based Load Distribution in LAG/ECMP.......................5 + 3. Hash-based Load Distribution in LAG/ECMP.......................6 4. Mechanisms for Optimizing LAG/ECMP Component Link Utilization..7 4.1. Differences in LAG vs ECMP................................8 4.2. Operational Overview......................................9 4.3. Large Flow Recognition...................................10 4.3.1. Flow Identification.................................10 4.3.2. Criteria and Techniques for Large Flow Recognition..11 4.3.3. Sampling Techniques.................................11 4.3.4. Inline Data Path Measurement........................13 - 4.3.5. Use of More Than One Method for Large Flow Recognition13 + 4.3.5. Use of More Than One Method for Large Flow + Recognition.........................................13 4.4. Load Rebalancing Options.................................14 4.4.1. Alternative Placement of Large Flows................14 4.4.2. Redistributing Small Flows..........................15 4.4.3. Component Link Protection Considerations............15 4.4.4. Load Rebalancing Algorithms.........................15 4.4.5. Load Rebalancing Example............................16 5. Information Model for Flow Rebalancing........................17 5.1. Configuration Parameters for Flow Rebalancing............17 5.2. System Configuration and Identification Parameters.......18 5.3. Information for Alternative Placement of Large Flows.....19 5.4. Information for Redistribution of Small Flows............19 5.5. Export of Flow Information...............................20 5.6. Monitoring information...................................20 5.6.1. Interface (link) utilization........................20 - 5.6.2. Other monitoring information........................21 + 5.6.2. Other monitoring information........................20 6. Operational Considerations....................................21 6.1. Rebalancing Frequency....................................21 - 6.2. Handling Route Changes...................................22 + 6.2. Handling Route Changes...................................21 + 6.3. Forwarding Resources.....................................21 7. IANA Considerations...........................................22 8. Security Considerations.......................................22 9. Contributing Authors..........................................22 10. Acknowledgements.............................................22 - 11. References...................................................22 - 11.1. Normative References....................................22 - 11.2. Informative References..................................22 + 11. References...................................................23 + 11.1. Normative References....................................23 + 11.2. Informative References..................................23 1. Introduction Networks extensively use link aggregation groups (LAG) [802.1AX] and equal cost multi-paths (ECMP) [RFC 2991] as techniques for capacity scaling. For the problems addressed by this document, network traffic can be predominantly categorized into two traffic types: long-lived large flows and other flows. These other flows, which include long- lived small flows, short-lived small flows, and short-lived large flows, are referred to as "small flows" in this document. Long-lived @@ -136,22 +138,20 @@ of bandwidth on a link, e.g. greater than 5% of link bandwidth. The number of such flows would necessarily be fairly small, e.g. on the order of 10's or 100's per LAG/ECMP. In other words, the number of large flows is NOT expected to be on the order of millions of flows. Examples of such large flows would be IPsec tunnels in service provider backbone networks or storage backup traffic in data center networks. 1.1. Acronyms - COTS: Commercial Off-the-shelf - DOS: Denial of Service ECMP: Equal Cost Multi-path GRE: Generic Routing Encapsulation LAG: Link Aggregation Group MPLS: Multiprotocol Label Switching @@ -162,20 +162,26 @@ QoS: Quality of Service STT: Stateless Transport Tunneling TCAM: Ternary Content Addressable Memory VXLAN: Virtual Extensible LAN 1.2. Terminology + Central management entity: Refers to an entity that is capable of + monitoring information about link utilization and flows in routers + across the network and may be capable of making traffic engineering + decisions for placement of large flows. It may include the functions + of a collector if the routers employ a sampling technique [RFC 7011]. + ECMP component link: An individual nexthop within an ECMP group. An ECMP component link may itself comprise a LAG. ECMP table: A table that is used as the nexthop of an ECMP route that comprises the set of component links and the weights associated with each of those component links. The weights are used to determine which values of the hash function map to a given component link. LAG component link: An individual link within a LAG. A LAG component link is typically a physical link. @@ -350,25 +357,26 @@ +-----+ +-----+ / \ \ / /\ / +---------+ / \ / / \ \ / \ / / \ +------+ \ / / \ / \ \ +-----+ +-----+ +-----+ | L1 | | L2 | | L3 | +-----+ +-----+ +-----+ - Figure 3: Two-level Fat Tree + Figure 3: Two-level Clos Network To demonstrate the limitations of local optimization, consider a two- - level fat-tree topology with three leaf nodes (L1, L2, L3) and two - spine nodes (S1, S2) and assume all of the links are 10 Gbps. + level Clos network topology as shown in Figure 3 with three leaf + nodes (L1, L2, L3) and two spine nodes (S1, S2). Assume all of the + links are 10 Gbps. Let L1 have two flows of 4 Gbps each towards L3, and let L2 have one flow of 7 Gbps also towards L3. If L1 balances the load optimally between S1 and S2, and L2 sends the flow via S1, then the downlink from S1 to L3 would get congested resulting in packet discards. On the other hand, if L1 had sent both its flows towards S1 and L2 had sent its flow towards S2, there would have been no congestion at either S1 or S2. The other issue with applying this scheme to ECMP groups is that it @@ -434,25 +442,27 @@ . IP header: IP Protocol, IP source address, IP destination address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP destination port. . MPLS Labels. For tunneling protocols like Generic Routing Encapsulation (GRE) [RFC 2784], Virtual eXtensible Local Area Network (VXLAN) [VXLAN], Network Virtualization using Generic Routing Encapsulation (NVGRE) - [NVGRE], Stateless Transport Tunneling (STT) [STT], etc., flow - identification is possible based on inner and/or outer headers. The - above list is not exhaustive. The mechanisms described in this - document are agnostic to the fields that are used for flow - identification. + [NVGRE], Stateless Transport Tunneling (STT) [STT], Layer 2 Tunneling + Protocol (L2TP) [RFC 3931], etc., flow identification is possible + based on inner and/or outer headers as well as fields introduced by + the tunnel header, as any or all such fields may be used for load + balancing decisions [RFC 5640]. The above list is not exhaustive. + The mechanisms described in this document are agnostic to the fields + that are used for flow identification. This method of flow identification is consistent with that of IPFIX [RFC 7011]. 4.3.2. Criteria and Techniques for Large Flow Recognition From a bandwidth and time duration perspective, in order to recognize large flows we define an observation interval and observe the bandwidth of the flow over that interval. A flow that exceeds a certain minimum bandwidth threshold over that observation interval @@ -591,23 +601,23 @@ to reliably determine the mapping of large flows to component links of a LAG/ECMP group, it is acceptable for the router to use more than one method for large flow recognition. If both methods are supported, inline data path measurement may be preferable because of its speed of detection [FLOW-ACC]. 4.4. Load Rebalancing Options Below are suggested techniques for load rebalancing. Equipment - vendors should implement all of these techniques and allow the - operator to choose one or more techniques based on their - applications. + vendors may implement more than one technique, including those not + described in this document, allowing the operator to choose between + them. Note that regardless of the method used, perfect rebalancing of large flows may not be possible since flows arrive and depart at different times. Also, any flows that are moved from one component link to another may experience momentary packet reordering. 4.4.1. Alternative Placement of Large Flows Within a LAG/ECMP group, the member component links with least average port utilization are identified. Some large flow(s) from the @@ -755,24 +765,24 @@ be recognized as a large flow until it falls below this threshold. This is also configured as a percentage of link speed and is typically lower than the minimum bandwidth threshold defined above. . Imbalance threshold: A measure of the deviation of the component link utilizations from the utilization of the overall LAG/ECMP group. Since component links can be of a different speed, the imbalance can be computed as follows. Let the utilization of each component link in a LAG/ECMP group with n - links of speed b_1, b_2, .., b_n, be u_1, u_2, .., u_n. The mean + links of speed b_1, b_2 ... b_n, be u_1, u_2 ... u_n. The mean utilization is computed is u_ave = [ (u_1 x b_1) + (u_2 x b_2) + - .. + (u_n x b_n) ] / [b_1 + b_2 + b_n]. The imbalance is then - computed as max_{i=1..n} | u_i - u_ave | / u_ave. + ... + (u_n x b_n) ] / [b_1 + b_2 + ... + b_n]. The imbalance is + then computed as max_{i=1 ... n} | u_i - u_ave |. . Rebalancing interval: The minimum amount of time between rebalancing events. This parameter ensures that rebalancing is not invoked too frequently as it impacts packet ordering. These parameters may be configured on a system-wide basis or it may apply to an individual LAG. It may be applied to an ECMP group provided the component links are not shared with any other ECMP group. @@ -812,39 +822,23 @@ ECMP groups, or it may be configured specifically for a given LAG or ECMP group. 5.3. Information for Alternative Placement of Large Flows In cases where large flow recognition is handled by an external management station (see Section 4.3.3), an information model for flows is required to allow the import of large flow information to the router. - The following are some of the elements of information model for - importing of flows: - - . Layer 2: source MAC address, destination MAC address, VLAN ID. - - . Layer 3 IP: IP Protocol, IP source address, IP destination - address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP - destination port. - - . MPLS Labels. - - This list is not exhaustive. For example, with overlay protocols - such as VXLAN and NVGRE, fields from the outer and/or inner headers - may be specified. In general, all fields in the packet that can be - used by forwarding decisions should be available for use when - importing flow information from an external management station. - - The IPFIX information model [RFC 7012] can be leveraged for large - flow identification. + Typical fields use for identifying large flows were discussed in + Section 4.3.1. The IPFIX information model [RFC 7012] can be + leveraged for large flow identification. Large Flow placement is achieved by specifying the relevant flow information along with the following: . For LAG: Router's IP address, LAG ID, LAG component link ID. . For ECMP: Router's IP address, ECMP group, ECMP component link ID. In the case where the ECMP component link itself comprises a LAG, we @@ -887,23 +881,23 @@ 5.6. Monitoring information 5.6.1. Interface (link) utilization The incoming bytes (ifInOctets), outgoing bytes (ifOutOctets) and interface speed (ifSpeed) can be measured from the Interface table (iftable) MIB [RFC 1213]. The link utilization can then be computed as follows: - Incoming link utilization = (ifInOctets 8 / ifSpeed) + Incoming link utilization = (ifInOctets/8) / ifSpeed - Outgoing link utilization = (ifOutOctets 8 / ifSpeed) + Outgoing link utilization = (ifOutOctets/8) / ifSpeed For high speed Ethernet links, the etherStatsHighCapacityTable MIB [RFC 3273] can be used. For scalability, it is recommended to use the counter push mechanism in [sflow-v5] for the interface counters. Doing so would help avoid counter polling through the MIB interface. The outgoing link utilization of the component links within a LAG/ECMP group can be used to compute the imbalance (See Section 5.1) @@ -953,64 +947,99 @@ links to tune the solution for their environment. 6.2. Handling Route Changes Large flow rebalancing must be aware of any changes to the FIB. In cases where the nexthop of a route no longer to points to the LAG, or to an ECMP group, any PBR entries added as described in Section 4.4.1 and 4.4.2 must be withdrawn in order to avoid the creation of forwarding loops. +6.3. Forwarding Resources + + Hash-based techniques used for load balancing with LAG/ECMP are + usually stateless. The mechanisms described in this document require + additional resources in the forwarding plane of routers for creating + PBR rules that are capable of overriding the forwarding decision from + the hash-based approach. These resources may limit the number of + flows that can be rebalanced and may also impact the latency + experienced by packets due to the additional lookups that are + required. + 7. IANA Considerations This memo includes no request to IANA. 8. Security Considerations This document does not directly impact the security of the Internet infrastructure or its applications. In fact, it could help if there is a DOS attack pattern which causes a hash imbalance resulting in heavy overloading of large flows to certain LAG/ECMP component links. + An attacker with knowledge of the large flow recognition algorithm + and any stateless distribution method can generate flows that are + distributed in a way that overloads a specific path. This could be + used to cause the creation of PBR rules that exhaust the available + rule capacity on nodes. If PBR rules are consequently discarded, + this could result in congestion on the attacker-selected path. + Alternatively, tracking large numbers of PBR rules could result in + performance degradation. + 9. Contributing Authors Sanjay Khanna Cisco Systems Email: sanjakha@gmail.com 10. Acknowledgements The authors would like to thank the following individuals for their review and valuable feedback on earlier versions of this document: Shane Amante, Fred Baker, Michael Bugenhagen, Zhen Cao, Brian Carpenter, Benoit Claise, Michael Fargano, Wes George, Sriganesh Kini, Roman Krzanowski, Andrew Malis, Dave McDysan, Pete Moyer, Peter Phaal, Dan Romascanu, Curtis Villamizar, Jianrong Wong, George - Yum, and Weifeng Zhang. + Yum, and Weifeng Zhang. As a part of the IETF Last Call process, + valuable comments were received from Martin Thomson, 11. References 11.1. Normative References -11.2. Informative References - [802.1AX] IEEE Standards Association, "IEEE Std 802.1AX-2008 IEEE Standard for Local and Metropolitan Area Networks - Link Aggregation", 2008. + [RFC 2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and + Multicast," November 2000. + + [RFC 7011] Claise, B. et al., "Specification of the IP Flow + Information Export (IPFIX) Protocol for the Exchange of IP Traffic + Flow Information," September 2013. + + [RFC 7012] Claise, B. and B. Trammell, "Information Model for IP Flow + Information Export (IPFIX)," September 2013. + + [sFlow-v5] Phaal, P. and M. Lavine, "sFlow version 5," + http://www.sflow.org/sflow_version_5.txt, July 2004. + +11.2. Informative References + [bin-pack] Coffman, Jr., E., M. Garey, and D. Johnson. Approximation Algorithms for Bin-Packing -- An Updated Survey. In Algorithm Design for Computer System Design, ed. by Ausiello, Lucertini, and Serafini. Springer-Verlag, 1984. - [CAIDA] Caida Internet Traffic Analysis, http://www.caida.org/home. + [CAIDA] "Caida Internet Traffic Analysis," http://www.caida.org/home. + [DevoFlow] Mogul, J., et al., "DevoFlow: Cost-Effective Flow Management for High Performance Enterprise Networks," Proceedings of the ACM SIGCOMM, August 2011. [FLOW-ACC] Zseby, T., et al., "Packet sampling for flow accounting: challenges and limitations," Proceedings of the 9th international conference on Passive and active network measurement, 2008. [ID.ietf-rtgwg-cl-requirement] Villamizar, C. et al., "Requirements for MPLS over a Composite Link," September 2013. @@ -1021,64 +1050,57 @@ [NDTM] Estan, C. and G. Varghese, "New directions in traffic measurement and accounting," Proceedings of ACM SIGCOMM, August 2002. [NVGRE] Sridharan, M. et al., "NVGRE: Network Virtualization using Generic Routing Encapsulation," draft-sridharan-virtualization- nvgre-04, February 2014. [RFC 2784] Farinacci, D. et al., "Generic Routing Encapsulation (GRE)," March 2000. - [RFC 2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and - Multicast," November 2000. - [RFC 6790] Kompella, K. et al., "The Use of Entropy Labels in MPLS Forwarding," November 2012. [RFC 1213] McCloghrie, K., "Management Information Base for Network Management of TCP/IP-based internets: MIB-II," March 1991. [RFC 2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path Algorithm," November 2000. [RFC 3273] Waldbusser, S., "Remote Network Monitoring Management Information Base for High Capacity Networks," July 2002. + [RFC 3931] Lau, J. (Ed.), M. Townsley (Ed.), and I. Goyret (Ed.), + "Layer 2 Tunneling Protocol - Version 3," March 2005. + [RFC 3954] Claise, B., "Cisco Systems NetFlow Services Export Version 9," October 2004. [RFC 5470] G. Sadasivan et al., "Architecture for IP Flow Information Export," March 2009. [RFC 5475] Zseby, T. et al., "Sampling and Filtering Techniques for IP Packet Selection," March 2009. + [RFC 5640] Filsfils, C., P. Mohapatra, and C. Pignataro, "Load + Balancing for Mesh Softwires," August 2009. + [RFC 5681] Allman, M. et al., "TCP Congestion Control," September 2009. - [RFC 7011] Claise, B. et al., "Specification of the IP Flow - Information Export (IPFIX) Protocol for the Exchange of IP Traffic - Flow Information," September 2013. - - [RFC 7012] Claise, B. and B. Trammell, "Information Model for IP Flow - Information Export (IPFIX)," September 2013. - [SAMP-BASIC] Phaal, P. and S. Panchen, "Packet Sampling Basics," http://www.sflow.org/packetSamplingBasics/. [sFlow-LAG] Phaal, P. and A. Ghanwani, "sFlow LAG counters structure," http://www.sflow.org/sflow_lag.txt, September 2012. - [sFlow-v5] Phaal, P. and M. Lavine, "sFlow version 5," - http://www.sflow.org/sflow_version_5.txt, July 2004. - - [STT] Davie, B. (ed) and J. Gross, "A Stateless Transport Tunneling + [STT] Davie, B. (Ed.) and J. Gross, "A Stateless Transport Tunneling Protocol for Network Virtualization (STT)," draft-davie-stt-06, March 2014. [VXLAN] Mahalingam, M. et al., "VXLAN: A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks," draft- mahalingam-dutt-dcops-vxlan-09, April 2014. [YONG] Yong, L., "Enhanced ECMP and Large Flow Aware Transport," draft-yong-pwe3-enhance-ecmp-lfat-01, September 2010.