draft-ietf-opsawg-large-flow-load-balancing-10.txt   draft-ietf-opsawg-large-flow-load-balancing-11.txt 
OPSAWG R. Krishnan OPSAWG R. Krishnan
Internet Draft Brocade Communications Internet Draft Brocade Communications
Intended status: Informational L. Yong Intended status: Informational L. Yong
Expires: October 8, 2014 Huawei USA Expires: October 22, 2014 Huawei USA
A. Ghanwani A. Ghanwani
Dell Dell
Ning So Ning So
Tata Communications Tata Communications
B. Khasnabish B. Khasnabish
ZTE Corporation ZTE Corporation
April 8, 2014 April 22, 2014
Mechanisms for Optimizing LAG/ECMP Component Link Utilization in Mechanisms for Optimizing LAG/ECMP Component Link Utilization in
Networks Networks
draft-ietf-opsawg-large-flow-load-balancing-10.txt draft-ietf-opsawg-large-flow-load-balancing-11.txt
Status of this Memo Status of this Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. This document may not be modified, provisions of BCP 78 and BCP 79. This document may not be modified,
and derivative works of it may not be created, except to publish it and derivative works of it may not be created, except to publish it
as an RFC and to translate it into languages other than English. as an RFC and to translate it into languages other than English.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 42 skipping to change at page 1, line 42
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html http://www.ietf.org/shadow.html
This Internet-Draft will expire on October 8, 2014. This Internet-Draft will expire on October 22, 2014.
Copyright Notice Copyright Notice
Copyright (c) 2014 IETF Trust and the persons identified as the Copyright (c) 2014 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 2, line 37 skipping to change at page 2, line 37
1. Introduction...................................................3 1. Introduction...................................................3
1.1. Acronyms..................................................4 1.1. Acronyms..................................................4
1.2. Terminology...............................................4 1.2. Terminology...............................................4
2. Flow Categorization............................................5 2. Flow Categorization............................................5
3. Hash-based Load Distribution in LAG/ECMP.......................5 3. Hash-based Load Distribution in LAG/ECMP.......................5
4. Mechanisms for Optimizing LAG/ECMP Component Link Utilization..7 4. Mechanisms for Optimizing LAG/ECMP Component Link Utilization..7
4.1. Differences in LAG vs ECMP................................8 4.1. Differences in LAG vs ECMP................................8
4.2. Operational Overview......................................9 4.2. Operational Overview......................................9
4.3. Large Flow Recognition...................................10 4.3. Large Flow Recognition...................................10
4.3.1. Flow Identification.................................10 4.3.1. Flow Identification.................................10
4.3.2. Criteria for Recognizing a Large Flow...............10 4.3.2. Criteria and Techniques for Large Flow Recognition..11
4.3.3. Sampling Techniques.................................11 4.3.3. Sampling Techniques.................................11
4.3.4. Automatic Hardware Recognition......................12 4.3.4. Inline Data Path Measurement........................13
4.3.5. Use of More Than One Detection Method...............13 4.3.5. Use of More Than One Method for Large Flow Recognition13
4.4. Load Rebalancing Options.................................13 4.4. Load Rebalancing Options.................................14
4.4.1. Alternative Placement of Large Flows................14 4.4.1. Alternative Placement of Large Flows................14
4.4.2. Redistributing Small Flows..........................14 4.4.2. Redistributing Small Flows..........................15
4.4.3. Component Link Protection Considerations............15 4.4.3. Component Link Protection Considerations............15
4.4.4. Load Rebalancing Algorithms.........................15 4.4.4. Load Rebalancing Algorithms.........................15
4.4.5. Load Rebalancing Example............................15 4.4.5. Load Rebalancing Example............................16
5. Information Model for Flow Rebalancing........................16 5. Information Model for Flow Rebalancing........................17
5.1. Configuration Parameters for Flow Rebalancing............16 5.1. Configuration Parameters for Flow Rebalancing............17
5.2. System Configuration and Identification Parameters.......17 5.2. System Configuration and Identification Parameters.......18
5.3. Information for Alternative Placement of Large Flows.....18 5.3. Information for Alternative Placement of Large Flows.....19
5.4. Information for Redistribution of Small Flows............19 5.4. Information for Redistribution of Small Flows............19
5.5. Export of Flow Information...............................19 5.5. Export of Flow Information...............................20
5.6. Monitoring information...................................20 5.6. Monitoring information...................................20
5.6.1. Interface (link) utilization........................20 5.6.1. Interface (link) utilization........................20
5.6.2. Other monitoring information........................20 5.6.2. Other monitoring information........................21
6. Operational Considerations....................................21 6. Operational Considerations....................................21
6.1. Rebalancing Frequency....................................21 6.1. Rebalancing Frequency....................................21
6.2. Handling Route Changes...................................21 6.2. Handling Route Changes...................................22
7. IANA Considerations...........................................21 7. IANA Considerations...........................................22
8. Security Considerations.......................................21 8. Security Considerations.......................................22
9. Contributing Authors..........................................22 9. Contributing Authors..........................................22
10. Acknowledgements.............................................22 10. Acknowledgements.............................................22
11. References...................................................22 11. References...................................................22
11.1. Normative References....................................22 11.1. Normative References....................................22
11.2. Informative References..................................22 11.2. Informative References..................................22
1. Introduction 1. Introduction
Networks extensively use link aggregation groups (LAG) [802.1AX] and Networks extensively use link aggregation groups (LAG) [802.1AX] and
equal cost multi-paths (ECMP) [RFC 2991] as techniques for capacity equal cost multi-paths (ECMP) [RFC 2991] as techniques for capacity
skipping to change at page 7, line 11 skipping to change at page 7, line 11
o The presence of 2 large flows causes congestion on this o The presence of 2 large flows causes congestion on this
component link. component link.
+-----------+ -> +-----------+ +-----------+ -> +-----------+
| | -> | | | | -> | |
| | ===> | | | | ===> | |
| (1)|--------|(1) | | (1)|--------|(1) |
| | -> | | | | -> | |
| | -> | | | | -> | |
| (R1) | -> | (R2) | | (R1) | -> | (R2) |
| (2)|--------|(2) | | (2)|--------|(2) |
| | -> | | | | -> | |
| | -> | | | | -> | |
| | ===> | | | | ===> | |
| | ===> | | | | ===> | |
| (3)|--------|(3) | | (3)|--------|(3) |
| | | | | | | |
+-----------+ +-----------+ +-----------+ +-----------+
Where: -> small flow Where: -> small flow
skipping to change at page 10, line 37 skipping to change at page 10, line 37
for example: for example:
. Layer 2: source MAC address, destination MAC address, VLAN ID. . Layer 2: source MAC address, destination MAC address, VLAN ID.
. IP header: IP Protocol, IP source address, IP destination . IP header: IP Protocol, IP source address, IP destination
address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP
destination port. destination port.
. MPLS Labels. . MPLS Labels.
For tunneling protocols like GRE, VXLAN, NVGRE, STT, etc., flow For tunneling protocols like Generic Routing Encapsulation (GRE)
[RFC 2784], Virtual eXtensible Local Area Network (VXLAN) [VXLAN],
Network Virtualization using Generic Routing Encapsulation (NVGRE)
[NVGRE], Stateless Transport Tunneling (STT) [STT], etc., flow
identification is possible based on inner and/or outer headers. The identification is possible based on inner and/or outer headers. The
above list is not exhaustive. The mechanisms described in this above list is not exhaustive. The mechanisms described in this
document are agnostic to the fields that are used for flow document are agnostic to the fields that are used for flow
identification. identification.
This method of flow identification is consistent with that of IPFIX This method of flow identification is consistent with that of IPFIX
[RFC 7011]. [RFC 7011].
4.3.2. Criteria for Recognizing a Large Flow 4.3.2. Criteria and Techniques for Large Flow Recognition
From a bandwidth and time duration perspective, in order to recognize From a bandwidth and time duration perspective, in order to recognize
large flows we define an observation interval and observe the large flows we define an observation interval and observe the
bandwidth of the flow over that interval. A flow that exceeds a bandwidth of the flow over that interval. A flow that exceeds a
certain minimum bandwidth threshold over that observation interval certain minimum bandwidth threshold over that observation interval
would be considered a large flow. would be considered a large flow.
The two parameters -- the observation interval, and the minimum The two parameters -- the observation interval, and the minimum
bandwidth threshold over that observation interval -- should be bandwidth threshold over that observation interval -- should be
programmable to facilitate handling of different use cases and programmable to facilitate handling of different use cases and
skipping to change at page 12, line 41 skipping to change at page 12, line 47
. Requires minimal router resources. . Requires minimal router resources.
Disadvantages: Disadvantages:
. In order to minimize the error inherent in sampling, there is a . In order to minimize the error inherent in sampling, there is a
minimum delay for the recognition time of large flows, and in minimum delay for the recognition time of large flows, and in
the time that it takes to react to this information. the time that it takes to react to this information.
With sampling, the detection of large flows can be done on the order With sampling, the detection of large flows can be done on the order
of one second [DevoFlow]. of one second [DevoFlow]. A discussion on determining the
appropriate sampling frequency is available in the following
reference [SAMP-BASIC].
4.3.4. Automatic Hardware Recognition 4.3.4. Inline Data Path Measurement
Implementations may perform automatic recognition of large flows in Implementations may perform recognition of large flows by performing
hardware on a router. Since this is done in hardware, it is an inline measurements on traffic in the data path of a router. Such an
solution and would be expected to operate at line rate. approach would be expected to operate at the interface speed on every
interface, accounting for all packets processed by the data path of
the router. An example of such an approach is described in IPFIX
[RFC 5470].
Using automatic hardware recognition of large flows, a faster Using inline data path measurement, a faster and more accurate
indication of large flows mapped to each of the component links in a indication of large flows mapped to each of the component links in a
LAG/ECMP group is available (as compared to the sampling approach LAG/ECMP group may be possible (as compared to the sampling-based
described above). approach).
The advantages and disadvantages of automatic hardware recognition The advantages and disadvantages of inline data path measurement are:
are:
Advantages: Advantages:
. Large flow detection is offloaded to hardware freeing up
software resources and possible dependence on an external
management station.
. As link speeds get higher, sampling rates are typically reduced . As link speeds get higher, sampling rates are typically reduced
to keep the number of samples manageable which places a lower to keep the number of samples manageable which places a lower
bound on the detection time. With automatic hardware bound on the detection time. With inline data path measurement,
recognition, large flows can be detected in shorter windows on large flows can be recognized in shorter windows on higher link
higher link speeds since every packet is accounted for in speeds since every packet is accounted for [NDTM].
hardware [NDTM].
. Eliminates the potential dependence on an external management
station for large flow recognition.
Disadvantages: Disadvantages:
. Such techniques are not supported in many routers. . It is more resource intensive in terms of the tables sizes
required for monitoring all flows in order to perform the
measurement.
As mentioned earlier, the observation interval for determining a As mentioned earlier, the observation interval for determining a
large flow and the bandwidth threshold for classifying a flow as a large flow and the bandwidth threshold for classifying a flow as a
large flow should be programmable parameters in a router. large flow should be programmable parameters in a router.
The implementation of automatic hardware recognition of large flows The implementation details of inline data path measurement of large
is vendor dependent and beyond the scope of this document. flows is vendor dependent and beyond the scope of this document.
4.3.5. Use of More Than One Detection Method 4.3.5. Use of More Than One Method for Large Flow Recognition
It is possible that a router may have line cards that support a It is possible that a router may have line cards that support a
sampling technique while other line cards support automatic hardware sampling technique while other line cards support inline data path
detection of large flows. As long as there is a way for the router measurement of large flows. As long as there is a way for the router
to reliably determine the mapping of large flows to component links to reliably determine the mapping of large flows to component links
of a LAG/ECMP group, it is acceptable for the router to use more than of a LAG/ECMP group, it is acceptable for the router to use more than
one method for large flow recognition. one method for large flow recognition.
If both methods are supported, inline data path measurement may be
preferable because of its speed of detection [FLOW-ACC].
4.4. Load Rebalancing Options 4.4. Load Rebalancing Options
Below are suggested techniques for load rebalancing. Equipment Below are suggested techniques for load rebalancing. Equipment
vendors should implement all of these techniques and allow the vendors should implement all of these techniques and allow the
operator to choose one or more techniques based on their operator to choose one or more techniques based on their
applications. applications.
Note that regardless of the method used, perfect rebalancing of large Note that regardless of the method used, perfect rebalancing of large
flows may not be possible since flows arrive and depart at different flows may not be possible since flows arrive and depart at different
times. Also, any flows that are moved from one component link to times. Also, any flows that are moved from one component link to
another may experience momentary packet reordering. another may experience momentary packet reordering.
4.4.1. Alternative Placement of Large Flows 4.4.1. Alternative Placement of Large Flows
Within a LAG/ECMP group, the member component links with least Within a LAG/ECMP group, the member component links with least
average port utilization are identified. Some large flow(s) from the average port utilization are identified. Some large flow(s) from the
heavily loaded component links are then moved to those lightly-loaded heavily loaded component links are then moved to those lightly-loaded
member component links using a PBR rule in the ingress processing member component links using a policy-based routing (PBR) rule in the
element(s) in the routers. ingress processing element(s) in the routers.
With this approach, only certain large flows are subjected to With this approach, only certain large flows are subjected to
momentary flow re-ordering. momentary flow re-ordering.
When a large flow is moved, this will increase the utilization of the When a large flow is moved, this will increase the utilization of the
link that it moved to potentially creating imbalance in the link that it moved to potentially creating imbalance in the
utilization once again across the component links. Therefore, when utilization once again across the component links. Therefore, when
moving large flows, care must be taken to account for the existing moving large flows, care must be taken to account for the existing
load, and what the future load will be after large flow has been load, and what the future load will be after large flow has been
moved. Further, the appearance of new large flows may require a moved. Further, the appearance of new large flows may require a
skipping to change at page 16, line 13 skipping to change at page 16, line 28
flow -- and the link utilization is normal now. flow -- and the link utilization is normal now.
+-----------+ -> +-----------+ +-----------+ -> +-----------+
| | -> | | | | -> | |
| | ===> | | | | ===> | |
| (1)|--------|(1) | | (1)|--------|(1) |
| | | | | | | |
| | ===> | | | | ===> | |
| | -> | | | | -> | |
| | -> | | | | -> | |
| (R1) | -> | (R2) | | (R1) | -> | (R2) |
| (2)|--------|(2) | | (2)|--------|(2) |
| | | | | | | |
| | -> | | | | -> | |
| | -> | | | | -> | |
| | ===> | | | | ===> | |
| (3)|--------|(3) | | (3)|--------|(3) |
| | | | | | | |
+-----------+ +-----------+ +-----------+ +-----------+
Where: -> small flow Where: -> small flow
skipping to change at page 22, line 43 skipping to change at page 23, line 15
[bin-pack] Coffman, Jr., E., M. Garey, and D. Johnson. Approximation [bin-pack] Coffman, Jr., E., M. Garey, and D. Johnson. Approximation
Algorithms for Bin-Packing -- An Updated Survey. In Algorithm Design Algorithms for Bin-Packing -- An Updated Survey. In Algorithm Design
for Computer System Design, ed. by Ausiello, Lucertini, and Serafini. for Computer System Design, ed. by Ausiello, Lucertini, and Serafini.
Springer-Verlag, 1984. Springer-Verlag, 1984.
[CAIDA] Caida Internet Traffic Analysis, http://www.caida.org/home. [CAIDA] Caida Internet Traffic Analysis, http://www.caida.org/home.
[DevoFlow] Mogul, J., et al., "DevoFlow: Cost-Effective Flow [DevoFlow] Mogul, J., et al., "DevoFlow: Cost-Effective Flow
Management for High Performance Enterprise Networks," Proceedings of Management for High Performance Enterprise Networks," Proceedings of
the ACM SIGCOMM, August 2011. the ACM SIGCOMM, August 2011.
[FLOW-ACC] Zseby, T., et al., "Packet sampling for flow accounting:
challenges and limitations," Proceedings of the 9th international
conference on Passive and active network measurement, 2008.
[ID.ietf-rtgwg-cl-requirement] Villamizar, C. et al., "Requirements [ID.ietf-rtgwg-cl-requirement] Villamizar, C. et al., "Requirements
for MPLS over a Composite Link," September 2013. for MPLS over a Composite Link," September 2013.
[ITCOM] Jo, J., et al., "Internet traffic load balancing using [ITCOM] Jo, J., et al., "Internet traffic load balancing using
dynamic hashing with flow volume," SPIE ITCOM, 2002. dynamic hashing with flow volume," SPIE ITCOM, 2002.
[NDTM] Estan, C. and G. Varghese, "New directions in traffic [NDTM] Estan, C. and G. Varghese, "New directions in traffic
measurement and accounting," Proceedings of ACM SIGCOMM, August 2002. measurement and accounting," Proceedings of ACM SIGCOMM, August 2002.
[NVGRE] Sridharan, M. et al., "NVGRE: Network Virtualization using
Generic Routing Encapsulation," draft-sridharan-virtualization-
nvgre-04, February 2014.
[RFC 2784] Farinacci, D. et al., "Generic Routing Encapsulation
(GRE)," March 2000.
[RFC 2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and [RFC 2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and
Multicast," November 2000. Multicast," November 2000.
[RFC 6790] Kompella, K. et al., "The Use of Entropy Labels in MPLS [RFC 6790] Kompella, K. et al., "The Use of Entropy Labels in MPLS
Forwarding," November 2012. Forwarding," November 2012.
[RFC 1213] McCloghrie, K., "Management Information Base for Network [RFC 1213] McCloghrie, K., "Management Information Base for Network
Management of TCP/IP-based internets: MIB-II," March 1991. Management of TCP/IP-based internets: MIB-II," March 1991.
[RFC 2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path [RFC 2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path
Algorithm," November 2000. Algorithm," November 2000.
[RFC 3273] Waldbusser, S., "Remote Network Monitoring Management [RFC 3273] Waldbusser, S., "Remote Network Monitoring Management
Information Base for High Capacity Networks," July 2002. Information Base for High Capacity Networks," July 2002.
[RFC 3954] Claise, B., "Cisco Systems NetFlow Services Export Version [RFC 3954] Claise, B., "Cisco Systems NetFlow Services Export Version
9," October 2004. 9," October 2004.
[RFC 5475] Zseby T., et al., "Sampling and Filtering Techniques for [RFC 5470] G. Sadasivan et al., "Architecture for IP Flow Information
Export," March 2009.
[RFC 5475] Zseby, T. et al., "Sampling and Filtering Techniques for
IP Packet Selection," March 2009. IP Packet Selection," March 2009.
[RFC 5681] Allman, M. et al., "TCP Congestion Control," September
2009.
[RFC 7011] Claise, B. et al., "Specification of the IP Flow [RFC 7011] Claise, B. et al., "Specification of the IP Flow
Information Export (IPFIX) Protocol for the Exchange of IP Traffic Information Export (IPFIX) Protocol for the Exchange of IP Traffic
Flow Information," September 2013. Flow Information," September 2013.
[RFC 7012] Claise, B. and B. Trammell, "Information Model for IP Flow [RFC 7012] Claise, B. and B. Trammell, "Information Model for IP Flow
Information Export (IPFIX)," September 2013. Information Export (IPFIX)," September 2013.
[SAMP-BASIC] Phaal, P. and S. Panchen, "Packet Sampling Basics,"
http://www.sflow.org/packetSamplingBasics/.
[sFlow-LAG] Phaal, P. and A. Ghanwani, "sFlow LAG counters [sFlow-LAG] Phaal, P. and A. Ghanwani, "sFlow LAG counters
structure," http://www.sflow.org/sflow_lag.txt, September 2012. structure," http://www.sflow.org/sflow_lag.txt, September 2012.
[sFlow-v5] Phaal, P. and M. Lavine, "sFlow version 5," [sFlow-v5] Phaal, P. and M. Lavine, "sFlow version 5,"
http://www.sflow.org/sflow_version_5.txt, July 2004. http://www.sflow.org/sflow_version_5.txt, July 2004.
[STT] Davie, B. (ed) and J. Gross, "A Stateless Transport Tunneling
Protocol for Network Virtualization (STT)," draft-davie-stt-06, March
2014.
[VXLAN] Mahalingam, M. et al., "VXLAN: A Framework for Overlaying
Virtualized Layer 2 Networks over Layer 3 Networks," draft-
mahalingam-dutt-dcops-vxlan-09, April 2014.
[YONG] Yong, L., "Enhanced ECMP and Large Flow Aware Transport," [YONG] Yong, L., "Enhanced ECMP and Large Flow Aware Transport,"
draft-yong-pwe3-enhance-ecmp-lfat-01, September 2010. draft-yong-pwe3-enhance-ecmp-lfat-01, September 2010.
[RFC 5681] Allman, M. et al., "TCP Congestion Control," September
2009
Appendix A. Internet Traffic Analysis and Load Balancing Simulation Appendix A. Internet Traffic Analysis and Load Balancing Simulation
Internet traffic [CAIDA] has been analyzed to obtain flow statistics Internet traffic [CAIDA] has been analyzed to obtain flow statistics
such as the number of packets in a flow and the flow duration. The such as the number of packets in a flow and the flow duration. The
five tuples in the packet header (IP addresses, TCP/UDP Ports, and IP five tuples in the packet header (IP addresses, TCP/UDP Ports, and IP
protocol) are used for flow identification. The analysis indicates protocol) are used for flow identification. The analysis indicates
that < ~2% of the flows take ~30% of total traffic volume while the that < ~2% of the flows take ~30% of total traffic volume while the
rest of the flows (> ~98%) contributes ~70% [YONG]. rest of the flows (> ~98%) contributes ~70% [YONG].
The simulation has shown that given Internet traffic pattern, the The simulation has shown that given Internet traffic pattern, the
hash-based technique does not evenly distribute the flows over ECMP hash-based technique does not evenly distribute the flows over ECMP
 End of changes. 36 change blocks. 
53 lines changed or deleted 88 lines changed or added

This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/