draft-ietf-opsawg-large-flow-load-balancing-06.txt   draft-ietf-opsawg-large-flow-load-balancing-07.txt 
OPSAWG R. Krishnan OPSAWG R. Krishnan
Internet Draft Brocade Communications Internet Draft Brocade Communications
Intended status: Informational L. Yong Intended status: Informational L. Yong
Expires: June 26, 2014 Huawei USA Expires: July 23, 2014 Huawei USA
December 26, 2013 A. Ghanwani A. Ghanwani
Dell Dell
Ning So Ning So
Tata Communications Tata Communications
S. Khanna
Cisco Systems
B. Khasnabish B. Khasnabish
ZTE Corporation ZTE Corporation
January 15, 2014
Mechanisms for Optimal LAG/ECMP Component Link Utilization in Mechanisms for Optimal LAG/ECMP Component Link Utilization in
Networks Networks
draft-ietf-opsawg-large-flow-load-balancing-06.txt draft-ietf-opsawg-large-flow-load-balancing-07.txt
Status of this Memo Status of this Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. This document may not be modified, provisions of BCP 78 and BCP 79. This document may not be modified,
and derivative works of it may not be created, except to publish it and derivative works of it may not be created, except to publish it
as an RFC and to translate it into languages other than English. as an RFC and to translate it into languages other than English.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 43 skipping to change at page 1, line 42
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html http://www.ietf.org/shadow.html
This Internet-Draft will expire on June 26, 2014. This Internet-Draft will expire on July 15, 2014.
Copyright Notice Copyright Notice
Copyright (c) 2013 IETF Trust and the persons identified as the Copyright (c) 2014 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
skipping to change at page 3, line 15 skipping to change at page 3, line 15
5.4. Information for Redistribution of Small Flows............17 5.4. Information for Redistribution of Small Flows............17
5.5. Export of Flow Information...............................17 5.5. Export of Flow Information...............................17
5.6. Monitoring information...................................18 5.6. Monitoring information...................................18
5.6.1. Interface (link) utilization........................18 5.6.1. Interface (link) utilization........................18
5.6.2. Other monitoring information........................18 5.6.2. Other monitoring information........................18
6. Operational Considerations....................................19 6. Operational Considerations....................................19
6.1. Rebalancing Frequency....................................19 6.1. Rebalancing Frequency....................................19
6.2. Handling Route Changes...................................19 6.2. Handling Route Changes...................................19
7. IANA Considerations...........................................19 7. IANA Considerations...........................................19
8. Security Considerations.......................................19 8. Security Considerations.......................................19
9. Acknowledgements..............................................20 9. Contributing Authors..........................................19
10. References...................................................20 10. Acknowledgements.............................................20
10.1. Normative References....................................20 11. References...................................................20
10.2. Informative References..................................20 11.1. Normative References....................................20
11.2. Informative References..................................20
1. Introduction 1. Introduction
Networks extensively use link aggregation groups (LAG) [802.1AX] and Networks extensively use link aggregation groups (LAG) [802.1AX] and
equal cost multi-paths (ECMP) [RFC 2991] as techniques for capacity equal cost multi-paths (ECMP) [RFC 2991] as techniques for capacity
scaling. For the problems addressed by this document, network traffic scaling. For the problems addressed by this document, network traffic
can be predominantly categorized into two traffic types: long-lived can be predominantly categorized into two traffic types: long-lived
large flows and other flows. These other flows, which include long- large flows and other flows. These other flows, which include long-
lived small flows, short-lived small flows, and short-lived large lived small flows, short-lived small flows, and short-lived large
flows, are referred to as small flows in this document. Stateless flows, are referred to as small flows in this document. Stateless
skipping to change at page 3, line 47 skipping to change at page 3, line 48
and assigning the large flows to specific LAG/ECMP component links or and assigning the large flows to specific LAG/ECMP component links or
redistributing the small flows when a component link on the router is redistributing the small flows when a component link on the router is
congested. congested.
It is useful to keep in mind that in typical use cases for this It is useful to keep in mind that in typical use cases for this
mechanism the large flows are those that consume a significant amount mechanism the large flows are those that consume a significant amount
of bandwidth on a link, e.g. greater than 5% of link bandwidth. The of bandwidth on a link, e.g. greater than 5% of link bandwidth. The
number of such flows would necessarily be fairly small, e.g. on the number of such flows would necessarily be fairly small, e.g. on the
order of 10's or 100's per LAG/ECMP. In other words, the number of order of 10's or 100's per LAG/ECMP. In other words, the number of
large flows is NOT expected to be on the order of millions of flows. large flows is NOT expected to be on the order of millions of flows.
Examples of such large flows would be IPSec tunnels in service Examples of such large flows would be IPsec tunnels in service
provider backbone networks or storage backup traffic in data center provider backbone networks or storage backup traffic in data center
networks. networks.
1.1. Acronyms 1.1. Acronyms
COTS: Commercial Off-the-shelf COTS: Commercial Off-the-shelf
DOS: Denial of Service DOS: Denial of Service
ECMP: Equal Cost Multi-path ECMP: Equal Cost Multi-path
skipping to change at page 9, line 7 skipping to change at page 9, line 7
The various steps in achieving optimal LAG/ECMP component link The various steps in achieving optimal LAG/ECMP component link
utilization in networks are detailed below: utilization in networks are detailed below:
Step 1) This involves large flow recognition in routers and Step 1) This involves large flow recognition in routers and
maintaining the mapping of the large flow to the component link that maintaining the mapping of the large flow to the component link that
it uses. The recognition of large flows is explained in Section 4.3. it uses. The recognition of large flows is explained in Section 4.3.
Step 2) The egress component links are periodically scanned for link Step 2) The egress component links are periodically scanned for link
utilization. If the egress component link utilization exceeds a pre- utilization. If the egress component link utilization exceeds a pre-
programmed threshold, an operator alert is generated. The large flows programmed threshold, an operator alert is generated. Information
mapped to the congested egress component link are exported to a about the large flows mapped to the congested egress component link
central management entity. is exported to a central management entity.
Step 3) On receiving the alert about the congested component link, Step 3) On receiving the alert about the congested component link,
the operator, through a central management entity, finds the large the operator, through a central management entity, finds the large
flows mapped to that component link and the LAG/ECMP group to which flows mapped to that component link and the LAG/ECMP group to which
the component link belongs. the component link belongs.
Step 4) The operator can choose to rebalance the large flows on Step 4) The operator can choose to rebalance the large flows on
lightly loaded component links of the LAG/ECMP group or redistribute lightly loaded component links of the LAG/ECMP group or redistribute
the small flows on the congested link to other component links of the the small flows on the congested link to other component links of the
group. The operator, through a central management entity, can choose group. The operator, through a central management entity, can choose
skipping to change at page 9, line 35 skipping to change at page 9, line 35
3) Have the router redistribute all the small flows on the 3) Have the router redistribute all the small flows on the
congested link to other component links in the group. congested link to other component links in the group.
The central management entity conveys the above information to the The central management entity conveys the above information to the
router. The load re-balancing options are explained in Section 4.4. router. The load re-balancing options are explained in Section 4.4.
Steps 2) to 4) could be automated if desired. Steps 2) to 4) could be automated if desired.
Providing large flow information to a central management entity Providing large flow information to a central management entity
provides the capability to further optimize flow distribution at with provides the capability to globally optimize flow distribution as
multi-node visibility. Consider the following example. A router may described in Section 4.1. Consider the following example. A router
have 3 ECMP nexthops that lead down paths P1, P2, and P3. A couple may have 3 ECMP nexthops that lead down paths P1, P2, and P3. A
of hops downstream on P1 may be congested, while P2 and P3 may be couple of hops downstream on P1 may be congested, while P2 and P3 may
under-utilized, which the local router does not have visibility into. be under-utilized, which the local router does not have visibility
With the help of a central management entity, the operator could into. With the help of a central management entity, the operator
redistribute some of the flows from P1 to P2 and P3 resulting in a could redistribute some of the flows from P1 to P2 and P3 resulting
more optimized flow of traffic. in a more optimized flow of traffic.
The techniques described above are especially useful when bundling The techniques described above are especially useful when bundling
links of different bandwidths for e.g. 10Gbps and 100Gbps as links of different bandwidths for e.g. 10Gbps and 100Gbps as
described in [ID.ietf-rtgwg-cl-requirement]. described in [ID.ietf-rtgwg-cl-requirement].
4.3. Large Flow Recognition 4.3. Large Flow Recognition
4.3.1. Flow Identification 4.3.1. Flow Identification
A flow (large flow or small flow) can be defined as a sequence of A flow (large flow or small flow) can be defined as a sequence of
skipping to change at page 11, line 8 skipping to change at page 11, line 8
been recognized as a large flow, it should continue to be recognized been recognized as a large flow, it should continue to be recognized
as a large flow as long as the traffic received during an observation as a large flow as long as the traffic received during an observation
interval exceeds some fraction of the bandwidth threshold, for interval exceeds some fraction of the bandwidth threshold, for
example 80% of the bandwidth threshold. example 80% of the bandwidth threshold.
Various techniques to identify a large flow are described below. Various techniques to identify a large flow are described below.
4.3.3. Sampling Techniques 4.3.3. Sampling Techniques
A number of routers support sampling techniques such as sFlow [sFlow- A number of routers support sampling techniques such as sFlow [sFlow-
v5, sFlow-LAG], PSAMP [RFC 5475] and Netflow Sampling [RFC 3954]. v5, sFlow-LAG], PSAMP [RFC 5475] and NetFlow Sampling [RFC 3954].
For the purpose of large flow identification, sampling must be For the purpose of large flow identification, sampling must be
enabled on all of the egress ports in the router where such enabled on all of the egress ports in the router where such
measurements are desired. measurements are desired.
Using sflow as an example, processing in an sFlow collector will Using sFlow as an example, processing in a sFlow collector will
provide an approximate indication of the large flows mapping to each provide an approximate indication of the large flows mapping to each
of the component links in each LAG/ECMP group. It is possible to of the component links in each LAG/ECMP group. It is possible to
implement this part of the collector function in the control plane of implement this part of the collector function in the control plane of
the router reducing dependence on an external management station, the router reducing dependence on an external management station,
assuming sufficient control plane resources are available. assuming sufficient control plane resources are available.
If egress sampling is not available, ingress sampling can suffice If egress sampling is not available, ingress sampling can suffice
since the central management entity used by the sampling technique since the central management entity used by the sampling technique
typically has multi-node visibility and can use the samples from an typically has multi-node visibility and can use the samples from an
immediately downstream node to make measurements for egress traffic immediately downstream node to make measurements for egress traffic
skipping to change at page 15, line 13 skipping to change at page 15, line 13
flow -- and the link utilization is normal now. flow -- and the link utilization is normal now.
+-----------+ -> +-----------+ +-----------+ -> +-----------+
| | -> | | | | -> | |
| | ===> | | | | ===> | |
| (1)|--------|(1) | | (1)|--------|(1) |
| | | | | | | |
| | ===> | | | | ===> | |
| | -> | | | | -> | |
| | -> | | | | -> | |
| (R1) | -> | (R2) | | (R1) | -> | (R2) |
| (2)|--------|(2) | | (2)|--------|(2) |
| | | | | | | |
| | -> | | | | -> | |
| | -> | | | | -> | |
| | ===> | | | | ===> | |
| (3)|--------|(3) | | (3)|--------|(3) |
| | | | | | | |
+-----------+ +-----------+ +-----------+ +-----------+
Where: -> small flows Where: -> small flows
skipping to change at page 20, line 5 skipping to change at page 19, line 46
This memo includes no request to IANA. This memo includes no request to IANA.
8. Security Considerations 8. Security Considerations
This document does not directly impact the security of the Internet This document does not directly impact the security of the Internet
infrastructure or its applications. In fact, it could help if there infrastructure or its applications. In fact, it could help if there
is a DOS attack pattern which causes a hash imbalance resulting in is a DOS attack pattern which causes a hash imbalance resulting in
heavy overloading of large flows to certain LAG/ECMP component heavy overloading of large flows to certain LAG/ECMP component
links. links.
9. Acknowledgements 9. Contributing Authors
Sanjay Khanna
Cisco Systems
Email: sanjakha@gmail.com
10. Acknowledgements
The authors would like to thank the following individuals for their The authors would like to thank the following individuals for their
review and valuable feedback on earlier versions of this document: review and valuable feedback on earlier versions of this document:
Shane Amante, Curtis Villamizar, Fred Baker, Wes George, Brian Shane Amante, Curtis Villamizar, Fred Baker, Wes George, Brian
Carpenter, George Yum, Michael Fargano, Michael Bugenhagen, Jianrong Carpenter, George Yum, Michael Fargano, Michael Bugenhagen, Jianrong
Wong, Peter Phaal, Roman Krzanowski, Weifeng Zhang, Pete Moyer, Wong, Peter Phaal, Roman Krzanowski, Weifeng Zhang, Pete Moyer,
Andrew Malis, Dave McDysan, Zhen Cao, Dan Romascanu, and Benoit Andrew Malis, Dave McDysan, Zhen Cao, Dan Romascanu, and Benoit
Claise. Claise.
10. References 11. References
10.1. Normative References 11.1. Normative References
10.2. Informative References 11.2. Informative References
[802.1AX] IEEE Standards Association, "IEEE Std 802.1AX-2008 IEEE [802.1AX] IEEE Standards Association, "IEEE Std 802.1AX-2008 IEEE
Standard for Local and Metropolitan Area Networks - Link Standard for Local and Metropolitan Area Networks - Link
Aggregation", 2008. Aggregation", 2008.
[bin-pack] Coffman, Jr., E., M. Garey, and D. Johnson. Approximation [bin-pack] Coffman, Jr., E., M. Garey, and D. Johnson. Approximation
Algorithms for Bin-Packing -- An Updated Survey. In Algorithm Design Algorithms for Bin-Packing -- An Updated Survey. In Algorithm Design
for Computer System Design, ed. by Ausiello, Lucertini, and Serafini. for Computer System Design, ed. by Ausiello, Lucertini, and Serafini.
Springer-Verlag, 1984. Springer-Verlag, 1984.
skipping to change at page 23, line 6 skipping to change at page 23, line 4
Dell Dell
San Jose, CA 95134 San Jose, CA 95134
Phone: +1-408-571-3228 Phone: +1-408-571-3228
Email: anoop@alumni.duke.edu Email: anoop@alumni.duke.edu
Ning So Ning So
Tata Communications Tata Communications
Plano, TX 75082, USA Plano, TX 75082, USA
Phone: +1-972-955-0914 Phone: +1-972-955-0914
Email: ning.so@tatacommunications.com Email: ning.so@tatacommunications.com
Sanjay Khanna
Cisco Systems
Email: sanjakha@gmail.com
Bhumip Khasnabish Bhumip Khasnabish
ZTE Corporation ZTE Corporation
New Jersey, 07960, USA New Jersey, 07960, USA
Phone: +1-781-752-8003 Phone: +1-781-752-8003
Email: vumip1@gmail.com Email: vumip1@gmail.com
 End of changes. 18 change blocks. 
35 lines changed or deleted 36 lines changed or added

This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/