draft-ietf-opsawg-large-flow-load-balancing-07.txt   draft-ietf-opsawg-large-flow-load-balancing-08.txt 
OPSAWG R. Krishnan OPSAWG R. Krishnan
Internet Draft Brocade Communications Internet Draft Brocade Communications
Intended status: Informational L. Yong Intended status: Informational L. Yong
Expires: July 23, 2014 Huawei USA Expires: October 5, 2014 Huawei USA
A. Ghanwani A. Ghanwani
Dell Dell
Ning So Ning So
Tata Communications Tata Communications
B. Khasnabish B. Khasnabish
ZTE Corporation ZTE Corporation
January 15, 2014 April 6, 2014
Mechanisms for Optimal LAG/ECMP Component Link Utilization in Mechanisms for Optimizing LAG/ECMP Component Link Utilization in
Networks Networks
draft-ietf-opsawg-large-flow-load-balancing-07.txt draft-ietf-opsawg-large-flow-load-balancing-08.txt
Status of this Memo Status of this Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. This document may not be modified, provisions of BCP 78 and BCP 79. This document may not be modified,
and derivative works of it may not be created, except to publish it and derivative works of it may not be created, except to publish it
as an RFC and to translate it into languages other than English. as an RFC and to translate it into languages other than English.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 42 skipping to change at page 1, line 42
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html http://www.ietf.org/shadow.html
This Internet-Draft will expire on July 15, 2014. This Internet-Draft will expire on October 6, 2014.
Copyright Notice Copyright Notice
Copyright (c) 2014 IETF Trust and the persons identified as the Copyright (c) 2014 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 2, line 30 skipping to change at page 2, line 30
optimally use the bandwidth in wired networks that extensively use optimally use the bandwidth in wired networks that extensively use
link aggregation groups and equal cost multi-paths as techniques for link aggregation groups and equal cost multi-paths as techniques for
bandwidth scaling. This draft explores some of the mechanisms useful bandwidth scaling. This draft explores some of the mechanisms useful
for achieving this. for achieving this.
Table of Contents Table of Contents
1. Introduction...................................................3 1. Introduction...................................................3
1.1. Acronyms..................................................4 1.1. Acronyms..................................................4
1.2. Terminology...............................................4 1.2. Terminology...............................................4
2. Flow Categorization............................................4 2. Flow Categorization............................................5
3. Hash-based Load Distribution in LAG/ECMP.......................5 3. Hash-based Load Distribution in LAG/ECMP.......................5
4. Mechanisms for Optimal LAG/ECMP Component Link Utilization.....7 4. Mechanisms for Optimizing LAG/ECMP Component Link Utilization..7
4.1. Differences in LAG vs ECMP................................7 4.1. Differences in LAG vs ECMP................................8
4.2. Overview of the mechanism.................................8 4.2. Operational Overview......................................9
4.3. Large Flow Recognition...................................10 4.3. Large Flow Recognition...................................10
4.3.1. Flow Identification.................................10 4.3.1. Flow Identification.................................10
4.3.2. Criteria for Identifying a Large Flow...............10 4.3.2. Criteria for Recognizing a Large Flow...............10
4.3.3. Sampling Techniques.................................11 4.3.3. Sampling Techniques.................................11
4.3.4. Automatic Hardware Recognition......................12 4.3.4. Automatic Hardware Recognition......................12
4.4. Load Re-balancing Options................................12 4.3.5. Use of More Than One Detection Method...............13
4.4.1. Alternative Placement of Large Flows................13 4.4. Load Rebalancing Options.................................13
4.4.2. Redistributing Small Flows..........................13 4.4.1. Alternative Placement of Large Flows................14
4.4.3. Component Link Protection Considerations............14 4.4.2. Redistributing Small Flows..........................14
4.4.4. Load Re-balancing Algorithms........................14 4.4.3. Component Link Protection Considerations............15
4.4.5. Load Re-Balancing Example...........................14 4.4.4. Load Rebalancing Algorithms.........................15
5. Information Model for Flow Re-balancing.......................15 4.4.5. Load Rebalancing Example............................15
5.1. Configuration Parameters for Flow Re-balancing...........15 5. Information Model for Flow Rebalancing........................16
5.2. System Configuration and Identification Parameters.......16 5.1. Configuration Parameters for Flow Rebalancing............16
5.3. Information for Alternative Placement of Large Flows.....17 5.2. System Configuration and Identification Parameters.......17
5.4. Information for Redistribution of Small Flows............17 5.3. Information for Alternative Placement of Large Flows.....18
5.5. Export of Flow Information...............................17 5.4. Information for Redistribution of Small Flows............19
5.6. Monitoring information...................................18 5.5. Export of Flow Information...............................19
5.6.1. Interface (link) utilization........................18 5.6. Monitoring information...................................20
5.6.2. Other monitoring information........................18 5.6.1. Interface (link) utilization........................20
6. Operational Considerations....................................19 5.6.2. Other monitoring information........................20
6.1. Rebalancing Frequency....................................19 6. Operational Considerations....................................21
6.2. Handling Route Changes...................................19 6.1. Rebalancing Frequency....................................21
7. IANA Considerations...........................................19 6.2. Handling Route Changes...................................21
8. Security Considerations.......................................19 7. IANA Considerations...........................................21
9. Contributing Authors..........................................19 8. Security Considerations.......................................21
10. Acknowledgements.............................................20 9. Contributing Authors..........................................22
11. References...................................................20 10. Acknowledgements.............................................22
11.1. Normative References....................................20 11. References...................................................22
11.2. Informative References..................................20 11.1. Normative References....................................22
11.2. Informative References..................................22
1. Introduction 1. Introduction
Networks extensively use link aggregation groups (LAG) [802.1AX] and Networks extensively use link aggregation groups (LAG) [802.1AX] and
equal cost multi-paths (ECMP) [RFC 2991] as techniques for capacity equal cost multi-paths (ECMP) [RFC 2991] as techniques for capacity
scaling. For the problems addressed by this document, network traffic scaling. For the problems addressed by this document, network traffic
can be predominantly categorized into two traffic types: long-lived can be predominantly categorized into two traffic types: long-lived
large flows and other flows. These other flows, which include long- large flows and other flows. These other flows, which include long-
lived small flows, short-lived small flows, and short-lived large lived small flows, short-lived small flows, and short-lived large
flows, are referred to as small flows in this document. Stateless flows, are referred to as "small flows" in this document. Long-lived
hash-based techniques [ITCOM, RFC 2991, RFC 2992, RFC 6790] are often large flows are simply referred to as "large flows."
used to distribute both large flows and small flows over the
component links in a LAG/ECMP. However the traffic may not be evenly
distributed over the component links due to the traffic pattern.
This draft describes mechanisms for optimal LAG/ECMP component link Stateless hash-based techniques [ITCOM, RFC 2991, RFC 2992, RFC 6790]
utilization while using hash-based techniques. The mechanisms are often used to distribute both large flows and small flows over
the component links in a LAG/ECMP. However the traffic may not be
evenly distributed over the component links due to the traffic
pattern.
This draft describes mechanisms for optimizing LAG/ECMP component
link utilization while using hash-based techniques. The mechanisms
comprise the following steps -- recognizing large flows in a router; comprise the following steps -- recognizing large flows in a router;
and assigning the large flows to specific LAG/ECMP component links or and assigning the large flows to specific LAG/ECMP component links or
redistributing the small flows when a component link on the router is redistributing the small flows when a component link on the router is
congested. congested.
It is useful to keep in mind that in typical use cases for this It is useful to keep in mind that in typical use cases for this
mechanism the large flows are those that consume a significant amount mechanism the large flows are those that consume a significant amount
of bandwidth on a link, e.g. greater than 5% of link bandwidth. The of bandwidth on a link, e.g. greater than 5% of link bandwidth. The
number of such flows would necessarily be fairly small, e.g. on the number of such flows would necessarily be fairly small, e.g. on the
order of 10's or 100's per LAG/ECMP. In other words, the number of order of 10's or 100's per LAG/ECMP. In other words, the number of
skipping to change at page 4, line 33 skipping to change at page 4, line 37
QoS: Quality of Service QoS: Quality of Service
STT: Stateless Transport Tunneling STT: Stateless Transport Tunneling
TCAM: Ternary Content Addressable Memory TCAM: Ternary Content Addressable Memory
VXLAN: Virtual Extensible LAN VXLAN: Virtual Extensible LAN
1.2. Terminology 1.2. Terminology
Large flow(s): long-lived large flow(s) ECMP component link: An individual nexthop within an ECMP group. An
ECMP component link may itself comprise a LAG.
Small flow(s): long-lived small flow(s), short-lived small flows, and ECMP table: A table that is used as the nexthop of an ECMP route that
short-lived large flow(s) comprises the set of component links and the weights associated with
each of those component links. The weights are used to determine
which values of the hash function map to a given component link.
2. Flow Categorization LAG component link: An individual link within a LAG. A LAG component
link is typically a physical link.
In general, based on the size and duration, a flow can be categorized LAG table: A table that is used as the output port which is a LAG
into any one of the following four types, as shown in Figure 1: that comprises the set of component links and the weights associated
with each of those component links. The weights are used to
determine which values of the hash function map to a given component
link.
(a) Short-lived Large Flow (SLLF), Large flow(s): Refers to long-lived large flow(s).
(b) Short-lived Small Flow (SLSF),
(c) Long-lived Large Flow (LLLF), and
(d) Long-lived Small Flow (LLSF).
Flow Size Small flow(s): Refers to any of, or a combination of, long-lived
^ small flow(s), short-lived small flows, and short-lived large
|--------------------|--------------------| flow(s).
| | |
Large | SLLF | LLLF |
Flow | | |
|--------------------|--------------------|
| | |
Small | SLSF | LLSF |
Flow | | |
+--------------------+--------------------+---> Flow Duration
Short-lived Long-lived
Flow Flow
Figure 1: Flow Categorization 2. Flow Categorization
In this document, as mentioned earlier, we categorize long-lived large In general, based on the size and duration, a flow can be categorized
flows as "large flows", and all of the others -- long-lived small flows, into any one of the following four types, as shown in Figure 1:
short-lived small flows, and short-lived large flows as "small flows".
(a) Short-lived Large Flow (SLLF),
(b) Short-lived Small Flow (SLSF),
(c) Long-lived Large Flow (LLLF), and
(d) Long-lived Small Flow (LLSF).
Flow Size
^
|--------------------|--------------------|
| | |
Large | SLLF | LLLF |
Flow | | |
|--------------------|--------------------|
| | |
Small | SLSF | LLSF |
Flow | | |
+--------------------+--------------------+-->Flow Duration
Short-lived Long-lived
Flow Flow
Figure 1: Flow Categorization
In this document, as mentioned earlier, we categorize long-lived
large flows as "large flows", and all of the others -- long-lived
small flows, short-lived small flows, and short-lived large flows
as "small flows".
3. Hash-based Load Distribution in LAG/ECMP 3. Hash-based Load Distribution in LAG/ECMP
Hashing techniques are often used for traffic load balancing to Hash-based techniques are often used for traffic load balancing to
select among multiple available paths with LAG/ECMP. The advantages select among multiple available paths within a LAG/ECMP group. The
of hash-based load distribution are the preservation of the packet advantages of hash-based techniques for load distribution are the
sequence in a flow and the real-time distribution without maintaining preservation of the packet sequence in a flow and the real-time
per-flow state in the router. Hash-based techniques use a combination distribution without maintaining per-flow state in the router. Hash-
of fields in the packet's headers to identify a flow, and the hash based techniques use a combination of fields in the packet's headers
function on these fields is used to generate a unique number that to identify a flow, and the hash function computed using these fields
identifies a link/path in a LAG/ECMP. The result of the hashing is used to generate a unique number that identifies a link/path in a
procedure is a many-to-one mapping of flows to component links. LAG/ECMP group. The result of the hashing procedure is a many-to-one
mapping of flows to component links.
If the traffic mix constitutes flows such that the result of the hash If the traffic mix constitutes flows such that the result of the hash
function across these flows is fairly uniform so that a similar function across these flows is fairly uniform so that a similar
number of flows is mapped to each component link, if the individual number of flows is mapped to each component link, if the individual
flow rates are much smaller as compared to the link capacity, and if flow rates are much smaller as compared to the link capacity, and if
the rate differences are not dramatic, the hash-based algorithm the rate differences are not dramatic, hash-based techniques produce
produces good results with respect to utilization of the individual good results with respect to utilization of the individual component
component links. However, if one or more of these conditions are not links. However, if one or more of these conditions are not met, hash-
met, hash-based techniques may result in unbalanced loads on based techniques may result in imbalance in the loads on individual
individual component links. component links.
One example is illustrated in Figure 2. In Figure 2, there are two One example is illustrated in Figure 2. In Figure 2, there are two
routers, R1 and R2, and there is a LAG between them which has 3 routers, R1 and R2, and there is a LAG between them which has 3
component links (1), (2), (3). There are a total of 10 flows that component links (1), (2), (3). There are a total of 10 flows that
need to be distributed across the links in this LAG. The result of need to be distributed across the links in this LAG. The result of
hashing is as follows: applying the hash-based technique is as follows:
. Component link (1) has 3 flows -- 2 small flows and 1 large . Component link (1) has 3 flows -- 2 small flows and 1 large
flow -- and the link utilization is normal. flow -- and the link utilization is normal.
. Component link (2) has 3 flows -- 3 small flows and no large . Component link (2) has 3 flows -- 3 small flows and no large
flow -- and the link utilization is light. flow -- and the link utilization is light.
o The absence of any large flow causes the component link o The absence of any large flow causes the component link
under-utilized. under-utilized.
skipping to change at page 6, line 45 skipping to change at page 7, line 21
| (R1) | -> | (R2) | | (R1) | -> | (R2) |
| (2)|--------|(2) | | (2)|--------|(2) |
| | -> | | | | -> | |
| | -> | | | | -> | |
| | ===> | | | | ===> | |
| | ===> | | | | ===> | |
| (3)|--------|(3) | | (3)|--------|(3) |
| | | | | | | |
+-----------+ +-----------+ +-----------+ +-----------+
Where: -> small flow Where: -> small flow
===> large flow ===> large flow
Figure 2: Unevenly Utilized Component Links Figure 2: Unevenly Utilized Component Links
This document presents improved load distribution techniques based on This document presents mechanisms for addressing the imbalance in
the large flow awareness. The techniques compensate for unbalanced load distribution resulting from commonly used hash-based techniques
load distribution resulting from hashing as demonstrated in the above for LAG/ECMP that were shown in the above example. The mechanisms use
example. large flow awareness to compensate for the imbalance in load
distribution.
4. Mechanisms for Optimal LAG/ECMP Component Link Utilization 4. Mechanisms for Optimizing LAG/ECMP Component Link Utilization
The suggested techniques in this draft are about a local optimization The suggested mechanisms in this draft are about a local optimization
solution; they are local in the sense that both the identification of solution; they are local in the sense that both the identification of
large flows and re-balancing of the load can be accomplished large flows and re-balancing of the load can be accomplished
completely within individual nodes in the network without the need completely within individual nodes in the network without the need
for interaction with other nodes. for interaction with other nodes.
This approach may not yield a globally optimal placement of large This approach may not yield a global optimization of the placement of
flows across multiple nodes in a network, which may be desirable in large flows across multiple nodes in a network, which may be
some networks. On the other hand, a local approach may be adequate desirable in some networks. On the other hand, a local approach may
for some environments for the following reasons: be adequate for some environments for the following reasons:
1) Different links within a network experience different levels of 1) Different links within a network experience different levels of
utilization and, thus, a "targeted" solution is needed for those hot- utilization and, thus, a "targeted" solution is needed for those hot-
spots in the network. An example is the utilization of a LAG between spots in the network. An example is the utilization of a LAG between
two routers that needs to be optimized. two routers that needs to be optimized.
2) Some networks may lack end-to-end visibility, e.g. when a 2) Some networks may lack end-to-end visibility, e.g. when a
certain network, under the control of a given operator, is a transit certain network, under the control of a given operator, is a transit
network for traffic from other networks that are not under the network for traffic from other networks that are not under the
control of the same operator. control of the same operator.
4.1. Differences in LAG vs ECMP 4.1. Differences in LAG vs ECMP
While the mechanisms explained herein are applicable to both LAGs and While the mechanisms explained herein are applicable to both LAGs and
ECMP, it is useful to note that there are some key differences ECMP groups, it is useful to note that there are some key differences
between the two that may impact how effective the mechanism is. This between the two that may impact how effective the mechanism is. This
relates, in part, to the localized information with which the scheme relates, in part, to the localized information with which the scheme
is intended to operate. is intended to operate.
A LAG is almost always between 2 adjacent routers. As a result, the A LAG is usually established across links that are between 2 adjacent
scope of problem of optimizing the bandwidth utilization on the routers. As a result, the scope of problem of optimizing the
component links is fairly narrow. It simply involves re-balancing bandwidth utilization on the component links is fairly narrow. It
the load across the component links between these two routers, and simply involves re-balancing the load across the component links
there is no impact whatsoever to other parts of the network. The between these two routers, and there is no impact whatsoever to other
scheme works equally well for unicast and multicast flows. parts of the network. The scheme works equally well for unicast and
multicast flows.
On the other hand, with ECMP, redistributing the load across On the other hand, with ECMP, redistributing the load across
component links that are part of the ECMP group may impact traffic component links that are part of the ECMP group may impact traffic
patterns at all of the nodes that are downstream of the given router patterns at all of the nodes that are downstream of the given router
between itself and the destination. The local optimization may between itself and the destination. The local optimization may
result in congestion at a downstream node. (In its simplest form, an result in congestion at a downstream node. (In its simplest form, an
ECMP group may be used to distribute traffic on component links that ECMP group may be used to distribute traffic on component links that
are between two adjacent routers, and in that case, the ECMP group is are between two adjacent routers, and in that case, the ECMP group is
no different than a LAG for the purpose of this discussion.) no different than a LAG for the purpose of this discussion. It
should be noted that an ECMP component link may itself comprise a
LAG, in which case the scheme may be further applied to the component
links within the LAG.)
+-----+ +-----+
| S1 | | S2 |
+-----+ +-----+
/ \ \ / /\
/ +---------+ / \
/ / \ \ / \
/ / \ +------+ \
/ / \ / \ \
+-----+ +-----+ +-----+
| L1 | | L2 | | L3 |
+-----+ +-----+ +-----+
Figure 3: Two-level fat tree
To demonstrate the limitations of local optimization, consider a two- To demonstrate the limitations of local optimization, consider a two-
level fat-tree topology with three leaf nodes (L1, L2, L3) and two level fat-tree topology with three leaf nodes (L1, L2, L3) and two
spine nodes (S1, S2) and assume all of the links are 10 Gbps. spine nodes (S1, S2) and assume all of the links are 10 Gbps.
+-----+ +-----+
| S1 | | S2 |
+-----+ +-----+
/ \ \ / /\
/ +---------+ / \
/ / \ \ / \
/ / \ +------+ \
/ / \ / \ \
+-----+ +-----+ +-----+
| L1 | | L2 | | L3 |
+-----+ +-----+ +-----+
Figure 3: Two Level Fat-tree
Let L1 have two flows of 4 Gbps each towards L3, and let L2 have one Let L1 have two flows of 4 Gbps each towards L3, and let L2 have one
flow of 7 Gbps also towards L3. If L1 balances the load optimally flow of 7 Gbps also towards L3. If L1 balances the load optimally
between S1 and S2, and L2 sends the flow via S1, then the downlink between S1 and S2, and L2 sends the flow via S1, then the downlink
from S1 to L3 would get congested resulting in packet discards. On from S1 to L3 would get congested resulting in packet discards. On
the other hand, if L1 had sent both its flows towards S1 and L2 had the other hand, if L1 had sent both its flows towards S1 and L2 had
sent its flow towards S2, there would have been no congestion at sent its flow towards S2, there would have been no congestion at
either S1 or S2. either S1 or S2.
The other issue with applying this scheme to ECMP groups is that it The other issue with applying this scheme to ECMP groups is that it
may not apply equally to unicast and multicast traffic because of the may not apply equally to unicast and multicast traffic because of the
way multicast trees are constructed. way multicast trees are constructed.
4.2. Overview of the mechanism Finally, it is possible for a single physical link to participate as
a component link in multiple ECMP groups, whereas with LAGs, a link
can participate as a component link of only one LAG.
The various steps in achieving optimal LAG/ECMP component link 4.2. Operational Overview
utilization in networks are detailed below:
The various steps in optimizing LAG/ECMP component link utilization
in networks are detailed below:
Step 1) This involves large flow recognition in routers and Step 1) This involves large flow recognition in routers and
maintaining the mapping of the large flow to the component link that maintaining the mapping of the large flow to the component link that
it uses. The recognition of large flows is explained in Section 4.3. it uses. The recognition of large flows is explained in Section 4.3.
Step 2) The egress component links are periodically scanned for link Step 2) The egress component links are periodically scanned for link
utilization. If the egress component link utilization exceeds a pre- utilization and the imbalance for the LAG/ECMP group is monitored. If
programmed threshold, an operator alert is generated. Information the imbalance exceeds a certain imbalance threshold, then re-
about the large flows mapped to the congested egress component link balancing is triggered. Measurement of the imbalance is discussed
is exported to a central management entity. further in 5.1. Additional criteria may also be used to determine
whether or not to trigger rebalancing, such as the maximum
Step 3) On receiving the alert about the congested component link, utilization of any of the component links, in addition to the
the operator, through a central management entity, finds the large imbalance.
flows mapped to that component link and the LAG/ECMP group to which
the component link belongs.
Step 4) The operator can choose to rebalance the large flows on
lightly loaded component links of the LAG/ECMP group or redistribute
the small flows on the congested link to other component links of the
group. The operator, through a central management entity, can choose
one of the following actions:
1) Indicate specific large flows to rebalance;
2) Have the router decide the best large flows to rebalance;
3) Have the router redistribute all the small flows on the
congested link to other component links in the group.
The central management entity conveys the above information to the Step 3) As a part of rebalancing, the operator can choose to
router. The load re-balancing options are explained in Section 4.4. rebalance the large flows on to lightly loaded component links of the
LAG/ECMP group, redistribute the small flows on the congested link to
other component links of the group, or a combination of both.
Steps 2) to 4) could be automated if desired. All of the steps identified above can be done locally within the
router itself or could involve the use of a central management
entity.
Providing large flow information to a central management entity Providing large flow information to a central management entity
provides the capability to globally optimize flow distribution as provides the capability to globally optimize flow distribution as
described in Section 4.1. Consider the following example. A router described in Section 4.1. Consider the following example. A router
may have 3 ECMP nexthops that lead down paths P1, P2, and P3. A may have 3 ECMP nexthops that lead down paths P1, P2, and P3. A
couple of hops downstream on P1 may be congested, while P2 and P3 may couple of hops downstream on path P1 there may be a congested link,
be under-utilized, which the local router does not have visibility while paths P2 and P3 may be under-utilized. This is something that
into. With the help of a central management entity, the operator the local router does not have visibility into. With the help of a
could redistribute some of the flows from P1 to P2 and P3 resulting central management entity, the operator could redistribute some of
in a more optimized flow of traffic. the flows from P1 to P2 and/or P3 resulting in a more optimized flow
of traffic.
The techniques described above are especially useful when bundling The mechanisms described above are especially useful when bundling
links of different bandwidths for e.g. 10Gbps and 100Gbps as links of different bandwidths for e.g. 10 Gbps and 100 Gbps as
described in [ID.ietf-rtgwg-cl-requirement]. described in [ID.ietf-rtgwg-cl-requirement].
4.3. Large Flow Recognition 4.3. Large Flow Recognition
4.3.1. Flow Identification 4.3.1. Flow Identification
A flow (large flow or small flow) can be defined as a sequence of A flow (large flow or small flow) can be defined as a sequence of
packets for which ordered delivery should be maintained. Flows are packets for which ordered delivery should be maintained. Flows are
typically identified using one or more fields from the packet header, typically identified using one or more fields from the packet header,
for example: for example:
skipping to change at page 10, line 28 skipping to change at page 10, line 43
destination port. destination port.
. MPLS Labels. . MPLS Labels.
For tunneling protocols like GRE, VXLAN, NVGRE, STT, etc., flow For tunneling protocols like GRE, VXLAN, NVGRE, STT, etc., flow
identification is possible based on inner and/or outer headers. The identification is possible based on inner and/or outer headers. The
above list is not exhaustive. The mechanisms described in this above list is not exhaustive. The mechanisms described in this
document are agnostic to the fields that are used for flow document are agnostic to the fields that are used for flow
identification. identification.
This definition of flows is consistent with that in IPFIX [RFC 7011]. This method of flow identification is consistent with that of IPFIX
[RFC 7011].
4.3.2. Criteria for Identifying a Large Flow 4.3.2. Criteria for Recognizing a Large Flow
From a bandwidth and time duration perspective, in order to identify From a bandwidth and time duration perspective, in order to recognize
large flows we define an observation interval and observe the large flows we define an observation interval and observe the
bandwidth of the flow over that interval. A flow that exceeds a bandwidth of the flow over that interval. A flow that exceeds a
certain minimum bandwidth threshold over that observation interval certain minimum bandwidth threshold over that observation interval
would be considered a large flow. would be considered a large flow.
The two parameters -- the observation interval, and the minimum The two parameters -- the observation interval, and the minimum
bandwidth threshold over that observation interval -- should be bandwidth threshold over that observation interval -- should be
programmable in a router to facilitate handling of different use programmable to facilitate handling of different use cases and
cases and traffic characteristics. For example, a flow which is at or traffic characteristics. For example, a flow which is at or above 10%
above 10% of link bandwidth for a time period of at least 1 second of link bandwidth for a time period of at least 1 second could be
could be declared a large flow [DevoFlow]. declared a large flow [DevoFlow].
In order to avoid excessive churn in the rebalancing, once a flow has In order to avoid excessive churn in the rebalancing, once a flow has
been recognized as a large flow, it should continue to be recognized been recognized as a large flow, it should continue to be recognized
as a large flow as long as the traffic received during an observation as a large flow for as long as the traffic received during an
interval exceeds some fraction of the bandwidth threshold, for observation interval exceeds some fraction of the bandwidth
example 80% of the bandwidth threshold. threshold, for example 80% of the bandwidth threshold.
Various techniques to identify a large flow are described below. Various techniques to recognize a large flow are described below.
4.3.3. Sampling Techniques 4.3.3. Sampling Techniques
A number of routers support sampling techniques such as sFlow [sFlow- A number of routers support sampling techniques such as sFlow [sFlow-
v5, sFlow-LAG], PSAMP [RFC 5475] and NetFlow Sampling [RFC 3954]. v5, sFlow-LAG], PSAMP [RFC 5475] and NetFlow Sampling [RFC 3954].
For the purpose of large flow identification, sampling must be For the purpose of large flow recognition, sampling needs to be
enabled on all of the egress ports in the router where such enabled on all of the egress ports in the router where such
measurements are desired. measurements are desired.
Using sFlow as an example, processing in a sFlow collector will Using sFlow as an example, processing in a sFlow collector will
provide an approximate indication of the large flows mapping to each provide an approximate indication of the large flows mapping to each
of the component links in each LAG/ECMP group. It is possible to of the component links in each LAG/ECMP group. It is possible to
implement this part of the collector function in the control plane of implement this part of the collector function in the control plane of
the router reducing dependence on an external management station, the router reducing dependence on an external management station,
assuming sufficient control plane resources are available. assuming sufficient control plane resources are available.
If egress sampling is not available, ingress sampling can suffice If egress sampling is not available, ingress sampling can suffice
since the central management entity used by the sampling technique since the central management entity used by the sampling technique
typically has multi-node visibility and can use the samples from an typically has multi-node visibility and can use the samples from an
immediately downstream node to make measurements for egress traffic immediately downstream node to make measurements for egress traffic
at the local node. This may not be available if the downstream at the local node.
device is under the control of a different operator, or if the
downstream device does not support sampling. Alternatively, since The option of using ingress sampling for this purpose may not be
sampling techniques require that the sample annotated with the available if the downstream device is under the control of a
packet's egress port information, ingress sampling may suffice. different operator, or if the downstream device does not support
However, this means that sampling would have to be enabled on all sampling.
ports, rather than only on those ports where such monitoring is
desired. Alternatively, since sampling techniques require that the sample be
annotated with the packet's egress port information, ingress sampling
may suffice. However, this means that sampling would have to be
enabled on all ports, rather than only on those ports where such
monitoring is desired. There is one situation in which this approach
may not work. If there are tunnels that originate from the given
router, and if the resulting tunnel comprises the large flow, then
this cannot be deduced from ingress sampling at the given router.
Instead, if egress sampling is unavailable, then ingress sampling
from the downstream router must be used.
To illustrate the use of ingress versus egress sampling, we refer to
Figure 2. Since we are looking at rebalancing flows at R1, we would
need to enable egress sampling on ports (1), (2), and (3) on R1. If
egress sampling is not available, and if R2 is also under the control
of the same administrator, enabling ingress sampling on R2's ports
(1), (2), and (3) would also work, but it would necessitate the
involvement of a central management entity in order for R1 to obtain
large flow information for each of its links. Finally, R1 can enable
ingress sampling only on all of its ports (not just the ports that
are part of the LAG/ECMP group being monitored) and that would
suffice if the sampling technique annotates the samples with the
egress port information.
The advantages and disadvantages of sampling techniques are as The advantages and disadvantages of sampling techniques are as
follows. follows.
Advantages: Advantages:
. Supported in most existing routers. . Supported in most existing routers.
. Requires minimal router resources. . Requires minimal router resources.
skipping to change at page 12, line 30 skipping to change at page 13, line 21
. Large flow detection is offloaded to hardware freeing up . Large flow detection is offloaded to hardware freeing up
software resources and possible dependence on an external software resources and possible dependence on an external
management station. management station.
. As link speeds get higher, sampling rates are typically reduced . As link speeds get higher, sampling rates are typically reduced
to keep the number of samples manageable which places a lower to keep the number of samples manageable which places a lower
bound on the detection time. With automatic hardware bound on the detection time. With automatic hardware
recognition, large flows can be detected in shorter windows on recognition, large flows can be detected in shorter windows on
higher link speeds since every packet is accounted for in higher link speeds since every packet is accounted for in
hardware [NDTM] hardware [NDTM].
Disadvantages: Disadvantages:
. Not supported in many routers. . Such techniques are not supported in many routers.
As mentioned earlier, the observation interval for determining a As mentioned earlier, the observation interval for determining a
large flow and the bandwidth threshold for classifying a flow as a large flow and the bandwidth threshold for classifying a flow as a
large flow should be programmable parameters in a router. large flow should be programmable parameters in a router.
The implementation of automatic hardware recognition of large flows The implementation of automatic hardware recognition of large flows
is vendor dependent and beyond the scope of this document. is vendor dependent and beyond the scope of this document.
4.4. Load Re-balancing Options 4.3.5. Use of More Than One Detection Method
Below are suggested techniques for load re-balancing. Equipment It is possible that a router may have line cards that support a
sampling technique while other line cards support automatic hardware
detection of large flows. As long as there is a way for the router
to reliably determine the mapping of large flows to component links
of a LAG/ECMP group, it is acceptable for the router to use more than
one method for large flow recognition.
4.4. Load Rebalancing Options
Below are suggested techniques for load rebalancing. Equipment
vendors should implement all of these techniques and allow the vendors should implement all of these techniques and allow the
operator to choose one or more techniques based on their operator to choose one or more techniques based on their
applications. applications.
Note that regardless of the method used, perfect re-balancing of Note that regardless of the method used, perfect rebalancing of large
large flows may not be possible since flows arrive and depart at flows may not be possible since flows arrive and depart at different
different times. Also, any flows that are moved from one component times. Also, any flows that are moved from one component link to
link to another may experience momentary packet reordering. another may experience momentary packet reordering.
4.4.1. Alternative Placement of Large Flows 4.4.1. Alternative Placement of Large Flows
Within a LAG/ECMP group, the member component links with least Within a LAG/ECMP group, the member component links with least
average port utilization are identified. Some large flow(s) from the average port utilization are identified. Some large flow(s) from the
heavily loaded component links are then moved to those lightly-loaded heavily loaded component links are then moved to those lightly-loaded
member component links using a PBR rule in the ingress processing member component links using a PBR rule in the ingress processing
element(s) in the routers. element(s) in the routers.
With this approach, only certain large flows are subjected to With this approach, only certain large flows are subjected to
momentary flow re-ordering. momentary flow re-ordering.
When a large flow is moved, this will increase the utilization of the When a large flow is moved, this will increase the utilization of the
link that it moved to potentially creating unbalanced utilization link that it moved to potentially creating imbalance in the
once again across the link components. Therefore, when moving large utilization once again across the component links. Therefore, when
flows, care must be taken to account for the existing load, and what moving large flows, care must be taken to account for the existing
the future load will be after large flow has been moved. Further, load, and what the future load will be after large flow has been
the appearance of new large flows may require a rearrangement of the moved. Further, the appearance of new large flows may require a
placement of existing flows. rearrangement of the placement of existing flows.
Consider a case where there is a LAG compromising 4 10 Gbps component Consider a case where there is a LAG compromising four 10 Gbps
links and there are 4 large flows each of 1 Gbps. These flows are component links and there are four large flows, each of 1 Gbps.
each placed on one of the component links. Subsequent, a 5-th large These flows are each placed on one of the component links.
flow of 2 Gbps is recognized and to maintain equitable load Subsequent, a fifth large flow of 2 Gbps is recognized and to
distribution, it may require placement of one of the existing 1 Gbps maintain equitable load distribution, it may require placement of one
flow to a different component link. And this would still result in of the existing 1 Gbps flow to a different component link. And this
some imbalance in the utilization across the component links. would still result in some imbalance in the utilization across the
component links.
4.4.2. Redistributing Small Flows 4.4.2. Redistributing Small Flows
Some large flows may consume the entire bandwidth of the component Some large flows may consume the entire bandwidth of the component
link(s). In this case, it would be desirable for the small flows to link(s). In this case, it would be desirable for the small flows to
not use the congested component link(s). This can be accomplished in not use the congested component link(s). This can be accomplished in
one of the following ways. one of the following ways.
This method works on some existing router hardware. The idea is to This method works on some existing router hardware. The idea is to
prevent, or reduce the probability, that the small flow hashes into prevent, or reduce the probability, that the small flow hashes into
skipping to change at page 14, line 19 skipping to change at page 15, line 22
reordering. reordering.
4.4.3. Component Link Protection Considerations 4.4.3. Component Link Protection Considerations
If desired, certain component links may be reserved for link If desired, certain component links may be reserved for link
protection. These reserved component links are not used for any flows protection. These reserved component links are not used for any flows
in the absence of any failures. In the case when the component in the absence of any failures. In the case when the component
link(s) fail, all the flows on the failed component link(s) are moved link(s) fail, all the flows on the failed component link(s) are moved
to the reserved component link(s). The mapping table of large flows to the reserved component link(s). The mapping table of large flows
to component link simply replaces the failed component link with the to component link simply replaces the failed component link with the
reserved link. Likewise, the LAG/ECMP hash table replaces the failed reserved link. Likewise, the LAG/ECMP table replaces the failed
component link with the reserved link. component link with the reserved link.
4.4.4. Load Re-balancing Algorithms 4.4.4. Load Rebalancing Algorithms
Specific algorithms for placement of large flows are out of scope of Specific algorithms for placement of large flows are out of scope of
this document. One possibility is to formulate the problem for large this document. One possibility is to formulate the problem for large
flow placement as the well-known bin-packing problem and make use of flow placement as the well-known bin-packing problem and make use of
the various heuristics that are available for that problem [bin- the various heuristics that are available for that problem [bin-
pack]. pack].
4.4.5. Load Re-Balancing Example 4.4.5. Load Rebalancing Example
Optimal LAG/ECMP component utilization for the use case in Figure 2 Optimizing LAG/ECMP component utilization for the use case in Figure
is depicted below in Figure 4. The large flow rebalancing explained 2 is depicted below in Figure 4. The large flow rebalancing explained
in Section 4.4 is used. The improved link utilization is as follows: in Section 4.4 is used. The improved link utilization is as follows:
. Component link (1) has 3 flows -- 2 small flows and 1 large . Component link (1) has 3 flows -- 2 small flows and 1 large
flow -- and the link utilization is normal. flow -- and the link utilization is normal.
. Component link (2) has 4 flows -- 3 small flows and 1 large . Component link (2) has 4 flows -- 3 small flows and 1 large
flow -- and the link utilization is normal now. flow -- and the link utilization is normal now.
. Component link (3) has 3 flows -- 2 small flows and 1 large . Component link (3) has 3 flows -- 2 small flows and 1 large
flow -- and the link utilization is normal now. flow -- and the link utilization is normal now.
skipping to change at page 15, line 23 skipping to change at page 16, line 23
| (R1) | -> | (R2) | | (R1) | -> | (R2) |
| (2)|--------|(2) | | (2)|--------|(2) |
| | | | | | | |
| | -> | | | | -> | |
| | -> | | | | -> | |
| | ===> | | | | ===> | |
| (3)|--------|(3) | | (3)|--------|(3) |
| | | | | | | |
+-----------+ +-----------+ +-----------+ +-----------+
Where: -> small flows Where: -> small flow
===> large flow ===> large flow
Figure 4: Evenly utilized Composite Links Figure 4: Evenly utilized Composite Links
Basically, the use of the mechanisms described in Section 4.4.1 Basically, the use of the mechanisms described in Section 4.4.1
resulted in a rebalancing of flows where one of the large flows on resulted in a rebalancing of flows where one of the large flows on
component link (3) which was previously congested was moved to component link (3) which was previously congested was moved to
component link (2) which was previously under-utilized. component link (2) which was previously under-utilized.
5. Information Model for Flow Re-balancing 5. Information Model for Flow Rebalancing
In order to support flow rebalancing in a router from an external In order to support flow rebalancing in a router from an external
system, the exchange of some information is necessary between the system, the exchange of some information is necessary between the
router and the external system. This section provides an exemplary router and the external system. This section provides an exemplary
information model covering the various components needed for the information model covering the various components needed for the
purpose. The model is intended to be informational and may be used purpose. The model is intended to be informational and may be used
as input for development of a data model. as input for development of a data model.
5.1. Configuration Parameters for Flow Re-balancing 5.1. Configuration Parameters for Flow Rebalancing
The following parameters are required the configuration of this The following parameters are required the configuration of this
feature: feature:
. Large flow recognition parameters: . Large flow recognition parameters:
o Observation interval: The observation interval is the time o Observation interval: The observation interval is the time
period in seconds over which the packet arrivals are period in seconds over which the packet arrivals are
observed for the purpose of large flow recognition. observed for the purpose of large flow recognition.
skipping to change at page 16, line 25 skipping to change at page 17, line 25
o Minimum bandwidth threshold for large flow maintenance: The o Minimum bandwidth threshold for large flow maintenance: The
minimum bandwidth threshold for large flow maintenance is minimum bandwidth threshold for large flow maintenance is
used to provide hysteresis for large flow recognition. used to provide hysteresis for large flow recognition.
Once a flow is recognized as a large flow, it continues to Once a flow is recognized as a large flow, it continues to
be recognized as a large flow until it falls below this be recognized as a large flow until it falls below this
threshold. This is also configured as a percentage of link threshold. This is also configured as a percentage of link
speed and is typically lower than the minimum bandwidth speed and is typically lower than the minimum bandwidth
threshold defined above. threshold defined above.
. Imbalance threshold: the difference between the utilization of . Imbalance threshold: A measure of the deviation of the
the least utilized and most utilized component links. Expressed component link utilizations from the utilization of the overall
as a percentage of link speed. LAG/ECMP group. Since component links can be of a different
speed, the imbalance can be computed as follows. Let the
utilization of each component link in a LAG/ECMP group with n
links of speed b_1, b_2, .., b_n, be u_1, u_2, .., u_n. The mean
utilization is computed is u_ave = [ (u_1 x b_1) + (u_2 x b_2) +
.. + (u_n x b_n) ] / [b_1 + b_2 + b_n]. The imbalance is then
computed as max_{i=1..n} | u_i - u_ave | / u_ave.
. Rebalancing interval: the minimum amount of time between . Rebalancing interval: The minimum amount of time between
rebalancing events. This parameter ensures that rebalancing is rebalancing events. This parameter ensures that rebalancing is
not invoked too frequently as it impacts frame ordering. not invoked too frequently as it impacts packet ordering.
These parameters may be configured on a system-wide basis or it may These parameters may be configured on a system-wide basis or it may
apply to an individual LAG. apply to an individual LAG. It may be applied to an ECMP group
provided the component links are not shared with any other ECMP
group.
5.2. System Configuration and Identification Parameters 5.2. System Configuration and Identification Parameters
The following parameters are useful for router configuration and
operation when using the mechanisms in this document.
. IP address: The IP address of a specific router that the . IP address: The IP address of a specific router that the
feature is being configured on, or that the large flow placement feature is being configured on, or that the large flow placement
is being applied to. is being applied to.
. LAG ID: Identifies the LAG. The LAG ID may be required when . LAG ID: Identifies the LAG on a given router. The LAG ID may be
configuring this feature (to apply a specific set of large flow required when configuring this feature (to apply a specific set
identification parameters to the LAG) and will be required when of large flow identification parameters to the LAG) and will be
specifying flow placement to achieve the desired rebalancing. required when specifying flow placement to achieve the desired
rebalancing.
. Component Link ID: Identifies the component link within a LAG. . Component Link ID: Identifies the component link within a LAG
This is required when specifying flow placement to achieve the or ECMP group. This is required when specifying flow placement
desired rebalancing. to achieve the desired rebalancing.
. Component Link Weight: The relative weight to be applied to
traffic for a given component link when using hash-based
techniques for load distribution.
. ECMP group: Identifies a particular ECMP group. The ECMP group
may be required when configuring this feature (to apply a
specific set of large flow identification parameters to the ECMP
group) and will be required when specifying flow placement to
achieve the desired rebalancing. We note that multiple ECMP
groups can share an overlapping set (or non-overlapping subset)
of component links. This document does not deal with the
complexity of addressing such configurations.
The feature may be configured globally for all LAGs and/or for all
ECMP groups, or it may be configured specifically for a given LAG or
ECMP group.
5.3. Information for Alternative Placement of Large Flows 5.3. Information for Alternative Placement of Large Flows
In cases where large flow recognition is handled by an external In cases where large flow recognition is handled by an external
management station (see Section 4.3.3 ), an information model for management station (see Section 4.3.3), an information model for
flows is required to allow the import of large flow information to flows is required to allow the import of large flow information to
the router. the router.
The following are some of the elements of information model for The following are some of the elements of information model for
importing of flows: importing of flows:
. Layer 2: source MAC address, destination MAC address, VLAN ID. . Layer 2: source MAC address, destination MAC address, VLAN ID.
. Layer 3 IP: IP Protocol, IP source address, IP destination . Layer 3 IP: IP Protocol, IP source address, IP destination
address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP
destination port. destination port.
. MPLS Labels. . MPLS Labels.
This list is not exhaustive. For example, with overlay protocols This list is not exhaustive. For example, with overlay protocols
such as VXLAN and NVGRE, fields from the outer and/or inner headers such as VXLAN and NVGRE, fields from the outer and/or inner headers
may be specified. In general, all fields in the packet that can be may be specified. In general, all fields in the packet that can be
used by forwarding decisions should be available for use when used by forwarding decisions should be available for use when
importing flow information from an external management station. importing flow information from an external management station.
The IPFIX information model [RFC 7011] can be leveraged for large The IPFIX information model [RFC 7012] can be leveraged for large
flow identification. The component link ID would be used to specify flow identification.
the target component link for the flow.
Large Flow placement is achieved by specifying the relevant flow
information along with the following:
. For LAG: Router's IP address, LAG ID, LAG component link ID.
. For ECMP: Router's IP address, ECMP group, ECMP component link
ID.
In the case where the ECMP component link itself comprises a LAG, we
would have to specify the parameters for both the ECMP group as well
as the LAG to which the large flow is being directed.
5.4. Information for Redistribution of Small Flows 5.4. Information for Redistribution of Small Flows
For small flows, the LAG ID and the component link IDs along with the Redistribution of small flows is done using the following:
percentage of traffic to be assigned to each component link ID Is
required. . For LAG: The LAG ID and the component link IDs along with the
relative weight of traffic to be assigned to each component link
ID are required.
. For ECMP: The ECMP group and the ECMP Nexthop along with the
relative weight of traffic to be assigned to each ECMP Nexthop
are required.
It is possible to have an ECMP nexthop that itself comprises a LAG.
In that case, we would have to specify the new weights for both the
ECMP nexthops within the ECMP group as well as the component links
within the LAG.
5.5. Export of Flow Information 5.5. Export of Flow Information
Exporting large flow information is required when large flow Exporting large flow information is required when large flow
recognition is being done on a router, but the decision to rebalance recognition is being done on a router, but the decision to rebalance
is being made in an external management station. Large flow is being made in an external management station. Large flow
information includes flow identification and the component link ID information includes flow identification and the component link ID
that the flow currently is assigned to. Other information such as that the flow currently is assigned to. Other information such as
flow QoS and bandwidth may be exported too. flow QoS and bandwidth may be exported too.
The IPFIX information model [RFC 7011] can be leveraged for large The IPFIX information model [RFC 7012] can be leveraged for large
flow identification. flow identification.
5.6. Monitoring information 5.6. Monitoring information
5.6.1. Interface (link) utilization 5.6.1. Interface (link) utilization
The incoming bytes (ifInOctets), outgoing bytes (ifOutOctets) and The incoming bytes (ifInOctets), outgoing bytes (ifOutOctets) and
interface speed (ifSpeed) can be measured from the Interface table interface speed (ifSpeed) can be measured from the Interface table
(iftable) MIB [RFC 1213]. (iftable) MIB [RFC 1213].
The link utilization can then be computed as follows: The link utilization can then be computed as follows:
Incoming link utilization = (ifInOctets *8 / ifSpeed) Incoming link utilization = (ifInOctets 8 / ifSpeed)
Outgoing link utilization = (ifOutOctets * 8 / ifSpeed) Outgoing link utilization = (ifOutOctets 8 / ifSpeed)
For high speed links, the etherStatsHighCapacityTable MIB [RFC 3273] For high speed Ethernet links, the etherStatsHighCapacityTable MIB
can be used. [RFC 3273] can be used.
For further scalability, it is recommended to use the counter push For scalability, it is recommended to use the counter push mechanism
mechanism in [sflow-v5] for the interface counters; this would help in [sflow-v5] for the interface counters. Doing so would help avoid
avoid counter polling through the MIB interface. counter polling through the MIB interface.
The outgoing link utilization of the component links within a LAG can The outgoing link utilization of the component links within a
be used to compute the imbalance threshold (See Section 5.1) for the LAG/ECMP group can be used to compute the imbalance (See Section 5.1)
LAG. for the LAG/ECMP group.
5.6.2. Other monitoring information 5.6.2. Other monitoring information
Additional monitoring information that is useful includes: Additional monitoring information that is useful includes:
. Number of times rebalancing was done. . Number of times rebalancing was done.
. Time since the last rebalancing event. . Time since the last rebalancing event.
. The number of large flows currently rebalanced by the scheme. . The number of large flows currently rebalanced by the scheme.
skipping to change at page 19, line 9 skipping to change at page 21, line 12
o the interfaces that the large flows was (re)directed to. o the interfaces that the large flows was (re)directed to.
. The settings for the weights of the interfaces within a . The settings for the weights of the interfaces within a
LAG/ECMP used by the small flows which depend on hashing. LAG/ECMP used by the small flows which depend on hashing.
6. Operational Considerations 6. Operational Considerations
6.1. Rebalancing Frequency 6.1. Rebalancing Frequency
Flows should be re-balanced only when the imbalance in the Flows should be rebalanced only when the imbalance in the utilization
utilization across component links exceeds a certain threshold. across component links exceeds a certain threshold. Frequent
Frequent re-balancing to achieve precise equitable utilization across rebalancing to achieve precise equitable utilization across component
component links could be counter-productive as it may result in links could be counter-productive as it may result in moving flows
moving flows back and forth between the component links impacting back and forth between the component links impacting packet ordering
packet ordering and system stability. This applies regardless of and system stability. This applies regardless of whether large flows
whether large flows or small flows are re-distributed. It should be or small flows are redistributed. It should be noted that reordering
noted that reordering is a concern for TCP flows with even a few is a concern for TCP flows with even a few packets because three out-
packets because three out-of-order packets would trigger sufficient of-order packets would trigger sufficient duplicate ACKs to the
duplicate ACKs to the sender resulting in a retransmission [RFC sender resulting in a retransmission [RFC 5681].
5681].
The operator would have to experiment with various values of the The operator would have to experiment with various values of the
large flow recognition parameters (minimum bandwidth threshold, large flow recognition parameters (minimum bandwidth threshold,
observation interval) and the imbalance threshold across component observation interval) and the imbalance threshold across component
links to tune the solution for their environment. links to tune the solution for their environment.
6.2. Handling Route Changes 6.2. Handling Route Changes
Large flow rebalancing must be aware of any changes to the FIB. In Large flow rebalancing must be aware of any changes to the FIB. In
cases where the next-hop of a route no longer to points to the LAG, cases where the nexthop of a route no longer to points to the LAG, or
or to an ECMP group, any PBR entries added as described in Section to an ECMP group, any PBR entries added as described in Section 4.4.1
4.4.1 and 4.4.2 must be withdrawn in order to avoid the creation of and 4.4.2 must be withdrawn in order to avoid the creation of
forwarding loops. forwarding loops.
7. IANA Considerations 7. IANA Considerations
This memo includes no request to IANA. This memo includes no request to IANA.
8. Security Considerations 8. Security Considerations
This document does not directly impact the security of the Internet This document does not directly impact the security of the Internet
infrastructure or its applications. In fact, it could help if there infrastructure or its applications. In fact, it could help if there
skipping to change at page 20, line 5 skipping to change at page 22, line 9
This document does not directly impact the security of the Internet This document does not directly impact the security of the Internet
infrastructure or its applications. In fact, it could help if there infrastructure or its applications. In fact, it could help if there
is a DOS attack pattern which causes a hash imbalance resulting in is a DOS attack pattern which causes a hash imbalance resulting in
heavy overloading of large flows to certain LAG/ECMP component heavy overloading of large flows to certain LAG/ECMP component
links. links.
9. Contributing Authors 9. Contributing Authors
Sanjay Khanna Sanjay Khanna
Cisco Systems Cisco Systems
Email: sanjakha@gmail.com Email: sanjakha@gmail.com
10. Acknowledgements 10. Acknowledgements
The authors would like to thank the following individuals for their The authors would like to thank the following individuals for their
review and valuable feedback on earlier versions of this document: review and valuable feedback on earlier versions of this document:
Shane Amante, Curtis Villamizar, Fred Baker, Wes George, Brian Shane Amante, Fred Baker, Michael Bugenhagen, Zhen Cao, Brian
Carpenter, George Yum, Michael Fargano, Michael Bugenhagen, Jianrong Carpenter, Benoit Claise, Michael Fargano, Wes George, Sriganesh
Wong, Peter Phaal, Roman Krzanowski, Weifeng Zhang, Pete Moyer, Kini, Roman Krzanowski, Andrew Malis, Dave McDysan, Pete Moyer,
Andrew Malis, Dave McDysan, Zhen Cao, Dan Romascanu, and Benoit Peter Phaal, Dan Romascanu, Curtis Villamizar, Jianrong Wong, George
Claise. Yum, and Weifeng Zhang.
11. References 11. References
11.1. Normative References 11.1. Normative References
11.2. Informative References 11.2. Informative References
[802.1AX] IEEE Standards Association, "IEEE Std 802.1AX-2008 IEEE [802.1AX] IEEE Standards Association, "IEEE Std 802.1AX-2008 IEEE
Standard for Local and Metropolitan Area Networks - Link Standard for Local and Metropolitan Area Networks - Link
Aggregation", 2008. Aggregation", 2008.
skipping to change at page 21, line 22 skipping to change at page 23, line 26
[RFC 3273] Waldbusser, S., "Remote Network Monitoring Management [RFC 3273] Waldbusser, S., "Remote Network Monitoring Management
Information Base for High Capacity Networks," July 2002. Information Base for High Capacity Networks," July 2002.
[RFC 3954] Claise, B., "Cisco Systems NetFlow Services Export Version [RFC 3954] Claise, B., "Cisco Systems NetFlow Services Export Version
9," October 2004. 9," October 2004.
[RFC 5475] Zseby T., et al., "Sampling and Filtering Techniques for [RFC 5475] Zseby T., et al., "Sampling and Filtering Techniques for
IP Packet Selection," March 2009. IP Packet Selection," March 2009.
[RFC 7011] Claise, B., "Specification of the IP Flow Information [RFC 7011] Claise, B. et al., "Specification of the IP Flow
Export (IPFIX) Protocol for the Exchange of IP Traffic Flow Information Export (IPFIX) Protocol for the Exchange of IP Traffic
Information," September 2013. Flow Information," September 2013.
[RFC 7012] Claise, B. and B. Trammell, "Information Model for IP Flow
Information Export (IPFIX)," September 2013.
[sFlow-LAG] Phaal, P. and A. Ghanwani, "sFlow LAG counters [sFlow-LAG] Phaal, P. and A. Ghanwani, "sFlow LAG counters
structure," http://www.sflow.org/sflow_lag.txt, September 2012. structure," http://www.sflow.org/sflow_lag.txt, September 2012.
[sFlow-v5] Phaal, P. and M. Lavine, "sFlow version 5," [sFlow-v5] Phaal, P. and M. Lavine, "sFlow version 5,"
http://www.sflow.org/sflow_version_5.txt, July 2004. http://www.sflow.org/sflow_version_5.txt, July 2004.
[YONG] Yong, L., "Enhanced ECMP and Large Flow Aware Transport," [YONG] Yong, L., "Enhanced ECMP and Large Flow Aware Transport,"
draft-yong-pwe3-enhance-ecmp-lfat-01, September 2010. draft-yong-pwe3-enhance-ecmp-lfat-01, September 2010.
[RFC 5681] Allman, M. et al., "TCP Congestion Control," September
2009
Appendix A. Internet Traffic Analysis and Load Balancing Simulation Appendix A. Internet Traffic Analysis and Load Balancing Simulation
Internet traffic [CAIDA] has been analyzed to obtain flow statistics Internet traffic [CAIDA] has been analyzed to obtain flow statistics
such as the number of packets in a flow and the flow duration. The such as the number of packets in a flow and the flow duration. The
five tuples in the packet header (IP addresses, TCP/UDP Ports, and IP five tuples in the packet header (IP addresses, TCP/UDP Ports, and IP
protocol) are used for flow identification. The analysis indicates protocol) are used for flow identification. The analysis indicates
that < ~2% of the flows take ~30% of total traffic volume while the that < ~2% of the flows take ~30% of total traffic volume while the
rest of the flows (> ~98%) contributes ~70% [YONG]. rest of the flows (> ~98%) contributes ~70% [YONG].
The simulation has shown that given Internet traffic pattern, the The simulation has shown that given Internet traffic pattern, the
 End of changes. 83 change blocks. 
255 lines changed or deleted 366 lines changed or added

This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/