draft-ietf-opsawg-large-flow-load-balancing-15.txt | rfc7424.txt | |||
---|---|---|---|---|
OPSAWG R. Krishnan | ||||
Internet Draft Brocade Communications | ||||
Intended status: Informational L. Yong | ||||
Expires: April 6, 2015 Huawei USA | ||||
A. Ghanwani | ||||
Dell | ||||
Ning So | ||||
Tata Communications | ||||
B. Khasnabish | ||||
ZTE Corporation | ||||
October 7, 2014 | ||||
Mechanisms for Optimizing LAG/ECMP Component Link Utilization in | Internet Engineering Task Force (IETF) R. Krishnan | |||
Networks | Request for Comments: 7424 Brocade Communications | |||
Category: Informational L. Yong | ||||
draft-ietf-opsawg-large-flow-load-balancing-15.txt | ISSN: 2070-1721 Huawei USA | |||
A. Ghanwani | ||||
Dell | ||||
N. So | ||||
Vinci Systems | ||||
B. Khasnabish | ||||
ZTE Corporation | ||||
January 2015 | ||||
Status of this Memo | Mechanisms for Optimizing Link Aggregation Group (LAG) and | |||
Equal-Cost Multipath (ECMP) Component Link Utilization in Networks | ||||
This Internet-Draft is submitted in full conformance with the | Abstract | |||
provisions of BCP 78 and BCP 79. This document may not be modified, | ||||
and derivative works of it may not be created, except to publish it | ||||
as an RFC and to translate it into languages other than English. | ||||
Internet-Drafts are working documents of the Internet Engineering | Demands on networking infrastructure are growing exponentially due to | |||
Task Force (IETF), its areas, and its working groups. Note that | bandwidth-hungry applications such as rich media applications and | |||
other groups may also distribute working documents as Internet- | inter-data-center communications. In this context, it is important | |||
Drafts. | to optimally use the bandwidth in wired networks that extensively use | |||
link aggregation groups and equal-cost multipaths as techniques for | ||||
bandwidth scaling. This document explores some of the mechanisms | ||||
useful for achieving this. | ||||
Internet-Drafts are draft documents valid for a maximum of six months | Status of This Memo | |||
and may be updated, replaced, or obsoleted by other documents at any | ||||
time. It is inappropriate to use Internet-Drafts as reference | ||||
material or to cite them other than as "work in progress." | ||||
The list of current Internet-Drafts can be accessed at | This document is not an Internet Standards Track specification; it is | |||
http://www.ietf.org/ietf/1id-abstracts.txt | published for informational purposes. | |||
The list of Internet-Draft Shadow Directories can be accessed at | This document is a product of the Internet Engineering Task Force | |||
http://www.ietf.org/shadow.html | (IETF). It represents the consensus of the IETF community. It has | |||
received public review and has been approved for publication by the | ||||
Internet Engineering Steering Group (IESG). Not all documents | ||||
approved by the IESG are a candidate for any level of Internet | ||||
Standard; see Section 2 of RFC 5741. | ||||
This Internet-Draft will expire on April 67, 2014. | Information about the current status of this document, any errata, | |||
and how to provide feedback on it may be obtained at | ||||
http://www.rfc-editor.org/info/rfc7424. | ||||
Copyright Notice | Copyright Notice | |||
Copyright (c) 2014 IETF Trust and the persons identified as the | Copyright (c) 2015 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | document authors. All rights reserved. | |||
This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
(http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
publication of this document. Please review these documents | publication of this document. Please review these documents | |||
carefully, as they describe your rights and restrictions with respect | carefully, as they describe your rights and restrictions with respect | |||
to this document. Code Components extracted from this document must | to this document. Code Components extracted from this document must | |||
include Simplified BSD License text as described in Section 4.e of | include Simplified BSD License text as described in Section 4.e of | |||
the Trust Legal Provisions and are provided without warranty as | the Trust Legal Provisions and are provided without warranty as | |||
described in the Simplified BSD License. | described in the Simplified BSD License. | |||
Abstract | ||||
Demands on networking infrastructure are growing exponentially due to | ||||
bandwidth hungry applications such as rich media applications and | ||||
inter-data center communications. In this context, it is important to | ||||
optimally use the bandwidth in wired networks that extensively use | ||||
link aggregation groups and equal cost multi-paths as techniques for | ||||
bandwidth scaling. This draft explores some of the mechanisms useful | ||||
for achieving this. | ||||
Table of Contents | Table of Contents | |||
1. Introduction...................................................3 | 1. Introduction ....................................................4 | |||
1.1. Acronyms..................................................4 | 1.1. Acronyms ...................................................4 | |||
1.2. Terminology...............................................4 | 1.2. Terminology ................................................5 | |||
2. Flow Categorization............................................5 | 2. Flow Categorization .............................................6 | |||
3. Hash-based Load Distribution in LAG/ECMP.......................6 | 3. Hash-Based Load Distribution in LAG/ECMP ........................6 | |||
4. Mechanisms for Optimizing LAG/ECMP Component Link Utilization..7 | 4. Mechanisms for Optimizing LAG/ECMP Component Link Utilization ...8 | |||
4.1. Differences in LAG vs ECMP................................8 | 4.1. Differences in LAG vs. ECMP ................................9 | |||
4.2. Operational Overview......................................9 | 4.2. Operational Overview ......................................10 | |||
4.3. Large Flow Recognition...................................10 | 4.3. Large Flow Recognition ....................................11 | |||
4.3.1. Flow Identification.................................10 | 4.3.1. Flow Identification ................................11 | |||
4.3.2. Criteria and Techniques for Large Flow Recognition..11 | 4.3.2. Criteria and Techniques for Large Flow | |||
4.3.3. Sampling Techniques.................................11 | Recognition ........................................12 | |||
4.3.4. Inline Data Path Measurement........................13 | 4.3.3. Sampling Techniques ................................12 | |||
4.3.5. Use of Multiple Methods for Large Flow Recognition..14 | 4.3.4. Inline Data Path Measurement .......................14 | |||
4.4. Load Rebalancing Options.................................14 | 4.3.5. Use of Multiple Methods for Large Flow | |||
4.4.1. Alternative Placement of Large Flows................14 | Recognition ........................................15 | |||
4.4.2. Redistributing Small Flows..........................15 | 4.4. Options for Load Rebalancing ..............................15 | |||
4.4.3. Component Link Protection Considerations............15 | 4.4.1. Alternative Placement of Large Flows ...............15 | |||
4.4.4. Load Rebalancing Algorithms.........................15 | 4.4.2. Redistributing Small Flows .........................16 | |||
4.4.5. Load Rebalancing Example............................16 | 4.4.3. Component Link Protection Considerations ...........16 | |||
5. Information Model for Flow Rebalancing........................17 | 4.4.4. Algorithms for Load Rebalancing ....................17 | |||
5.1. Configuration Parameters for Flow Rebalancing............17 | 4.4.5. Example of Load Rebalancing ........................17 | |||
5.2. System Configuration and Identification Parameters.......18 | 5. Information Model for Flow Rebalancing .........................18 | |||
5.3. Information for Alternative Placement of Large Flows.....19 | 5.1. Configuration Parameters for Flow Rebalancing .............18 | |||
5.4. Information for Redistribution of Small Flows............19 | 5.2. System Configuration and Identification Parameters ........19 | |||
5.5. Export of Flow Information...............................20 | 5.3. Information for Alternative Placement of Large Flows ......20 | |||
5.6. Monitoring information...................................20 | 5.4. Information for Redistribution of Small Flows .............21 | |||
5.6.1. Interface (link) utilization........................20 | 5.5. Export of Flow Information ................................21 | |||
5.6.2. Other monitoring information........................21 | 5.6. Monitoring Information ....................................21 | |||
6. Operational Considerations....................................21 | 5.6.1. Interface (Link) Utilization .......................21 | |||
6.1. Rebalancing Frequency....................................21 | 5.6.2. Other Monitoring Information .......................22 | |||
6.2. Handling Route Changes...................................21 | 6. Operational Considerations .....................................23 | |||
6.3. Forwarding Resources.....................................22 | 6.1. Rebalancing Frequency .....................................23 | |||
7. IANA Considerations...........................................22 | 6.2. Handling Route Changes ....................................23 | |||
8. Security Considerations.......................................22 | 6.3. Forwarding Resources ......................................23 | |||
9. Contributing Authors..........................................22 | 7. Security Considerations ........................................23 | |||
10. Acknowledgements.............................................23 | 8. References .....................................................24 | |||
11. References...................................................23 | 8.1. Normative References ......................................24 | |||
11.1. Normative References....................................23 | 8.2. Informative References ....................................25 | |||
11.2. Informative References..................................23 | Appendix A. Internet Traffic Analysis and Load-Balancing | |||
Simulation ...........................................28 | ||||
Acknowledgements ..................................................28 | ||||
Contributors ......................................................28 | ||||
Authors' Addresses ................................................29 | ||||
1. Introduction | 1. Introduction | |||
Networks extensively use link aggregation groups (LAG) [802.1AX] and | Networks extensively use link aggregation groups (LAGs) [802.1AX] and | |||
equal cost multi-paths (ECMP) [RFC 2991] as techniques for capacity | equal-cost multipaths (ECMPs) [RFC2991] as techniques for capacity | |||
scaling. For the problems addressed by this document, network traffic | scaling. For the problems addressed by this document, network | |||
can be predominantly categorized into two traffic types: long-lived | traffic can be predominantly categorized into two traffic types: | |||
large flows and other flows. These other flows, which include long- | long-lived large flows and other flows. These other flows, which | |||
lived small flows, short-lived small flows, and short-lived large | include long-lived small flows, short-lived small flows, and short- | |||
flows, are referred to as "small flows" in this document. Long-lived | lived large flows, are referred to as "small flows" in this document. | |||
large flows are simply referred to as "large flows." | Long-lived large flows are simply referred to as "large flows". | |||
Stateless hash-based techniques [ITCOM, RFC 2991, RFC 2992, RFC 6790] | Stateless hash-based techniques [ITCOM] [RFC2991] [RFC2992] [RFC6790] | |||
are often used to distribute both large flows and small flows over | are often used to distribute both large flows and small flows over | |||
the component links in a LAG/ECMP. However the traffic may not be | the component links in a LAG/ECMP. However, the traffic may not be | |||
evenly distributed over the component links due to the traffic | evenly distributed over the component links due to the traffic | |||
pattern. | pattern. | |||
This draft describes mechanisms for optimizing LAG/ECMP component | This document describes mechanisms for optimizing LAG/ECMP component | |||
link utilization while using hash-based techniques. The mechanisms | link utilization when using hash-based techniques. The mechanisms | |||
comprise the following steps -- recognizing large flows in a router; | comprise the following steps: 1) recognizing large flows in a router, | |||
and assigning the large flows to specific LAG/ECMP component links or | and 2) assigning the large flows to specific LAG/ECMP component links | |||
redistributing the small flows when a component link on the router is | or redistributing the small flows when a component link on the router | |||
congested. | is congested. | |||
It is useful to keep in mind that in typical use cases for this | It is useful to keep in mind that in typical use cases for these | |||
mechanism the large flows are those that consume a significant amount | mechanisms, the large flows consume a significant amount of bandwidth | |||
of bandwidth on a link, e.g. greater than 5% of link bandwidth. The | on a link, e.g., greater than 5% of link bandwidth. The number of | |||
number of such flows would necessarily be fairly small, e.g. on the | such flows would necessarily be fairly small, e.g., on the order of | |||
order of 10's or 100's per LAG/ECMP. In other words, the number of | 10s or 100s per LAG/ECMP. In other words, the number of large flows | |||
large flows is NOT expected to be on the order of millions of flows. | is NOT expected to be on the order of millions of flows. Examples of | |||
Examples of such large flows would be IPsec tunnels in service | such large flows would be IPsec tunnels in service provider backbone | |||
provider backbone networks or storage backup traffic in data center | networks or storage backup traffic in data center networks. | |||
networks. | ||||
1.1. Acronyms | 1.1. Acronyms | |||
DOS: Denial of Service | DoS: Denial of Service | |||
ECMP: Equal Cost Multi-path | ECMP: Equal-Cost Multipath | |||
GRE: Generic Routing Encapsulation | GRE: Generic Routing Encapsulation | |||
LAG: Link Aggregation Group | IPFIX: IP Flow Information Export | |||
MPLS: Multiprotocol Label Switching | LAG: Link Aggregation Group | |||
NVGRE: Network Virtualization using Generic Routing Encapsulation | MPLS: Multiprotocol Label Switching | |||
PBR: Policy Based Routing | NVGRE: Network Virtualization using Generic Routing Encapsulation | |||
PBR: Policy-Based Routing | ||||
QoS: Quality of Service | QoS: Quality of Service | |||
STT: Stateless Transport Tunneling | STT: Stateless Transport Tunneling | |||
TCAM: Ternary Content Addressable Memory | VXLAN: Virtual eXtensible LAN | |||
VXLAN: Virtual Extensible LAN | 1.2. Terminology | |||
1.2. Terminology | Central management entity: | |||
An entity that is capable of monitoring information about link | ||||
utilization and flows in routers across the network and may be | ||||
capable of making traffic-engineering decisions for placement of | ||||
large flows. It may include the functions of a collector | ||||
[RFC7011]. | ||||
Central management entity: Refers to an entity that is capable of | ECMP component link: | |||
monitoring information about link utilization and flows in routers | An individual next hop within an ECMP group. An ECMP component | |||
across the network and may be capable of making traffic engineering | link may itself comprise a LAG. | |||
decisions for placement of large flows. It may include the functions | ||||
of a collector [RFC 7011]. | ||||
ECMP component link: An individual nexthop within an ECMP group. An | ECMP table: | |||
ECMP component link may itself comprise a LAG. | A table that is used as the next hop of an ECMP route that | |||
comprises the set of ECMP component links and the weights | ||||
associated with each of those ECMP component links. The input for | ||||
looking up the table is the hash value for the packet, and the | ||||
weights are used to determine which values of the hash function | ||||
map to a given ECMP component link. | ||||
ECMP table: A table that is used as the nexthop of an ECMP route that | Flow (large or small): | |||
comprises the set of ECMP component links and the weights associated | A sequence of packets for which ordered delivery should be | |||
with each of those ECMP component links. The input for looking up | maintained, e.g., packets belonging to the same TCP connection. | |||
the table is the hash value for the packet, and the weights are used | ||||
to determine which values of the hash function map to a given ECMP | ||||
component link. | ||||
LAG component link: An individual link within a LAG. A LAG component | LAG component link: | |||
link is typically a physical link. | An individual link within a LAG. A LAG component link is | |||
typically a physical link. | ||||
LAG table: A table that is used as the output port which is a LAG | LAG table: | |||
that comprises the set of LAG component links and the weights | A table that is used as the output port, which is a LAG, that | |||
associated with each of those component links. The input for looking | comprises the set of LAG component links and the weights | |||
up the table is the hash value for the packet, and the weights are | associated with each of those component links. The input for | |||
used to determine which values of the hash function map to a given | looking up the table is the hash value for the packet, and the | |||
LAG component link. | weights are used to determine which values of the hash function | |||
map to a given LAG component link. | ||||
Large flow(s): Refers to long-lived large flow(s). | Large flow(s): | |||
Refers to long-lived large flow(s). | ||||
Small flow(s): Refers to any of, or a combination of, long-lived | Small flow(s): | |||
small flow(s), short-lived small flows, and short-lived large | Refers to any of, or a combination of, long-lived small flow(s), | |||
flow(s). | short-lived small flows, and short-lived large flow(s). | |||
2. Flow Categorization | 2. Flow Categorization | |||
In general, based on the size and duration, a flow can be categorized | In general, based on the size and duration, a flow can be categorized | |||
into any one of the following four types, as shown in Figure 1: | into any one of the following four types, as shown in Figure 1: | |||
(a) Short-lived Large Flow (SLLF), | o short-lived large flow (SLLF), | |||
(b) Short-lived Small Flow (SLSF), | ||||
(c) Long-lived Large Flow (LLLF), and | o short-lived small flow (SLSF), | |||
(d) Long-lived Small Flow (LLSF). | ||||
o long-lived large flow (LLLF), and | ||||
o long-lived small flow (LLSF). | ||||
Flow Bandwidth | Flow Bandwidth | |||
^ | ^ | |||
|--------------------|--------------------| | |--------------------|--------------------| | |||
| | | | | | | | |||
Large | SLLF | LLLF | | Large | SLLF | LLLF | | |||
Flow | | | | Flow | | | | |||
|--------------------|--------------------| | |--------------------|--------------------| | |||
| | | | | | | | |||
Small | SLSF | LLSF | | Small | SLSF | LLSF | | |||
Flow | | | | Flow | | | | |||
+--------------------+--------------------+-->Flow Duration | +--------------------+--------------------+-->Flow Duration | |||
Short-lived Long-lived | Short-Lived Long-Lived | |||
Flow Flow | Flow Flow | |||
Figure 1: Flow Categorization | Figure 1: Flow Categorization | |||
In this document, as mentioned earlier, we categorize long-lived | In this document, as mentioned earlier, we categorize long-lived | |||
large flows as "large flows", and all of the others -- long-lived | large flows as "large flows", and all of the others (long-lived small | |||
small flows, short-lived small flows, and short-lived large flows | flows, short-lived small flows, and short-lived large flows) as | |||
as "small flows". | "small flows". | |||
3. Hash-based Load Distribution in LAG/ECMP | 3. Hash-Based Load Distribution in LAG/ECMP | |||
Hash-based techniques are often used for traffic load balancing to | Hash-based techniques are often used for load balancing of traffic to | |||
select among multiple available paths within a LAG/ECMP group. The | select among multiple available paths within a LAG/ECMP group. The | |||
advantages of hash-based techniques for load distribution are the | advantages of hash-based techniques for load distribution are the | |||
preservation of the packet sequence in a flow and the real-time | preservation of the packet sequence in a flow and the real-time | |||
distribution without maintaining per-flow state in the router. Hash- | distribution without maintaining per-flow state in the router. Hash- | |||
based techniques use a combination of fields in the packet's headers | based techniques use a combination of fields in the packet's headers | |||
to identify a flow, and the hash function computed using these fields | to identify a flow, and the hash function computed using these fields | |||
is used to generate a unique number that identifies a link/path in a | is used to generate a unique number that identifies a link/path in a | |||
LAG/ECMP group. The result of the hashing procedure is a many-to-one | LAG/ECMP group. The result of the hashing procedure is a many-to-one | |||
mapping of flows to component links. | mapping of flows to component links. | |||
If the traffic mix constitutes flows such that the result of the hash | Hash-based techniques produce good results with respect to | |||
function across these flows is fairly uniform so that a similar | utilization of the individual component links if: | |||
number of flows is mapped to each component link, if the individual | ||||
flow rates are much smaller as compared to the link capacity, and if | o the traffic mix constitutes flows such that the result of the hash | |||
the rate differences are not dramatic, hash-based techniques produce | function across these flows is fairly uniform so that a similar | |||
good results with respect to utilization of the individual component | number of flows is mapped to each component link, | |||
links. However, if one or more of these conditions are not met, hash- | ||||
based techniques may result in imbalance in the loads on individual | o the individual flow rates are much smaller as compared to the link | |||
capacity, and | ||||
o the differences in flow rates are not dramatic. | ||||
However, if one or more of these conditions are not met, hash-based | ||||
techniques may result in imbalance in the loads on individual | ||||
component links. | component links. | |||
One example is illustrated in Figure 2. In Figure 2, there are two | An example is illustrated in Figure 2. As shown, there are two | |||
routers, R1 and R2, and there is a LAG between them which has 3 | routers, R1 and R2, and there is a LAG between them that has three | |||
component links (1), (2), (3). There are a total of 10 flows that | component links (1), (2), and (3). A total of ten flows need to be | |||
need to be distributed across the links in this LAG. The result of | distributed across the links in this LAG. The result of applying the | |||
applying the hash-based technique is as follows: | hash-based technique is as follows: | |||
. Component link (1) has 3 flows -- 2 small flows and 1 large | o Component link (1) has three flows (two small flows and one large | |||
flow -- and the link utilization is normal. | flow), and the link utilization is normal. | |||
. Component link (2) has 3 flows -- 3 small flows and no large | o Component link (2) has three flows (three small flows and no large | |||
flow -- and the link utilization is light. | flows), and the link utilization is light. | |||
o The absence of any large flow causes the component link | - The absence of any large flow causes the component link to be | |||
under-utilized. | underutilized. | |||
. Component link (3) has 4 flows -- 2 small flows and 2 large | o Component link (3) has four flows (two small flows and two large | |||
flows -- and the link capacity is exceeded resulting in | flows), and the link capacity is exceeded resulting in congestion. | |||
congestion. | ||||
o The presence of 2 large flows causes congestion on this | - The presence of two large flows causes congestion on this | |||
component link. | component link. | |||
+-----------+ -> +-----------+ | +-----------+ -> +-----------+ | |||
| | -> | | | | | -> | | | |||
| | ===> | | | | | ===> | | | |||
| (1)|--------|(1) | | | (1)|--------|(1) | | |||
| | -> | | | | | -> | | | |||
| | -> | | | | | -> | | | |||
| (R1) | -> | (R2) | | | (R1) | -> | (R2) | | |||
| (2)|--------|(2) | | | (2)|--------|(2) | | |||
| | -> | | | | | -> | | | |||
skipping to change at page 7, line 28 | skipping to change at page 8, line 28 | |||
| | | | | | | | | | |||
+-----------+ +-----------+ | +-----------+ +-----------+ | |||
Where: -> small flow | Where: -> small flow | |||
===> large flow | ===> large flow | |||
Figure 2: Unevenly Utilized Component Links | Figure 2: Unevenly Utilized Component Links | |||
This document presents mechanisms for addressing the imbalance in | This document presents mechanisms for addressing the imbalance in | |||
load distribution resulting from commonly used hash-based techniques | load distribution resulting from commonly used hash-based techniques | |||
for LAG/ECMP that were shown in the above example. The mechanisms use | for LAG/ECMP that are shown in the above example. The mechanisms use | |||
large flow awareness to compensate for the imbalance in load | large flow awareness to compensate for the imbalance in load | |||
distribution. | distribution. | |||
4. Mechanisms for Optimizing LAG/ECMP Component Link Utilization | 4. Mechanisms for Optimizing LAG/ECMP Component Link Utilization | |||
The suggested mechanisms in this draft are about a local optimization | The suggested mechanisms in this document are local optimization | |||
solution; they are local in the sense that both the identification of | solutions; they are local in the sense that both the identification | |||
large flows and re-balancing of the load can be accomplished | of large flows and rebalancing of the load can be accomplished | |||
completely within individual nodes in the network without the need | completely within individual routers in the network without the need | |||
for interaction with other nodes. | for interaction with other routers. | |||
This approach may not yield a global optimization of the placement of | This approach may not yield a global optimization of the placement of | |||
large flows across multiple nodes in a network, which may be | large flows across multiple routers in a network, which may be | |||
desirable in some networks. On the other hand, a local approach may | desirable in some networks. On the other hand, a local approach may | |||
be adequate for some environments for the following reasons: | be adequate for some environments for the following reasons: | |||
1) Different links within a network experience different levels of | 1) Different links within a network experience different levels of | |||
utilization and, thus, a "targeted" solution is needed for those hot- | utilization; thus, a "targeted" solution is needed for those hot | |||
spots in the network. An example is the utilization of a LAG between | spots in the network. An example is the utilization of a LAG | |||
two routers that needs to be optimized. | between two routers that needs to be optimized. | |||
2) Some networks may lack end-to-end visibility, e.g. when a | 2) Some networks may lack end-to-end visibility, e.g., when a | |||
certain network, under the control of a given operator, is a transit | certain network, under the control of a given operator, is a | |||
network for traffic from other networks that are not under the | transit network for traffic from other networks that are not | |||
control of the same operator. | under the control of the same operator. | |||
4.1. Differences in LAG vs ECMP | 4.1. Differences in LAG vs. ECMP | |||
While the mechanisms explained herein are applicable to both LAGs and | While the mechanisms explained herein are applicable to both LAGs and | |||
ECMP groups, it is useful to note that there are some key differences | ECMP groups, it is useful to note that there are some key differences | |||
between the two that may impact how effective the mechanism is. This | between the two that may impact how effective the mechanisms are. | |||
relates, in part, to the localized information with which the scheme | This relates, in part, to the localized information with which the | |||
is intended to operate. | mechanisms are intended to operate. | |||
A LAG is usually established across links that are between 2 adjacent | A LAG is usually established across links that are between two | |||
routers. As a result, the scope of problem of optimizing the | adjacent routers. As a result, the scope of the problem of | |||
bandwidth utilization on the component links is fairly narrow. It | optimizing the bandwidth utilization on the component links is fairly | |||
simply involves re-balancing the load across the component links | narrow. It simply involves rebalancing the load across the component | |||
between these two routers, and there is no impact whatsoever to other | links between these two routers, and there is no impact whatsoever to | |||
parts of the network. The scheme works equally well for unicast and | other parts of the network. The scheme works equally well for | |||
multicast flows. | unicast and multicast flows. | |||
On the other hand, with ECMP, redistributing the load across | On the other hand, with ECMP, redistributing the load across | |||
component links that are part of the ECMP group may impact traffic | component links that are part of the ECMP group may impact traffic | |||
patterns at all of the nodes that are downstream of the given router | patterns at all of the routers that are downstream of the given | |||
between itself and the destination. The local optimization may | router between itself and the destination. The local optimization | |||
result in congestion at a downstream node. (In its simplest form, an | may result in congestion at a downstream node. (In its simplest | |||
ECMP group may be used to distribute traffic on component links that | form, an ECMP group may be used to distribute traffic on component | |||
are between two adjacent routers, and in that case, the ECMP group is | links that are between two adjacent routers, and in that case, the | |||
no different than a LAG for the purpose of this discussion. It | ECMP group is no different than a LAG for the purpose of this | |||
should be noted that an ECMP component link may itself comprise a | discussion. It should be noted that an ECMP component link may | |||
LAG, in which case the scheme may be further applied to the component | itself comprise a LAG, in which case the scheme may be further | |||
links within the LAG.) | applied to the component links within the LAG.) | |||
+-----+ +-----+ | ||||
| S1 | | S2 | | ||||
+-----+ +-----+ | ||||
/ \ \ / /\ | ||||
/ +---------+ / \ | ||||
/ / \ \ / \ | ||||
/ / \ +------+ \ | ||||
/ / \ / \ \ | ||||
+-----+ +-----+ +-----+ | ||||
| L1 | | L2 | | L3 | | ||||
+-----+ +-----+ +-----+ | ||||
Figure 3: Two-level Clos Network | ||||
To demonstrate the limitations of local optimization, consider a two- | To demonstrate the limitations of local optimization, consider a two- | |||
level Clos network topology as shown in Figure 3 with three leaf | level Clos network topology as shown in Figure 3 with three leaf | |||
nodes (L1, L2, L3) and two spine nodes (S1, S2). Assume all of the | routers (L1, L2, and L3) and two spine routers (S1 and S2). Assume | |||
links are 10 Gbps. | all of the links are 10 Gbps. | |||
Let L1 have two flows of 4 Gbps each towards L3, and let L2 have one | Let L1 have two flows of 4 Gbps each towards L3, and let L2 have one | |||
flow of 7 Gbps also towards L3. If L1 balances the load optimally | flow of 7 Gbps also towards L3. If L1 balances the load optimally | |||
between S1 and S2, and L2 sends the flow via S1, then the downlink | between S1 and S2, and L2 sends the flow via S1, then the downlink | |||
from S1 to L3 would get congested resulting in packet discards. On | from S1 to L3 would get congested, resulting in packet discards. On | |||
the other hand, if L1 had sent both its flows towards S1 and L2 had | the other hand, if L1 had sent both its flows towards S1 and L2 had | |||
sent its flow towards S2, there would have been no congestion at | sent its flow towards S2, there would have been no congestion at | |||
either S1 or S2. | either S1 or S2. | |||
+-----+ +-----+ | ||||
| S1 | | S2 | | ||||
+-----+ +-----+ | ||||
/ \ \ / /\ | ||||
/ +---------+ / \ | ||||
/ / \ \ / \ | ||||
/ / \ +------+ \ | ||||
/ / \ / \ \ | ||||
+-----+ +-----+ +-----+ | ||||
| L1 | | L2 | | L3 | | ||||
+-----+ +-----+ +-----+ | ||||
Figure 3: Two-Level Clos Network | ||||
The other issue with applying this scheme to ECMP groups is that it | The other issue with applying this scheme to ECMP groups is that it | |||
may not apply equally to unicast and multicast traffic because of the | may not apply equally to unicast and multicast traffic because of the | |||
way multicast trees are constructed. | way multicast trees are constructed. | |||
Finally, it is possible for a single physical link to participate as | Finally, it is possible for a single physical link to participate as | |||
a component link in multiple ECMP groups, whereas with LAGs, a link | a component link in multiple ECMP groups, whereas with LAGs, a link | |||
can participate as a component link of only one LAG. | can participate as a component link of only one LAG. | |||
4.2. Operational Overview | 4.2. Operational Overview | |||
The various steps in optimizing LAG/ECMP component link utilization | The various steps in optimizing LAG/ECMP component link utilization | |||
in networks are detailed below: | in networks are detailed below: | |||
Step 1) This involves large flow recognition in routers and | Step 1: | |||
maintaining the mapping of the large flow to the component link that | This step involves recognizing large flows in routers and | |||
it uses. The recognition of large flows is explained in Section 4.3. | maintaining the mapping for each large flow to the component link | |||
that it uses. Recognition of large flows is explained in Section | ||||
4.3. | ||||
Step 2) The egress component links are periodically scanned for link | Step 2: | |||
utilization and the imbalance for the LAG/ECMP group is monitored. If | The egress component links are periodically scanned for link | |||
the imbalance exceeds a certain imbalance threshold, then re- | utilization, and the imbalance for the LAG/ECMP group is | |||
balancing is triggered. Measurement of the imbalance is discussed | monitored. If the imbalance exceeds a certain threshold, then | |||
further in 5.1. Additional criteria may also be used to determine | rebalancing is triggered. Measurement of the imbalance is | |||
whether or not to trigger rebalancing, such as the maximum | discussed further in Section 5.1. In addition to the imbalance, | |||
utilization of any of the component links, in addition to the | further criteria (such as the maximum utilization of any of the | |||
imbalance. The use of sampling techniques for the measurement of | component links) may also be used to determine whether or not to | |||
egress component link utilization, including the issues of depending | trigger rebalancing. The use of sampling techniques for the | |||
on ingress sampling for these measurements, are discussed in Section | measurement of egress component link utilization, including the | |||
4.3.3. | issues of depending on ingress sampling for these measurements, | |||
are discussed in Section 4.3.3. | ||||
Step 3) As a part of rebalancing, the operator can choose to | Step 3: | |||
rebalance the large flows on to lightly loaded component links of the | As a part of rebalancing, the operator can choose to rebalance the | |||
LAG/ECMP group, redistribute the small flows on the congested link to | large flows by placing them on lightly loaded component links of | |||
other component links of the group, or a combination of both. | the LAG/ECMP group, redistribute the small flows on the congested | |||
link to other component links of the group, or a combination of | ||||
both. | ||||
All of the steps identified above can be done locally within the | All of the steps identified above can be done locally within the | |||
router itself or could involve the use of a central management | router itself or could involve the use of a central management | |||
entity. | entity. | |||
Providing large flow information to a central management entity | Providing large flow information to a central management entity | |||
provides the capability to globally optimize flow distribution as | provides the capability to globally optimize flow distribution as | |||
described in Section 4.1. Consider the following example. A router | described in Section 4.1. Consider the following example. A router | |||
may have 3 ECMP nexthops that lead down paths P1, P2, and P3. A | may have three ECMP next hops that lead down paths P1, P2, and P3. A | |||
couple of hops downstream on path P1 there may be a congested link, | couple of hops downstream on path P1, there may be a congested link, | |||
while paths P2 and P3 may be under-utilized. This is something that | while paths P2 and P3 may be underutilized. This is something that | |||
the local router does not have visibility into. With the help of a | the local router does not have visibility into. With the help of a | |||
central management entity, the operator could redistribute some of | central management entity, the operator could redistribute some of | |||
the flows from P1 to P2 and/or P3 resulting in a more optimized flow | the flows from P1 to P2 and/or P3, resulting in a more optimized flow | |||
of traffic. | of traffic. | |||
The mechanisms described above are especially useful when bundling | The steps described above are especially useful when bundling links | |||
links of different bandwidths for e.g. 10 Gbps and 100 Gbps as | of different bandwidths, e.g., 10 Gbps and 100 Gbps as described in | |||
described in [ID.ietf-rtgwg-cl-requirement]. | [RFC7226]. | |||
4.3. Large Flow Recognition | 4.3. Large Flow Recognition | |||
4.3.1. Flow Identification | 4.3.1. Flow Identification | |||
A flow (large flow or small flow) can be defined as a sequence of | Flows are typically identified using one or more fields from the | |||
packets for which ordered delivery should be maintained. Flows are | packet header, for example: | |||
typically identified using one or more fields from the packet header, | ||||
for example: | ||||
. Layer 2: Source MAC address, destination MAC address, VLAN ID. | o Layer 2: Source Media Access Control (MAC) address, destination | |||
MAC address, VLAN ID. | ||||
. IP header: IP Protocol, IP source address, IP destination | o IP header: IP protocol, IP source address, IP destination address, | |||
address, flow label (IPv6 only) | flow label (IPv6 only). | |||
. Transport protocol header: Source port number, destination port | o Transport protocol header: Source port number, destination port | |||
number. These apply to protocols such as TCP, UDP, SCTP. | number. These apply to protocols such as TCP, UDP, and the Stream | |||
Control Transmission Protocol (SCTP). | ||||
. MPLS Labels. | o MPLS labels. | |||
For tunneling protocols like Generic Routing Encapsulation (GRE) | For tunneling protocols like Generic Routing Encapsulation (GRE) | |||
[RFC 2784], Virtual eXtensible Local Area Network (VXLAN) [RFC 7348], | [RFC2784], Virtual eXtensible LAN (VXLAN) [RFC7348], Network | |||
Network Virtualization using Generic Routing Encapsulation (NVGRE) | Virtualization using Generic Routing Encapsulation (NVGRE) [NVGRE], | |||
[NVGRE], Stateless Transport Tunneling (STT) [STT], Layer 2 Tunneling | Stateless Transport Tunneling (STT) [STT], Layer 2 Tunneling Protocol | |||
Protocol (L2TP) [RFC 3931], etc., flow identification is possible | (L2TP) [RFC3931], etc., flow identification is possible based on | |||
based on inner and/or outer headers as well as fields introduced by | inner and/or outer headers as well as fields introduced by the tunnel | |||
the tunnel header, as any or all such fields may be used for load | header, as any or all such fields may be used for load balancing | |||
balancing decisions [RFC 5640]. The above list is not exhaustive. | decisions [RFC5640]. | |||
The above list is not exhaustive. | ||||
The mechanisms described in this document are agnostic to the fields | The mechanisms described in this document are agnostic to the fields | |||
that are used for flow identification. | that are used for flow identification. | |||
This method of flow identification is consistent with that of IPFIX | This method of flow identification is consistent with that of IPFIX | |||
[RFC 7011]. | [RFC7011]. | |||
4.3.2. Criteria and Techniques for Large Flow Recognition | 4.3.2. Criteria and Techniques for Large Flow Recognition | |||
From a bandwidth and time duration perspective, in order to recognize | From the perspective of bandwidth and time duration, in order to | |||
large flows we define an observation interval and observe the | recognize large flows, we define an observation interval and measure | |||
bandwidth of the flow over that interval. A flow that exceeds a | the bandwidth of the flow over that interval. A flow that exceeds a | |||
certain minimum bandwidth threshold over that observation interval | certain minimum bandwidth threshold over that observation interval | |||
would be considered a large flow. | would be considered a large flow. | |||
The two parameters -- the observation interval, and the minimum | The two parameters -- the observation interval and the minimum | |||
bandwidth threshold over that observation interval -- should be | bandwidth threshold over that observation interval -- should be | |||
programmable to facilitate handling of different use cases and | programmable to facilitate handling of different use cases and | |||
traffic characteristics. For example, a flow which is at or above 10% | traffic characteristics. For example, a flow that is at or above 10% | |||
of link bandwidth for a time period of at least 1 second could be | of link bandwidth for a time period of at least one second could be | |||
declared a large flow [DevoFlow]. | declared a large flow [DEVOFLOW]. | |||
In order to avoid excessive churn in the rebalancing, once a flow has | In order to avoid excessive churn in the rebalancing, once a flow has | |||
been recognized as a large flow, it should continue to be recognized | been recognized as a large flow, it should continue to be recognized | |||
as a large flow for as long as the traffic received during an | as a large flow for as long as the traffic received during an | |||
observation interval exceeds some fraction of the bandwidth | observation interval exceeds some fraction of the bandwidth | |||
threshold, for example 80% of the bandwidth threshold. | threshold, for example, 80% of the bandwidth threshold. | |||
Various techniques to recognize a large flow are described below. | Various techniques to recognize a large flow are described in | |||
Sections 4.3.3, 4.3.4, and 4.3.5. | ||||
4.3.3. Sampling Techniques | 4.3.3. Sampling Techniques | |||
A number of routers support sampling techniques such as sFlow [sFlow- | A number of routers support sampling techniques such as sFlow | |||
v5, sFlow-LAG], PSAMP [RFC 5475] and NetFlow Sampling [RFC 3954]. | [sFlow-v5] [sFlow-LAG], Packet Sampling (PSAMP) [RFC5475], and | |||
For the purpose of large flow recognition, sampling needs to be | NetFlow Sampling [RFC3954]. For the purpose of large flow | |||
enabled on all of the egress ports in the router where such | recognition, sampling needs to be enabled on all of the egress ports | |||
measurements are desired. | in the router where such measurements are desired. | |||
Using sFlow as an example, processing in a sFlow collector will | Using sFlow as an example, processing in an sFlow collector can | |||
provide an approximate indication of the large flows mapping to each | provide an approximate indication of the mapping of large flows to | |||
of the component links in each LAG/ECMP group. It is possible to | each of the component links in each LAG/ECMP group. Assuming | |||
sufficient control plane resources are available, it is possible to | ||||
implement this part of the collector function in the control plane of | implement this part of the collector function in the control plane of | |||
the router reducing dependence on an external management station, | the router to reduce dependence on a central management entity. | |||
assuming sufficient control plane resources are available. | ||||
If egress sampling is not available, ingress sampling can suffice | If egress sampling is not available, ingress sampling can suffice | |||
since the central management entity used by the sampling technique | since the central management entity used by the sampling technique | |||
typically has multi-node visibility and can use the samples from an | typically has visibility across multiple routers in a network and can | |||
immediately downstream node to make measurements for egress traffic | use the samples from an immediately downstream router to make | |||
at the local node. | measurements for egress traffic at the local router. | |||
The option of using ingress sampling for this purpose may not be | The option of using ingress sampling for this purpose may not be | |||
available if the downstream device is under the control of a | available if the downstream router is under the control of a | |||
different operator, or if the downstream device does not support | different operator or if the downstream device does not support | |||
sampling. | sampling. | |||
Alternatively, since sampling techniques require that the sample be | Alternatively, since sampling techniques require that the sample be | |||
annotated with the packet's egress port information, ingress sampling | annotated with the packet's egress port information, ingress sampling | |||
may suffice. However, this means that sampling would have to be | may suffice. However, this means that sampling would have to be | |||
enabled on all ports, rather than only on those ports where such | enabled on all ports, rather than only on those ports where such | |||
monitoring is desired. There is one situation in which this approach | monitoring is desired. There is one situation in which this approach | |||
may not work. If there are tunnels that originate from the given | may not work. If there are tunnels that originate from the given | |||
router, and if the resulting tunnel comprises the large flow, then | router and if the resulting tunnel comprises the large flow, then | |||
this cannot be deduced from ingress sampling at the given router. | this cannot be deduced from ingress sampling at the given router. | |||
Instead, if egress sampling is unavailable, then ingress sampling | Instead, for this scenario, if egress sampling is unavailable, then | |||
from the downstream router must be used. | ingress sampling from the downstream router must be used. | |||
To illustrate the use of ingress versus egress sampling, we refer to | To illustrate the use of ingress versus egress sampling, we refer to | |||
Figure 2. Since we are looking at rebalancing flows at R1, we would | Figure 2. Since we are looking at rebalancing flows at R1, we would | |||
need to enable egress sampling on ports (1), (2), and (3) on R1. If | need to enable egress sampling on ports (1), (2), and (3) on R1. If | |||
egress sampling is not available, and if R2 is also under the control | egress sampling is not available and if R2 is also under the control | |||
of the same administrator, enabling ingress sampling on R2's ports | of the same administrator, enabling ingress sampling on R2's ports | |||
(1), (2), and (3) would also work, but it would necessitate the | (1), (2), and (3) would also work, but it would necessitate the | |||
involvement of a central management entity in order for R1 to obtain | involvement of a central management entity in order for R1 to obtain | |||
large flow information for each of its links. Finally, R1 can enable | large flow information for each of its links. Finally, R1 can only | |||
ingress sampling only on all of its ports (not just the ports that | enable ingress sampling on all of its ports (not just the ports that | |||
are part of the LAG/ECMP group being monitored) and that would | are part of the LAG/ECMP group being monitored), and that would | |||
suffice if the sampling technique annotates the samples with the | suffice if the sampling technique annotates the samples with the | |||
egress port information. | egress port information. | |||
The advantages and disadvantages of sampling techniques are as | The advantages and disadvantages of sampling techniques are as | |||
follows. | follows. | |||
Advantages: | Advantages: | |||
. Supported in most existing routers. | o Supported in most existing routers. | |||
. Requires minimal router resources. | o Requires minimal router resources. | |||
Disadvantages: | Disadvantage: | |||
. In order to minimize the error inherent in sampling, there is a | o In order to minimize the error inherent in sampling, there is a | |||
minimum delay for the recognition time of large flows, and in | minimum delay for the recognition time of large flows, and in the | |||
the time that it takes to react to this information. | time that it takes to react to this information. | |||
With sampling, the detection of large flows can be done on the order | With sampling, the detection of large flows can be done on the order | |||
of one second [DevoFlow]. A discussion on determining the | of one second [DEVOFLOW]. A discussion on determining the | |||
appropriate sampling frequency is available in the following | appropriate sampling frequency is available in [SAMP-BASIC]. | |||
reference [SAMP-BASIC]. | ||||
4.3.4. Inline Data Path Measurement | 4.3.4. Inline Data Path Measurement | |||
Implementations may perform recognition of large flows by performing | Implementations may perform recognition of large flows by performing | |||
measurements on traffic in the data path of a router. Such an | measurements on traffic in the data path of a router. Such an | |||
approach would be expected to operate at the interface speed on every | approach would be expected to operate at the interface speed on every | |||
interface, accounting for all packets processed by the data path of | interface, accounting for all packets processed by the data path of | |||
the router. An example of such an approach is described in IPFIX | the router. An example of such an approach is described in IPFIX | |||
[RFC 5470]. | [RFC5470]. | |||
Using inline data path measurement, a faster and more accurate | Using inline data path measurement, a faster and more accurate | |||
indication of large flows mapped to each of the component links in a | indication of large flows mapped to each of the component links in a | |||
LAG/ECMP group may be possible (as compared to the sampling-based | LAG/ECMP group may be possible (as compared to the sampling-based | |||
approach). | approach). | |||
The advantages and disadvantages of inline data path measurement are: | The advantages and disadvantages of inline data path measurement are | |||
as follows: | ||||
Advantages: | Advantages: | |||
. As link speeds get higher, sampling rates are typically reduced | o As link speeds get higher, sampling rates are typically reduced to | |||
to keep the number of samples manageable which places a lower | keep the number of samples manageable, which places a lower bound | |||
bound on the detection time. With inline data path measurement, | on the detection time. With inline data path measurement, large | |||
large flows can be recognized in shorter windows on higher link | flows can be recognized in shorter windows on higher link speeds | |||
speeds since every packet is accounted for [NDTM]. | since every packet is accounted for [NDTM]. | |||
. Eliminates the potential dependence on an external management | o Inline data path measurement eliminates the potential dependence | |||
station for large flow recognition. | on a central management entity for large flow recognition. | |||
Disadvantages: | Disadvantage: | |||
. It is more resource intensive in terms of the tables sizes | o Inline data path measurement is more resource intensive in terms | |||
required for monitoring all flows in order to perform the | of the table sizes required for monitoring all flows. | |||
measurement. | ||||
As mentioned earlier, the observation interval for determining a | As mentioned earlier, the observation interval for determining a | |||
large flow and the bandwidth threshold for classifying a flow as a | large flow and the bandwidth threshold for classifying a flow as a | |||
large flow should be programmable parameters in a router. | large flow should be programmable parameters in a router. | |||
The implementation details of inline data path measurement of large | The implementation details of inline data path measurement of large | |||
flows is vendor dependent and beyond the scope of this document. | flows is vendor dependent and beyond the scope of this document. | |||
4.3.5. Use of Multiple Methods for Large Flow Recognition | 4.3.5. Use of Multiple Methods for Large Flow Recognition | |||
It is possible that a router may have line cards that support a | It is possible that a router may have line cards that support a | |||
sampling technique while other line cards support inline data path | sampling technique while other line cards support inline data path | |||
measurement of large flows. As long as there is a way for the router | measurement. As long as there is a way for the router to reliably | |||
to reliably determine the mapping of large flows to component links | determine the mapping of large flows to component links of a LAG/ECMP | |||
of a LAG/ECMP group, it is acceptable for the router to use more than | group, it is acceptable for the router to use more than one method | |||
one method for large flow recognition. | for large flow recognition. | |||
If both methods are supported, inline data path measurement may be | If both methods are supported, inline data path measurement may be | |||
preferable because of its speed of detection [FLOW-ACC]. | preferable because of its speed of detection [FLOW-ACC]. | |||
4.4. Load Rebalancing Options | 4.4. Options for Load Rebalancing | |||
Below are suggested techniques for load balancing. Equipment vendors | The following subsections describe suggested techniques for load | |||
may implement more than one technique, including those not described | balancing. Equipment vendors may implement more than one technique, | |||
in this document, and allow the operator to choose between them. | including those not described in this document, and allow the | |||
operator to choose between them. | ||||
Note that regardless of the method used, perfect rebalancing of large | Note that regardless of the method used, perfect rebalancing of large | |||
flows may not be possible since flows arrive and depart at different | flows may not be possible since flows arrive and depart at different | |||
times. Also, any flows that are moved from one component link to | times. Also, any flows that are moved from one component link to | |||
another may experience momentary packet reordering. | another may experience momentary packet reordering. | |||
4.4.1. Alternative Placement of Large Flows | 4.4.1. Alternative Placement of Large Flows | |||
Within a LAG/ECMP group, the member component links with least | Within a LAG/ECMP group, member component links with the least | |||
average port utilization are identified. Some large flow(s) from the | average link utilization are identified. Some large flow(s) from the | |||
heavily loaded component links are then moved to those lightly-loaded | heavily loaded component links are then moved to those lightly loaded | |||
member component links using a policy-based routing (PBR) rule in the | member component links using a PBR rule in the ingress processing | |||
ingress processing element(s) in the routers. | element(s) in the routers. | |||
With this approach, only certain large flows are subjected to | With this approach, only certain large flows are subjected to | |||
momentary flow re-ordering. | momentary flow reordering. | |||
When a large flow is moved, this will increase the utilization of the | Moving a large flow will increase the utilization of the link that it | |||
link that it moved to potentially creating imbalance in the | is moved to, potentially once again creating an imbalance in the | |||
utilization once again across the component links. Therefore, when | utilization across the component links. Therefore, when moving a | |||
moving large flows, care must be taken to account for the existing | large flow, care must be taken to account for the existing load and | |||
load, and what the future load will be after large flow has been | the future load after the large flow has been moved. Further, the | |||
moved. Further, the appearance of new large flows may require a | appearance of new large flows may require a rearrangement of the | |||
rearrangement of the placement of existing flows. | placement of existing flows. | |||
Consider a case where there is a LAG compromising four 10 Gbps | Consider a case where there is a LAG compromising four 10 Gbps | |||
component links and there are four large flows, each of 1 Gbps. | component links and there are four large flows, each of 1 Gbps. | |||
These flows are each placed on one of the component links. | These flows are each placed on one of the component links. | |||
Subsequent, a fifth large flow of 2 Gbps is recognized and to | Subsequently, a fifth large flow of 2 Gbps is recognized, and to | |||
maintain equitable load distribution, it may require placement of one | maintain equitable load distribution, it may require placement of one | |||
of the existing 1 Gbps flow to a different component link. And this | of the existing 1 Gbps flow to a different component link. This | |||
would still result in some imbalance in the utilization across the | would still result in some imbalance in the utilization across the | |||
component links. | component links. | |||
4.4.2. Redistributing Small Flows | 4.4.2. Redistributing Small Flows | |||
Some large flows may consume the entire bandwidth of the component | Some large flows may consume the entire bandwidth of the component | |||
link(s). In this case, it would be desirable for the small flows to | link(s). In this case, it would be desirable for the small flows to | |||
not use the congested component link(s). This can be accomplished in | not use the congested component link(s). | |||
one of the following ways. | ||||
This method works on some existing router hardware. The idea is to | o The LAG/ECMP table is modified to include only non-congested | |||
prevent, or reduce the probability, that the small flow hashes into | component link(s). Small flows hash into this table to be mapped | |||
the congested component link(s). | to a destination component link. Alternatively, if certain | |||
component links are heavily loaded but not congested, the output | ||||
of the hash function can be adjusted to account for large flow | ||||
loading on each of the component links. | ||||
. The LAG/ECMP table is modified to include only non-congested | o The PBR rules for large flows (refer to Section 4.4.1) must have | |||
component link(s). Small flows hash into this table to be mapped | strict precedence over the LAG/ECMP table lookup result. | |||
to a destination component link. Alternatively, if certain | ||||
component links are heavily loaded, but not congested, the | ||||
output of the hash function can be adjusted to account for large | ||||
flow loading on each of the component links. | ||||
. The PBR rules for large flows (refer to Section 4.4.1) must | This method works on some existing router hardware. The idea is to | |||
have strict precedence over the LAG/ECMP table lookup result. | prevent, or reduce the probability, that a small flow hashes into the | |||
congested component link(s). | ||||
With this approach the small flows that are moved would be subject to | With this approach, the small flows that are moved would be subject | |||
reordering. | to reordering. | |||
4.4.3. Component Link Protection Considerations | 4.4.3. Component Link Protection Considerations | |||
If desired, certain component links may be reserved for link | If desired, certain component links may be reserved for link | |||
protection. These reserved component links are not used for any flows | protection. These reserved component links are not used for any | |||
in the absence of any failures. In the case when the component | flows in the absence of any failures. When there is a failure of one | |||
link(s) fail, all the flows on the failed component link(s) are moved | or more component links, all the flows on the failed component | |||
to the reserved component link(s). The mapping table of large flows | link(s) are moved to the reserved component link(s). The mapping | |||
to component link simply replaces the failed component link with the | table of large flows to component links simply replaces the failed | |||
reserved link. Likewise, the LAG/ECMP table replaces the failed | component link with the reserved component link. Likewise, the | |||
component link with the reserved link. | LAG/ECMP table replaces the failed component link with the reserved | |||
component link. | ||||
4.4.4. Load Rebalancing Algorithms | 4.4.4. Algorithms for Load Rebalancing | |||
Specific algorithms for placement of large flows are out of scope of | Specific algorithms for placement of large flows are out of the scope | |||
this document. One possibility is to formulate the problem for large | of this document. One possibility is to formulate the problem for | |||
flow placement as the well-known bin-packing problem and make use of | large flow placement as the well-known bin-packing problem and make | |||
the various heuristics that are available for that problem [bin- | use of the various heuristics that are available for that problem | |||
pack]. | [BIN-PACK]. | |||
4.4.5. Load Rebalancing Example | 4.4.5. Example of Load Rebalancing | |||
Optimizing LAG/ECMP component utilization for the use case in Figure | Optimizing LAG/ECMP component utilization for the use case in Figure | |||
2 is depicted below in Figure 4. The large flow rebalancing explained | 2 is depicted below in Figure 4. The large flow rebalancing | |||
in Section 4.4 is used. The improved link utilization is as follows: | explained in Section 4.4.1 is used. The improved link utilization is | |||
as follows: | ||||
. Component link (1) has 3 flows -- 2 small flows and 1 large | o Component link (1) has three flows (two small flows and one large | |||
flow -- and the link utilization is normal. | flow), and the link utilization is normal. | |||
. Component link (2) has 4 flows -- 3 small flows and 1 large | o Component link (2) has four flows (three small flows and one large | |||
flow -- and the link utilization is normal now. | flow), and the link utilization is normal now. | |||
. Component link (3) has 3 flows -- 2 small flows and 1 large | o Component link (3) has three flows (two small flows and one large | |||
flow -- and the link utilization is normal now. | flow), and the link utilization is normal now. | |||
+-----------+ -> +-----------+ | +-----------+ -> +-----------+ | |||
| | -> | | | | | -> | | | |||
| | ===> | | | | | ===> | | | |||
| (1)|--------|(1) | | | (1)|--------|(1) | | |||
| | | | | | | | | | |||
| | ===> | | | | | ===> | | | |||
| | -> | | | | | -> | | | |||
| | -> | | | | | -> | | | |||
| (R1) | -> | (R2) | | | (R1) | -> | (R2) | | |||
| (2)|--------|(2) | | | (2)|--------|(2) | | |||
| | | | | | | | | | |||
| | -> | | | | | -> | | | |||
| | -> | | | | | -> | | | |||
| | ===> | | | | | ===> | | | |||
| (3)|--------|(3) | | | (3)|--------|(3) | | |||
| | | | | | | | | | |||
+-----------+ +-----------+ | +-----------+ +-----------+ | |||
Where: -> small flow | Where: -> small flow | |||
===> large flow | ===> large flow | |||
Figure 4: Evenly Utilized Composite Links | Figure 4: Evenly Utilized Composite Links | |||
Basically, the use of the mechanisms described in Section 4.4.1 | Basically, the use of the mechanisms described in Section 4.4.1 | |||
resulted in a rebalancing of flows where one of the large flows on | resulted in a rebalancing of flows where one of the large flows on | |||
component link (3) which was previously congested was moved to | component link (3), which was previously congested, was moved to | |||
component link (2) which was previously under-utilized. | component link (2), which was previously underutilized. | |||
5. Information Model for Flow Rebalancing | 5. Information Model for Flow Rebalancing | |||
In order to support flow rebalancing in a router from an external | In order to support flow rebalancing in a router from an external | |||
system, the exchange of some information is necessary between the | system, the exchange of some information is necessary between the | |||
router and the external system. This section provides an exemplary | router and the external system. This section provides an exemplary | |||
information model covering the various components needed for the | information model covering the various components needed for this | |||
purpose. The model is intended to be informational and may be used | purpose. The model is intended to be informational and may be used | |||
as input for development of a data model. | as a guide for the development of a data model. | |||
5.1. Configuration Parameters for Flow Rebalancing | 5.1. Configuration Parameters for Flow Rebalancing | |||
The following parameters are required the configuration of this | The following parameters are required for configuration of this | |||
feature: | feature: | |||
. Large flow recognition parameters: | o Large flow recognition parameters: | |||
o Observation interval: The observation interval is the time | - Observation interval: The observation interval is the time | |||
period in seconds over which the packet arrivals are | period in seconds over which packet arrivals are observed for | |||
observed for the purpose of large flow recognition. | the purpose of large flow recognition. | |||
o Minimum bandwidth threshold: The minimum bandwidth threshold | - Minimum bandwidth threshold: The minimum bandwidth threshold | |||
would be configured as a percentage of link speed and | would be configured as a percentage of link speed and | |||
translated into a number of bytes over the observation | translated into a number of bytes over the observation | |||
interval. A flow for which the number of bytes received, | interval. A flow for which the number of bytes received over a | |||
for a given observation interval, exceeds this number would | given observation interval exceeds this number would be | |||
be recognized as a large flow. | recognized as a large flow. | |||
o Minimum bandwidth threshold for large flow maintenance: The | - Minimum bandwidth threshold for large flow maintenance: The | |||
minimum bandwidth threshold for large flow maintenance is | minimum bandwidth threshold for large flow maintenance is used | |||
used to provide hysteresis for large flow recognition. | to provide hysteresis for large flow recognition. Once a flow | |||
Once a flow is recognized as a large flow, it continues to | is recognized as a large flow, it continues to be recognized as | |||
be recognized as a large flow until it falls below this | a large flow until it falls below this threshold. This is also | |||
threshold. This is also configured as a percentage of link | configured as a percentage of link speed and is typically lower | |||
speed and is typically lower than the minimum bandwidth | than the minimum bandwidth threshold defined above. | |||
threshold defined above. | ||||
. Imbalance threshold: A measure of the deviation of the | o Imbalance threshold: A measure of the deviation of the component | |||
component link utilizations from the utilization of the overall | link utilizations from the utilization of the overall LAG/ECMP | |||
LAG/ECMP group. Since component links can be of a different | group. Since component links can be different speeds, the | |||
speed, the imbalance can be computed as follows. Let the | imbalance can be computed as follows. Let the utilization of each | |||
utilization of each component link in a LAG/ECMP group with n | component link in a LAG/ECMP group with n links of speed b_1, b_2 | |||
links of speed b_1, b_2 .. b_n, be u_1, u_2 .. u_n. The mean | .. b_n be u_1, u_2 .. u_n. The mean utilization is computed as | |||
utilization is computed is u_ave = [ (u_1 x b_1) + (u_2 x b_2) + | ||||
.. + (u_n x b_n) ] / [b_1 + b_2 + .. + b_n]. The imbalance is | ||||
then computed as max_{i=1..n} | u_i - u_ave |. | ||||
. Rebalancing interval: The minimum amount of time between | u_ave = [ (u_1 * b_1) + (u_2 * b_2) + .. + (u_n * b_n) ] / | |||
rebalancing events. This parameter ensures that rebalancing is | [b_1 + b_2 + .. + b_n]. | |||
not invoked too frequently as it impacts packet ordering. | ||||
These parameters may be configured on a system-wide basis or it may | The imbalance is then computed as | |||
apply to an individual LAG. It may be applied to an ECMP group | ||||
provided the component links are not shared with any other ECMP | ||||
group. | ||||
5.2. System Configuration and Identification Parameters | max_{i=1..n} | u_i - u_ave |. | |||
o Rebalancing interval: The minimum amount of time between | ||||
rebalancing events. This parameter ensures that rebalancing is | ||||
not invoked too frequently as it impacts packet ordering. | ||||
These parameters may be configured on a system-wide basis or may | ||||
apply to an individual LAG/ECMP group. They may be applied to an | ||||
ECMP group, provided that the component links are not shared with any | ||||
other ECMP group. | ||||
5.2. System Configuration and Identification Parameters | ||||
The following parameters are useful for router configuration and | The following parameters are useful for router configuration and | |||
operation when using the mechanisms in this document. | operation when using the mechanisms in this document. | |||
. IP address: The IP address of a specific router that the | o IP address: The IP address of a specific router that the feature | |||
feature is being configured on, or that the large flow placement | is being configured on or that the large flow placement is being | |||
is being applied to. | applied to. | |||
. LAG ID: Identifies the LAG on a given router. The LAG ID may be | o LAG ID: Identifies the LAG on a given router. The LAG ID may be | |||
required when configuring this feature (to apply a specific set | required when configuring this feature (to apply a specific set of | |||
of large flow identification parameters to the LAG) and will be | large flow identification parameters to the LAG) and will be | |||
required when specifying flow placement to achieve the desired | required when specifying flow placement to achieve the desired | |||
rebalancing. | rebalancing. | |||
. Component Link ID: Identifies the component link within a LAG | o Component Link ID: Identifies the component link within a LAG or | |||
or ECMP group. This is required when specifying flow placement | ECMP group. This is required when specifying flow placement to | |||
to achieve the desired rebalancing. | achieve the desired rebalancing. | |||
. Component Link Weight: The relative weight to be applied to | o Component Link Weight: The relative weight to be applied to | |||
traffic for a given component link when using hash-based | traffic for a given component link when using hash-based | |||
techniques for load distribution. | techniques for load distribution. | |||
. ECMP group: Identifies a particular ECMP group. The ECMP group | o ECMP group: Identifies a particular ECMP group. The ECMP group | |||
may be required when configuring this feature (to apply a | may be required when configuring this feature (to apply a specific | |||
specific set of large flow identification parameters to the ECMP | set of large flow identification parameters to the ECMP group) and | |||
group) and will be required when specifying flow placement to | will be required when specifying flow placement to achieve the | |||
achieve the desired rebalancing. We note that multiple ECMP | desired rebalancing. We note that multiple ECMP groups can share | |||
groups can share an overlapping set (or non-overlapping subset) | an overlapping set (or non-overlapping subset) of component links. | |||
of component links. This document does not deal with the | This document does not deal with the complexity of addressing such | |||
complexity of addressing such configurations. | configurations. | |||
The feature may be configured globally for all LAGs and/or for all | The feature may be configured globally for all LAGs and/or for all | |||
ECMP groups, or it may be configured specifically for a given LAG or | ECMP groups, or it may be configured specifically for a given LAG or | |||
ECMP group. | ECMP group. | |||
5.3. Information for Alternative Placement of Large Flows | 5.3. Information for Alternative Placement of Large Flows | |||
In cases where large flow recognition is handled by an external | In cases where large flow recognition is handled by a central | |||
management station (see Section 4.3.3), an information model for | management entity (see Section 4.3.3), an information model for flows | |||
flows is required to allow the import of large flow information to | is required to allow the import of large flow information to the | |||
the router. | router. | |||
Typical fields use for identifying large flows were discussed in | Typical fields used for identifying large flows were discussed in | |||
Section 4.3.1. The IPFIX information model [RFC 7012] can be | Section 4.3.1. The IPFIX information model [RFC7012] can be | |||
leveraged for large flow identification. | leveraged for large flow identification. | |||
Large Flow placement is achieved by specifying the relevant flow | Large flow placement is achieved by specifying the relevant flow | |||
information along with the following: | information along with the following: | |||
. For LAG: Router's IP address, LAG ID, LAG component link ID. | o For LAG: router's IP address, LAG ID, LAG component link ID. | |||
. For ECMP: Router's IP address, ECMP group, ECMP component link | o For ECMP: router's IP address, ECMP group, ECMP component link ID. | |||
ID. | ||||
In the case where the ECMP component link itself comprises a LAG, we | In the case where the ECMP component link itself comprises a LAG, we | |||
would have to specify the parameters for both the ECMP group as well | would have to specify the parameters for both the ECMP group as well | |||
as the LAG to which the large flow is being directed. | as the LAG to which the large flow is being directed. | |||
5.4. Information for Redistribution of Small Flows | 5.4. Information for Redistribution of Small Flows | |||
Redistribution of small flows is done using the following: | Redistribution of small flows is done using the following: | |||
. For LAG: The LAG ID and the component link IDs along with the | o For LAG: The LAG ID and the component link IDs along with the | |||
relative weight of traffic to be assigned to each component link | relative weight of traffic to be assigned to each component link | |||
ID are required. | ID are required. | |||
. For ECMP: The ECMP group and the ECMP Nexthop along with the | o For ECMP: The ECMP group and the ECMP next hop along with the | |||
relative weight of traffic to be assigned to each ECMP Nexthop | relative weight of traffic to be assigned to each ECMP next hop | |||
are required. | are required. | |||
It is possible to have an ECMP nexthop that itself comprises a LAG. | It is possible to have an ECMP next hop that itself comprises a LAG. | |||
In that case, we would have to specify the new weights for both the | In that case, we would have to specify the new weights for both the | |||
ECMP nexthops within the ECMP group as well as the component links | ECMP component links and the LAG component links. | |||
within the LAG. | ||||
In the case where an ECMP component link itself comprises a LAG, we | In the case where an ECMP component link itself comprises a LAG, we | |||
would have to specify new weights for both the component links within | would have to specify new weights for both the component links within | |||
the ECMP group as well as the component links within the LAG. | the ECMP group as well as the component links within the LAG. | |||
5.5. Export of Flow Information | 5.5. Export of Flow Information | |||
Exporting large flow information is required when large flow | Exporting large flow information is required when large flow | |||
recognition is being done on a router, but the decision to rebalance | recognition is being done on a router but the decision to rebalance | |||
is being made in an external management station. Large flow | is being made in a central management entity. Large flow information | |||
information includes flow identification and the component link ID | includes flow identification and the component link ID that the flow | |||
that the flow currently is assigned to. Other information such as | is currently assigned to. Other information such as flow QoS and | |||
flow QoS and bandwidth may be exported too. | bandwidth may be exported too. | |||
The IPFIX information model [RFC 7012] can be leveraged for large | The IPFIX information model [RFC7012] can be leveraged for large flow | |||
flow identification. | identification. | |||
5.6. Monitoring information | 5.6. Monitoring Information | |||
5.6.1. Interface (link) utilization | 5.6.1. Interface (Link) Utilization | |||
The incoming bytes (ifInOctets), outgoing bytes (ifOutOctets) and | The incoming bytes (ifInOctets), outgoing bytes (ifOutOctets), and | |||
interface speed (ifSpeed) can be obtained, for example, from the | interface speed (ifSpeed) can be obtained, for example, from the | |||
Interface table (iftable) MIB [RFC 1213]. | Interfaces table (ifTable) in the MIB module defined in [RFC1213]. | |||
The link utilization can then be computed as follows: | The link utilization can then be computed as follows: | |||
Incoming link utilization = (delta_ifInOctets * 8) / (ifSpeed * T) | Incoming link utilization = (delta_ifInOctets * 8) / (ifSpeed * T) | |||
Outgoing link utilization = (delta_ifOutOctets * 8) / (ifSpeed * T) | Outgoing link utilization = (delta_ifOutOctets * 8) / (ifSpeed * T) | |||
Where T is the interval over which the utilization is being measured, | Where T is the interval over which the utilization is being measured, | |||
delta_ifInOctets is the change in ifInOctets over that interval, and | delta_ifInOctets is the change in ifInOctets over that interval, and | |||
delta_ifOutOctets is the change in ifOutOctets over that interval. | delta_ifOutOctets is the change in ifOutOctets over that interval. | |||
For high speed Ethernet links, the etherStatsHighCapacityTable MIB | For high-speed Ethernet links, the etherStatsHighCapacityTable in the | |||
[RFC 3273] can be used. | MIB module defined in [RFC3273] can be used. | |||
Similar results may be achieved using the corresponding objects of | Similar results may be achieved using the corresponding objects of | |||
other interface management data models such as YANG [RFC 7223] if | other interface management data models such as YANG [RFC7223] if | |||
those are used instead of MIBs. | those are used instead of MIBs. | |||
For scalability, it is recommended to use the counter push mechanism | For scalability, it is recommended to use the counter push mechanism | |||
in [sflow-v5] for the interface counters. Doing so would help avoid | in [sFlow-v5] for the interface counters. Doing so would help avoid | |||
counter polling through the MIB interface. | counter polling through the MIB interface. | |||
The outgoing link utilization of the component links within a | The outgoing link utilization of the component links within a | |||
LAG/ECMP group can be used to compute the imbalance (See Section 5.1) | LAG/ECMP group can be used to compute the imbalance (see Section 5.1) | |||
for the LAG/ECMP group. | for the LAG/ECMP group. | |||
5.6.2. Other monitoring information | 5.6.2. Other Monitoring Information | |||
Additional monitoring information that is useful includes: | Additional monitoring information that is useful includes: | |||
. Number of times rebalancing was done. | o Number of times rebalancing was done. | |||
. Time since the last rebalancing event. | o Time since the last rebalancing event. | |||
. The number of large flows currently rebalanced by the scheme. | o The number of large flows currently rebalanced by the scheme. | |||
. A list of the large flows that have been rebalanced including | o A list of the large flows that have been rebalanced including | |||
o the rate of each large flow at the time of the last | - the rate of each large flow at the time of the last rebalancing | |||
rebalancing for that flow, | for that flow, | |||
o the time that rebalancing was last performed for the given | - the time that rebalancing was last performed for the given | |||
large flow, and | large flow, and | |||
o the interfaces that the large flows was (re)directed to. | - the interfaces that the large flows was (re)directed to. | |||
. The settings for the weights of the interfaces within a | o The settings for the weights of the interfaces within a LAG/ECMP | |||
LAG/ECMP used by the small flows which depend on hashing. | group used by the small flows that depend on hashing. | |||
6. Operational Considerations | 6. Operational Considerations | |||
6.1. Rebalancing Frequency | 6.1. Rebalancing Frequency | |||
Flows should be rebalanced only when the imbalance in the utilization | Flows should be rebalanced only when the imbalance in the utilization | |||
across component links exceeds a certain threshold. Frequent | across component links exceeds a certain threshold. Frequent | |||
rebalancing to achieve precise equitable utilization across component | rebalancing to achieve precise equitable utilization across component | |||
links could be counter-productive as it may result in moving flows | links could be counterproductive as it may result in moving flows | |||
back and forth between the component links impacting packet ordering | back and forth between the component links, impacting packet ordering | |||
and system stability. This applies regardless of whether large flows | and system stability. This applies regardless of whether large flows | |||
or small flows are redistributed. It should be noted that reordering | or small flows are redistributed. It should be noted that reordering | |||
is a concern for TCP flows with even a few packets because three out- | is a concern for TCP flows with even a few packets because three out- | |||
of-order packets would trigger sufficient duplicate ACKs to the | of-order packets would trigger sufficient duplicate ACKs to the | |||
sender resulting in a retransmission [RFC 5681]. | sender, resulting in a retransmission [RFC5681]. | |||
The operator would have to experiment with various values of the | The operator would have to experiment with various values of the | |||
large flow recognition parameters (minimum bandwidth threshold, | large flow recognition parameters (minimum bandwidth threshold, | |||
minimum bandwidth threshold for large flow maintenance, and | ||||
observation interval) and the imbalance threshold across component | observation interval) and the imbalance threshold across component | |||
links to tune the solution for their environment. | links to tune the solution for their environment. | |||
6.2. Handling Route Changes | 6.2. Handling Route Changes | |||
Large flow rebalancing must be aware of any changes to the FIB. In | Large flow rebalancing must be aware of any changes to the Forwarding | |||
cases where the nexthop of a route no longer to points to the LAG, or | Information Base (FIB). In cases where the next hop of a route no | |||
to an ECMP group, any PBR entries added as described in Section 4.4.1 | longer to points to the LAG or to an ECMP group, any PBR entries | |||
and 4.4.2 must be withdrawn in order to avoid the creation of | added as described in Sections 4.4.1 and 4.4.2 must be withdrawn in | |||
forwarding loops. | order to avoid the creation of forwarding loops. | |||
6.3. Forwarding Resources | 6.3. Forwarding Resources | |||
Hash-based techniques used for load balancing with LAG/ECMP are | Hash-based techniques used for load balancing with LAG/ECMP are | |||
usually stateless. The mechanisms described in this document require | usually stateless. The mechanisms described in this document require | |||
additional resources in the forwarding plane of routers for creating | additional resources in the forwarding plane of routers for creating | |||
PBR rules that are capable of overriding the forwarding decision from | PBR rules that are capable of overriding the forwarding decision from | |||
the hash-based approach. These resources may limit the number of | the hash-based approach. These resources may limit the number of | |||
flows that can be rebalanced and may also impact the latency | flows that can be rebalanced and may also impact the latency | |||
experienced by packets due to the additional lookups that are | experienced by packets due to the additional lookups that are | |||
required. | required. | |||
7. IANA Considerations | 7. Security Considerations | |||
This memo includes no request to IANA. | ||||
8. Security Considerations | ||||
This document does not directly impact the security of the Internet | This document does not directly impact the security of the Internet | |||
infrastructure or its applications. In fact, it could help if there | infrastructure or its applications. In fact, it could help if there | |||
is a DOS attack pattern which causes a hash imbalance resulting in | is a DoS attack pattern that causes a hash imbalance resulting in | |||
heavy overloading of large flows to certain LAG/ECMP component | heavy overloading of large flows to certain LAG/ECMP component links. | |||
links. | ||||
An attacker with knowledge of the large flow recognition algorithm | An attacker with knowledge of the large flow recognition algorithm | |||
and any stateless distribution method can generate flows that are | and any stateless distribution method can generate flows that are | |||
distributed in a way that overloads a specific path. This could be | distributed in a way that overloads a specific path. This could be | |||
used to cause the creation of PBR rules that exhaust the available | used to cause the creation of PBR rules that exhaust the available | |||
rule capacity on nodes. If PBR rules are consequently discarded, | PBR rule capacity on routers in the network. If PBR rules are | |||
this could result in congestion on the attacker-selected path. | consequently discarded, this could result in congestion on the | |||
Alternatively, tracking large numbers of PBR rules could result in | attacker-selected path. Alternatively, tracking large numbers of PBR | |||
performance degradation. | rules could result in performance degradation. | |||
9. Contributing Authors | ||||
Sanjay Khanna | ||||
Cisco Systems | ||||
Email: sanjakha@gmail.com | ||||
10. Acknowledgements | ||||
The authors would like to thank the following individuals for their | ||||
review and valuable feedback on earlier versions of this document: | ||||
Shane Amante, Fred Baker, Michael Bugenhagen, Zhen Cao, Brian | ||||
Carpenter, Benoit Claise, Michael Fargano, Wes George, Sriganesh | ||||
Kini, Roman Krzanowski, Andrew Malis, Dave McDysan, Pete Moyer, | ||||
Peter Phaal, Dan Romascanu, Curtis Villamizar, Jianrong Wong, George | ||||
Yum, and Weifeng Zhang. As a part of the IETF Last Call process, | ||||
valuable comments were received from Martin Thomson and Carlos | ||||
Pignatro. | ||||
11. References | 8. References | |||
11.1. Normative References | 8.1. Normative References | |||
[802.1AX] IEEE Standards Association, "IEEE Std 802.1AX-2008 IEEE | [802.1AX] IEEE, "IEEE Standard for Local and metropolitan area | |||
Standard for Local and Metropolitan Area Networks - Link | networks - Link Aggregation", IEEE Std 802.1AX-2008, | |||
Aggregation", 2008. | 2008. | |||
[RFC 2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and | [RFC2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast | |||
Multicast," November 2000. | and Multicast Next-Hop Selection", RFC 2991, November | |||
2000, <http://www.rfc-editor.org/info/rfc2991>. | ||||
[RFC 7011] Claise, B. et al., "Specification of the IP Flow | [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, | |||
Information Export (IPFIX) Protocol for the Exchange of IP Traffic | "Specification of the IP Flow Information Export (IPFIX) | |||
Flow Information," September 2013. | Protocol for the Exchange of Flow Information", STD 77, | |||
RFC 7011, September 2013, | ||||
<http://www.rfc-editor.org/info/rfc7011>. | ||||
[RFC 7012] Claise, B. and B. Trammell, "Information Model for IP Flow | [RFC7012] Claise, B., Ed., and B. Trammell, Ed., "Information | |||
Information Export (IPFIX)," September 2013. | Model for IP Flow Information Export (IPFIX)", RFC 7012, | |||
September 2013, | ||||
<http://www.rfc-editor.org/info/rfc7012>. | ||||
11.2. Informative References | 8.2. Informative References | |||
[bin-pack] Coffman, Jr., E., M. Garey, and D. Johnson. Approximation | [BIN-PACK] Coffman, Jr., E., Garey, M., and D. Johnson. | |||
Algorithms for Bin-Packing -- An Updated Survey. In Algorithm Design | "Approximation Algorithms for Bin-Packing -- An Updated | |||
for Computer System Design, ed. by Ausiello, Lucertini, and Serafini. | Survey" (in "Algorithm Design for Computer System | |||
Springer-Verlag, 1984. | Design"), Springer, 1984. | |||
[CAIDA] "Caida Internet Traffic Analysis," http://www.caida.org/home. | [CAIDA] "Caida Traffic Analysis Research", | |||
<http://www.caida.org/research/traffic-analysis/>. | ||||
[DevoFlow] Mogul, J., et al., "DevoFlow: Cost-Effective Flow | [DEVOFLOW] Mogul, J., Tourrilhes, J., Yalagandula, P., Sharma, P., | |||
Management for High Performance Enterprise Networks," Proceedings of | Curtis, R., and S. Banerjee, "DevoFlow: Cost-Effective | |||
the ACM SIGCOMM, August 2011. | Flow Management for High Performance Enterprise | |||
Networks", Proceedings of the ACM SIGCOMM, 2010. | ||||
[FLOW-ACC] Zseby, T., et al., "Packet sampling for flow accounting: | [FLOW-ACC] Zseby, T., Hirsch, T., and B. Claise, "Packet Sampling | |||
challenges and limitations," Proceedings of the 9th international | for Flow Accounting: Challenges and Limitations", | |||
conference on Passive and active network measurement, 2008. | Proceedings of the 9th international Passive and Active | |||
Measurement Conference, 2008. | ||||
[ID.ietf-rtgwg-cl-requirement] Villamizar, C. et al., "Requirements | [ITCOM] Jo, J., Kim, Y., Chao, H., and F. Merat, "Internet | |||
for MPLS over a Composite Link," September 2013. | traffic load balancing using dynamic hashing with flow | |||
volume", SPIE ITCOM, 2002. | ||||
[ITCOM] Jo, J., et al., "Internet traffic load balancing using | [NDTM] Estan, C. and G. Varghese, "New Directions in Traffic | |||
dynamic hashing with flow volume," SPIE ITCOM, 2002. | Measurement and Accounting", Proceedings of ACM SIGCOMM, | |||
August 2002. | ||||
[NDTM] Estan, C. and G. Varghese, "New directions in traffic | [NVGRE] Garg, P. and Y. Wang, "NVGRE: Network Virtualization | |||
measurement and accounting," Proceedings of ACM SIGCOMM, August 2002. | using Generic Routing Encapsulation", Work in Progress, | |||
draft-sridharan-virtualization-nvgre-07, November 2014. | ||||
[NVGRE] Sridharan, M. et al., "NVGRE: Network Virtualization using | [RFC2784] Farinacci, D., Li, T., Hanks, S., Meyer, D., and P. | |||
Generic Routing Encapsulation," draft-sridharan-virtualization- | Traina, "Generic Routing Encapsulation (GRE)", RFC 2784, | |||
nvgre-06, January 2015. | March 2000, <http://www.rfc-editor.org/info/rfc2784>. | |||
[RFC 2784] Farinacci, D. et al., "Generic Routing Encapsulation | [RFC6790] Kompella, K., Drake, J., Amante, S., Henderickx, W., and | |||
(GRE)," March 2000. | L. Yong, "The Use of Entropy Labels in MPLS Forwarding", | |||
RFC 6790, November 2012, | ||||
<http://www.rfc-editor.org/info/rfc6790>. | ||||
[RFC 6790] Kompella, K. et al., "The Use of Entropy Labels in MPLS | [RFC1213] McCloghrie, K. and M. Rose, "Management Information Base | |||
Forwarding," November 2012. | for Network Management of TCP/IP-based internets: | |||
MIB-II", STD 17, RFC 1213, March 1991, | ||||
<http://www.rfc-editor.org/info/rfc1213>. | ||||
[RFC 1213] McCloghrie, K., "Management Information Base for Network | [RFC2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path | |||
Management of TCP/IP-based internets: MIB-II," March 1991. | Algorithm", RFC 2992, November 2000, | |||
<http://www.rfc-editor.org/info/rfc2992>. | ||||
[RFC 2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path | [RFC3273] Waldbusser, S., "Remote Network Monitoring Management | |||
Algorithm," November 2000. | Information Base for High Capacity Networks", RFC 3273, | |||
July 2002, <http://www.rfc-editor.org/info/rfc3273>. | ||||
[RFC 3273] Waldbusser, S., "Remote Network Monitoring Management | [RFC3931] Lau, J., Ed., Townsley, M., Ed., and I. Goyret, Ed., | |||
Information Base for High Capacity Networks," July 2002. | "Layer Two Tunneling Protocol - Version 3 (L2TPv3)", RFC | |||
3931, March 2005, | ||||
<http://www.rfc-editor.org/info/rfc3931>. | ||||
[RFC 3931] Lau, J. (Ed.), M. Townsley (Ed.), and I. Goyret (Ed.), | [RFC3954] Claise, B., Ed., "Cisco Systems NetFlow Services Export | |||
"Layer 2 Tunneling Protocol - Version 3," March 2005. | Version 9", RFC 3954, October 2004, | |||
<http://www.rfc-editor.org/info/rfc3954>. | ||||
[RFC 3954] Claise, B., "Cisco Systems NetFlow Services Export Version | [RFC5470] Sadasivan, G., Brownlee, N., Claise, B., and J. Quittek, | |||
9," October 2004. | "Architecture for IP Flow Information Export", RFC 5470, | |||
March 2009, <http://www.rfc-editor.org/info/rfc5470>. | ||||
[RFC 5470] G. Sadasivan et al., "Architecture for IP Flow Information | [RFC5475] Zseby, T., Molina, M., Duffield, N., Niccolini, S., and | |||
Export," March 2009. | F. Raspall, "Sampling and Filtering Techniques for IP | |||
Packet Selection", RFC 5475, March 2009, | ||||
<http://www.rfc-editor.org/info/rfc5475>. | ||||
[RFC 5475] Zseby, T. et al., "Sampling and Filtering Techniques for | [RFC5640] Filsfils, C., Mohapatra, P., and C. Pignataro, "Load- | |||
IP Packet Selection," March 2009. | Balancing for Mesh Softwires", RFC 5640, August 2009, | |||
<http://www.rfc-editor.org/info/rfc5640>. | ||||
[RFC 5640] Filsfils, C., P. Mohapatra, and C. Pignataro, "Load | [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion | |||
Balancing for Mesh Softwires," August 2009. | Control", RFC 5681, September 2009, | |||
<http://www.rfc-editor.org/info/rfc5681>. | ||||
[RFC 5681] Allman, M. et al., "TCP Congestion Control," September | [RFC7223] Bjorklund, M., "A YANG Data Model for Interface | |||
2009. | Management", RFC 7223, May 2014, | |||
<http://www.rfc-editor.org/info/rfc7223>. | ||||
[RFC 7223] Bjorklund, M., "A YANG Data Model for Interface | [RFC7226] Villamizar, C., Ed., McDysan, D., Ed., Ning, S., Malis, | |||
Management," May 2014. | A., and L. Yong, "Requirements for Advanced Multipath in | |||
MPLS Networks", RFC 7226, May 2014, | ||||
<http://www.rfc-editor.org/info/rfc7226>. | ||||
[SAMP-BASIC] Phaal, P. and S. Panchen, "Packet Sampling Basics," | [SAMP-BASIC] Phaal, P. and S. Panchen, "Packet Sampling Basics", | |||
http://www.sflow.org/packetSamplingBasics/. | <http://www.sflow.org/packetSamplingBasics/>. | |||
[sFlow-v5] Phaal, P. and M. Lavine, "sFlow version 5," | [sFlow-v5] Phaal, P. and M. Lavine, "sFlow version 5", July 2004, | |||
http://www.sflow.org/sflow_version_5.txt, July 2004. | <http://www.sflow.org/sflow_version_5.txt>. | |||
[sFlow-LAG] Phaal, P. and A. Ghanwani, "sFlow LAG counters | [sFlow-LAG] Phaal, P. and A. Ghanwani, "sFlow LAG Counters | |||
structure," http://www.sflow.org/sflow_lag.txt, September 2012. | Structure", September 2012, | |||
<http://www.sflow.org/sflow_lag.txt>. | ||||
[STT] Davie, B. (Ed.) and J. Gross, "A Stateless Transport Tunneling | [STT] Davie, B., Ed., and J. Gross, "A Stateless Transport | |||
Protocol for Network Virtualization (STT)," draft-davie-stt-06, March | Tunneling Protocol for Network Virtualization (STT)", | |||
2014. | Work in Progress, draft-davie-stt-06, April 2014. | |||
[RFC 7348] Mahalingam, M. et al., "VXLAN: A Framework for Overlaying | [RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., | |||
Virtualized Layer 2 Networks over Layer 3 Networks," August 2014. | Kreeger, L., Sridhar, T., Bursell, M., and C. Wright, | |||
"Virtual eXtensible Local Area Network (VXLAN): A | ||||
Framework for Overlaying Virtualized Layer 2 Networks | ||||
over Layer 3 Networks", RFC 7348, August 2014, | ||||
<http://www.rfc-editor.org/info/rfc7348>. | ||||
[YONG] Yong, L., "Enhanced ECMP and Large Flow Aware Transport," | [YONG] Yong, L. and P. Yang, "Enhanced ECMP and Large Flow | |||
draft-yong-pwe3-enhance-ecmp-lfat-01, September 2010. | Aware Transport", Work in Progress, | |||
draft-yong-pwe3-enhance-ecmp-lfat-01, March 2010. | ||||
Appendix A. Internet Traffic Analysis and Load Balancing Simulation | Appendix A. Internet Traffic Analysis and Load-Balancing Simulation | |||
Internet traffic [CAIDA] has been analyzed to obtain flow statistics | Internet traffic [CAIDA] has been analyzed to obtain flow statistics | |||
such as the number of packets in a flow and the flow duration. The | such as the number of packets in a flow and the flow duration. The | |||
five tuples in the packet header (IP addresses, TCP/UDP Ports, and IP | 5-tuple in the packet header (IP source address, IP destination | |||
protocol) are used for flow identification. The analysis indicates | address, transport protocol source port number, transport protocol | |||
that < ~2% of the flows take ~30% of total traffic volume while the | destination port number, and IP protocol) is used for flow | |||
rest of the flows (> ~98%) contributes ~70% [YONG]. | identification. The analysis indicates that < ~2% of the flows take | |||
~30% of total traffic volume while the rest of the flows (> ~98%) | ||||
contributes ~70% [YONG]. | ||||
The simulation has shown that given Internet traffic pattern, the | The simulation has shown that, given Internet traffic patterns, the | |||
hash-based technique does not evenly distribute the flows over ECMP | hash-based technique does not evenly distribute flows over ECMP | |||
paths. Some paths may be > 90% loaded while others are < 40% loaded. | paths. Some paths may be > 90% loaded while others are < 40% loaded. | |||
The more ECMP paths exist, the more severe the misbalancing. This | The greater the number of ECMP paths, the more severe is the | |||
implies that hash-based distribution can cause some paths to become | imbalance in the load distribution. This implies that hash-based | |||
congested while other paths are underutilized [YONG]. | distribution can cause some paths to become congested while other | |||
paths are underutilized [YONG]. | ||||
The simulation also shows substantial improvement by using the large | The simulation also shows substantial improvement by using the large | |||
flow-aware hash-based distribution technique described in this | flow-aware, hash-based distribution technique described in this | |||
document. In using the same simulated traffic, the improved | document. In using the same simulated traffic, the improved | |||
rebalancing can achieve < 10% load differences among the paths. It | rebalancing can achieve < 10% load differences among the paths. It | |||
proves how large flow-aware hash-based distribution can effectively | proves how large flow-aware, hash-based distribution can effectively | |||
compensate the uneven load balancing caused by hashing and the | compensate the uneven load balancing caused by hashing and the | |||
traffic characteristics [YONG]. | traffic characteristics [YONG]. | |||
Acknowledgements | ||||
The authors would like to thank the following individuals for their | ||||
review and valuable feedback on earlier versions of this document: | ||||
Shane Amante, Fred Baker, Michael Bugenhagen, Zhen Cao, Brian | ||||
Carpenter, Benoit Claise, Michael Fargano, Wes George, Sriganesh | ||||
Kini, Roman Krzanowski, Andrew Malis, Dave McDysan, Pete Moyer, Peter | ||||
Phaal, Dan Romascanu, Curtis Villamizar, Jianrong Wong, George Yum, | ||||
and Weifeng Zhang. As a part of the IETF Last Call process, valuable | ||||
comments were received from Martin Thomson and Carlos Pignataro. | ||||
Contributors | ||||
Sanjay Khanna | ||||
Cisco Systems | ||||
EMail: sanjakha@gmail.com | ||||
Authors' Addresses | Authors' Addresses | |||
Ram Krishnan | Ram Krishnan | |||
Brocade Communications | Brocade Communications | |||
San Jose, 95134, USA | San Jose, CA 95134 | |||
United States | ||||
Phone: +1-408-406-7890 | Phone: +1-408-406-7890 | |||
Email: ramkri123@gmail.com | EMail: ramkri123@gmail.com | |||
Lucy Yong | Lucy Yong | |||
Huawei USA | Huawei USA | |||
5340 Legacy Drive | 5340 Legacy Drive | |||
Plano, TX 75025, USA | Plano, TX 75025 | |||
United States | ||||
Phone: +1-469-277-5837 | Phone: +1-469-277-5837 | |||
Email: lucy.yong@huawei.com | EMail: lucy.yong@huawei.com | |||
Anoop Ghanwani | Anoop Ghanwani | |||
Dell | Dell | |||
San Jose, CA 95134 | 5450 Great America Pkwy | |||
Santa Clara, CA 95054 | ||||
United States | ||||
Phone: +1-408-571-3228 | Phone: +1-408-571-3228 | |||
Email: anoop@alumni.duke.edu | EMail: anoop@alumni.duke.edu | |||
Ning So | Ning So | |||
Tata Communications | Vinci Systems | |||
Plano, TX 75082, USA | 2613 Fairbourne Cir | |||
Phone: +1-972-955-0914 | Plano, TX 75093 | |||
Email: ning.so@tatacommunications.com | United States | |||
EMail: ningso@yahoo.com | ||||
Bhumip Khasnabish | Bhumip Khasnabish | |||
ZTE Corporation | ZTE Corporation | |||
New Jersey, 07960, USA | New Jersey 07960 | |||
United States | ||||
Phone: +1-781-752-8003 | Phone: +1-781-752-8003 | |||
Email: vumip1@gmail.com | EMail: vumip1@gmail.com | |||
End of changes. 251 change blocks. | ||||
688 lines changed or deleted | 745 lines changed or added | |||
This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |