draft-ietf-opsawg-large-flow-load-balancing-00.txt | draft-ietf-opsawg-large-flow-load-balancing-01.txt | |||
---|---|---|---|---|
OPSAWG R. Krishnan | OPSAWG R. Krishnan | |||
Internet Draft S. Khanna | Internet Draft S. Khanna | |||
Intended status: Informational Brocade Communications | Intended status: Informational Brocade Communications | |||
Expires: November 2013 L. Yong | Expires: December 23, 2013 L. Yong | |||
May 8, 2013 Huawei USA | June 23, 2013 Huawei USA | |||
A. Ghanwani | A. Ghanwani | |||
Dell | Dell | |||
Ning So | Ning So | |||
Tata Communications | Tata Communications | |||
B. Khasnabish | B. Khasnabish | |||
ZTE Corporation | ZTE Corporation | |||
Mechanisms for Optimal LAG/ECMP Component Link Utilization in | Mechanisms for Optimal LAG/ECMP Component Link Utilization in | |||
Networks | Networks | |||
draft-ietf-opsawg-large-flow-load-balancing-00.txt | draft-ietf-opsawg-large-flow-load-balancing-01.txt | |||
Status of this Memo | Status of this Memo | |||
This Internet-Draft is submitted in full conformance with the | This Internet-Draft is submitted in full conformance with the | |||
provisions of BCP 78 and BCP 79. This document may not be modified, | provisions of BCP 78 and BCP 79. This document may not be modified, | |||
and derivative works of it may not be created, except to publish it | and derivative works of it may not be created, except to publish it | |||
as an RFC and to translate it into languages other than English. | as an RFC and to translate it into languages other than English. | |||
Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
Task Force (IETF), its areas, and its working groups. Note that | Task Force (IETF), its areas, and its working groups. Note that | |||
skipping to change at page 1, line 42 | skipping to change at page 1, line 42 | |||
and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
The list of current Internet-Drafts can be accessed at | The list of current Internet-Drafts can be accessed at | |||
http://www.ietf.org/ietf/1id-abstracts.txt | http://www.ietf.org/ietf/1id-abstracts.txt | |||
The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
http://www.ietf.org/shadow.html | http://www.ietf.org/shadow.html | |||
This Internet-Draft will expire on November 8, 2013. | This Internet-Draft will expire on December 23, 2013. | |||
Copyright Notice | Copyright Notice | |||
Copyright (c) 2013 IETF Trust and the persons identified as the | Copyright (c) 2013 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | document authors. All rights reserved. | |||
This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
(http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
publication of this document. Please review these documents | publication of this document. Please review these documents | |||
skipping to change at page 2, line 29 | skipping to change at page 2, line 29 | |||
center communications, etc. In this context, it is important to | center communications, etc. In this context, it is important to | |||
optimally use the bandwidth in wired networks that extensively use | optimally use the bandwidth in wired networks that extensively use | |||
LAG/ECMP techniques for bandwidth scaling. This draft explores some | LAG/ECMP techniques for bandwidth scaling. This draft explores some | |||
of the mechanisms useful for achieving this. | of the mechanisms useful for achieving this. | |||
Table of Contents | Table of Contents | |||
1. Introduction...................................................3 | 1. Introduction...................................................3 | |||
1.1. Acronyms..................................................3 | 1.1. Acronyms..................................................3 | |||
1.2. Terminology...............................................4 | 1.2. Terminology...............................................4 | |||
2. Hash-based Load Distribution in LAG/ECMP.......................4 | 2. Flow Categorization............................................4 | |||
3. Mechanisms for Optimal LAG/ECMP Component Link Utilization.....5 | 3. Hash-based Load Distribution in LAG/ECMP.......................5 | |||
3.1. Large Flow Recognition....................................7 | 4. Mechanisms for Optimal LAG/ECMP Component Link Utilization.....7 | |||
3.1.1. Flow Identification..................................7 | 4.1. Differences in LAG vs ECMP................................7 | |||
3.1.2. Criteria for Identifying a Large Flow................8 | 4.2. Overview of the mechanism.................................8 | |||
3.1.3. Sampling Techniques..................................8 | 4.3. Large Flow Recognition....................................9 | |||
3.1.4. Automatic Hardware Recognition.......................9 | 4.3.1. Flow Identification..................................9 | |||
3.2. Load Re-balancing Options................................10 | 4.3.2. Criteria for Identifying a Large Flow...............10 | |||
3.2.1. Alternative Placement of Large Flows................10 | 4.3.3. Sampling Techniques.................................10 | |||
3.2.2. Redistributing Small Flows..........................11 | 4.3.4. Automatic Hardware Recognition......................11 | |||
3.2.3. Component Link Protection Considerations............11 | 4.4. Load Re-balancing Options................................12 | |||
3.2.4. Load Re-Balancing Example...........................12 | 4.4.1. Alternative Placement of Large Flows................12 | |||
4. Information Model for Flow Re-balancing.......................13 | 4.4.2. Redistributing Small Flows..........................13 | |||
4.1. Configuration Parameters.................................13 | 4.4.3. Component Link Protection Considerations............13 | |||
4.2. Import of Flow Information...............................13 | 4.4.4. Load Re-balancing Algorithms........................14 | |||
5. Operational Considerations....................................14 | 4.4.5. Load Re-Balancing Example...........................14 | |||
6. IANA Considerations...........................................14 | 5. Information Model for Flow Re-balancing.......................15 | |||
7. Security Considerations.......................................15 | 5.1. Configuration Parameters for Flow Re-balancing...........15 | |||
8. Acknowledgements..............................................15 | 5.2. System Configuration and Identification Parameters.......16 | |||
9. References....................................................15 | 5.3. Information for Alternative Placement of Large Flows.....16 | |||
9.1. Normative References.....................................15 | 5.4. Information for Redistribution of Small Flows............17 | |||
9.2. Informative References...................................15 | 5.5. Export of Flow Information...............................17 | |||
5.6. Monitoring information...................................17 | ||||
5.6.1. Interface (link) utilization........................17 | ||||
5.6.2. Other monitoring information........................17 | ||||
6. Operational Considerations....................................18 | ||||
7. IANA Considerations...........................................18 | ||||
8. Security Considerations.......................................18 | ||||
9. Acknowledgements..............................................18 | ||||
10. References...................................................19 | ||||
10.1. Normative References....................................19 | ||||
10.2. Informative References..................................19 | ||||
1. Introduction | 1. Introduction | |||
Networks extensively use LAG/ECMP techniques for capacity scaling. | Networks extensively use LAG/ECMP techniques for capacity scaling. | |||
Network traffic can be predominantly categorized into two traffic | Network traffic can be predominantly categorized into two traffic | |||
types: long-lived large flows and other flows (which include long- | types: long-lived large flows and other flows (which include long- | |||
lived small flows, short-lived small/large flows). Stateless hash- | lived small flows, short-lived small/large flows). Stateless hash- | |||
based techniques [ITCOM, RFC 2991, RFC 2992, RFC 6790] are often used | based techniques [ITCOM, RFC 2991, RFC 2992, RFC 6790] are often used | |||
to distribute both long-lived large flows and other flows over the | to distribute both long-lived large flows and other flows over the | |||
component links in a LAG/ECMP. However the traffic may not be evenly | component links in a LAG/ECMP. However the traffic may not be evenly | |||
distributed over the component links due to the traffic pattern. | distributed over the component links due to the traffic pattern. | |||
This draft describes best practices for optimal LAG/ECMP component | This draft describes mechanisms for optimal LAG/ECMP component link | |||
link utilization while using hash-based techniques. These best | utilization while using hash-based techniques. The mechanisms | |||
practices comprise the following steps -- recognizing long-lived | comprise the following steps -- recognizing long-lived large flows in | |||
large flows in a router; and assigning the long-lived large flows to | a router; and assigning the long-lived large flows to specific | |||
specific LAG/ECMP component links or redistributing other flows when | LAG/ECMP component links or redistributing other flows when a | |||
a component link on the router is congested. | component link on the router is congested. | |||
It is useful to keep in mind that the typical use case is where the | It is useful to keep in mind that the typical use case is where the | |||
long-lived large flows are those that consume a significant amount of | long-lived large flows are those that consume a significant amount of | |||
bandwidth on a link, e.g. greater than 5% of link bandwidth. The | bandwidth on a link, e.g. greater than 5% of link bandwidth. The | |||
number of such flows would necessarily be fairly small, e.g. on the | number of such flows would necessarily be fairly small, e.g. on the | |||
order of 10's or 100's per link. In other words, the number of long- | order of 10's or 100's per link. In other words, the number of long- | |||
lived large flows is NOT expected to be on the order of millions of | lived large flows is NOT expected to be on the order of millions of | |||
flows. Examples of such long-lived large flows would be IPSec | flows. Examples of such long-lived large flows would be IPSec | |||
tunnels in service provider backbones or storage backup traffic in | tunnels in service provider backbones or storage backup traffic in | |||
data center networks. | data center networks. | |||
skipping to change at page 4, line 19 | skipping to change at page 4, line 29 | |||
VXLAN: Virtual Extensible LAN | VXLAN: Virtual Extensible LAN | |||
1.2. Terminology | 1.2. Terminology | |||
Large flow(s): long-lived large flow(s) | Large flow(s): long-lived large flow(s) | |||
Small flow(s): long-lived small flow(s) and short-lived small/large | Small flow(s): long-lived small flow(s) and short-lived small/large | |||
flow(s) | flow(s) | |||
2. Hash-based Load Distribution in LAG/ECMP | 2. Flow Categorization | |||
In general, based on the size and duration, a flow can be categorized | ||||
into any one of the following four types, as shown in Figure 1: | ||||
(a) Short-Lived Large Flow (SLLF), | ||||
(b) Short-Lived Small Flow (SLSF), | ||||
(c) Long-Lived Large Flow (LLLF), and | ||||
(d) Long-Lived Small Flow (LLSF). | ||||
Flow Size | ||||
^ | ||||
|--------------------|--------------------| | ||||
| | | | ||||
Large | SLLF | LLLF | | ||||
Flow | | | | ||||
|--------------------|--------------------| | ||||
| | | | ||||
Small | SLSF | LLSF | | ||||
Flow | | | | ||||
+--------------------+--------------------+---> Flow duration | ||||
Short-Lived Long-Lived | ||||
Flow Flow | ||||
Figure 1: Flow Categorization | ||||
In this document, we categorize Long-lived large flow(s) as "Large" | ||||
flow(s), and all of the others -- Long-lived small flow(s) and short- | ||||
lived small/large flow(s) as "Small" flow(s). | ||||
3. Hash-based Load Distribution in LAG/ECMP | ||||
Hashing techniques are often used for traffic load balancing to | Hashing techniques are often used for traffic load balancing to | |||
select among multiple available paths with LAG/ECMP. The advantages | select among multiple available paths with LAG/ECMP. The advantages | |||
of hash-based load distribution are the preservation of the packet | of hash-based load distribution are the preservation of the packet | |||
sequence in a flow and the real-time distribution without maintaining | sequence in a flow and the real-time distribution without maintaining | |||
per-flow state in the router. Hash-based techniques use a combination | per-flow state in the router. Hash-based techniques use a combination | |||
of fields in the packet's headers to identify a flow, and the hash | of fields in the packet's headers to identify a flow, and the hash | |||
function on these fields is used to generate a unique number that | function on these fields is used to generate a unique number that | |||
identifies a link/path in a LAG/ECMP. The result of the hashing | identifies a link/path in a LAG/ECMP. The result of the hashing | |||
procedure is a many-to-one mapping of flows to component links. | procedure is a many-to-one mapping of flows to component links. | |||
skipping to change at page 4, line 41 | skipping to change at page 6, line 7 | |||
If the traffic load constitutes flows such that the result of the | If the traffic load constitutes flows such that the result of the | |||
hash function across these flows is fairly uniform so that a similar | hash function across these flows is fairly uniform so that a similar | |||
number of flows is mapped to each component link, if, the individual | number of flows is mapped to each component link, if, the individual | |||
flow rates are much smaller as compared to the link capacity, and if | flow rates are much smaller as compared to the link capacity, and if | |||
the rate differences are not dramatic, the hash-based algorithm | the rate differences are not dramatic, the hash-based algorithm | |||
produces good results with respect to utilization of the individual | produces good results with respect to utilization of the individual | |||
component links. However, if one or more of these conditions are not | component links. However, if one or more of these conditions are not | |||
met, hash-based techniques may result in unbalanced loads on | met, hash-based techniques may result in unbalanced loads on | |||
individual component links. | individual component links. | |||
One example is illustrated in Figure 1. In the figure, there are two | One example is illustrated in Figure 2. In Figure 2, there are two | |||
routers, R1 and R2, and there is a LAG between them which has 3 | routers, R1 and R2, and there is a LAG between them which has 3 | |||
component links (1), (2), (3). There are a total of 10 flows that | component links (1), (2), (3). There are a total of 10 flows that | |||
need to be distributed across the links in this LAG. The result of | need to be distributed across the links in this LAG. The result of | |||
hashing is as follows: | hashing is as follows: | |||
. Component link (1) has 3 flows -- 2 small flows and 1 large | . Component link (1) has 3 flows -- 2 small flows and 1 large | |||
flow -- and the link utilization is normal. | flow -- and the link utilization is normal. | |||
. Component link (2) has 3 flows -- 3 small flows and no large | . Component link (2) has 3 flows -- 3 small flows and no large | |||
flow -- and the link utilization is light. | flow -- and the link utilization is light. | |||
skipping to change at page 5, line 42 | skipping to change at page 6, line 48 | |||
| | -> -> | | | | | -> -> | | | |||
| |=====> | | | | |=====> | | | |||
| |=====> | | | | |=====> | | | |||
| (3)|--/---/-|(3) | | | (3)|--/---/-|(3) | | |||
| | | | | | | | | | |||
+-----------+ +-----------+ | +-----------+ +-----------+ | |||
Where: ->-> small flows | Where: ->-> small flows | |||
===> large flow | ===> large flow | |||
Figure 1: Unevenly Utilized Component Links | Figure 2: Unevenly Utilized Component Links | |||
This document presents improved load distribution techniques based on | This document presents improved load distribution techniques based on | |||
the large flow awareness. The techniques compensate for unbalanced | the large flow awareness. The techniques compensate for unbalanced | |||
load distribution resulting from hashing as demonstrated in the above | load distribution resulting from hashing as demonstrated in the above | |||
example. | example. | |||
3. Mechanisms for Optimal LAG/ECMP Component Link Utilization | 4. Mechanisms for Optimal LAG/ECMP Component Link Utilization | |||
The suggested techniques in this draft are about a local optimization | The suggested techniques in this draft are about a local optimization | |||
solution; they are local in the sense that both the identification of | solution; they are local in the sense that both the identification of | |||
large flows and re-balancing of the load can be accomplished | large flows and re-balancing of the load can be accomplished | |||
completely within individual nodes in the network without the need | completely within individual nodes in the network without the need | |||
for interaction with other nodes. | for interaction with other nodes. | |||
This approach may not yield a globally optimal placement of large | This approach may not yield a globally optimal placement of large | |||
flows across multiple nodes in a network, which may be desirable in | flows across multiple nodes in a network, which may be desirable in | |||
some networks. On the other hand, a local approach may be adequate | some networks. On the other hand, a local approach may be adequate | |||
skipping to change at page 6, line 28 | skipping to change at page 7, line 33 | |||
1) Different links within a network experience different levels of | 1) Different links within a network experience different levels of | |||
utilization and, thus, a "targeted" solution is needed for those hot- | utilization and, thus, a "targeted" solution is needed for those hot- | |||
spots in the network. An example is the utilization of a LAG between | spots in the network. An example is the utilization of a LAG between | |||
two routers that needs to be optimized. | two routers that needs to be optimized. | |||
2) Some networks may lack end-to-end visibility, e.g. when a | 2) Some networks may lack end-to-end visibility, e.g. when a | |||
certain network, under the control of a given operator, is a transit | certain network, under the control of a given operator, is a transit | |||
network for traffic from other networks that are not under the | network for traffic from other networks that are not under the | |||
control of the same operator. | control of the same operator. | |||
4.1. Differences in LAG vs ECMP | ||||
While the mechanisms explained herein are applicable to both LAGs and | ||||
ECMP, it is useful to note that there are some key differences | ||||
between the two that may impact how effective the mechanism is. This | ||||
relates, in part, to the localized information with which the scheme | ||||
is intended to operate. | ||||
A LAG is almost always between 2 adjacent routers. As a result, the | ||||
scope of problem of optimizing the bandwidth utilization on the | ||||
component links is fairly narrow. It simply involves re-balancing | ||||
the load across the component links between these two routers, and | ||||
there is no impact whatsoever to other parts of the network. The | ||||
scheme works equally well for unicast and multicast flows. | ||||
On the other hand, with ECMP, redistributing the load across | ||||
component links that are part of the ECMP group may impact traffic | ||||
patterns at all of the nodes that are downstream of the given router | ||||
between itself and the destination. The local optimization may | ||||
result in congestion at a downstream node. (In its simplest form, an | ||||
ECMP group may be used to distribute traffic on component links that | ||||
are between two adjacent routers, and in that case, the ECMP group is | ||||
no different than a LAG for the purpose of this discussion.) | ||||
To demonstrate the limitations of local optimization, consider a two- | ||||
level fat-tree topology with three leaf nodes (L1, L2, L3) and two | ||||
spine nodes (S1, S2) and assume all of the links are 10 Gbps. Let L1 | ||||
have two flows of 4 Gbps each towards L3, and let L2 have one flow of | ||||
7 Gbps also towards L3. If L1 balances the load optimally between S1 | ||||
and S2, and L2 sends the flow via S1, then the downlink from S1 to L3 | ||||
would get congested resulting in packet discards. On the other hand, | ||||
if L1 had sent both its flows towards S1 and L2 had sent its flow | ||||
towards S2, there would have been no congestion at either S1 or S2. | ||||
The other issue with applying this scheme to ECMP groups is that it | ||||
may not apply equally to unicast and multicast traffic because of the | ||||
way multicast trees are constructed. | ||||
4.2. Overview of the mechanism | ||||
The various steps in achieving optimal LAG/ECMP component link | The various steps in achieving optimal LAG/ECMP component link | |||
utilization in networks are detailed below: | utilization in networks are detailed below: | |||
Step 1) This involves large flow recognition in routers and | Step 1) This involves large flow recognition in routers and | |||
maintaining the mapping of the large flow to the component link that | maintaining the mapping of the large flow to the component link that | |||
it uses. The recognition of large flows is explained in Section 3.1. | it uses. The recognition of large flows is explained in Section 3.1. | |||
Step 2) The egress component links are periodically scanned for link | Step 2) The egress component links are periodically scanned for link | |||
utilization. If the egress component link utilization exceeds a pre- | utilization. If the egress component link utilization exceeds a pre- | |||
programmed threshold, an operator alert is generated. The large flows | programmed threshold, an operator alert is generated. The large flows | |||
skipping to change at page 7, line 37 | skipping to change at page 9, line 37 | |||
of hops downstream on P1 may be congested, while P2 and P3 may be | of hops downstream on P1 may be congested, while P2 and P3 may be | |||
under-utilized, which the local router does not have visibility into. | under-utilized, which the local router does not have visibility into. | |||
With the help of a central management entity, the operator could | With the help of a central management entity, the operator could | |||
redistribute some of the flows from P1 to P2 and P3 resulting in a | redistribute some of the flows from P1 to P2 and P3 resulting in a | |||
more optimized flow of traffic. | more optimized flow of traffic. | |||
The techniques described above are especially useful when bundling | The techniques described above are especially useful when bundling | |||
links of different bandwidths for e.g. 10Gbps and 100Gbps as | links of different bandwidths for e.g. 10Gbps and 100Gbps as | |||
described in [I-D.ietf-rtgwg-cl-requirement]. | described in [I-D.ietf-rtgwg-cl-requirement]. | |||
3.1. Large Flow Recognition | 4.3. Large Flow Recognition | |||
3.1.1. Flow Identification | 4.3.1. Flow Identification | |||
A flow (large flow or small flow) can be defined as a sequence of | A flow (large flow or small flow) can be defined as a sequence of | |||
packets for which ordered delivery should be maintained. Flows are | packets for which ordered delivery should be maintained. Flows are | |||
typically identified using one or more fields from the packet header | typically identified using one or more fields from the packet header | |||
from the following list: | from the following list: | |||
. Layer 2: source MAC address, destination MAC address, VLAN ID. | . Layer 2: source MAC address, destination MAC address, VLAN ID. | |||
. IP header: IP Protocol, IP source address, IP destination | . IP header: IP Protocol, IP source address, IP destination | |||
address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP | address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP | |||
destination port. | destination port. | |||
. MPLS Labels. | . MPLS Labels. | |||
For tunneling protocols like GRE, VXLAN, NVGRE, STT, etc., flow | For tunneling protocols like GRE, VXLAN, NVGRE, STT, etc., flow | |||
identification is possible based on inner and/or outer headers. The | identification is possible based on inner and/or outer headers. The | |||
above list is not exhaustive. The mechanisms described in this | above list is not exhaustive. The mechanisms described in this | |||
document are agnostic to the fields that are used for flow | document are agnostic to the fields that are used for flow | |||
identification. | identification. | |||
3.1.2. Criteria for Identifying a Large Flow | 4.3.2. Criteria for Identifying a Large Flow | |||
From a bandwidth and time duration perspective, in order to identify | From a bandwidth and time duration perspective, in order to identify | |||
large flows we define an observation interval and observe the | large flows we define an observation interval and observe the | |||
bandwidth of the flow over that interval. A flow that exceeds a | bandwidth of the flow over that interval. A flow that exceeds a | |||
certain minimum bandwidth threshold over that observation interval | certain minimum bandwidth threshold over that observation interval | |||
would be considered a large flow. | would be considered a large flow. | |||
The two parameters -- the observation interval, and the minimum | The two parameters -- the observation interval, and the minimum | |||
bandwidth threshold over that observation interval -- should be | bandwidth threshold over that observation interval -- should be | |||
programmable in a router to facilitate handling of different use | programmable in a router to facilitate handling of different use | |||
skipping to change at page 8, line 36 | skipping to change at page 10, line 36 | |||
could be declared a large flow [DevoFlow]. | could be declared a large flow [DevoFlow]. | |||
In order to avoid excessive churn in the rebalancing, once a flow has | In order to avoid excessive churn in the rebalancing, once a flow has | |||
been recognized as a large flow, it should continue to be recognized | been recognized as a large flow, it should continue to be recognized | |||
as a large flow as long as the traffic received during an observation | as a large flow as long as the traffic received during an observation | |||
interval exceeds some fraction of the bandwidth threshold, for | interval exceeds some fraction of the bandwidth threshold, for | |||
example 80% of the bandwidth threshold. | example 80% of the bandwidth threshold. | |||
Various techniques to identify a large flow are described below. | Various techniques to identify a large flow are described below. | |||
3.1.3. Sampling Techniques | 4.3.3. Sampling Techniques | |||
A number of routers support sampling techniques such as sFlow [sFlow- | A number of routers support sampling techniques such as sFlow [sFlow- | |||
v5, sFlow-LAG], PSAMP [RFC 5475] and Netflow Sampling [RFC 3954]. | v5, sFlow-LAG], PSAMP [RFC 5475] and Netflow Sampling [RFC 3954]. | |||
For the purpose of large flow identification, sampling must be | For the purpose of large flow identification, sampling must be | |||
enabled on all of the egress ports in the router where such | enabled on all of the egress ports in the router where such | |||
measurements are desired. | measurements are desired. | |||
Using sflow as an example, processing in an sFlow collector will | Using sflow as an example, processing in a sFlow collector will | |||
provide an approximate indication of the large flows mapping to each | provide an approximate indication of the large flows mapping to each | |||
of the component links in each LAG/ECMP group. It is possible to | of the component links in each LAG/ECMP group. It is possible to | |||
implement this part of the collector function in the control plane of | implement this part of the collector function in the control plane of | |||
the router reducing dependence on an external management station, | the router reducing dependence on an external management station, | |||
assuming sufficient control plane resources are available. | assuming sufficient control plane resources are available. | |||
If egress sampling is not available, ingress sampling can suffice | If egress sampling is not available, ingress sampling can suffice | |||
since the central management entity used by the sampling technique | since the central management entity used by the sampling technique | |||
typically has multi-node visibility and can use the samples from an | typically has multi-node visibility and can use the samples from an | |||
immediately downstream node to make measurements for egress traffic | immediately downstream node to make measurements for egress traffic | |||
skipping to change at page 9, line 36 | skipping to change at page 11, line 36 | |||
Disadvantages: | Disadvantages: | |||
. In order to minimize the error inherent in sampling, there is a | . In order to minimize the error inherent in sampling, there is a | |||
minimum delay for the recognition time of large flows, and in | minimum delay for the recognition time of large flows, and in | |||
the time that it takes to react to this information. | the time that it takes to react to this information. | |||
With sampling, the detection of large flows can be done on the order | With sampling, the detection of large flows can be done on the order | |||
of one second [DevoFlow]. | of one second [DevoFlow]. | |||
3.1.4. Automatic Hardware Recognition | 4.3.4. Automatic Hardware Recognition | |||
Implementations may perform automatic recognition of large flows in | Implementations may perform automatic recognition of large flows in | |||
hardware on a router. Since this is done in hardware, it is an inline | hardware on a router. Since this is done in hardware, it is an inline | |||
solution and would be expected to operate at line rate. | solution and would be expected to operate at line rate. | |||
Using automatic hardware recognition of large flows, a faster | Using automatic hardware recognition of large flows, a faster | |||
indication of large flows mapped to each of the component links in a | indication of large flows mapped to each of the component links in a | |||
LAG/ECMP group is available (as compared to the sampling approach | LAG/ECMP group is available (as compared to the sampling approach | |||
described above). | described above). | |||
skipping to change at page 10, line 27 | skipping to change at page 12, line 27 | |||
. Not supported in many routers. | . Not supported in many routers. | |||
As mentioned earlier, the observation interval for determining a | As mentioned earlier, the observation interval for determining a | |||
large flow and the bandwidth threshold for classifying a flow as a | large flow and the bandwidth threshold for classifying a flow as a | |||
large flow should be programmable parameters in a router. | large flow should be programmable parameters in a router. | |||
The implementation of automatic hardware recognition of large flows | The implementation of automatic hardware recognition of large flows | |||
is vendor dependent and beyond the scope of this document. | is vendor dependent and beyond the scope of this document. | |||
3.2. Load Re-balancing Options | 4.4. Load Re-balancing Options | |||
Below are suggested techniques for load re-balancing. Equipment | Below are suggested techniques for load re-balancing. Equipment | |||
vendors should implement all of these techniques and allow the | vendors should implement all of these techniques and allow the | |||
operator to choose one or more techniques based on their | operator to choose one or more techniques based on their | |||
applications. | applications. | |||
Note that regardless of the method used, perfect re-balancing of | Note that regardless of the method used, perfect re-balancing of | |||
large flows may not be possible since flows arrive and depart at | large flows may not be possible since flows arrive and depart at | |||
different times. Also, any flows that are moved from one component | different times. Also, any flows that are moved from one component | |||
link to another may experience momentary packet reordering. | link to another may experience momentary packet reordering. | |||
3.2.1. Alternative Placement of Large Flows | 4.4.1. Alternative Placement of Large Flows | |||
Within a LAG/ECMP group, the member component links with least | Within a LAG/ECMP group, the member component links with least | |||
average port utilization are identified. Some large flow(s) from the | average port utilization are identified. Some large flow(s) from the | |||
heavily loaded component links are then moved to those lightly-loaded | heavily loaded component links are then moved to those lightly-loaded | |||
member component links using a PBR rule in the ingress processing | member component links using a PBR rule in the ingress processing | |||
element(s) in the routers. | element(s) in the routers. | |||
With this approach, only certain large flows are subjected to | With this approach, only certain large flows are subjected to | |||
momentary flow re-ordering. | momentary flow re-ordering. | |||
skipping to change at page 11, line 18 | skipping to change at page 13, line 18 | |||
placement of existing flows. | placement of existing flows. | |||
Consider a case where there is a LAG compromising 4 10 Gbps component | Consider a case where there is a LAG compromising 4 10 Gbps component | |||
links and there are 4 large flows each of 1 Gbps. These flows are | links and there are 4 large flows each of 1 Gbps. These flows are | |||
each placed on one of the component links. Subsequent, a 5-th large | each placed on one of the component links. Subsequent, a 5-th large | |||
flow of 2 Gbps is recognized and to maintain equitable load | flow of 2 Gbps is recognized and to maintain equitable load | |||
distribution, it may require placement of one of the existing 1 Gbps | distribution, it may require placement of one of the existing 1 Gbps | |||
flow to a different component link. And this would still result in | flow to a different component link. And this would still result in | |||
some imbalance in the utilization across the component links. | some imbalance in the utilization across the component links. | |||
3.2.2. Redistributing Small Flows | 4.4.2. Redistributing Small Flows | |||
Some large flows may consume the entire bandwidth of the component | Some large flows may consume the entire bandwidth of the component | |||
link(s). In this case, it would be desirable for the small flows to | link(s). In this case, it would be desirable for the small flows to | |||
not use the congested component link(s). This can be accomplished in | not use the congested component link(s). This can be accomplished in | |||
one of the following ways. | one of the following ways. | |||
This method works on some existing router hardware. The idea is to | This method works on some existing router hardware. The idea is to | |||
prevent, or reduce the probability, that the small flow hashes into | prevent, or reduce the probability, that the small flow hashes into | |||
the congested component link(s). | the congested component link(s). | |||
skipping to change at page 11, line 42 | skipping to change at page 13, line 42 | |||
component links are heavily loaded, but not congested, the | component links are heavily loaded, but not congested, the | |||
output of the hash function can be adjusted to account for large | output of the hash function can be adjusted to account for large | |||
flow loading on each of the component links. | flow loading on each of the component links. | |||
. The PBR rules for large flows (refer to Section 3.2.1) must | . The PBR rules for large flows (refer to Section 3.2.1) must | |||
have strict precedence over the LAG/ECMP table lookup result. | have strict precedence over the LAG/ECMP table lookup result. | |||
With this approach the small flows that are moved would be subject to | With this approach the small flows that are moved would be subject to | |||
reordering. | reordering. | |||
3.2.3. Component Link Protection Considerations | 4.4.3. Component Link Protection Considerations | |||
If desired, certain component links may be reserved for link | If desired, certain component links may be reserved for link | |||
protection. These reserved component links are not used for any flows | protection. These reserved component links are not used for any flows | |||
in the absence of any failures.. In the case when the component | in the absence of any failures.. In the case when the component | |||
link(s) fail, all the flows on the failed component link(s) are moved | link(s) fail, all the flows on the failed component link(s) are moved | |||
to the reserved component link(s). The mapping table of large flows | to the reserved component link(s). The mapping table of large flows | |||
to component link simply replaces the failed component link with the | to component link simply replaces the failed component link with the | |||
reserved link. Likewise, the LAG/ECMP hash table replaces the failed | reserved link. Likewise, the LAG/ECMP hash table replaces the failed | |||
component link with the reserved link. | component link with the reserved link. | |||
3.2.4. Load Re-Balancing Example | 4.4.4. Load Re-balancing Algorithms | |||
Optimal LAG/ECMP component utilization for the use case in Figure 1 | Specific algorithms for placement of large flows are out of scope of | |||
is depicted below in Figure 2. The large flow rebalancing explained | this document. One possibility is to formulate the problem for large | |||
flow placement as the well-known bin-packing problem and make use of | ||||
the various heuristics that are available for that problem [bin- | ||||
pack]. | ||||
4.4.5. Load Re-Balancing Example | ||||
Optimal LAG/ECMP component utilization for the use case in Figure 2 | ||||
is depicted below in Figure 3. The large flow rebalancing explained | ||||
in Section 3.2.1 is used. The improved link utilization is as | in Section 3.2.1 is used. The improved link utilization is as | |||
follows: | follows: | |||
. Component link (1) has 3 flows -- 2 small flows and 1 large | . Component link (1) has 3 flows -- 2 small flows and 1 large | |||
flow -- and the link utilization is normal. | flow -- and the link utilization is normal. | |||
. Component link (2) has 4 flows -- 3 small flows and 1 large | . Component link (2) has 4 flows -- 3 small flows and 1 large | |||
flow -- and the link utilization is normal now. | flow -- and the link utilization is normal now. | |||
. Component link (3) has 3 flows -- 2 small flows and 1 large | . Component link (3) has 3 flows -- 2 small flows and 1 large | |||
skipping to change at page 12, line 42 | skipping to change at page 15, line 5 | |||
| | | | | | | | | | |||
| | -> -> | | | | | -> -> | | | |||
| |=====> | | | | |=====> | | | |||
| (3)|--/---/-|(3) | | | (3)|--/---/-|(3) | | |||
| | | | | | | | | | |||
+-----------+ +-----------+ | +-----------+ +-----------+ | |||
Where: ->-> small flows | Where: ->-> small flows | |||
===> large flow | ===> large flow | |||
Figure 2: Evenly utilized Composite Links | Figure 3: Evenly utilized Composite Links | |||
Basically, the use of the mechanisms described in Section 3.2.1 | Basically, the use of the mechanisms described in Section 3.2.1 | |||
resulted in a rebalancing of flows where one of the large flows on | resulted in a rebalancing of flows where one of the large flows on | |||
component link (3) which was previously congested was moved to | component link (3) which was previously congested was moved to | |||
component link (2) which was previously under-utilized. | component link (2) which was previously under-utilized. | |||
4. Information Model for Flow Re-balancing | 5. Information Model for Flow Re-balancing | |||
4.1. Configuration Parameters | 5.1. Configuration Parameters for Flow Re-balancing | |||
The following parameters are required the configuration of this | The following parameters are required the configuration of this | |||
feature: | feature: | |||
. Large flow recognition parameters. | . Large flow recognition parameters: | |||
o Observation interval: The observation interval is the time | o Observation interval: The observation interval is the time | |||
period in seconds over which the packet arrivals are | period in seconds over which the packet arrivals are | |||
observed for the purpose of large flow recognition. | observed for the purpose of large flow recognition. | |||
o Minimum bandwidth threshold: The minimum bandwidth threshold | o Minimum bandwidth threshold: The minimum bandwidth threshold | |||
would be configured as a percentage of link speed and | would be configured as a percentage of link speed and | |||
translated into a number of bytes over the observation | translated into a number of bytes over the observation | |||
interval. A flow for which the number of bytes received, | interval. A flow for which the number of bytes received, | |||
for a given observation interval, exceeds this number would | for a given observation interval, exceeds this number would | |||
skipping to change at page 13, line 38 | skipping to change at page 16, line 5 | |||
Once a flow is recognized as a large flow, it continues to | Once a flow is recognized as a large flow, it continues to | |||
be recognized as a large flow until it falls below this | be recognized as a large flow until it falls below this | |||
threshold. This is also configured as a percentage of link | threshold. This is also configured as a percentage of link | |||
speed and is typically lower than the minimum bandwidth | speed and is typically lower than the minimum bandwidth | |||
threshold defined above. | threshold defined above. | |||
. Imbalance threshold: the difference between the utilization of | . Imbalance threshold: the difference between the utilization of | |||
the least utilized and most utilized component links. Expressed | the least utilized and most utilized component links. Expressed | |||
as a percentage of link speed. | as a percentage of link speed. | |||
4.2. Import of Flow Information | . Rebalancing interval: the minimum amount of time between | |||
rebalancing events. This parameter ensures that rebalancing is | ||||
not invoked too frequently as it impacts frame ordering. | ||||
These parameters may be configured on a system-wide basis or it may | ||||
apply to an individual LAG. | ||||
5.2. System Configuration and Identification Parameters | ||||
. IP address: The IP address of a specific router that the | ||||
feature is being configured on, or that the large flow placement | ||||
is being applied to. | ||||
. LAG ID: Identifies the LAG. The LAG ID may be required when | ||||
configuring this feature (to apply a specific set of large flow | ||||
identification parameters to the LAG) and will be required when | ||||
specifying flow placement to achieve the desired rebalancing. | ||||
. Component Link ID: Identifies the component link within a LAG. | ||||
This is required when specifying flow placement to achieve the | ||||
desired rebalancing. | ||||
5.3. Information for Alternative Placement of Large Flows | ||||
In cases where large flow recognition is handled by an external | In cases where large flow recognition is handled by an external | |||
management station (see Section 3.1.3), an information model for | management station (see Section 3.1.3), an information model for | |||
flows is required to allow the import of large flow information to | flows is required to allow the import of large flow information to | |||
the router. | the router. | |||
The following are some of the elements of information model for | The following are some of the elements of information model for | |||
importing of flows: | importing of flows: | |||
. Layer 2: source MAC address, destination MAC address, VLAN ID. | . Layer 2: source MAC address, destination MAC address, VLAN ID. | |||
skipping to change at page 14, line 22 | skipping to change at page 17, line 5 | |||
destination port. | destination port. | |||
. MPLS Labels. | . MPLS Labels. | |||
This list is not exhaustive. For example, with overlay protocols | This list is not exhaustive. For example, with overlay protocols | |||
such as VXLAN and NVGRE, fields from the outer and/or inner headers | such as VXLAN and NVGRE, fields from the outer and/or inner headers | |||
may be specified. In general, all fields in the packet that can be | may be specified. In general, all fields in the packet that can be | |||
used by forwarding decisions should be available for use when | used by forwarding decisions should be available for use when | |||
importing flow information from an external management station. | importing flow information from an external management station. | |||
5. Operational Considerations | The IPFIX information model [RFC 5101] can be leveraged for large | |||
flow identification. The component link ID would be used to specify | ||||
the target component link for the flow. | ||||
5.4. Information for Redistribution of Small Flows | ||||
For small flows, the LAG ID and the component link IDs along with the | ||||
percentage of traffic to be assigned to each component link ID Is | ||||
required. | ||||
5.5. Export of Flow Information | ||||
Exporting flow information is required when large flow identification | ||||
is being done on a router, but the decision to rebalance is being | ||||
made in an external management station. | ||||
It is recommended to use IPFIX protocol [RFC 5101] for exporting of | ||||
large flows from the router to an external management station. | ||||
5.6. Monitoring information | ||||
5.6.1. Interface (link) utilization | ||||
The incoming bytes (ifInOctets), outgoing bytes (ifOutOctets) and | ||||
interface speed (ifSpeed) can be measured from the Interface table | ||||
(iftable) MIB [RFC 1213]. | ||||
The link utilization can then be computed as follows: | ||||
Incoming link utilization = (ifInOctets *8 / ifSpeed) | ||||
Outgoing link utilization = (ifOutOctets * 8 / ifSpeed) | ||||
For high speed links, the etherStatsHighCapacityTable MIB [RFC 3273] | ||||
can be used. | ||||
The outgoing link utilization of the component links within a LAG can | ||||
be used to compute the imbalance threshold (See Section 5.1) for the | ||||
LAG. | ||||
5.6.2. Other monitoring information | ||||
Additional monitoring information includes: | ||||
. Number of times rebalancing was done. | ||||
. Time since the last rebalancing event. | ||||
6. Operational Considerations | ||||
Flows should be re-balanced only when the imbalance in the | Flows should be re-balanced only when the imbalance in the | |||
utilization across component links exceeds a certain threshold. | utilization across component links exceeds a certain threshold. | |||
Frequent re-balancing to achieve precise equitable utilization across | Frequent re-balancing to achieve precise equitable utilization across | |||
component links could be counter-productive as it may result in | component links could be counter-productive as it may result in | |||
moving flows back and forth between the component links impacting | moving flows back and forth between the component links impacting | |||
packet ordering and system stability. This applies regardless of | packet ordering and system stability. This applies regardless of | |||
whether large flows or small flows are re-distributed. | whether large flows or small flows are re-distributed. It should be | |||
noted that reordering is a concern for TCP flows with even a few | ||||
packets because three out-of-order packets would trigger sufficient | ||||
duplicate ACKs to the sender resulting in a retransmission [RFC | ||||
5681]. | ||||
The operator would have to experiment with various values of the | The operator would have to experiment with various values of the | |||
large flow recognition parameters (minimum bandwidth threshold, | large flow recognition parameters (minimum bandwidth threshold, | |||
observation interval) and the imbalance threshold across component | observation interval) and the imbalance threshold across component | |||
links to tune the solution for their environment. | links to tune the solution for their environment. | |||
6. IANA Considerations | 7. IANA Considerations | |||
This memo includes no request to IANA. | This memo includes no request to IANA. | |||
7. Security Considerations | 8. Security Considerations | |||
This document does not directly impact the security of the Internet | This document does not directly impact the security of the Internet | |||
infrastructure or its applications. In fact, it could help if there | infrastructure or its applications. In fact, it could help if there | |||
is a DOS attack pattern which causes a hash imbalance resulting in | is a DOS attack pattern which causes a hash imbalance resulting in | |||
heavy overloading of large flows to certain LAG/ECMP component | heavy overloading of large flows to certain LAG/ECMP component | |||
links. | links. | |||
8. Acknowledgements | 9. Acknowledgements | |||
The authors would like to thank the following individuals for their | The authors would like to thank the following individuals for their | |||
review and valuable feedback on earlier versions of this document: | review and valuable feedback on earlier versions of this document: | |||
Shane Amante, Curtis Villamizar, Fred Baker, Wes George, Brian | Shane Amante, Curtis Villamizar, Fred Baker, Wes George, Brian | |||
Carpenter, George Yum, Michael Fargano, Michael Bugenhagen, Jianrong | Carpenter, George Yum, Michael Fargano, Michael Bugenhagen, Jianrong | |||
Wong, Peter Phaal, Roman Krzanowski, Weifeng Zhang, Pete Moyer, Andrew | Wong, Peter Phaal, Roman Krzanowski, Weifeng Zhang, Pete Moyer, | |||
Mallis, Dave Mcdysan and Zhen Cao | Andrew Malis, Dave McDysan, Zhen Cao, and Dan Romascanu. | |||
9. References | 10. References | |||
9.1. Normative References | 10.1. Normative References | |||
9.2. Informative References | 10.2. Informative References | |||
[I-D.ietf-rtgwg-cl-requirement] Villamizar, C. et al., "Requirements | [I-D.ietf-rtgwg-cl-requirement] Villamizar, C. et al., "Requirements | |||
for MPLS over a Composite Link", June 2012. | for MPLS over a Composite Link," September 2013. | |||
[RFC 6790] Kompella, K. et al., "The Use of Entropy Labels in MPLS | [RFC 6790] Kompella, K. et al., "The Use of Entropy Labels in MPLS | |||
Forwarding", November 2012. | Forwarding," November 2012. | |||
[CAIDA] Caida Internet Traffic Analysis, http://www.caida.org/home. | [CAIDA] Caida Internet Traffic Analysis, http://www.caida.org/home. | |||
[YONG] Yong, L., "Enhanced ECMP and Large Flow Aware Transport", | [YONG] Yong, L., "Enhanced ECMP and Large Flow Aware Transport," | |||
draft-yong-pwe3-enhance-ecmp-lfat-01, September 2010. | draft-yong-pwe3-enhance-ecmp-lfat-01, September 2010. | |||
[ITCOM] Jo, J., et al., "Internet traffic load balancing using | [ITCOM] Jo, J., et al., "Internet traffic load balancing using | |||
dynamic hashing with flow volume", SPIE ITCOM, 2002. | dynamic hashing with flow volume," SPIE ITCOM, 2002. | |||
[RFC 2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and | [RFC 2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and | |||
Multicast", November 2000. | Multicast," November 2000. | |||
[RFC 2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path | [RFC 2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path | |||
Algorithm", November 2000. | Algorithm," November 2000. | |||
[RFC 5475] Zseby, T., et al., "Sampling and Filtering Techniques for | [RFC 5475] Zseby, T., et al., "Sampling and Filtering Techniques for | |||
IP Packet Selection", March 2009. | IP Packet Selection," March 2009. | |||
[sFlow-v5] Phaal, P. and M. Lavine, "sFlow version 5", July 2004. | [sFlow-v5] Phaal, P. and M. Lavine, "sFlow version 5," July 2004. | |||
[sFlow-LAG] Phaal, P. and A. Ghanwani, "sFlow LAG counters | [sFlow-LAG] Phaal, P. and A. Ghanwani, "sFlow LAG counters | |||
structure", September 2012. | structure," September 2012. | |||
[RFC 3954] Claise, B., "Cisco Systems NetFlow Services Export Version | [RFC 3954] Claise, B., "Cisco Systems NetFlow Services Export Version | |||
9", October 2004 | 9," October 2004 | |||
[RFC 5101] Claise, B., "Specification of the IP Flow Information | ||||
Export (IPFIX) Protocol for the Exchange of IP Traffic Flow | ||||
Information," January 2008 | ||||
[RFC 1213] McCloghrie, K., "Management Information Base for Network | ||||
Management of TCP/IP-based internets: MIB-II," March 1991. | ||||
[RFC 3273] Waldbusser, S., "Remote Network Monitoring Management | ||||
Information Base for High Capacity Networks," July 2002. | ||||
[DevoFlow] Mogul, J., et al., "DevoFlow: Cost-Effective Flow | [DevoFlow] Mogul, J., et al., "DevoFlow: Cost-Effective Flow | |||
Management for High Performance Enterprise Networks", Proceedings of | Management for High Performance Enterprise Networks," Proceedings of | |||
the ACM SIGCOMM, August 2011. | the ACM SIGCOMM, August 2011. | |||
[Bloom] Bloom, B. H., "Space /Time Trade-offs in Hash Coding with | ||||
Allowable Errors", Communications of the ACM, July 1970. | ||||
[NDTM] Estan, C. and G. Varghese, "New directions in traffic | [NDTM] Estan, C. and G. Varghese, "New directions in traffic | |||
measurement and accounting", Proceedings of ACM SIGCOMM, August 2002. | measurement and accounting," Proceedings of ACM SIGCOMM, August 2002. | |||
[bin-pack] Coffman, Jr., E., M. Garey, and D. Johnson. Approximation | ||||
Algorithms for Bin-Packing -- An Updated Survey. In Algorithm Design | ||||
for Computer System Design, ed. by Ausiello, Lucertini, and Serafini. | ||||
Springer-Verlag, 1984. | ||||
Appendix A. Internet Traffic Analysis and Load Balancing Simulation | Appendix A. Internet Traffic Analysis and Load Balancing Simulation | |||
Internet traffic [CAIDA] has been analyzed to obtain flow statistics | Internet traffic [CAIDA] has been analyzed to obtain flow statistics | |||
such as the number of packets in a flow and the flow duration. The | such as the number of packets in a flow and the flow duration. The | |||
five tuples in the packet header (IP addresses, TCP/UDP Ports, and IP | five tuples in the packet header (IP addresses, TCP/UDP Ports, and IP | |||
protocol) are used for flow identification. The analysis indicates | protocol) are used for flow identification. The analysis indicates | |||
that < ~2% of the flows take ~30% of total traffic volume while the | that < ~2% of the flows take ~30% of total traffic volume while the | |||
rest of the flows (> ~98%) contributes ~70% [YONG]. | rest of the flows (> ~98%) contributes ~70% [YONG]. | |||
End of changes. 49 change blocks. | ||||
79 lines changed or deleted | 253 lines changed or added | |||
This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |