draft-ietf-opsawg-large-flow-load-balancing-11.txt | draft-ietf-opsawg-large-flow-load-balancing-12.txt | |||
---|---|---|---|---|
OPSAWG R. Krishnan | OPSAWG R. Krishnan | |||
Internet Draft Brocade Communications | Internet Draft Brocade Communications | |||
Intended status: Informational L. Yong | Intended status: Informational L. Yong | |||
Expires: October 22, 2014 Huawei USA | Expires: December 13, 2014 Huawei USA | |||
A. Ghanwani | A. Ghanwani | |||
Dell | Dell | |||
Ning So | Ning So | |||
Tata Communications | Tata Communications | |||
B. Khasnabish | B. Khasnabish | |||
ZTE Corporation | ZTE Corporation | |||
April 22, 2014 | June 13, 2014 | |||
Mechanisms for Optimizing LAG/ECMP Component Link Utilization in | Mechanisms for Optimizing LAG/ECMP Component Link Utilization in | |||
Networks | Networks | |||
draft-ietf-opsawg-large-flow-load-balancing-11.txt | draft-ietf-opsawg-large-flow-load-balancing-12.txt | |||
Status of this Memo | Status of this Memo | |||
This Internet-Draft is submitted in full conformance with the | This Internet-Draft is submitted in full conformance with the | |||
provisions of BCP 78 and BCP 79. This document may not be modified, | provisions of BCP 78 and BCP 79. This document may not be modified, | |||
and derivative works of it may not be created, except to publish it | and derivative works of it may not be created, except to publish it | |||
as an RFC and to translate it into languages other than English. | as an RFC and to translate it into languages other than English. | |||
Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
Task Force (IETF), its areas, and its working groups. Note that | Task Force (IETF), its areas, and its working groups. Note that | |||
skipping to change at page 1, line 42 | skipping to change at page 1, line 42 | |||
and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
The list of current Internet-Drafts can be accessed at | The list of current Internet-Drafts can be accessed at | |||
http://www.ietf.org/ietf/1id-abstracts.txt | http://www.ietf.org/ietf/1id-abstracts.txt | |||
The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
http://www.ietf.org/shadow.html | http://www.ietf.org/shadow.html | |||
This Internet-Draft will expire on October 22, 2014. | This Internet-Draft will expire on December 13, 2014. | |||
Copyright Notice | Copyright Notice | |||
Copyright (c) 2014 IETF Trust and the persons identified as the | Copyright (c) 2014 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | document authors. All rights reserved. | |||
This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
(http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
publication of this document. Please review these documents | publication of this document. Please review these documents | |||
skipping to change at page 2, line 31 | skipping to change at page 2, line 31 | |||
link aggregation groups and equal cost multi-paths as techniques for | link aggregation groups and equal cost multi-paths as techniques for | |||
bandwidth scaling. This draft explores some of the mechanisms useful | bandwidth scaling. This draft explores some of the mechanisms useful | |||
for achieving this. | for achieving this. | |||
Table of Contents | Table of Contents | |||
1. Introduction...................................................3 | 1. Introduction...................................................3 | |||
1.1. Acronyms..................................................4 | 1.1. Acronyms..................................................4 | |||
1.2. Terminology...............................................4 | 1.2. Terminology...............................................4 | |||
2. Flow Categorization............................................5 | 2. Flow Categorization............................................5 | |||
3. Hash-based Load Distribution in LAG/ECMP.......................5 | 3. Hash-based Load Distribution in LAG/ECMP.......................6 | |||
4. Mechanisms for Optimizing LAG/ECMP Component Link Utilization..7 | 4. Mechanisms for Optimizing LAG/ECMP Component Link Utilization..7 | |||
4.1. Differences in LAG vs ECMP................................8 | 4.1. Differences in LAG vs ECMP................................8 | |||
4.2. Operational Overview......................................9 | 4.2. Operational Overview......................................9 | |||
4.3. Large Flow Recognition...................................10 | 4.3. Large Flow Recognition...................................10 | |||
4.3.1. Flow Identification.................................10 | 4.3.1. Flow Identification.................................10 | |||
4.3.2. Criteria and Techniques for Large Flow Recognition..11 | 4.3.2. Criteria and Techniques for Large Flow Recognition..11 | |||
4.3.3. Sampling Techniques.................................11 | 4.3.3. Sampling Techniques.................................11 | |||
4.3.4. Inline Data Path Measurement........................13 | 4.3.4. Inline Data Path Measurement........................13 | |||
4.3.5. Use of More Than One Method for Large Flow Recognition13 | 4.3.5. Use of More Than One Method for Large Flow | |||
Recognition.........................................13 | ||||
4.4. Load Rebalancing Options.................................14 | 4.4. Load Rebalancing Options.................................14 | |||
4.4.1. Alternative Placement of Large Flows................14 | 4.4.1. Alternative Placement of Large Flows................14 | |||
4.4.2. Redistributing Small Flows..........................15 | 4.4.2. Redistributing Small Flows..........................15 | |||
4.4.3. Component Link Protection Considerations............15 | 4.4.3. Component Link Protection Considerations............15 | |||
4.4.4. Load Rebalancing Algorithms.........................15 | 4.4.4. Load Rebalancing Algorithms.........................15 | |||
4.4.5. Load Rebalancing Example............................16 | 4.4.5. Load Rebalancing Example............................16 | |||
5. Information Model for Flow Rebalancing........................17 | 5. Information Model for Flow Rebalancing........................17 | |||
5.1. Configuration Parameters for Flow Rebalancing............17 | 5.1. Configuration Parameters for Flow Rebalancing............17 | |||
5.2. System Configuration and Identification Parameters.......18 | 5.2. System Configuration and Identification Parameters.......18 | |||
5.3. Information for Alternative Placement of Large Flows.....19 | 5.3. Information for Alternative Placement of Large Flows.....19 | |||
5.4. Information for Redistribution of Small Flows............19 | 5.4. Information for Redistribution of Small Flows............19 | |||
5.5. Export of Flow Information...............................20 | 5.5. Export of Flow Information...............................20 | |||
5.6. Monitoring information...................................20 | 5.6. Monitoring information...................................20 | |||
5.6.1. Interface (link) utilization........................20 | 5.6.1. Interface (link) utilization........................20 | |||
5.6.2. Other monitoring information........................21 | 5.6.2. Other monitoring information........................20 | |||
6. Operational Considerations....................................21 | 6. Operational Considerations....................................21 | |||
6.1. Rebalancing Frequency....................................21 | 6.1. Rebalancing Frequency....................................21 | |||
6.2. Handling Route Changes...................................22 | 6.2. Handling Route Changes...................................21 | |||
6.3. Forwarding Resources.....................................21 | ||||
7. IANA Considerations...........................................22 | 7. IANA Considerations...........................................22 | |||
8. Security Considerations.......................................22 | 8. Security Considerations.......................................22 | |||
9. Contributing Authors..........................................22 | 9. Contributing Authors..........................................22 | |||
10. Acknowledgements.............................................22 | 10. Acknowledgements.............................................22 | |||
11. References...................................................22 | 11. References...................................................23 | |||
11.1. Normative References....................................22 | 11.1. Normative References....................................23 | |||
11.2. Informative References..................................22 | 11.2. Informative References..................................23 | |||
1. Introduction | 1. Introduction | |||
Networks extensively use link aggregation groups (LAG) [802.1AX] and | Networks extensively use link aggregation groups (LAG) [802.1AX] and | |||
equal cost multi-paths (ECMP) [RFC 2991] as techniques for capacity | equal cost multi-paths (ECMP) [RFC 2991] as techniques for capacity | |||
scaling. For the problems addressed by this document, network traffic | scaling. For the problems addressed by this document, network traffic | |||
can be predominantly categorized into two traffic types: long-lived | can be predominantly categorized into two traffic types: long-lived | |||
large flows and other flows. These other flows, which include long- | large flows and other flows. These other flows, which include long- | |||
lived small flows, short-lived small flows, and short-lived large | lived small flows, short-lived small flows, and short-lived large | |||
flows, are referred to as "small flows" in this document. Long-lived | flows, are referred to as "small flows" in this document. Long-lived | |||
skipping to change at page 4, line 11 | skipping to change at page 4, line 12 | |||
of bandwidth on a link, e.g. greater than 5% of link bandwidth. The | of bandwidth on a link, e.g. greater than 5% of link bandwidth. The | |||
number of such flows would necessarily be fairly small, e.g. on the | number of such flows would necessarily be fairly small, e.g. on the | |||
order of 10's or 100's per LAG/ECMP. In other words, the number of | order of 10's or 100's per LAG/ECMP. In other words, the number of | |||
large flows is NOT expected to be on the order of millions of flows. | large flows is NOT expected to be on the order of millions of flows. | |||
Examples of such large flows would be IPsec tunnels in service | Examples of such large flows would be IPsec tunnels in service | |||
provider backbone networks or storage backup traffic in data center | provider backbone networks or storage backup traffic in data center | |||
networks. | networks. | |||
1.1. Acronyms | 1.1. Acronyms | |||
COTS: Commercial Off-the-shelf | ||||
DOS: Denial of Service | DOS: Denial of Service | |||
ECMP: Equal Cost Multi-path | ECMP: Equal Cost Multi-path | |||
GRE: Generic Routing Encapsulation | GRE: Generic Routing Encapsulation | |||
LAG: Link Aggregation Group | LAG: Link Aggregation Group | |||
MPLS: Multiprotocol Label Switching | MPLS: Multiprotocol Label Switching | |||
skipping to change at page 4, line 37 | skipping to change at page 4, line 36 | |||
QoS: Quality of Service | QoS: Quality of Service | |||
STT: Stateless Transport Tunneling | STT: Stateless Transport Tunneling | |||
TCAM: Ternary Content Addressable Memory | TCAM: Ternary Content Addressable Memory | |||
VXLAN: Virtual Extensible LAN | VXLAN: Virtual Extensible LAN | |||
1.2. Terminology | 1.2. Terminology | |||
Central management entity: Refers to an entity that is capable of | ||||
monitoring information about link utilization and flows in routers | ||||
across the network and may be capable of making traffic engineering | ||||
decisions for placement of large flows. It may include the functions | ||||
of a collector if the routers employ a sampling technique [RFC 7011]. | ||||
ECMP component link: An individual nexthop within an ECMP group. An | ECMP component link: An individual nexthop within an ECMP group. An | |||
ECMP component link may itself comprise a LAG. | ECMP component link may itself comprise a LAG. | |||
ECMP table: A table that is used as the nexthop of an ECMP route that | ECMP table: A table that is used as the nexthop of an ECMP route that | |||
comprises the set of component links and the weights associated with | comprises the set of component links and the weights associated with | |||
each of those component links. The weights are used to determine | each of those component links. The weights are used to determine | |||
which values of the hash function map to a given component link. | which values of the hash function map to a given component link. | |||
LAG component link: An individual link within a LAG. A LAG component | LAG component link: An individual link within a LAG. A LAG component | |||
link is typically a physical link. | link is typically a physical link. | |||
skipping to change at page 7, line 11 | skipping to change at page 7, line 11 | |||
o The presence of 2 large flows causes congestion on this | o The presence of 2 large flows causes congestion on this | |||
component link. | component link. | |||
+-----------+ -> +-----------+ | +-----------+ -> +-----------+ | |||
| | -> | | | | | -> | | | |||
| | ===> | | | | | ===> | | | |||
| (1)|--------|(1) | | | (1)|--------|(1) | | |||
| | -> | | | | | -> | | | |||
| | -> | | | | | -> | | | |||
| (R1) | -> | (R2) | | | (R1) | -> | (R2) | | |||
| (2)|--------|(2) | | | (2)|--------|(2) | | |||
| | -> | | | | | -> | | | |||
| | -> | | | | | -> | | | |||
| | ===> | | | | | ===> | | | |||
| | ===> | | | | | ===> | | | |||
| (3)|--------|(3) | | | (3)|--------|(3) | | |||
| | | | | | | | | | |||
+-----------+ +-----------+ | +-----------+ +-----------+ | |||
Where: -> small flow | Where: -> small flow | |||
skipping to change at page 8, line 47 | skipping to change at page 8, line 47 | |||
+-----+ +-----+ | +-----+ +-----+ | |||
/ \ \ / /\ | / \ \ / /\ | |||
/ +---------+ / \ | / +---------+ / \ | |||
/ / \ \ / \ | / / \ \ / \ | |||
/ / \ +------+ \ | / / \ +------+ \ | |||
/ / \ / \ \ | / / \ / \ \ | |||
+-----+ +-----+ +-----+ | +-----+ +-----+ +-----+ | |||
| L1 | | L2 | | L3 | | | L1 | | L2 | | L3 | | |||
+-----+ +-----+ +-----+ | +-----+ +-----+ +-----+ | |||
Figure 3: Two-level Fat Tree | Figure 3: Two-level Clos Network | |||
To demonstrate the limitations of local optimization, consider a two- | To demonstrate the limitations of local optimization, consider a two- | |||
level fat-tree topology with three leaf nodes (L1, L2, L3) and two | level Clos network topology as shown in Figure 3 with three leaf | |||
spine nodes (S1, S2) and assume all of the links are 10 Gbps. | nodes (L1, L2, L3) and two spine nodes (S1, S2). Assume all of the | |||
links are 10 Gbps. | ||||
Let L1 have two flows of 4 Gbps each towards L3, and let L2 have one | Let L1 have two flows of 4 Gbps each towards L3, and let L2 have one | |||
flow of 7 Gbps also towards L3. If L1 balances the load optimally | flow of 7 Gbps also towards L3. If L1 balances the load optimally | |||
between S1 and S2, and L2 sends the flow via S1, then the downlink | between S1 and S2, and L2 sends the flow via S1, then the downlink | |||
from S1 to L3 would get congested resulting in packet discards. On | from S1 to L3 would get congested resulting in packet discards. On | |||
the other hand, if L1 had sent both its flows towards S1 and L2 had | the other hand, if L1 had sent both its flows towards S1 and L2 had | |||
sent its flow towards S2, there would have been no congestion at | sent its flow towards S2, there would have been no congestion at | |||
either S1 or S2. | either S1 or S2. | |||
The other issue with applying this scheme to ECMP groups is that it | The other issue with applying this scheme to ECMP groups is that it | |||
skipping to change at page 10, line 40 | skipping to change at page 10, line 40 | |||
. IP header: IP Protocol, IP source address, IP destination | . IP header: IP Protocol, IP source address, IP destination | |||
address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP | address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP | |||
destination port. | destination port. | |||
. MPLS Labels. | . MPLS Labels. | |||
For tunneling protocols like Generic Routing Encapsulation (GRE) | For tunneling protocols like Generic Routing Encapsulation (GRE) | |||
[RFC 2784], Virtual eXtensible Local Area Network (VXLAN) [VXLAN], | [RFC 2784], Virtual eXtensible Local Area Network (VXLAN) [VXLAN], | |||
Network Virtualization using Generic Routing Encapsulation (NVGRE) | Network Virtualization using Generic Routing Encapsulation (NVGRE) | |||
[NVGRE], Stateless Transport Tunneling (STT) [STT], etc., flow | [NVGRE], Stateless Transport Tunneling (STT) [STT], Layer 2 Tunneling | |||
identification is possible based on inner and/or outer headers. The | Protocol (L2TP) [RFC 3931], etc., flow identification is possible | |||
above list is not exhaustive. The mechanisms described in this | based on inner and/or outer headers as well as fields introduced by | |||
document are agnostic to the fields that are used for flow | the tunnel header, as any or all such fields may be used for load | |||
identification. | balancing decisions [RFC 5640]. The above list is not exhaustive. | |||
The mechanisms described in this document are agnostic to the fields | ||||
that are used for flow identification. | ||||
This method of flow identification is consistent with that of IPFIX | This method of flow identification is consistent with that of IPFIX | |||
[RFC 7011]. | [RFC 7011]. | |||
4.3.2. Criteria and Techniques for Large Flow Recognition | 4.3.2. Criteria and Techniques for Large Flow Recognition | |||
From a bandwidth and time duration perspective, in order to recognize | From a bandwidth and time duration perspective, in order to recognize | |||
large flows we define an observation interval and observe the | large flows we define an observation interval and observe the | |||
bandwidth of the flow over that interval. A flow that exceeds a | bandwidth of the flow over that interval. A flow that exceeds a | |||
certain minimum bandwidth threshold over that observation interval | certain minimum bandwidth threshold over that observation interval | |||
skipping to change at page 14, line 13 | skipping to change at page 14, line 13 | |||
to reliably determine the mapping of large flows to component links | to reliably determine the mapping of large flows to component links | |||
of a LAG/ECMP group, it is acceptable for the router to use more than | of a LAG/ECMP group, it is acceptable for the router to use more than | |||
one method for large flow recognition. | one method for large flow recognition. | |||
If both methods are supported, inline data path measurement may be | If both methods are supported, inline data path measurement may be | |||
preferable because of its speed of detection [FLOW-ACC]. | preferable because of its speed of detection [FLOW-ACC]. | |||
4.4. Load Rebalancing Options | 4.4. Load Rebalancing Options | |||
Below are suggested techniques for load rebalancing. Equipment | Below are suggested techniques for load rebalancing. Equipment | |||
vendors should implement all of these techniques and allow the | vendors may implement more than one technique, including those not | |||
operator to choose one or more techniques based on their | described in this document, allowing the operator to choose between | |||
applications. | them. | |||
Note that regardless of the method used, perfect rebalancing of large | Note that regardless of the method used, perfect rebalancing of large | |||
flows may not be possible since flows arrive and depart at different | flows may not be possible since flows arrive and depart at different | |||
times. Also, any flows that are moved from one component link to | times. Also, any flows that are moved from one component link to | |||
another may experience momentary packet reordering. | another may experience momentary packet reordering. | |||
4.4.1. Alternative Placement of Large Flows | 4.4.1. Alternative Placement of Large Flows | |||
Within a LAG/ECMP group, the member component links with least | Within a LAG/ECMP group, the member component links with least | |||
average port utilization are identified. Some large flow(s) from the | average port utilization are identified. Some large flow(s) from the | |||
skipping to change at page 16, line 28 | skipping to change at page 16, line 28 | |||
flow -- and the link utilization is normal now. | flow -- and the link utilization is normal now. | |||
+-----------+ -> +-----------+ | +-----------+ -> +-----------+ | |||
| | -> | | | | | -> | | | |||
| | ===> | | | | | ===> | | | |||
| (1)|--------|(1) | | | (1)|--------|(1) | | |||
| | | | | | | | | | |||
| | ===> | | | | | ===> | | | |||
| | -> | | | | | -> | | | |||
| | -> | | | | | -> | | | |||
| (R1) | -> | (R2) | | | (R1) | -> | (R2) | | |||
| (2)|--------|(2) | | | (2)|--------|(2) | | |||
| | | | | | | | | | |||
| | -> | | | | | -> | | | |||
| | -> | | | | | -> | | | |||
| | ===> | | | | | ===> | | | |||
| (3)|--------|(3) | | | (3)|--------|(3) | | |||
| | | | | | | | | | |||
+-----------+ +-----------+ | +-----------+ +-----------+ | |||
Where: -> small flow | Where: -> small flow | |||
===> large flow | ===> large flow | |||
Figure 4: Evenly Utilized Composite Links | Figure 4: Evenly Utilized Composite Links | |||
Basically, the use of the mechanisms described in Section 4.4.1 | Basically, the use of the mechanisms described in Section 4.4.1 | |||
resulted in a rebalancing of flows where one of the large flows on | resulted in a rebalancing of flows where one of the large flows on | |||
component link (3) which was previously congested was moved to | component link (3) which was previously congested was moved to | |||
component link (2) which was previously under-utilized. | component link (2) which was previously under-utilized. | |||
5. Information Model for Flow Rebalancing | 5. Information Model for Flow Rebalancing | |||
skipping to change at page 17, line 46 | skipping to change at page 17, line 46 | |||
be recognized as a large flow until it falls below this | be recognized as a large flow until it falls below this | |||
threshold. This is also configured as a percentage of link | threshold. This is also configured as a percentage of link | |||
speed and is typically lower than the minimum bandwidth | speed and is typically lower than the minimum bandwidth | |||
threshold defined above. | threshold defined above. | |||
. Imbalance threshold: A measure of the deviation of the | . Imbalance threshold: A measure of the deviation of the | |||
component link utilizations from the utilization of the overall | component link utilizations from the utilization of the overall | |||
LAG/ECMP group. Since component links can be of a different | LAG/ECMP group. Since component links can be of a different | |||
speed, the imbalance can be computed as follows. Let the | speed, the imbalance can be computed as follows. Let the | |||
utilization of each component link in a LAG/ECMP group with n | utilization of each component link in a LAG/ECMP group with n | |||
links of speed b_1, b_2, .., b_n, be u_1, u_2, .., u_n. The mean | links of speed b_1, b_2 ... b_n, be u_1, u_2 ... u_n. The mean | |||
utilization is computed is u_ave = [ (u_1 x b_1) + (u_2 x b_2) + | utilization is computed is u_ave = [ (u_1 x b_1) + (u_2 x b_2) + | |||
.. + (u_n x b_n) ] / [b_1 + b_2 + b_n]. The imbalance is then | ... + (u_n x b_n) ] / [b_1 + b_2 + ... + b_n]. The imbalance is | |||
computed as max_{i=1..n} | u_i - u_ave | / u_ave. | then computed as max_{i=1 ... n} | u_i - u_ave |. | |||
. Rebalancing interval: The minimum amount of time between | . Rebalancing interval: The minimum amount of time between | |||
rebalancing events. This parameter ensures that rebalancing is | rebalancing events. This parameter ensures that rebalancing is | |||
not invoked too frequently as it impacts packet ordering. | not invoked too frequently as it impacts packet ordering. | |||
These parameters may be configured on a system-wide basis or it may | These parameters may be configured on a system-wide basis or it may | |||
apply to an individual LAG. It may be applied to an ECMP group | apply to an individual LAG. It may be applied to an ECMP group | |||
provided the component links are not shared with any other ECMP | provided the component links are not shared with any other ECMP | |||
group. | group. | |||
skipping to change at page 19, line 12 | skipping to change at page 19, line 12 | |||
ECMP groups, or it may be configured specifically for a given LAG or | ECMP groups, or it may be configured specifically for a given LAG or | |||
ECMP group. | ECMP group. | |||
5.3. Information for Alternative Placement of Large Flows | 5.3. Information for Alternative Placement of Large Flows | |||
In cases where large flow recognition is handled by an external | In cases where large flow recognition is handled by an external | |||
management station (see Section 4.3.3), an information model for | management station (see Section 4.3.3), an information model for | |||
flows is required to allow the import of large flow information to | flows is required to allow the import of large flow information to | |||
the router. | the router. | |||
The following are some of the elements of information model for | Typical fields use for identifying large flows were discussed in | |||
importing of flows: | Section 4.3.1. The IPFIX information model [RFC 7012] can be | |||
leveraged for large flow identification. | ||||
. Layer 2: source MAC address, destination MAC address, VLAN ID. | ||||
. Layer 3 IP: IP Protocol, IP source address, IP destination | ||||
address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP | ||||
destination port. | ||||
. MPLS Labels. | ||||
This list is not exhaustive. For example, with overlay protocols | ||||
such as VXLAN and NVGRE, fields from the outer and/or inner headers | ||||
may be specified. In general, all fields in the packet that can be | ||||
used by forwarding decisions should be available for use when | ||||
importing flow information from an external management station. | ||||
The IPFIX information model [RFC 7012] can be leveraged for large | ||||
flow identification. | ||||
Large Flow placement is achieved by specifying the relevant flow | Large Flow placement is achieved by specifying the relevant flow | |||
information along with the following: | information along with the following: | |||
. For LAG: Router's IP address, LAG ID, LAG component link ID. | . For LAG: Router's IP address, LAG ID, LAG component link ID. | |||
. For ECMP: Router's IP address, ECMP group, ECMP component link | . For ECMP: Router's IP address, ECMP group, ECMP component link | |||
ID. | ID. | |||
In the case where the ECMP component link itself comprises a LAG, we | In the case where the ECMP component link itself comprises a LAG, we | |||
skipping to change at page 20, line 40 | skipping to change at page 20, line 27 | |||
5.6. Monitoring information | 5.6. Monitoring information | |||
5.6.1. Interface (link) utilization | 5.6.1. Interface (link) utilization | |||
The incoming bytes (ifInOctets), outgoing bytes (ifOutOctets) and | The incoming bytes (ifInOctets), outgoing bytes (ifOutOctets) and | |||
interface speed (ifSpeed) can be measured from the Interface table | interface speed (ifSpeed) can be measured from the Interface table | |||
(iftable) MIB [RFC 1213]. | (iftable) MIB [RFC 1213]. | |||
The link utilization can then be computed as follows: | The link utilization can then be computed as follows: | |||
Incoming link utilization = (ifInOctets 8 / ifSpeed) | Incoming link utilization = (ifInOctets/8) / ifSpeed | |||
Outgoing link utilization = (ifOutOctets 8 / ifSpeed) | Outgoing link utilization = (ifOutOctets/8) / ifSpeed | |||
For high speed Ethernet links, the etherStatsHighCapacityTable MIB | For high speed Ethernet links, the etherStatsHighCapacityTable MIB | |||
[RFC 3273] can be used. | [RFC 3273] can be used. | |||
For scalability, it is recommended to use the counter push mechanism | For scalability, it is recommended to use the counter push mechanism | |||
in [sflow-v5] for the interface counters. Doing so would help avoid | in [sflow-v5] for the interface counters. Doing so would help avoid | |||
counter polling through the MIB interface. | counter polling through the MIB interface. | |||
The outgoing link utilization of the component links within a | The outgoing link utilization of the component links within a | |||
LAG/ECMP group can be used to compute the imbalance (See Section 5.1) | LAG/ECMP group can be used to compute the imbalance (See Section 5.1) | |||
skipping to change at page 22, line 13 | skipping to change at page 21, line 48 | |||
links to tune the solution for their environment. | links to tune the solution for their environment. | |||
6.2. Handling Route Changes | 6.2. Handling Route Changes | |||
Large flow rebalancing must be aware of any changes to the FIB. In | Large flow rebalancing must be aware of any changes to the FIB. In | |||
cases where the nexthop of a route no longer to points to the LAG, or | cases where the nexthop of a route no longer to points to the LAG, or | |||
to an ECMP group, any PBR entries added as described in Section 4.4.1 | to an ECMP group, any PBR entries added as described in Section 4.4.1 | |||
and 4.4.2 must be withdrawn in order to avoid the creation of | and 4.4.2 must be withdrawn in order to avoid the creation of | |||
forwarding loops. | forwarding loops. | |||
6.3. Forwarding Resources | ||||
Hash-based techniques used for load balancing with LAG/ECMP are | ||||
usually stateless. The mechanisms described in this document require | ||||
additional resources in the forwarding plane of routers for creating | ||||
PBR rules that are capable of overriding the forwarding decision from | ||||
the hash-based approach. These resources may limit the number of | ||||
flows that can be rebalanced and may also impact the latency | ||||
experienced by packets due to the additional lookups that are | ||||
required. | ||||
7. IANA Considerations | 7. IANA Considerations | |||
This memo includes no request to IANA. | This memo includes no request to IANA. | |||
8. Security Considerations | 8. Security Considerations | |||
This document does not directly impact the security of the Internet | This document does not directly impact the security of the Internet | |||
infrastructure or its applications. In fact, it could help if there | infrastructure or its applications. In fact, it could help if there | |||
is a DOS attack pattern which causes a hash imbalance resulting in | is a DOS attack pattern which causes a hash imbalance resulting in | |||
heavy overloading of large flows to certain LAG/ECMP component | heavy overloading of large flows to certain LAG/ECMP component | |||
links. | links. | |||
An attacker with knowledge of the large flow recognition algorithm | ||||
and any stateless distribution method can generate flows that are | ||||
distributed in a way that overloads a specific path. This could be | ||||
used to cause the creation of PBR rules that exhaust the available | ||||
rule capacity on nodes. If PBR rules are consequently discarded, | ||||
this could result in congestion on the attacker-selected path. | ||||
Alternatively, tracking large numbers of PBR rules could result in | ||||
performance degradation. | ||||
9. Contributing Authors | 9. Contributing Authors | |||
Sanjay Khanna | Sanjay Khanna | |||
Cisco Systems | Cisco Systems | |||
Email: sanjakha@gmail.com | Email: sanjakha@gmail.com | |||
10. Acknowledgements | 10. Acknowledgements | |||
The authors would like to thank the following individuals for their | The authors would like to thank the following individuals for their | |||
review and valuable feedback on earlier versions of this document: | review and valuable feedback on earlier versions of this document: | |||
Shane Amante, Fred Baker, Michael Bugenhagen, Zhen Cao, Brian | Shane Amante, Fred Baker, Michael Bugenhagen, Zhen Cao, Brian | |||
Carpenter, Benoit Claise, Michael Fargano, Wes George, Sriganesh | Carpenter, Benoit Claise, Michael Fargano, Wes George, Sriganesh | |||
Kini, Roman Krzanowski, Andrew Malis, Dave McDysan, Pete Moyer, | Kini, Roman Krzanowski, Andrew Malis, Dave McDysan, Pete Moyer, | |||
Peter Phaal, Dan Romascanu, Curtis Villamizar, Jianrong Wong, George | Peter Phaal, Dan Romascanu, Curtis Villamizar, Jianrong Wong, George | |||
Yum, and Weifeng Zhang. | Yum, and Weifeng Zhang. As a part of the IETF Last Call process, | |||
valuable comments were received from Martin Thomson, | ||||
11. References | 11. References | |||
11.1. Normative References | 11.1. Normative References | |||
11.2. Informative References | ||||
[802.1AX] IEEE Standards Association, "IEEE Std 802.1AX-2008 IEEE | [802.1AX] IEEE Standards Association, "IEEE Std 802.1AX-2008 IEEE | |||
Standard for Local and Metropolitan Area Networks - Link | Standard for Local and Metropolitan Area Networks - Link | |||
Aggregation", 2008. | Aggregation", 2008. | |||
[RFC 2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and | ||||
Multicast," November 2000. | ||||
[RFC 7011] Claise, B. et al., "Specification of the IP Flow | ||||
Information Export (IPFIX) Protocol for the Exchange of IP Traffic | ||||
Flow Information," September 2013. | ||||
[RFC 7012] Claise, B. and B. Trammell, "Information Model for IP Flow | ||||
Information Export (IPFIX)," September 2013. | ||||
[sFlow-v5] Phaal, P. and M. Lavine, "sFlow version 5," | ||||
http://www.sflow.org/sflow_version_5.txt, July 2004. | ||||
11.2. Informative References | ||||
[bin-pack] Coffman, Jr., E., M. Garey, and D. Johnson. Approximation | [bin-pack] Coffman, Jr., E., M. Garey, and D. Johnson. Approximation | |||
Algorithms for Bin-Packing -- An Updated Survey. In Algorithm Design | Algorithms for Bin-Packing -- An Updated Survey. In Algorithm Design | |||
for Computer System Design, ed. by Ausiello, Lucertini, and Serafini. | for Computer System Design, ed. by Ausiello, Lucertini, and Serafini. | |||
Springer-Verlag, 1984. | Springer-Verlag, 1984. | |||
[CAIDA] Caida Internet Traffic Analysis, http://www.caida.org/home. | [CAIDA] "Caida Internet Traffic Analysis," http://www.caida.org/home. | |||
[DevoFlow] Mogul, J., et al., "DevoFlow: Cost-Effective Flow | [DevoFlow] Mogul, J., et al., "DevoFlow: Cost-Effective Flow | |||
Management for High Performance Enterprise Networks," Proceedings of | Management for High Performance Enterprise Networks," Proceedings of | |||
the ACM SIGCOMM, August 2011. | the ACM SIGCOMM, August 2011. | |||
[FLOW-ACC] Zseby, T., et al., "Packet sampling for flow accounting: | [FLOW-ACC] Zseby, T., et al., "Packet sampling for flow accounting: | |||
challenges and limitations," Proceedings of the 9th international | challenges and limitations," Proceedings of the 9th international | |||
conference on Passive and active network measurement, 2008. | conference on Passive and active network measurement, 2008. | |||
[ID.ietf-rtgwg-cl-requirement] Villamizar, C. et al., "Requirements | [ID.ietf-rtgwg-cl-requirement] Villamizar, C. et al., "Requirements | |||
for MPLS over a Composite Link," September 2013. | for MPLS over a Composite Link," September 2013. | |||
skipping to change at page 23, line 35 | skipping to change at page 24, line 18 | |||
[NDTM] Estan, C. and G. Varghese, "New directions in traffic | [NDTM] Estan, C. and G. Varghese, "New directions in traffic | |||
measurement and accounting," Proceedings of ACM SIGCOMM, August 2002. | measurement and accounting," Proceedings of ACM SIGCOMM, August 2002. | |||
[NVGRE] Sridharan, M. et al., "NVGRE: Network Virtualization using | [NVGRE] Sridharan, M. et al., "NVGRE: Network Virtualization using | |||
Generic Routing Encapsulation," draft-sridharan-virtualization- | Generic Routing Encapsulation," draft-sridharan-virtualization- | |||
nvgre-04, February 2014. | nvgre-04, February 2014. | |||
[RFC 2784] Farinacci, D. et al., "Generic Routing Encapsulation | [RFC 2784] Farinacci, D. et al., "Generic Routing Encapsulation | |||
(GRE)," March 2000. | (GRE)," March 2000. | |||
[RFC 2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and | ||||
Multicast," November 2000. | ||||
[RFC 6790] Kompella, K. et al., "The Use of Entropy Labels in MPLS | [RFC 6790] Kompella, K. et al., "The Use of Entropy Labels in MPLS | |||
Forwarding," November 2012. | Forwarding," November 2012. | |||
[RFC 1213] McCloghrie, K., "Management Information Base for Network | [RFC 1213] McCloghrie, K., "Management Information Base for Network | |||
Management of TCP/IP-based internets: MIB-II," March 1991. | Management of TCP/IP-based internets: MIB-II," March 1991. | |||
[RFC 2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path | [RFC 2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path | |||
Algorithm," November 2000. | Algorithm," November 2000. | |||
[RFC 3273] Waldbusser, S., "Remote Network Monitoring Management | [RFC 3273] Waldbusser, S., "Remote Network Monitoring Management | |||
Information Base for High Capacity Networks," July 2002. | Information Base for High Capacity Networks," July 2002. | |||
[RFC 3931] Lau, J. (Ed.), M. Townsley (Ed.), and I. Goyret (Ed.), | ||||
"Layer 2 Tunneling Protocol - Version 3," March 2005. | ||||
[RFC 3954] Claise, B., "Cisco Systems NetFlow Services Export Version | [RFC 3954] Claise, B., "Cisco Systems NetFlow Services Export Version | |||
9," October 2004. | 9," October 2004. | |||
[RFC 5470] G. Sadasivan et al., "Architecture for IP Flow Information | [RFC 5470] G. Sadasivan et al., "Architecture for IP Flow Information | |||
Export," March 2009. | Export," March 2009. | |||
[RFC 5475] Zseby, T. et al., "Sampling and Filtering Techniques for | [RFC 5475] Zseby, T. et al., "Sampling and Filtering Techniques for | |||
IP Packet Selection," March 2009. | IP Packet Selection," March 2009. | |||
[RFC 5640] Filsfils, C., P. Mohapatra, and C. Pignataro, "Load | ||||
Balancing for Mesh Softwires," August 2009. | ||||
[RFC 5681] Allman, M. et al., "TCP Congestion Control," September | [RFC 5681] Allman, M. et al., "TCP Congestion Control," September | |||
2009. | 2009. | |||
[RFC 7011] Claise, B. et al., "Specification of the IP Flow | ||||
Information Export (IPFIX) Protocol for the Exchange of IP Traffic | ||||
Flow Information," September 2013. | ||||
[RFC 7012] Claise, B. and B. Trammell, "Information Model for IP Flow | ||||
Information Export (IPFIX)," September 2013. | ||||
[SAMP-BASIC] Phaal, P. and S. Panchen, "Packet Sampling Basics," | [SAMP-BASIC] Phaal, P. and S. Panchen, "Packet Sampling Basics," | |||
http://www.sflow.org/packetSamplingBasics/. | http://www.sflow.org/packetSamplingBasics/. | |||
[sFlow-LAG] Phaal, P. and A. Ghanwani, "sFlow LAG counters | [sFlow-LAG] Phaal, P. and A. Ghanwani, "sFlow LAG counters | |||
structure," http://www.sflow.org/sflow_lag.txt, September 2012. | structure," http://www.sflow.org/sflow_lag.txt, September 2012. | |||
[sFlow-v5] Phaal, P. and M. Lavine, "sFlow version 5," | [STT] Davie, B. (Ed.) and J. Gross, "A Stateless Transport Tunneling | |||
http://www.sflow.org/sflow_version_5.txt, July 2004. | ||||
[STT] Davie, B. (ed) and J. Gross, "A Stateless Transport Tunneling | ||||
Protocol for Network Virtualization (STT)," draft-davie-stt-06, March | Protocol for Network Virtualization (STT)," draft-davie-stt-06, March | |||
2014. | 2014. | |||
[VXLAN] Mahalingam, M. et al., "VXLAN: A Framework for Overlaying | [VXLAN] Mahalingam, M. et al., "VXLAN: A Framework for Overlaying | |||
Virtualized Layer 2 Networks over Layer 3 Networks," draft- | Virtualized Layer 2 Networks over Layer 3 Networks," draft- | |||
mahalingam-dutt-dcops-vxlan-09, April 2014. | mahalingam-dutt-dcops-vxlan-09, April 2014. | |||
[YONG] Yong, L., "Enhanced ECMP and Large Flow Aware Transport," | [YONG] Yong, L., "Enhanced ECMP and Large Flow Aware Transport," | |||
draft-yong-pwe3-enhance-ecmp-lfat-01, September 2010. | draft-yong-pwe3-enhance-ecmp-lfat-01, September 2010. | |||
End of changes. 34 change blocks. | ||||
69 lines changed or deleted | 90 lines changed or added | |||
This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |