draft-ietf-opsawg-large-flow-load-balancing-01.txt | draft-ietf-opsawg-large-flow-load-balancing-02.txt | |||
---|---|---|---|---|
OPSAWG R. Krishnan | OPSAWG R. Krishnan | |||
Internet Draft S. Khanna | Internet Draft S. Khanna | |||
Intended status: Informational Brocade Communications | Intended status: Informational Brocade Communications | |||
Expires: December 23, 2013 L. Yong | Expires: December 25, 2013 L. Yong | |||
June 23, 2013 Huawei USA | June 25, 2013 Huawei USA | |||
A. Ghanwani | A. Ghanwani | |||
Dell | Dell | |||
Ning So | Ning So | |||
Tata Communications | Tata Communications | |||
B. Khasnabish | B. Khasnabish | |||
ZTE Corporation | ZTE Corporation | |||
Mechanisms for Optimal LAG/ECMP Component Link Utilization in | Mechanisms for Optimal LAG/ECMP Component Link Utilization in | |||
Networks | Networks | |||
draft-ietf-opsawg-large-flow-load-balancing-01.txt | draft-ietf-opsawg-large-flow-load-balancing-02.txt | |||
Status of this Memo | Status of this Memo | |||
This Internet-Draft is submitted in full conformance with the | This Internet-Draft is submitted in full conformance with the | |||
provisions of BCP 78 and BCP 79. This document may not be modified, | provisions of BCP 78 and BCP 79. This document may not be modified, | |||
and derivative works of it may not be created, except to publish it | and derivative works of it may not be created, except to publish it | |||
as an RFC and to translate it into languages other than English. | as an RFC and to translate it into languages other than English. | |||
Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
Task Force (IETF), its areas, and its working groups. Note that | Task Force (IETF), its areas, and its working groups. Note that | |||
skipping to change at page 1, line 42 | skipping to change at page 1, line 42 | |||
and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
The list of current Internet-Drafts can be accessed at | The list of current Internet-Drafts can be accessed at | |||
http://www.ietf.org/ietf/1id-abstracts.txt | http://www.ietf.org/ietf/1id-abstracts.txt | |||
The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
http://www.ietf.org/shadow.html | http://www.ietf.org/shadow.html | |||
This Internet-Draft will expire on December 23, 2013. | This Internet-Draft will expire on December 25, 2013. | |||
Copyright Notice | Copyright Notice | |||
Copyright (c) 2013 IETF Trust and the persons identified as the | Copyright (c) 2013 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | document authors. All rights reserved. | |||
This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
(http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
publication of this document. Please review these documents | publication of this document. Please review these documents | |||
skipping to change at page 2, line 32 | skipping to change at page 2, line 32 | |||
of the mechanisms useful for achieving this. | of the mechanisms useful for achieving this. | |||
Table of Contents | Table of Contents | |||
1. Introduction...................................................3 | 1. Introduction...................................................3 | |||
1.1. Acronyms..................................................3 | 1.1. Acronyms..................................................3 | |||
1.2. Terminology...............................................4 | 1.2. Terminology...............................................4 | |||
2. Flow Categorization............................................4 | 2. Flow Categorization............................................4 | |||
3. Hash-based Load Distribution in LAG/ECMP.......................5 | 3. Hash-based Load Distribution in LAG/ECMP.......................5 | |||
4. Mechanisms for Optimal LAG/ECMP Component Link Utilization.....7 | 4. Mechanisms for Optimal LAG/ECMP Component Link Utilization.....7 | |||
4.1. Differences in LAG vs ECMP................................7 | 4.1. Differences in LAG vs ECMP................................8 | |||
4.2. Overview of the mechanism.................................8 | 4.2. Overview of the mechanism.................................9 | |||
4.3. Large Flow Recognition....................................9 | 4.3. Large Flow Recognition...................................10 | |||
4.3.1. Flow Identification..................................9 | 4.3.1. Flow Identification.................................10 | |||
4.3.2. Criteria for Identifying a Large Flow...............10 | 4.3.2. Criteria for Identifying a Large Flow...............10 | |||
4.3.3. Sampling Techniques.................................10 | 4.3.3. Sampling Techniques.................................11 | |||
4.3.4. Automatic Hardware Recognition......................11 | 4.3.4. Automatic Hardware Recognition......................12 | |||
4.4. Load Re-balancing Options................................12 | 4.4. Load Re-balancing Options................................13 | |||
4.4.1. Alternative Placement of Large Flows................12 | 4.4.1. Alternative Placement of Large Flows................13 | |||
4.4.2. Redistributing Small Flows..........................13 | 4.4.2. Redistributing Small Flows..........................13 | |||
4.4.3. Component Link Protection Considerations............13 | 4.4.3. Component Link Protection Considerations............14 | |||
4.4.4. Load Re-balancing Algorithms........................14 | 4.4.4. Load Re-balancing Algorithms........................14 | |||
4.4.5. Load Re-Balancing Example...........................14 | 4.4.5. Load Re-Balancing Example...........................14 | |||
5. Information Model for Flow Re-balancing.......................15 | 5. Information Model for Flow Re-balancing.......................15 | |||
5.1. Configuration Parameters for Flow Re-balancing...........15 | 5.1. Configuration Parameters for Flow Re-balancing...........15 | |||
5.2. System Configuration and Identification Parameters.......16 | 5.2. System Configuration and Identification Parameters.......16 | |||
5.3. Information for Alternative Placement of Large Flows.....16 | 5.3. Information for Alternative Placement of Large Flows.....17 | |||
5.4. Information for Redistribution of Small Flows............17 | 5.4. Information for Redistribution of Small Flows............17 | |||
5.5. Export of Flow Information...............................17 | 5.5. Export of Flow Information...............................17 | |||
5.6. Monitoring information...................................17 | 5.6. Monitoring information...................................18 | |||
5.6.1. Interface (link) utilization........................17 | 5.6.1. Interface (link) utilization........................18 | |||
5.6.2. Other monitoring information........................17 | 5.6.2. Other monitoring information........................18 | |||
6. Operational Considerations....................................18 | 6. Operational Considerations....................................18 | |||
7. IANA Considerations...........................................18 | 7. IANA Considerations...........................................19 | |||
8. Security Considerations.......................................18 | 8. Security Considerations.......................................19 | |||
9. Acknowledgements..............................................18 | 9. Acknowledgements..............................................19 | |||
10. References...................................................19 | 10. References...................................................20 | |||
10.1. Normative References....................................19 | 10.1. Normative References....................................20 | |||
10.2. Informative References..................................19 | 10.2. Informative References..................................20 | |||
1. Introduction | 1. Introduction | |||
Networks extensively use LAG/ECMP techniques for capacity scaling. | Networks extensively use LAG/ECMP techniques for capacity scaling. | |||
Network traffic can be predominantly categorized into two traffic | Network traffic can be predominantly categorized into two traffic | |||
types: long-lived large flows and other flows (which include long- | types: long-lived large flows and other flows (which include long- | |||
lived small flows, short-lived small/large flows). Stateless hash- | lived small flows, short-lived small/large flows). Stateless hash- | |||
based techniques [ITCOM, RFC 2991, RFC 2992, RFC 6790] are often used | based techniques [ITCOM, RFC 2991, RFC 2992, RFC 6790] are often used | |||
to distribute both long-lived large flows and other flows over the | to distribute both long-lived large flows and other flows over the | |||
component links in a LAG/ECMP. However the traffic may not be evenly | component links in a LAG/ECMP. However the traffic may not be evenly | |||
skipping to change at page 8, line 37 | skipping to change at page 9, line 18 | |||
may not apply equally to unicast and multicast traffic because of the | may not apply equally to unicast and multicast traffic because of the | |||
way multicast trees are constructed. | way multicast trees are constructed. | |||
4.2. Overview of the mechanism | 4.2. Overview of the mechanism | |||
The various steps in achieving optimal LAG/ECMP component link | The various steps in achieving optimal LAG/ECMP component link | |||
utilization in networks are detailed below: | utilization in networks are detailed below: | |||
Step 1) This involves large flow recognition in routers and | Step 1) This involves large flow recognition in routers and | |||
maintaining the mapping of the large flow to the component link that | maintaining the mapping of the large flow to the component link that | |||
it uses. The recognition of large flows is explained in Section 3.1. | it uses. The recognition of large flows is explained in Section 4.3. | |||
Step 2) The egress component links are periodically scanned for link | Step 2) The egress component links are periodically scanned for link | |||
utilization. If the egress component link utilization exceeds a pre- | utilization. If the egress component link utilization exceeds a pre- | |||
programmed threshold, an operator alert is generated. The large flows | programmed threshold, an operator alert is generated. The large flows | |||
mapped to the congested egress component link are exported to a | mapped to the congested egress component link are exported to a | |||
central management entity. | central management entity. | |||
Step 3) On receiving the alert about the congested component link, | Step 3) On receiving the alert about the congested component link, | |||
the operator, through a central management entity, finds the large | the operator, through a central management entity, finds the large | |||
flows mapped to that component link and the LAG/ECMP group to which | flows mapped to that component link and the LAG/ECMP group to which | |||
skipping to change at page 9, line 19 | skipping to change at page 9, line 45 | |||
one of the following actions: | one of the following actions: | |||
1) Indicate specific large flows to rebalance; | 1) Indicate specific large flows to rebalance; | |||
2) Have the router decide the best large flows to rebalance; | 2) Have the router decide the best large flows to rebalance; | |||
3) Have the router redistribute all the small flows on the | 3) Have the router redistribute all the small flows on the | |||
congested link to other component links in the group. | congested link to other component links in the group. | |||
The central management entity conveys the above information to the | The central management entity conveys the above information to the | |||
router. The load re-balancing options are explained in Section 3.2. | router. The load re-balancing options are explained in Section 4.4. | |||
Steps 2) to 4) could be automated if desired. | Steps 2) to 4) could be automated if desired. | |||
Providing large flow information to a central management entity | Providing large flow information to a central management entity | |||
provides the capability to further optimize flow distribution at with | provides the capability to further optimize flow distribution at with | |||
multi-node visibility. Consider the following example. A router may | multi-node visibility. Consider the following example. A router may | |||
have 3 ECMP nexthops that lead down paths P1, P2, and P3. A couple | have 3 ECMP nexthops that lead down paths P1, P2, and P3. A couple | |||
of hops downstream on P1 may be congested, while P2 and P3 may be | of hops downstream on P1 may be congested, while P2 and P3 may be | |||
under-utilized, which the local router does not have visibility into. | under-utilized, which the local router does not have visibility into. | |||
With the help of a central management entity, the operator could | With the help of a central management entity, the operator could | |||
skipping to change at page 10, line 44 | skipping to change at page 11, line 23 | |||
Various techniques to identify a large flow are described below. | Various techniques to identify a large flow are described below. | |||
4.3.3. Sampling Techniques | 4.3.3. Sampling Techniques | |||
A number of routers support sampling techniques such as sFlow [sFlow- | A number of routers support sampling techniques such as sFlow [sFlow- | |||
v5, sFlow-LAG], PSAMP [RFC 5475] and Netflow Sampling [RFC 3954]. | v5, sFlow-LAG], PSAMP [RFC 5475] and Netflow Sampling [RFC 3954]. | |||
For the purpose of large flow identification, sampling must be | For the purpose of large flow identification, sampling must be | |||
enabled on all of the egress ports in the router where such | enabled on all of the egress ports in the router where such | |||
measurements are desired. | measurements are desired. | |||
Using sflow as an example, processing in a sFlow collector will | Using sflow as an example, processing in an sFlow collector will | |||
provide an approximate indication of the large flows mapping to each | provide an approximate indication of the large flows mapping to each | |||
of the component links in each LAG/ECMP group. It is possible to | of the component links in each LAG/ECMP group. It is possible to | |||
implement this part of the collector function in the control plane of | implement this part of the collector function in the control plane of | |||
the router reducing dependence on an external management station, | the router reducing dependence on an external management station, | |||
assuming sufficient control plane resources are available. | assuming sufficient control plane resources are available. | |||
If egress sampling is not available, ingress sampling can suffice | If egress sampling is not available, ingress sampling can suffice | |||
since the central management entity used by the sampling technique | since the central management entity used by the sampling technique | |||
typically has multi-node visibility and can use the samples from an | typically has multi-node visibility and can use the samples from an | |||
immediately downstream node to make measurements for egress traffic | immediately downstream node to make measurements for egress traffic | |||
skipping to change at page 13, line 36 | skipping to change at page 14, line 16 | |||
prevent, or reduce the probability, that the small flow hashes into | prevent, or reduce the probability, that the small flow hashes into | |||
the congested component link(s). | the congested component link(s). | |||
. The LAG/ECMP table is modified to include only non-congested | . The LAG/ECMP table is modified to include only non-congested | |||
component link(s). Small flows hash into this table to be mapped | component link(s). Small flows hash into this table to be mapped | |||
to a destination component link. Alternatively, if certain | to a destination component link. Alternatively, if certain | |||
component links are heavily loaded, but not congested, the | component links are heavily loaded, but not congested, the | |||
output of the hash function can be adjusted to account for large | output of the hash function can be adjusted to account for large | |||
flow loading on each of the component links. | flow loading on each of the component links. | |||
. The PBR rules for large flows (refer to Section 3.2.1) must | . The PBR rules for large flows (refer to Section 4.4.1) must | |||
have strict precedence over the LAG/ECMP table lookup result. | have strict precedence over the LAG/ECMP table lookup result. | |||
With this approach the small flows that are moved would be subject to | With this approach the small flows that are moved would be subject to | |||
reordering. | reordering. | |||
4.4.3. Component Link Protection Considerations | 4.4.3. Component Link Protection Considerations | |||
If desired, certain component links may be reserved for link | If desired, certain component links may be reserved for link | |||
protection. These reserved component links are not used for any flows | protection. These reserved component links are not used for any flows | |||
in the absence of any failures.. In the case when the component | in the absence of any failures.. In the case when the component | |||
skipping to change at page 14, line 19 | skipping to change at page 14, line 45 | |||
Specific algorithms for placement of large flows are out of scope of | Specific algorithms for placement of large flows are out of scope of | |||
this document. One possibility is to formulate the problem for large | this document. One possibility is to formulate the problem for large | |||
flow placement as the well-known bin-packing problem and make use of | flow placement as the well-known bin-packing problem and make use of | |||
the various heuristics that are available for that problem [bin- | the various heuristics that are available for that problem [bin- | |||
pack]. | pack]. | |||
4.4.5. Load Re-Balancing Example | 4.4.5. Load Re-Balancing Example | |||
Optimal LAG/ECMP component utilization for the use case in Figure 2 | Optimal LAG/ECMP component utilization for the use case in Figure 2 | |||
is depicted below in Figure 3. The large flow rebalancing explained | is depicted below in Figure 3. The large flow rebalancing explained | |||
in Section 3.2.1 is used. The improved link utilization is as | in Section 4.4 is used. The improved link utilization is as follows: | |||
follows: | ||||
. Component link (1) has 3 flows -- 2 small flows and 1 large | . Component link (1) has 3 flows -- 2 small flows and 1 large | |||
flow -- and the link utilization is normal. | flow -- and the link utilization is normal. | |||
. Component link (2) has 4 flows -- 3 small flows and 1 large | . Component link (2) has 4 flows -- 3 small flows and 1 large | |||
flow -- and the link utilization is normal now. | flow -- and the link utilization is normal now. | |||
. Component link (3) has 3 flows -- 2 small flows and 1 large | . Component link (3) has 3 flows -- 2 small flows and 1 large | |||
flow -- and the link utilization is normal now. | flow -- and the link utilization is normal now. | |||
skipping to change at page 15, line 7 | skipping to change at page 15, line 29 | |||
| |=====> | | | | |=====> | | | |||
| (3)|--/---/-|(3) | | | (3)|--/---/-|(3) | | |||
| | | | | | | | | | |||
+-----------+ +-----------+ | +-----------+ +-----------+ | |||
Where: ->-> small flows | Where: ->-> small flows | |||
===> large flow | ===> large flow | |||
Figure 3: Evenly utilized Composite Links | Figure 3: Evenly utilized Composite Links | |||
Basically, the use of the mechanisms described in Section 3.2.1 | Basically, the use of the mechanisms described in Section 4.4.1 | |||
resulted in a rebalancing of flows where one of the large flows on | resulted in a rebalancing of flows where one of the large flows on | |||
component link (3) which was previously congested was moved to | component link (3) which was previously congested was moved to | |||
component link (2) which was previously under-utilized. | component link (2) which was previously under-utilized. | |||
5. Information Model for Flow Re-balancing | 5. Information Model for Flow Re-balancing | |||
5.1. Configuration Parameters for Flow Re-balancing | 5.1. Configuration Parameters for Flow Re-balancing | |||
The following parameters are required the configuration of this | The following parameters are required the configuration of this | |||
feature: | feature: | |||
skipping to change at page 16, line 30 | skipping to change at page 17, line 8 | |||
identification parameters to the LAG) and will be required when | identification parameters to the LAG) and will be required when | |||
specifying flow placement to achieve the desired rebalancing. | specifying flow placement to achieve the desired rebalancing. | |||
. Component Link ID: Identifies the component link within a LAG. | . Component Link ID: Identifies the component link within a LAG. | |||
This is required when specifying flow placement to achieve the | This is required when specifying flow placement to achieve the | |||
desired rebalancing. | desired rebalancing. | |||
5.3. Information for Alternative Placement of Large Flows | 5.3. Information for Alternative Placement of Large Flows | |||
In cases where large flow recognition is handled by an external | In cases where large flow recognition is handled by an external | |||
management station (see Section 3.1.3), an information model for | management station (see Section 4.3.3 ), an information model for | |||
flows is required to allow the import of large flow information to | flows is required to allow the import of large flow information to | |||
the router. | the router. | |||
The following are some of the elements of information model for | The following are some of the elements of information model for | |||
importing of flows: | importing of flows: | |||
. Layer 2: source MAC address, destination MAC address, VLAN ID. | . Layer 2: source MAC address, destination MAC address, VLAN ID. | |||
. Layer 3 IP: IP Protocol, IP source address, IP destination | . Layer 3 IP: IP Protocol, IP source address, IP destination | |||
address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP | address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP | |||
skipping to change at page 17, line 17 | skipping to change at page 17, line 41 | |||
the target component link for the flow. | the target component link for the flow. | |||
5.4. Information for Redistribution of Small Flows | 5.4. Information for Redistribution of Small Flows | |||
For small flows, the LAG ID and the component link IDs along with the | For small flows, the LAG ID and the component link IDs along with the | |||
percentage of traffic to be assigned to each component link ID Is | percentage of traffic to be assigned to each component link ID Is | |||
required. | required. | |||
5.5. Export of Flow Information | 5.5. Export of Flow Information | |||
Exporting flow information is required when large flow identification | Exporting large flow information is required when large flow | |||
is being done on a router, but the decision to rebalance is being | recognition is being done on a router, but the decision to rebalance | |||
made in an external management station. | is being made in an external management station. Large flow | |||
information includes flow identification and the component link ID | ||||
that the flow currently is assigned to. Other information such as | ||||
flow QoS and bandwidth may be exported too. | ||||
It is recommended to use IPFIX protocol [RFC 5101] for exporting of | The IPFIX information model [RFC 5101] can be leveraged for large | |||
large flows from the router to an external management station. | flow identification. | |||
5.6. Monitoring information | 5.6. Monitoring information | |||
5.6.1. Interface (link) utilization | 5.6.1. Interface (link) utilization | |||
The incoming bytes (ifInOctets), outgoing bytes (ifOutOctets) and | The incoming bytes (ifInOctets), outgoing bytes (ifOutOctets) and | |||
interface speed (ifSpeed) can be measured from the Interface table | interface speed (ifSpeed) can be measured from the Interface table | |||
(iftable) MIB [RFC 1213]. | (iftable) MIB [RFC 1213]. | |||
The link utilization can then be computed as follows: | The link utilization can then be computed as follows: | |||
Incoming link utilization = (ifInOctets *8 / ifSpeed) | Incoming link utilization = (ifInOctets *8 / ifSpeed) | |||
Outgoing link utilization = (ifOutOctets * 8 / ifSpeed) | Outgoing link utilization = (ifOutOctets * 8 / ifSpeed) | |||
For high speed links, the etherStatsHighCapacityTable MIB [RFC 3273] | For high speed links, the etherStatsHighCapacityTable MIB [RFC 3273] | |||
can be used. | can be used. | |||
For further scalability, it is recommended to use the counter push | ||||
mechanism in [sflow-v5] for the interface counters; this would help | ||||
avoid counter polling through the MIB interface. | ||||
The outgoing link utilization of the component links within a LAG can | The outgoing link utilization of the component links within a LAG can | |||
be used to compute the imbalance threshold (See Section 5.1) for the | be used to compute the imbalance threshold (See Section 5.1) for the | |||
LAG. | LAG. | |||
5.6.2. Other monitoring information | 5.6.2. Other monitoring information | |||
Additional monitoring information includes: | Additional monitoring information includes: | |||
. Number of times rebalancing was done. | . Number of times rebalancing was done. | |||
skipping to change at page 20, line 27 | skipping to change at page 21, line 23 | |||
the ACM SIGCOMM, August 2011. | the ACM SIGCOMM, August 2011. | |||
[NDTM] Estan, C. and G. Varghese, "New directions in traffic | [NDTM] Estan, C. and G. Varghese, "New directions in traffic | |||
measurement and accounting," Proceedings of ACM SIGCOMM, August 2002. | measurement and accounting," Proceedings of ACM SIGCOMM, August 2002. | |||
[bin-pack] Coffman, Jr., E., M. Garey, and D. Johnson. Approximation | [bin-pack] Coffman, Jr., E., M. Garey, and D. Johnson. Approximation | |||
Algorithms for Bin-Packing -- An Updated Survey. In Algorithm Design | Algorithms for Bin-Packing -- An Updated Survey. In Algorithm Design | |||
for Computer System Design, ed. by Ausiello, Lucertini, and Serafini. | for Computer System Design, ed. by Ausiello, Lucertini, and Serafini. | |||
Springer-Verlag, 1984. | Springer-Verlag, 1984. | |||
Appendix A. Internet Traffic Analysis and Load Balancing Simulation | Appendix A. Internet Traffic Analysis and Load Balancing Simulation | |||
Internet traffic [CAIDA] has been analyzed to obtain flow statistics | Internet traffic [CAIDA] has been analyzed to obtain flow statistics | |||
such as the number of packets in a flow and the flow duration. The | such as the number of packets in a flow and the flow duration. The | |||
five tuples in the packet header (IP addresses, TCP/UDP Ports, and IP | five tuples in the packet header (IP addresses, TCP/UDP Ports, and IP | |||
protocol) are used for flow identification. The analysis indicates | protocol) are used for flow identification. The analysis indicates | |||
that < ~2% of the flows take ~30% of total traffic volume while the | that < ~2% of the flows take ~30% of total traffic volume while the | |||
rest of the flows (> ~98%) contributes ~70% [YONG]. | rest of the flows (> ~98%) contributes ~70% [YONG]. | |||
The simulation has shown that given Internet traffic pattern, the | The simulation has shown that given Internet traffic pattern, the | |||
hash-based technique does not evenly distribute the flows over ECMP | hash-based technique does not evenly distribute the flows over ECMP | |||
End of changes. 20 change blocks. | ||||
37 lines changed or deleted | 43 lines changed or added | |||
This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |