draft-ietf-opsawg-large-flow-load-balancing-01.txt   draft-ietf-opsawg-large-flow-load-balancing-02.txt 
OPSAWG R. Krishnan OPSAWG R. Krishnan
Internet Draft S. Khanna Internet Draft S. Khanna
Intended status: Informational Brocade Communications Intended status: Informational Brocade Communications
Expires: December 23, 2013 L. Yong Expires: December 25, 2013 L. Yong
June 23, 2013 Huawei USA June 25, 2013 Huawei USA
A. Ghanwani A. Ghanwani
Dell Dell
Ning So Ning So
Tata Communications Tata Communications
B. Khasnabish B. Khasnabish
ZTE Corporation ZTE Corporation
Mechanisms for Optimal LAG/ECMP Component Link Utilization in Mechanisms for Optimal LAG/ECMP Component Link Utilization in
Networks Networks
draft-ietf-opsawg-large-flow-load-balancing-01.txt draft-ietf-opsawg-large-flow-load-balancing-02.txt
Status of this Memo Status of this Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. This document may not be modified, provisions of BCP 78 and BCP 79. This document may not be modified,
and derivative works of it may not be created, except to publish it and derivative works of it may not be created, except to publish it
as an RFC and to translate it into languages other than English. as an RFC and to translate it into languages other than English.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
skipping to change at page 1, line 42 skipping to change at page 1, line 42
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html http://www.ietf.org/shadow.html
This Internet-Draft will expire on December 23, 2013. This Internet-Draft will expire on December 25, 2013.
Copyright Notice Copyright Notice
Copyright (c) 2013 IETF Trust and the persons identified as the Copyright (c) 2013 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
skipping to change at page 2, line 32 skipping to change at page 2, line 32
of the mechanisms useful for achieving this. of the mechanisms useful for achieving this.
Table of Contents Table of Contents
1. Introduction...................................................3 1. Introduction...................................................3
1.1. Acronyms..................................................3 1.1. Acronyms..................................................3
1.2. Terminology...............................................4 1.2. Terminology...............................................4
2. Flow Categorization............................................4 2. Flow Categorization............................................4
3. Hash-based Load Distribution in LAG/ECMP.......................5 3. Hash-based Load Distribution in LAG/ECMP.......................5
4. Mechanisms for Optimal LAG/ECMP Component Link Utilization.....7 4. Mechanisms for Optimal LAG/ECMP Component Link Utilization.....7
4.1. Differences in LAG vs ECMP................................7 4.1. Differences in LAG vs ECMP................................8
4.2. Overview of the mechanism.................................8 4.2. Overview of the mechanism.................................9
4.3. Large Flow Recognition....................................9 4.3. Large Flow Recognition...................................10
4.3.1. Flow Identification..................................9 4.3.1. Flow Identification.................................10
4.3.2. Criteria for Identifying a Large Flow...............10 4.3.2. Criteria for Identifying a Large Flow...............10
4.3.3. Sampling Techniques.................................10 4.3.3. Sampling Techniques.................................11
4.3.4. Automatic Hardware Recognition......................11 4.3.4. Automatic Hardware Recognition......................12
4.4. Load Re-balancing Options................................12 4.4. Load Re-balancing Options................................13
4.4.1. Alternative Placement of Large Flows................12 4.4.1. Alternative Placement of Large Flows................13
4.4.2. Redistributing Small Flows..........................13 4.4.2. Redistributing Small Flows..........................13
4.4.3. Component Link Protection Considerations............13 4.4.3. Component Link Protection Considerations............14
4.4.4. Load Re-balancing Algorithms........................14 4.4.4. Load Re-balancing Algorithms........................14
4.4.5. Load Re-Balancing Example...........................14 4.4.5. Load Re-Balancing Example...........................14
5. Information Model for Flow Re-balancing.......................15 5. Information Model for Flow Re-balancing.......................15
5.1. Configuration Parameters for Flow Re-balancing...........15 5.1. Configuration Parameters for Flow Re-balancing...........15
5.2. System Configuration and Identification Parameters.......16 5.2. System Configuration and Identification Parameters.......16
5.3. Information for Alternative Placement of Large Flows.....16 5.3. Information for Alternative Placement of Large Flows.....17
5.4. Information for Redistribution of Small Flows............17 5.4. Information for Redistribution of Small Flows............17
5.5. Export of Flow Information...............................17 5.5. Export of Flow Information...............................17
5.6. Monitoring information...................................17 5.6. Monitoring information...................................18
5.6.1. Interface (link) utilization........................17 5.6.1. Interface (link) utilization........................18
5.6.2. Other monitoring information........................17 5.6.2. Other monitoring information........................18
6. Operational Considerations....................................18 6. Operational Considerations....................................18
7. IANA Considerations...........................................18 7. IANA Considerations...........................................19
8. Security Considerations.......................................18 8. Security Considerations.......................................19
9. Acknowledgements..............................................18 9. Acknowledgements..............................................19
10. References...................................................19 10. References...................................................20
10.1. Normative References....................................19 10.1. Normative References....................................20
10.2. Informative References..................................19 10.2. Informative References..................................20
1. Introduction 1. Introduction
Networks extensively use LAG/ECMP techniques for capacity scaling. Networks extensively use LAG/ECMP techniques for capacity scaling.
Network traffic can be predominantly categorized into two traffic Network traffic can be predominantly categorized into two traffic
types: long-lived large flows and other flows (which include long- types: long-lived large flows and other flows (which include long-
lived small flows, short-lived small/large flows). Stateless hash- lived small flows, short-lived small/large flows). Stateless hash-
based techniques [ITCOM, RFC 2991, RFC 2992, RFC 6790] are often used based techniques [ITCOM, RFC 2991, RFC 2992, RFC 6790] are often used
to distribute both long-lived large flows and other flows over the to distribute both long-lived large flows and other flows over the
component links in a LAG/ECMP. However the traffic may not be evenly component links in a LAG/ECMP. However the traffic may not be evenly
skipping to change at page 8, line 37 skipping to change at page 9, line 18
may not apply equally to unicast and multicast traffic because of the may not apply equally to unicast and multicast traffic because of the
way multicast trees are constructed. way multicast trees are constructed.
4.2. Overview of the mechanism 4.2. Overview of the mechanism
The various steps in achieving optimal LAG/ECMP component link The various steps in achieving optimal LAG/ECMP component link
utilization in networks are detailed below: utilization in networks are detailed below:
Step 1) This involves large flow recognition in routers and Step 1) This involves large flow recognition in routers and
maintaining the mapping of the large flow to the component link that maintaining the mapping of the large flow to the component link that
it uses. The recognition of large flows is explained in Section 3.1. it uses. The recognition of large flows is explained in Section 4.3.
Step 2) The egress component links are periodically scanned for link Step 2) The egress component links are periodically scanned for link
utilization. If the egress component link utilization exceeds a pre- utilization. If the egress component link utilization exceeds a pre-
programmed threshold, an operator alert is generated. The large flows programmed threshold, an operator alert is generated. The large flows
mapped to the congested egress component link are exported to a mapped to the congested egress component link are exported to a
central management entity. central management entity.
Step 3) On receiving the alert about the congested component link, Step 3) On receiving the alert about the congested component link,
the operator, through a central management entity, finds the large the operator, through a central management entity, finds the large
flows mapped to that component link and the LAG/ECMP group to which flows mapped to that component link and the LAG/ECMP group to which
skipping to change at page 9, line 19 skipping to change at page 9, line 45
one of the following actions: one of the following actions:
1) Indicate specific large flows to rebalance; 1) Indicate specific large flows to rebalance;
2) Have the router decide the best large flows to rebalance; 2) Have the router decide the best large flows to rebalance;
3) Have the router redistribute all the small flows on the 3) Have the router redistribute all the small flows on the
congested link to other component links in the group. congested link to other component links in the group.
The central management entity conveys the above information to the The central management entity conveys the above information to the
router. The load re-balancing options are explained in Section 3.2. router. The load re-balancing options are explained in Section 4.4.
Steps 2) to 4) could be automated if desired. Steps 2) to 4) could be automated if desired.
Providing large flow information to a central management entity Providing large flow information to a central management entity
provides the capability to further optimize flow distribution at with provides the capability to further optimize flow distribution at with
multi-node visibility. Consider the following example. A router may multi-node visibility. Consider the following example. A router may
have 3 ECMP nexthops that lead down paths P1, P2, and P3. A couple have 3 ECMP nexthops that lead down paths P1, P2, and P3. A couple
of hops downstream on P1 may be congested, while P2 and P3 may be of hops downstream on P1 may be congested, while P2 and P3 may be
under-utilized, which the local router does not have visibility into. under-utilized, which the local router does not have visibility into.
With the help of a central management entity, the operator could With the help of a central management entity, the operator could
skipping to change at page 10, line 44 skipping to change at page 11, line 23
Various techniques to identify a large flow are described below. Various techniques to identify a large flow are described below.
4.3.3. Sampling Techniques 4.3.3. Sampling Techniques
A number of routers support sampling techniques such as sFlow [sFlow- A number of routers support sampling techniques such as sFlow [sFlow-
v5, sFlow-LAG], PSAMP [RFC 5475] and Netflow Sampling [RFC 3954]. v5, sFlow-LAG], PSAMP [RFC 5475] and Netflow Sampling [RFC 3954].
For the purpose of large flow identification, sampling must be For the purpose of large flow identification, sampling must be
enabled on all of the egress ports in the router where such enabled on all of the egress ports in the router where such
measurements are desired. measurements are desired.
Using sflow as an example, processing in a sFlow collector will Using sflow as an example, processing in an sFlow collector will
provide an approximate indication of the large flows mapping to each provide an approximate indication of the large flows mapping to each
of the component links in each LAG/ECMP group. It is possible to of the component links in each LAG/ECMP group. It is possible to
implement this part of the collector function in the control plane of implement this part of the collector function in the control plane of
the router reducing dependence on an external management station, the router reducing dependence on an external management station,
assuming sufficient control plane resources are available. assuming sufficient control plane resources are available.
If egress sampling is not available, ingress sampling can suffice If egress sampling is not available, ingress sampling can suffice
since the central management entity used by the sampling technique since the central management entity used by the sampling technique
typically has multi-node visibility and can use the samples from an typically has multi-node visibility and can use the samples from an
immediately downstream node to make measurements for egress traffic immediately downstream node to make measurements for egress traffic
skipping to change at page 13, line 36 skipping to change at page 14, line 16
prevent, or reduce the probability, that the small flow hashes into prevent, or reduce the probability, that the small flow hashes into
the congested component link(s). the congested component link(s).
. The LAG/ECMP table is modified to include only non-congested . The LAG/ECMP table is modified to include only non-congested
component link(s). Small flows hash into this table to be mapped component link(s). Small flows hash into this table to be mapped
to a destination component link. Alternatively, if certain to a destination component link. Alternatively, if certain
component links are heavily loaded, but not congested, the component links are heavily loaded, but not congested, the
output of the hash function can be adjusted to account for large output of the hash function can be adjusted to account for large
flow loading on each of the component links. flow loading on each of the component links.
. The PBR rules for large flows (refer to Section 3.2.1) must . The PBR rules for large flows (refer to Section 4.4.1) must
have strict precedence over the LAG/ECMP table lookup result. have strict precedence over the LAG/ECMP table lookup result.
With this approach the small flows that are moved would be subject to With this approach the small flows that are moved would be subject to
reordering. reordering.
4.4.3. Component Link Protection Considerations 4.4.3. Component Link Protection Considerations
If desired, certain component links may be reserved for link If desired, certain component links may be reserved for link
protection. These reserved component links are not used for any flows protection. These reserved component links are not used for any flows
in the absence of any failures.. In the case when the component in the absence of any failures.. In the case when the component
skipping to change at page 14, line 19 skipping to change at page 14, line 45
Specific algorithms for placement of large flows are out of scope of Specific algorithms for placement of large flows are out of scope of
this document. One possibility is to formulate the problem for large this document. One possibility is to formulate the problem for large
flow placement as the well-known bin-packing problem and make use of flow placement as the well-known bin-packing problem and make use of
the various heuristics that are available for that problem [bin- the various heuristics that are available for that problem [bin-
pack]. pack].
4.4.5. Load Re-Balancing Example 4.4.5. Load Re-Balancing Example
Optimal LAG/ECMP component utilization for the use case in Figure 2 Optimal LAG/ECMP component utilization for the use case in Figure 2
is depicted below in Figure 3. The large flow rebalancing explained is depicted below in Figure 3. The large flow rebalancing explained
in Section 3.2.1 is used. The improved link utilization is as in Section 4.4 is used. The improved link utilization is as follows:
follows:
. Component link (1) has 3 flows -- 2 small flows and 1 large . Component link (1) has 3 flows -- 2 small flows and 1 large
flow -- and the link utilization is normal. flow -- and the link utilization is normal.
. Component link (2) has 4 flows -- 3 small flows and 1 large . Component link (2) has 4 flows -- 3 small flows and 1 large
flow -- and the link utilization is normal now. flow -- and the link utilization is normal now.
. Component link (3) has 3 flows -- 2 small flows and 1 large . Component link (3) has 3 flows -- 2 small flows and 1 large
flow -- and the link utilization is normal now. flow -- and the link utilization is normal now.
skipping to change at page 15, line 7 skipping to change at page 15, line 29
| |=====> | | | |=====> | |
| (3)|--/---/-|(3) | | (3)|--/---/-|(3) |
| | | | | | | |
+-----------+ +-----------+ +-----------+ +-----------+
Where: ->-> small flows Where: ->-> small flows
===> large flow ===> large flow
Figure 3: Evenly utilized Composite Links Figure 3: Evenly utilized Composite Links
Basically, the use of the mechanisms described in Section 3.2.1 Basically, the use of the mechanisms described in Section 4.4.1
resulted in a rebalancing of flows where one of the large flows on resulted in a rebalancing of flows where one of the large flows on
component link (3) which was previously congested was moved to component link (3) which was previously congested was moved to
component link (2) which was previously under-utilized. component link (2) which was previously under-utilized.
5. Information Model for Flow Re-balancing 5. Information Model for Flow Re-balancing
5.1. Configuration Parameters for Flow Re-balancing 5.1. Configuration Parameters for Flow Re-balancing
The following parameters are required the configuration of this The following parameters are required the configuration of this
feature: feature:
skipping to change at page 16, line 30 skipping to change at page 17, line 8
identification parameters to the LAG) and will be required when identification parameters to the LAG) and will be required when
specifying flow placement to achieve the desired rebalancing. specifying flow placement to achieve the desired rebalancing.
. Component Link ID: Identifies the component link within a LAG. . Component Link ID: Identifies the component link within a LAG.
This is required when specifying flow placement to achieve the This is required when specifying flow placement to achieve the
desired rebalancing. desired rebalancing.
5.3. Information for Alternative Placement of Large Flows 5.3. Information for Alternative Placement of Large Flows
In cases where large flow recognition is handled by an external In cases where large flow recognition is handled by an external
management station (see Section 3.1.3), an information model for management station (see Section 4.3.3 ), an information model for
flows is required to allow the import of large flow information to flows is required to allow the import of large flow information to
the router. the router.
The following are some of the elements of information model for The following are some of the elements of information model for
importing of flows: importing of flows:
. Layer 2: source MAC address, destination MAC address, VLAN ID. . Layer 2: source MAC address, destination MAC address, VLAN ID.
. Layer 3 IP: IP Protocol, IP source address, IP destination . Layer 3 IP: IP Protocol, IP source address, IP destination
address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP
skipping to change at page 17, line 17 skipping to change at page 17, line 41
the target component link for the flow. the target component link for the flow.
5.4. Information for Redistribution of Small Flows 5.4. Information for Redistribution of Small Flows
For small flows, the LAG ID and the component link IDs along with the For small flows, the LAG ID and the component link IDs along with the
percentage of traffic to be assigned to each component link ID Is percentage of traffic to be assigned to each component link ID Is
required. required.
5.5. Export of Flow Information 5.5. Export of Flow Information
Exporting flow information is required when large flow identification Exporting large flow information is required when large flow
is being done on a router, but the decision to rebalance is being recognition is being done on a router, but the decision to rebalance
made in an external management station. is being made in an external management station. Large flow
information includes flow identification and the component link ID
that the flow currently is assigned to. Other information such as
flow QoS and bandwidth may be exported too.
It is recommended to use IPFIX protocol [RFC 5101] for exporting of The IPFIX information model [RFC 5101] can be leveraged for large
large flows from the router to an external management station. flow identification.
5.6. Monitoring information 5.6. Monitoring information
5.6.1. Interface (link) utilization 5.6.1. Interface (link) utilization
The incoming bytes (ifInOctets), outgoing bytes (ifOutOctets) and The incoming bytes (ifInOctets), outgoing bytes (ifOutOctets) and
interface speed (ifSpeed) can be measured from the Interface table interface speed (ifSpeed) can be measured from the Interface table
(iftable) MIB [RFC 1213]. (iftable) MIB [RFC 1213].
The link utilization can then be computed as follows: The link utilization can then be computed as follows:
Incoming link utilization = (ifInOctets *8 / ifSpeed) Incoming link utilization = (ifInOctets *8 / ifSpeed)
Outgoing link utilization = (ifOutOctets * 8 / ifSpeed) Outgoing link utilization = (ifOutOctets * 8 / ifSpeed)
For high speed links, the etherStatsHighCapacityTable MIB [RFC 3273] For high speed links, the etherStatsHighCapacityTable MIB [RFC 3273]
can be used. can be used.
For further scalability, it is recommended to use the counter push
mechanism in [sflow-v5] for the interface counters; this would help
avoid counter polling through the MIB interface.
The outgoing link utilization of the component links within a LAG can The outgoing link utilization of the component links within a LAG can
be used to compute the imbalance threshold (See Section 5.1) for the be used to compute the imbalance threshold (See Section 5.1) for the
LAG. LAG.
5.6.2. Other monitoring information 5.6.2. Other monitoring information
Additional monitoring information includes: Additional monitoring information includes:
. Number of times rebalancing was done. . Number of times rebalancing was done.
skipping to change at page 20, line 27 skipping to change at page 21, line 23
the ACM SIGCOMM, August 2011. the ACM SIGCOMM, August 2011.
[NDTM] Estan, C. and G. Varghese, "New directions in traffic [NDTM] Estan, C. and G. Varghese, "New directions in traffic
measurement and accounting," Proceedings of ACM SIGCOMM, August 2002. measurement and accounting," Proceedings of ACM SIGCOMM, August 2002.
[bin-pack] Coffman, Jr., E., M. Garey, and D. Johnson. Approximation [bin-pack] Coffman, Jr., E., M. Garey, and D. Johnson. Approximation
Algorithms for Bin-Packing -- An Updated Survey. In Algorithm Design Algorithms for Bin-Packing -- An Updated Survey. In Algorithm Design
for Computer System Design, ed. by Ausiello, Lucertini, and Serafini. for Computer System Design, ed. by Ausiello, Lucertini, and Serafini.
Springer-Verlag, 1984. Springer-Verlag, 1984.
Appendix A. Internet Traffic Analysis and Load Balancing Simulation Appendix A. Internet Traffic Analysis and Load Balancing Simulation
Internet traffic [CAIDA] has been analyzed to obtain flow statistics Internet traffic [CAIDA] has been analyzed to obtain flow statistics
such as the number of packets in a flow and the flow duration. The such as the number of packets in a flow and the flow duration. The
five tuples in the packet header (IP addresses, TCP/UDP Ports, and IP five tuples in the packet header (IP addresses, TCP/UDP Ports, and IP
protocol) are used for flow identification. The analysis indicates protocol) are used for flow identification. The analysis indicates
that < ~2% of the flows take ~30% of total traffic volume while the that < ~2% of the flows take ~30% of total traffic volume while the
rest of the flows (> ~98%) contributes ~70% [YONG]. rest of the flows (> ~98%) contributes ~70% [YONG].
The simulation has shown that given Internet traffic pattern, the The simulation has shown that given Internet traffic pattern, the
hash-based technique does not evenly distribute the flows over ECMP hash-based technique does not evenly distribute the flows over ECMP
 End of changes. 20 change blocks. 
37 lines changed or deleted 43 lines changed or added

This html diff was produced by rfcdiff 1.41. The latest version is available from http://tools.ietf.org/tools/rfcdiff/