draft-ietf-ipoib-link-multicast-03.txt   draft-ietf-ipoib-link-multicast-04.txt 
INTERNET-DRAFT H.K. Jerry Chu INTERNET-DRAFT H.K. Jerry Chu
<draft-ietf-ipoib-link-multicast-03.txt> Sun Microsystems <draft-ietf-ipoib-link-multicast-04.txt> Sun Microsystems
Vivek Kashyap Vivek Kashyap
IBM IBM
Expires: September, 2003 March, 2003 Expires: December, 2003 June, 2003
IP link and multicast over InfiniBand networks IP link and multicast over InfiniBand networks
Status of this Memo Status of this Memo
This document is an Internet-Draft and is in full conformance with This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026. all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that other Task Force (IETF), its areas, and its working groups. Note that other
skipping to change at page 2, line 12 skipping to change at page 2, line 12
4.0 IB Multicast Architecture 4.0 IB Multicast Architecture
5.0 IB Links vs. IPoIB Links 5.0 IB Links vs. IPoIB Links
6.0 Setting up an IPoIB Link 6.0 Setting up an IPoIB Link
6.1 Maximum Transmission Unit 6.1 Maximum Transmission Unit
6.2 IPoIB Link Q_Key 6.2 IPoIB Link Q_Key
6.3 Other Link Attributes 6.3 Other Link Attributes
7.0 The IPoIB Broadcast Group 7.0 The IPoIB Broadcast Group
8.0 Mapping for other Multicast Groups 8.0 Mapping for other Multicast Groups
9.0 Sending and Receiving IP Multicast Packets 9.0 Sending and Receiving IP Multicast Packets
10.0 IP Multicast Routing 10.0 IP Multicast Routing
11.0 Security Considerations 11.0 New Types of Vulnerability in IB Multicast
12.0 Acknowledgments 12.0 Security Considerations
13.0 References 13.0 Acknowledgments
14.0 Author's Address 14.0 References
15.0 Full Copyright Statement 15.0 Author's Address
16.0 Full Copyright Statement
1.0 Introduction 1.0 Introduction
InfiniBand Architecture (IBA) defines four layers of network services InfiniBand Architecture (IBA) defines four layers of network services
corresponding to layer one through layer four of the OSI reference corresponding to layer one through layer four of the OSI reference
model. For the purpose of running IP over an InfiniBand (IB) model. For the purpose of running IP over an InfiniBand (IB)
network, the IB link, network, and transport layers collectively network, the IB link, network, and transport layers collectively
constitute the data link layer to the IP stack. One can find a constitute the data link layer to the IP stack. One can find a
general overview of IB architecture related to IP networks in general overview of IB architecture related to IP networks in
[IPoIB_ARCH]. [IPoIB_ARCH].
skipping to change at page 3, line 32 skipping to change at page 3, line 33
is referred to [IBTA]. is referred to [IBTA].
IBA defines two layers of multicast services. Its link layer uses IBA defines two layers of multicast services. Its link layer uses
multicast LIDs (MLIDs) in the Local Route Header (LRH). LIDs are multicast LIDs (MLIDs) in the Local Route Header (LRH). LIDs are
allocated by the Subnet Manager (SM) and fall in the range between allocated by the Subnet Manager (SM) and fall in the range between
0xC0000 to 0xFFFE (approximately 16k). MLIDs are used by IB switches 0xC0000 to 0xFFFE (approximately 16k). MLIDs are used by IB switches
to program their multicast forwarding tables. An IB switch to program their multicast forwarding tables. An IB switch
implementation may support much fewer MLIDs in its forwarding table implementation may support much fewer MLIDs in its forwarding table
though. though.
IB network layer uses multicast GIDs (MGIDs) in the Global Route The IB network layer uses multicast GIDs (MGIDs) in the Global Route
Header (GRH). MGIDs closely resemble IPv6 multicast addresses [AARCH] Header (GRH). MGIDs closely resemble IPv6 multicast addresses [AARCH]
shown below. shown below.
| 8 | 4 | 4 | 112 bits | | 8 | 4 | 4 | 112 bits |
+------ -+----+----+---------------------------------------------+ +------ -+----+----+---------------------------------------------+
|11111111|flgs|scop| group ID | |11111111|flgs|scop| group ID |
+--------+----+----+---------------------------------------------+ +--------+----+----+---------------------------------------------+
Figure 1 Figure 1
skipping to change at page 5, line 39 skipping to change at page 5, line 40
Once an IB partition is established for IPoIB use, the link MTU and Once an IB partition is established for IPoIB use, the link MTU and
Q_Key are two other attributes that must be chosen before an IPoIB Q_Key are two other attributes that must be chosen before an IPoIB
link can be configured. link can be configured.
6.1 Maximum Transmission Unit 6.1 Maximum Transmission Unit
IB defines five permissible maximum payload sizes (MTUs). They are IB defines five permissible maximum payload sizes (MTUs). They are
256, 512, 1024, 2048 and 4096 bytes. [IPV6] requires a link MTU of 256, 512, 1024, 2048 and 4096 bytes. [IPV6] requires a link MTU of
1280 bytes or greater. To be better compatible with Ethernet, the 1280 bytes or greater. To be better compatible with Ethernet, the
dominant network media in both the LAN and WAN environment, the IPoIB dominant network media in both the LAN and WAN environment, the IPoIB
link MTU SHALL be 1500 bytes or greater. This leaves only 2048 and link MTU should be 1500 bytes or greater. This leaves only 2048 and
4096 bytes as the two acceptable MTUs for IPoIB. Channel adaptors 4096 bytes as the two acceptable MTUs for IPoIB. Channel adaptors
supporting a MTU less than the minimal requirement can still expose supporting a MTU less than the minimal requirement can still expose
an acceptable MTU to IP through an adaptation layer that fragments an acceptable MTU to IP through an adaptation layer that fragments
larger messages into smaller IB packets, and reassembles them on the larger messages into smaller IB packets, and reassembles them on the
receiving end. But this must be done in a way that is transparent to receiving end. But this must be done in a way that is transparent to
the IP stack. the IP stack.
It is up to the network administrator to select a link MTU to use It is up to the network administrator to select a link MTU to use
when configuring an IPoIB link. The link MTU SHALL not be greater when configuring an IPoIB link. The link MTU SHALL not be greater
than the MTU of any IB devices on the IPoIB link. Here the IB devices than the MTU of any IB devices on the IPoIB link. Here the IB devices
skipping to change at page 8, line 12 skipping to change at page 8, line 12
Once the right MGID and broadcast group are identified, the local Once the right MGID and broadcast group are identified, the local
node SHOULD use the MTU associated with the broadcast group. In case node SHOULD use the MTU associated with the broadcast group. In case
the MTU of the broadcast group is greater than what the local HCA can the MTU of the broadcast group is greater than what the local HCA can
support, the node can not join the IPoIB link and operate as an IP support, the node can not join the IPoIB link and operate as an IP
node. Otherwise the local node must join the broadcast group as a node. Otherwise the local node must join the broadcast group as a
"full member" and use the rest of link attributes associated with the "full member" and use the rest of link attributes associated with the
group for all future communication to the link. group for all future communication to the link.
In addition to the special all-node multicast group for broadcast In addition to the special all-node multicast group for broadcast
purpose, an all-router multicast group SHOULD be created at link purpose, an all-router multicast group may be created at link
configuration time if an IP router will be attached to the link. This configuration time if an IP router will be attached to the link. This
is to facilitate IP multicast operations described later. An IB is to facilitate IP multicast operations described later. An IB
multicast group for the all-router MGID must cover every IB subnet multicast group for the all-router MGID must cover every IB subnet
that the IPoIB link encompasses. The format of the all-router MGID that the IPoIB link encompasses. The format of the all-router MGID
will be covered in the next section. will be covered in the next section.
8.0 Mapping for other Multicast Groups 8.0 Mapping for other Multicast Groups
The general IP multicast [IPMULT] support over IB is similar to the The general IP multicast [IPMULT] support over IB is similar to the
case of the special broadcast group discussed above. An algorithmic case of the special broadcast group discussed above. An algorithmic
skipping to change at page 9, line 46 skipping to change at page 9, line 46
supporting IP multicast over IB. supporting IP multicast over IB.
A) An IB multicast group must be explicitly created through the SA A) An IB multicast group must be explicitly created through the SA
before it can be used. before it can be used.
This implies that in order to send a packet destined for an IP This implies that in order to send a packet destined for an IP
multicast address, the IPoIB implementation must check with the SA on multicast address, the IPoIB implementation must check with the SA on
the outbound link first for a "MCMemberRecord" that matches the MGID. the outbound link first for a "MCMemberRecord" that matches the MGID.
If one does exist, the MLID associated with the multicast group is If one does exist, the MLID associated with the multicast group is
used as the DLID for the packet. Otherwise, it implies no member used as the DLID for the packet. Otherwise, it implies no member
exists on the local link. The packet SHOULD be forwarded to locally exists on the local link. If the scope of the IP multicast group is
connected routers. This is to allow local routers to forward the beyond link-local, the packet must be sent to the on-link routers
packet to multicast listeners on remote networks. The specific through the use of the all-router multicast group or the broadcast
mechanism for a sender to forward packets to routers are left to group. This is to allow local routers to forward the packet to
implementations. One can use, for example, the broadcast group, or multicast listeners on remote networks. The all-router multicast
the all-router multicast group for this purpose. group is preferred over the broadcast group for better efficiency. If
the all-router multicast group does not exist, the sender can assume
that there are no routers on the local link; hence the packet can be
safely dropped.
B) A multicast sender must join the target multicast group as a B) A multicast sender must join the target multicast group as a
"SendOnlyNonMember" before outgoing multicast messages from it can be "SendOnlyNonMember" before outgoing multicast messages from it can be
successfully routed. The "SendOnlyNonMember" join is different from successfully routed. The "SendOnlyNonMember" join is different from
the regular "FullMember" join in two aspects. First, both types of the regular "FullMember" join in two aspects. First, both types of
joins enable multicast packets to be routed FROM the local port, but joins enable multicast packets to be routed FROM the local port, but
only the "FullMember" join causes multicast packets to be routed TO only the "FullMember" join causes multicast packets to be routed TO
the port. Second, the sender port of a "SendOnlyNonMember" join will the port. Second, the sender port of a "SendOnlyNonMember" join will
not be counted as a member of the multicast group for purposes of not be counted as a member of the multicast group for purposes of
group creation and deletion. group creation and deletion.
skipping to change at page 10, line 26 skipping to change at page 10, line 28
implementation when processing an egress multicast packet. implementation when processing an egress multicast packet.
if the egress port is already a "SendOnlyNonMember", or a if the egress port is already a "SendOnlyNonMember", or a
"FullMember" "FullMember"
=> send the packet => send the packet
else if the target multicast group exists else if the target multicast group exists
=> do "SendOnlyNonMember" join => do "SendOnlyNonMember" join
=> send the packet => send the packet
else if the all-router multicast group exists else if scope > link-local AND the all-router multicast group exists
=> send the packet to all routers => send the packet to all routers
else else
=> drop the packet => drop the packet
Implementations should cache the information about the existence of Implementations should cache the information about the existence of
an IB multicast group, its MLID and other attributes. This is to an IB multicast group, its MLID and other attributes. This is to
avoid expensive SA calls on every outgoing multicast packet. The avoid expensive SA calls on every outgoing multicast packet. Senders
cache may need to be validated periodically. Senders should also MUST subscribe to the multicast group create and delete traps in
subscribe to the multicast group create and delete traps in order to order to monitor the status of specific IB multicast groups. E.g.,
monitor the status of specific IB multicast groups. Multicast packets multicast packets directed to the all-router multicast group due to a
directed to the all-router multicast group due to a lack of listener lack of listener on the local subnet must be forwarded to the right
on the local subnet must be forwarded to the right multicast group if multicast group if the group is created later. This happens when a
the group is created later. This happens when a listener shows up on listener shows up on the local subnet.
the local subnet.
A node joining an IP multicast group must first construct a MGID A node joining an IP multicast group must first construct a MGID
according to the rule described in section 8 above. Once the correct according to the rule described in section 8 above. Once the correct
MGID is calculated, the node must call the SA of the outbound link to MGID is calculated, the node must call the SA of the outbound link to
attempt a "FullMember" join of the IB multicast group corresponding attempt a "FullMember" join of the IB multicast group corresponding
to the MGID. If the IB multicast group doesn't already exist, one to the MGID. If the IB multicast group doesn't already exist, one
must be created first with the IPoIB link MTU. For the rest of must be created first with the IPoIB link MTU. For the rest of
attributes, it is recommended the same values from the all-node attributes, the same values from the all-node multicast/broadcast
multicast/broadcast group be used. group SHOULD be used.
The join request will cause the local port to be added to the The join request will cause the local port to be added to the
multicast group. It also enables the SM to program IB switches and multicast group. It also enables the SM to program IB switches and
routers with the new multicast information to ensure the correct routers with the new multicast information to ensure the correct
forwarding of multicast packets for the group. forwarding of multicast packets for the group.
When a node leaves an IP multicast group, it SHOULD make a When a node leaves an IP multicast group, it SHOULD make a
"FullMember" leave request to the SA. This gives SM an opportunity to "FullMember" leave request to the SA. This gives SM an opportunity to
update relevant forwarding information, to delete an IB multicast update relevant forwarding information, to delete an IB multicast
group if the local port is the last FullMember to leave, and free up group if the local port is the last FullMember to leave, and free up
skipping to change at page 11, line 46 skipping to change at page 11, line 49
C) Subscribe to the IB multicast group creation events using a C) Subscribe to the IB multicast group creation events using a
wildcarded MGID so that the router can "NonMember" join all IB wildcarded MGID so that the router can "NonMember" join all IB
multicast groups created subsequently for IPv4 or IPv6. multicast groups created subsequently for IPv4 or IPv6.
The "NonMember" join has the same effect as a "FullMember" join The "NonMember" join has the same effect as a "FullMember" join
except that the former will not be counted as a member of the except that the former will not be counted as a member of the
multicast group for purposes of group creation or deletion. That is, multicast group for purposes of group creation or deletion. That is,
when the last "FullMember" leaves a multicast group, the group can be when the last "FullMember" leaves a multicast group, the group can be
safely deleted by the SA without concerning any "NonMember" routers. safely deleted by the SA without concerning any "NonMember" routers.
11.0 Security Considerations 11.0 New Types of Vulnerability in IB Multicast
Many IB multicast functions are subject to failures due to a number
of possible resource constraints. These include the creation of IB
multicast groups, the join calls ("SendOnlyNonMember", "FullMember",
and "NonMember"), and the attaching of a QP to a multicast group.
In general, the occurrence of these failure conditions is highly
implementation dependent, and is believed to be rare. Usually a
failed multicast operation at the IB level can be propagated back to
the IP level, causing the original operation to fail, and the
initiator of the operation to be notified. But some IB multicast
functions are not tied to any foreground operation, making their
failures hard to detect. E.g., if an IP multicast router attempts to
"NonMember" join a newly created multicast group in the local subnet,
but the join call fails, packet forwarding for that particular
multicast group will likely to fail silently, that is, without the
attention of local multicast senders. This type of problems can add
more vulnerability to the already unreliable IP multicast operations.
Implementations should log error messages upon any failure from an IB
multicast operation. Network administrators should be aware of this
vulnerability, and preserve enough multicast resources at the points
where IP multicast will be used heavily. E.g., HCAs with ample
multicast resources should be used at any IP multicast router.
12.0 Security Considerations
All the operations for creating and configuring an IPoIB link All the operations for creating and configuring an IPoIB link
described in this document, including assigning P_Keys to CAs, described in this document, including assigning P_Keys to CAs,
creating IB multicast groups in SA, creating and attaching QPs to IB creating IB multicast groups in SA, creating and attaching QPs to IB
multicast groups,... etc, are privileged operations, and MUST be multicast groups,... etc, are privileged operations, and MUST be
protected by the underlying operating system. This is to prevent protected by the underlying operating system. This is to prevent
malicious, non- privileged software from hijacking important malicious, non-privileged software from hijacking important resources
resources and configurations. For example, A bogus IPoIB broadcast and configurations. For example, A bogus IPoIB broadcast group may
group may prevent a proper one from being created when the network prevent a proper one from being created when the network
administrator tries to set up a link. administrator tries to set up a link.
Controlled Q_Keys SHOULD be used in IPoIB links. This is to prevent Controlled Q_Keys SHOULD be used in IPoIB links. This is to prevent
non-privileged software from fabricating IP datagrams to send, as non-privileged software from fabricating IP datagrams to send, as
mentioned in section 6.2. mentioned in section 6.2.
12.0 Acknowledgments 13.0 Acknowledgments
The authors would like to thank Bruce Beukema, David Brean, Dan The authors would like to thank Bruce Beukema, David Brean, Dan
Cassiday, Aditya Dube, Yaron Haviv, Michael Krause, Thomas Narten, Cassiday, Aditya Dube, Yaron Haviv, Michael Krause, Thomas Narten,
Erik Nordmark, Greg Pfister, Renato Recio, Satya Sharma, and David L. Erik Nordmark, Greg Pfister, Renato Recio, Kanoj Sarcar, Satya
Stevens for their suggestions and many clarifications on the IBA Sharma, and David L. Stevens for their suggestions and many
specification. clarifications on the IBA specification.
13.0 References
14.0 References
[AARCH] Hinden, R. and S. Deering "IP Version 6 Addressing [AARCH] Hinden, R. and S. Deering "IP Version 6 Addressing
Architecture", RFC 2373, July 1998. Architecture", RFC 2373, July 1998.
[DHCP] R. Droms "Dynamic Host Configuration Protocol", RFC 2131, [DHCP] R. Droms "Dynamic Host Configuration Protocol", RFC 2131,
March 1997. March 1997.
[DISC] Narten, T., Nordmark, E. and W. Simpson, "Neighbor [DISC] Narten, T., Nordmark, E. and W. Simpson, "Neighbor
Discovery for IP Version 6 (IPv6)", RFC 2461, December Discovery for IP Version 6 (IPv6)", RFC 2461, December
1998. 1998.
skipping to change at page 13, line 11 skipping to change at page 13, line 39
[IPV6] Deering, S. and R. Hinden, "Internet Protocol, Version 6 [IPV6] Deering, S. and R. Hinden, "Internet Protocol, Version 6
(IPv6) Specification", RFC 2460, December 1998. (IPv6) Specification", RFC 2460, December 1998.
[IP6MLD] Deering S., Fenner W., Haberman B., "Multicast Listener [IP6MLD] Deering S., Fenner W., Haberman B., "Multicast Listener
Discovery (MLD) for IPv6", RFC 2710, October 1999. Discovery (MLD) for IPv6", RFC 2710, October 1999.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997. Requirement Levels", BCP 14, RFC 2119, March 1997.
14.0 Author's Address 15.0 Author's Address
H.K. Jerry Chu H.K. Jerry Chu
17 Network Circle, UMPK17-201 17 Network Circle, UMPK17-201
Menlo Park, CA 94025 Menlo Park, CA 94025
USA USA
Phone: +1 650 786-5146 Phone: +1 650 786-5146
EMail: jerry.chu@sun.com EMail: jerry.chu@sun.com
Vivek Kashyap Vivek Kashyap
IBM IBM
15450, SW Koll Parkway 15450, SW Koll Parkway
Beaverton, OR 97006 Beaverton, OR 97006
Phone: 503 578 3422 Phone: 503 578 3422
EMail: vivk@us.ibm.com EMail: vivk@us.ibm.com
15.0 Full Copyright Statement 16.0 Full Copyright Statement
Copyright (C) The Internet Society (2003>. All Rights Reserved. Copyright (C) The Internet Society (2003>. All Rights Reserved.
This document and translations of it may be copied and furnished to This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph are kind, provided that the above copyright notice and this paragraph are
included on all such copies and derivative works. However, this included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing document itself may not be modified in any way, such as by removing
 End of changes. 

This html diff was produced by rfcdiff 1.23, available from http://www.levkowetz.com/ietf/tools/rfcdiff/