draft-ietf-ipoib-link-multicast-02.txt   draft-ietf-ipoib-link-multicast-03.txt 
INTERNET-DRAFT H.K. Jerry Chu INTERNET-DRAFT H.K. Jerry Chu
<draft-ietf-ipoib-link-multicast-02.txt> Sun Microsystems <draft-ietf-ipoib-link-multicast-03.txt> Sun Microsystems
Vivek Kashyap Vivek Kashyap
IBM IBM
Expires: December, 2002 June, 2002 Expires: September, 2003 March, 2003
IP link and multicast over InfiniBand networks IP link and multicast over InfiniBand networks
Status of this Memo Status of this Memo
This document is an Internet-Draft and is in full conformance with This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026. all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that other Task Force (IETF), its areas, and its working groups. Note that other
skipping to change at page 1, line 31 skipping to change at page 1, line 31
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
Copyright (C) The Internet Society (date). All Rights Reserved. Copyright (C) The Internet Society (2003). All Rights Reserved.
Abstract Abstract
This document specifies a method for setting up IP subnets and This document specifies a method for setting up IP subnets and
multicast services over InfiniBand(TM) networks. Discussions in this multicast services over InfiniBand(TM) networks. Discussions in this
document are applicable to both IPv4 and IPv6, unless explicitly document are applicable to both IPv4 and IPv6, unless explicitly
specified. A separate document will cover unicast and encapsulation specified. A separate document will cover unicast and encapsulation
of IP datagrams over InfiniBand networks. of IP datagrams over InfiniBand networks.
Table of Contents Table of Contents
1.0 Introduction 1.0 Introduction
2.0 Terminology 2.0 Terminology
3.0 Basic IPoIB Transport - Unreliable Datagram 3.0 Basic IPoIB Transport - Unreliable Datagram
4.0 IB Multicast Architecture 4.0 IB Multicast Architecture
5.0 IB Links vs IPoIB Links 5.0 IB Links vs. IPoIB Links
6.0 Setting up an IPoIB Link 6.0 Setting up an IPoIB Link
6.1 Maximum Transmission Unit 6.1 Maximum Transmission Unit
6.2 IPoIB Link Q_Key 6.2 IPoIB Link Q_Key
6.3 Other Link Attributes 6.3 Other Link Attributes
7.0 The IPoIB Broadcast Group 7.0 The IPoIB Broadcast Group
8.0 Mapping for other Multicast Groups 8.0 Mapping for other Multicast Groups
9.0 Sending and Receiving IP Multicast Packets 9.0 Sending and Receiving IP Multicast Packets
10.0 Security Considerations 10.0 IP Multicast Routing
11.0 Acknowledgments 11.0 Security Considerations
12.0 References 12.0 Acknowledgments
13.0 Author's Address 13.0 References
14.0 Full Copyright Statement 14.0 Author's Address
15.0 Full Copyright Statement
1.0 Introduction 1.0 Introduction
InfiniBand Architecture (IBA) defines four layers of network services InfiniBand Architecture (IBA) defines four layers of network services
corresponding to layer one through layer four of the OSI reference corresponding to layer one through layer four of the OSI reference
model. For the purpose of running IP over an InfiniBand (IB) model. For the purpose of running IP over an InfiniBand (IB)
network, the IB link, network, and transport layers collectively network, the IB link, network, and transport layers collectively
constitute the data link layer to the IP stack. One can find a constitute the data link layer to the IP stack. One can find a
general overview of IB architecture related to IP networks in general overview of IB architecture related to IP networks in
[IPoIB_ARCH]. [IPoIB_ARCH].
This document will focus on the necessary steps in order to lay out This document will focus on the necessary steps in order to lay out
an IP network on top of an IB network. It will describe all the an IP network on top of an IB network. It will describe all the
elements of an IP over InfiniBand (IPoIB) link, how to configure its elements of an IP over InfiniBand (IPoIB) link, how to configure its
associated attributes, and how to set up basic broadcast and associated attributes, and how to set up basic broadcast and
multicast services for it. IPoIB link is the building block upon multicast services for it. IPoIB links are the building blocks upon
which an IP network consisting of many IP subnets connected by which an IP network consisting of many IP subnets connected by
routers can be built. Subnetting allows the containment of broadcast routers can be built. Subnetting allows the containment of broadcast
traffic within a single link. It also provides certain degree of traffic within a single link. It also provides certain degree of
isolation for administration purpose between nodes on different isolation for the administration purpose between nodes on different
subnets. subnets.
2.0 Terminology 2.0 Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119]. document are to be interpreted as described in [RFC2119].
3.0 Basic IPoIB Transport - Unreliable Datagram 3.0 Basic IPoIB Transport - Unreliable Datagram
InfiniBand defines four types of transport services [IBTA]. They are InfiniBand defines four types of transport services [IBTA]. They are
reliable connection, unreliable connection, reliable datagram, reliable connection, unreliable connection, reliable datagram,
unreliable datagram. IBA also defines a special raw datagram service unreliable datagram. IBA also defines a special raw datagram service
for encapsulation purpose. Both unreliable datagram and raw datagram for encapsulation purpose. Both unreliable datagram and raw datagram
define support for multicast. They provide the basic transport define support for multicast. They provide the basic transport
mechanism that best matches the IP datagram paradigm. mechanism that best matches the IP datagram paradigm.
IB unreliable datagram provides many additional features such as the IB unreliable datagram provides many additional features such as the
partition key (P_Key) protection, multiple queue pairs (QPs), and partition key (P_Key) protection, multiple queue pairs (QPs), and
Q_Key protection. Moreover, it requires a 32-bit invariant CRC Q_Key protection. Moreover, it defines a 32-bit invariant CRC
checksum, which provides a much stronger protection against data checksum, which provides a much stronger protection against data
corruption, compared with the 16-bit CRC that a raw datagram carries. corruption, compared with the 16-bit CRC that a raw datagram carries.
For these reasons, IB unreliable datagram is considered to be a much For these reasons, IB unreliable datagram is considered to be a much
better choice as the basic IPoIB transport than the raw datagram, and better choice as the basic IPoIB transport than the raw datagram, and
is chosen as the default IPoIB transport mechanism ([IPoIB_ARCH], is chosen as the default IPoIB transport mechanism ([IPoIB_ARCH],
[IPoIB_ENCAP]). [IPoIB_ENCAP]).
4.0 IB Multicast Architecture 4.0 IB Multicast Architecture
The following discussion gives a short overview of the multicast The following discussion gives a short overview of the multicast
architecture in InfiniBand. For a more complete description, the architecture in InfiniBand. For a complete specification, the reader
reader is referred to [IBTA] and [IPoIB_ARCH]. is referred to [IBTA].
IBA defines two layers of multicast services. Its link layer uses IBA defines two layers of multicast services. Its link layer uses
multicast LIDs (MLIDs), which are allocated by the Subnet Manager multicast LIDs (MLIDs) in the Local Route Header (LRH). LIDs are
(SM) and fall in the range between 0xC0000 to 0xFFFE (approximately allocated by the Subnet Manager (SM) and fall in the range between
16k). MLIDs are used by IB switches to program their multicast 0xC0000 to 0xFFFE (approximately 16k). MLIDs are used by IB switches
forwarding tables. An IB switch implementation may support much fewer to program their multicast forwarding tables. An IB switch
MLIDs in its forwarding table though. implementation may support much fewer MLIDs in its forwarding table
though.
IB network layer uses multicast GIDs (MGIDs), which closely resemble IB network layer uses multicast GIDs (MGIDs) in the Global Route
IPv6 multicast addresses [AARCH] shown below. Header (GRH). MGIDs closely resemble IPv6 multicast addresses [AARCH]
shown below.
| 8 | 4 | 4 | 112 bits | | 8 | 4 | 4 | 112 bits |
+------ -+----+----+---------------------------------------------+ +------ -+----+----+---------------------------------------------+
|11111111|flgs|scop| group ID | |11111111|flgs|scop| group ID |
+--------+----+----+---------------------------------------------+ +--------+----+----+---------------------------------------------+
Figure 1 Figure 1
[IPoIB_ARCH] describes each field in more details. [IPoIB_ARCH] describes each field in more details.
Since every IB multicast packet is required to carry both LRH and Since every IB multicast packet is required to carry a LRH and a GRH,
GRH, a valid MGID and a valid MLID are both needed before a valid IB both a valid MGID and a valid MLID are needed before an IB multicast
multicast packet can be constructed. packet can be constructed.
An IB multicast group is uniquely identified by a valid MGID. Before An IB multicast group is uniquely identified by a valid MGID. Before
a MGID can be used within an IB subnet, either as a destination a MGID can be used within an IB subnet, either as a destination
address of a multicast packet, or representing a multicast group that address of a multicast packet, or to represent a multicast group that
an IB node can join, an IB multicast group corresponding to the MGID an IB node can join, an IB multicast group corresponding to the MGID
must be created through the Subnet Administrator (SA). Besides the must be created through the Subnet Administrator (SA). Besides the
MGID, the creator must supply values of the path MTU, Q_Key, TClass, the MGID, the creator of an IB multicast group must supply values of
FlowLabel, HopLimit that are appropriate for all the potential path MTU, P_Key, Q_Key, Service Level (SL), FlowLabel, TClass that
clients of the multicast group to use. In return, SA will allocate a are appropriate for all the potential clients of the multicast group
MLID to be used by switches in the local IB subnet. to use. In return, SA will allocate a MLID to be used by switches in
the local IB subnet.
Unreliable multicast is defined by IBA as an optional functionality Unreliable multicast is defined by IBA as an optional functionality
for channel adaptors (CAs) and switches. In today's IP technology, for channel adaptors (CAs) and switches. In today's IP technology,
link multicast has become an indispensable function for better link multicast has become an indispensable function for better
supporting a modern IP network. For this reason, it is required that supporting a modern IP network. For this reason, it is required that
an IPoIB fabric supports multicast. This includes all the CAs and an IPoIB fabric supports multicast. This includes all the CAs and
switches that are part of an IP network. switches that are part of an IP network.
5.0 IB Links vs IPoIB Links 5.0 IB Links vs. IPoIB Links
A link segment on top of which an IP subnet can be configured is A link segment on top of which an IP subnet can be configured is
defined in [IPV6] as a communication facility or medium over which defined in [IPV6] as a communication facility or medium over which
nodes can communicate at the "link" layer. For most types of nodes can communicate at the "link" layer. For most types of
communication media, the boundary between different data link communication media, the boundary between different data link
segments follows the physical topology of the network. E.g. an segments closely follows the physical topology of the network. For
Ethernet network connected by switches, hubs, or bridges usually instance, an Ethernet network connected by switches, hubs, or bridges
forms a single link segment and broadcast/multicast domain. Different usually forms a single link segment and broadcast/multicast domain.
Ethernet segments can be connected together by IP routers at the Different Ethernet segments can be connected by IP routers at the
network layer. network layer to form an IP network.
InfiniBand defines its own link-layer and subnets consisting of nodes InfiniBand defines its own link-layer and subnets consisting of nodes
connected by IB switches. However, the IPoIB link boundary need not connected by IB switches and routers. However, the IPoIB link
follow the IB link boundary. Nodes residing on different IB subnets boundary need not follow the IB link boundary. Nodes residing on
can still communicate directly with one another through IB routers at different IB subnets can still communicate directly with one another
the InfiniBand network layer. This communication at the network layer through IB routers at the InfiniBand network layer. This
applies to unicast as well as multicast. communication at the network layer applies to unicast as well as
multicast.
The ultimate requirement for two nodes in the same IB fabric to The ultimate requirement for two nodes in the same IB fabric to
communicate at the IB level, besides physical connectivity, is a communicate at the IB level, besides physical connectivity, is a
common P_Key. common P_Key.
Partitioning in IB provides an isolation mechanism among nodes in an Partitioning in IB provides an isolation mechanism among nodes in an
IB fabric, much like VLANs in the Ethernet network. Each HCA (Host IB fabric, much like VLANs in the Ethernet network. Each port of an
Channel Adaptor) port of an endnode contains a P_Key table holding HCA (Host Channel Adaptor) contains a P_Key table holding all the
all the valid P_Keys the port is allowed to use. The P_Key table is valid P_Keys the port is allowed to use. The P_Key table is set up by
set up by the SM of the local IB subnet. Each QP is programmed with a the SM of the local IB subnet. Each QP is programmed with a P_Key
P_Key from the local P_Key table. This P_Key is carried in all the from the local P_Key table. This P_Key is carried in all the outgoing
outgoing packets from the QP, and is used to compare against the packets from the QP, and is used to compare against the P_Key of all
P_Key of incoming packets to the QP. Any packet with an invalid P_Key incoming packets to the QP. Any packet with an invalid P_Key will be
will be discarded by the QP and trigger a P_Key violation trap. IB discarded by the QP and a P_Key violation trap will be generated. IB
switches may optionally enforce partition checking too. switches may optionally enforce partition checking too.
Following the above, IB partitions are the natural choice for Following the above, IB partitions are the natural choice for
defining IPoIB link boundary. It also provides much needed defining IPoIB link boundary. It also provides much needed
flexibility for a network administrator to group nodes logically into flexibility for a network administrator to group nodes logically into
different subnets in a large network. different subnets in a large network.
6.0 Setting up an IPoIB Link 6.0 Setting up an IPoIB Link
A network administrator defines an IPoIB link by setting up an IB A network administrator defines an IPoIB link by setting up an IB
partition and assigning it a unique P_Key. An IB partition may or may partition and assigning it a unique P_Key. Since a full-duplex
not span multiple IB subnets; and whether it does or not is mostly communication is required among IP nodes, full-membership P_Keys,
transparent to IPoIB. that is, those with the high-order bit set to 1 shall be used. An IB
partition may or may not span multiple IB subnets; and whether it
does or not is mostly transparent to IPoIB.
Each node attached to the IB partition MUST have one of its HCAs Each node attached to an IB partition MUST have one of its HCAs
assigned the P_Key to use. Note that the P_key table of an HCA port assigned the P_Key to use. Note that the P_key table of an HCA port
may contain many P_Keys. It is up to the implementation to define the may contain many P_Keys. It is up to the implementation to define the
method by which the P_Key relevant to a particular IPoIB subnet is method by which the P_Key relevant to a particular IPoIB subnet is
determined and conveyed to the IPoIB stack. E.g. implementations can determined and conveyed to the IPoIB stack. For instance,
resort to a manual configuration to choose the P_key or a set of implementations may resort to a manual configuration when choosing
P_Keys for IPoIB use, and rely on DHCP [DHCP] to assign an IP subnet the P_key or a set of P_Keys for IPoIB, and rely on DHCP [DHCP] to
number to each IPoIB link. assign an IP subnet number to each IPoIB link.
Once an IB partition is established for IPoIB use, the link MTU and Once an IB partition is established for IPoIB use, the link MTU and
Q_Key are two other attributes that must be chosen before the IPoIB Q_Key are two other attributes that must be chosen before an IPoIB
link can be configured. link can be configured.
6.1 Maximum Transmission Unit 6.1 Maximum Transmission Unit
IB defines five permissible maximum payload sizes (MTUs). They are IB defines five permissible maximum payload sizes (MTUs). They are
256, 512, 1024, 2048 and 4096 bytes. [IPV6] requires a link MTU of 256, 512, 1024, 2048 and 4096 bytes. [IPV6] requires a link MTU of
1280 bytes or greater. To be better compatible with Ethernet, the 1280 bytes or greater. To be better compatible with Ethernet, the
dominant network media in both the LAN and WAN environment, the IPoIB dominant network media in both the LAN and WAN environment, the IPoIB
link MTU SHALL be 1500 bytes or greater. This leaves only 2048 and link MTU SHALL be 1500 bytes or greater. This leaves only 2048 and
4096 bytes as the two acceptable MTUs for IPoIB. Channel adaptors 4096 bytes as the two acceptable MTUs for IPoIB. Channel adaptors
supporting a MTU less than the minimal requirement can still expose supporting a MTU less than the minimal requirement can still expose
an acceptable MTU to IP through an adaptation layer that fragments an acceptable MTU to IP through an adaptation layer that fragments
larger messages into smaller IB packets, and reassembles them on the larger messages into smaller IB packets, and reassembles them on the
receiving end. But this must be done in a way that is transparent to receiving end. But this must be done in a way that is transparent to
the IP stack. the IP stack.
It is up to the network administrator to select a link MTU to use It is up to the network administrator to select a link MTU to use
when configuring an IPoIB link. The link MTU SHALL not be greater when configuring an IPoIB link. The link MTU SHALL not be greater
than the MTU of any IB device on the IPoIB link minus the size of the than the MTU of any IB devices on the IPoIB link. Here the IB devices
"Type" field encapsulated in the payload [IPoIB_ENCAP]. Here the IB include IB switches, CAs, or routers.
devices include IB switches, CAs, or routers.
In general, a maximal link MTU should be employed whenever possible In general, a maximum link MTU should be employed whenever possible
to attain better throughput performance. One caveat is that once a to attain a better throughput performance. One caveat is that once a
link MTU is chosen for a given IPoIB link, nodes connected by CAs of link MTU is chosen for a given IPoIB link, nodes connected by CAs of
a smaller MTU won't be able to join the link unless the whole link a smaller MTU won't be able to join the link unless the whole link
and all the devices attached to it are reconfigured to use a smaller and all the devices attached to it are reconfigured to use the
MTU. smaller MTU.
The flexibility of configuring a smaller than the full link MTU size It may be desirable in some case to use a smaller link MTU than the
does make it easier for one to bridge an IPoIB link with an Ethernet full size. For example, bridging an IPoIB link with an Ethernet link
link, by setting the IPoIB link MTU to 1500 bytes. For IPv4, this may could be made much easier if the IPoIB link MTU is reduced to 1500
require a manual configuration of a different link MTU than the bytes. For IPv4, this may require a manual configuration of a
maximum that all the nodes support. (See 7.0 below.) For IPv6, one different link MTU than the maximum that all the nodes support. For
can use the MTU option of the router advertisement [DISC] to announce IPv6, one can use the MTU option of the router advertisement [DISC]
a smaller MTU to all the nodes. to announce a smaller MTU to all the nodes.
In case an IPoIB link spans more than one IB subnet, the IPoIB link In case an IPoIB link spans more than one IB subnet, the IPoIB link
MTU MUST not exceed the path MTU of any path connecting two nodes in MTU MUST not exceed the path MTU of any path connecting two nodes in
the same IB partition. It is up to the network administrator to the same IB partition. It is up to the network administrator to
determine the appropriate path MTU value that will work for any node determine the appropriate path MTU value that will work for any node
in the same IPoIB link. in the same IPoIB link.
6.2 IPoIB Link Q_Key 6.2 IPoIB Link Q_Key
A Q_Key is programmed by the source QP in every IB datagram, and is A Q_Key is programmed by the source QP in every IB datagram, and is
compared against the Q_Key of the destination QP. A Q_Key violation compared against the Q_Key of the destination QP. A Q_Key violation
will cause the offending datagram to be dropped, and a Q_Key will cause the offending datagram to be dropped, and a Q_Key
violation trap to be raised. violation counter to be incremented on the receiving port. A trap is
also generated if the feature is supported on that port.
A Q_Key must be selected to be used by all the QPs attached to an A single Q_Key must be selected for all the QPs attached to an IPoIB
IPoIB link. It is recommended that a controlled Q_Key be used with link to use. It is recommended that a controlled Q_Key be used with
the high order bit set. This is to prevent non-privileged software the high order bit set. This is to prevent non-privileged software
from fabricating and sending out bogus IP datagrams. All QPs from fabricating and sending out bogus IP datagrams. All QPs
configured to be used on a given IPoIB link SHALL be assigned the configured for a given IPoIB link SHALL be assigned the same per-link
same per-link Q_Key. Q_Key.
6.3 Other Link Attributes 6.3 Other Link Attributes
TClass, FlowLabel, and HopLimit are three other attributes that are TClass, FlowLabel, HopLimit, and SL are four other attributes that
required if an IPoIB link covers more than a single IB subnet. The are required if an IPoIB link covers more than a single IB subnet.
selection of these values are implementation dependent. But it must The selection of these values are implementation dependent.
take into account the topology of IB subnets comprising the IPoIB Implementations must take into account the topology of IB subnets
link in order to allow successful communication between any two nodes comprising the IPoIB link to ensure a successful communication
in the same IPoIB link. between any two nodes in the same IPoIB link.
7.0 The IPoIB Broadcast Group 7.0 The IPoIB Broadcast Group
Once an IB partition is created with link attributes identified for Once an IB partition is created with link attributes identified for
an IPoIB link, the network administrator must create a special IB an IPoIB link, the network administrator must create a special IB
all-node multicast group (henceforth referred to as the broadcast all-node multicast group (henceforth referred to as the broadcast
group) with these link attributes that every node on the IPoIB link group) with these link attributes for every node on the IPoIB link to
can join. join. The creation of an IB multicast group is through the use of
the "MCMemberRecord" SA attribute as described in the IBA
specification.
The MGID of the broadcast group will embed in it the P_Key of the IB The MGID of an IPoIB broadcast group will embed in it the P_Key of
partition that defines the IPoIB link. A special signature is also the IB partition that defines the IPoIB link. A special signature is
embedded to identify the MGID for IPoIB use only. For IPv4 over IB, also embedded to identify all the MGIDs for IPoIB use only. For IPv4
the signature will be "0x401B". For IPv6 over IB, the signature will over IB, the signature will be "0x401B". For IPv6 over IB, the
be "0x601B". signature will be "0x601B".
For an IPv4 subnet, the MGID for this special IB multicast group For an IPv4 subnet, the MGID for this special IB multicast group
SHALL have the following format: SHALL have the following format:
| 8 | 4 | 4 | 16 bits | 16 bits | 48 bits | 32 bits | | 8 | 4 | 4 | 16 bits | 16 bits | 48 bits | 32 bits |
+--------+----+----+----------------+---------+----------+---------+ +--------+----+----+----------------+---------+----------+---------+
|11111111|0001|scop|0100000000011011|< P_Key >|00.......0|<all 1's>| |11111111|0001|scop|0100000000011011|< P_Key >|00.......0|<all 1's>|
+--------+----+----+----------------+---------+----------+---------+ +--------+----+----+----------------+---------+----------+---------+
Figure 2 Figure 2
skipping to change at page 7, line 31 skipping to change at page 7, line 41
|11111111|0001|scop|0110000000011011|< P_Key >|000.............0001| |11111111|0001|scop|0110000000011011|< P_Key >|000.............0001|
+--------+----+----+----------------+---------+--------------------+ +--------+----+----+----------------+---------+--------------------+
Figure 3 Figure 3
As for the scop bits, if the IPoIB link is fully contained within a As for the scop bits, if the IPoIB link is fully contained within a
single IB subnet, the scop bits SHALL be set to 2 (link-local). single IB subnet, the scop bits SHALL be set to 2 (link-local).
Otherwise the scope will be set higher. Otherwise the scope will be set higher.
The broadcast group for IPv4 will serve to provide a broadcast The broadcast group for IPv4 will serve to provide a broadcast
service for protocol like ARP to use. service for protocols like ARP to use.
When a node is brought up on an IPoIB link identified by a P_Key, it When a node is first brought up on an IPoIB link identified by a
must look for the right broadcast group to join. This is done by P_Key, it must look for the right broadcast group to join. This is
constructing the MGID with the link P_Key and the IPoIB signature. done by querying the SA MCMemberRecord database for a multicast group
The node SHOULD always look for a MGID of a link-local scope first with a MGID matching the one constructed from the link P_Key and the
before attempting one with a greater scope. IPoIB signature. The node SHOULD always look for a MGID of a link-
local scope first before attempting one with a greater scope.
Once the right MGID and broadcast group are identified, the local Once the right MGID and broadcast group are identified, the local
node SHOULD use the MTU associated with the broadcast group. In case node SHOULD use the MTU associated with the broadcast group. In case
the MTU of the broadcast group is greater than what the local HCA can the MTU of the broadcast group is greater than what the local HCA can
support, the node can not join the IPoIB link and operate as an IP support, the node can not join the IPoIB link and operate as an IP
node. Otherwise the local node must join the broadcast group and use node. Otherwise the local node must join the broadcast group as a
the rest of link attributes associated with the group for all future "full member" and use the rest of link attributes associated with the
communication to the link. group for all future communication to the link.
In addition to the special all-node multicast group for broadcast In addition to the special all-node multicast group for broadcast
purpose, an all-router multicast group SHOULD be created at link purpose, an all-router multicast group SHOULD be created at link
configuration time if an IP router will be attached to the link. This configuration time if an IP router will be attached to the link. This
is to facilitate IP multicast operations described later. An IB is to facilitate IP multicast operations described later. An IB
multicast group for the all-router MGID must cover every IB subnet multicast group for the all-router MGID must cover every IB subnet
that the IPoIB link encompasses. The format of the all-router MGID that the IPoIB link encompasses. The format of the all-router MGID
will be covered in next section. will be covered in the next section.
8.0 Mapping for other Multicast Groups 8.0 Mapping for other Multicast Groups
The support of general IP multicast [IPMULT] over IB is similar to The general IP multicast [IPMULT] support over IB is similar to the
the case of the special broadcast group discussed above. An case of the special broadcast group discussed above. An algorithmic
algorithmic mapping is used so that given an IP multicast address, mapping is used so that given an IP multicast address, individual
individual host can compute the corresponding IB multicast address host can compute the corresponding IB multicast address (MGID) all by
(MGID) all by itself without having to consult an external entity. itself without having to consult an external entity. This also
This also removes the need for an externally maintained IP to IB removes the need for an externally maintained IP to IB multicast
multicast mapping table. mapping table.
The IPoIB multicast mapping is depicted in Figure 4. The same mapping The IPoIB multicast mapping is depicted in Figure 4. The same mapping
function is used for both IPv4 and IPv6 except the IPoIB signature function is used for both IPv4 and IPv6 except the IPoIB signature
field. field.
| 8 | 4 | 4 | 16 bits | 16 bits | 80 bits | | 8 | 4 | 4 | 16 bits | 16 bits | 80 bits |
+------ -+----+----+-----------------+---------+--------------------+ +------ -+----+----+-----------------+---------+--------------------+
|11111111|0001|scop|<IPoIB signature>|< P_Key >| group ID | |11111111|0001|scop|<IPoIB signature>|< P_Key >| group ID |
+--------+----+----+-----------------+---------+--------------------+ +--------+----+----+-----------------+---------+--------------------+
Figure 4 Figure 4
Since a MGID allocated for transporting IP multicast datagrams is Since a MGID allocated for transporting IP multicast datagrams is
considered only a transient link-layer multicast address, all IB considered only a transient link-layer multicast address, all IB
MGIDs allocated for IPoIB purpose SHOULD have T = 1. The scope bits MGIDs allocated for IPoIB purpose SHOULD have T = 1. The scope bits
SHALL be the same as that of the all-node MGID for the same IPoIB SHALL be the same as that of the all-node MGID for the same IPoIB
link. link.
The IP multicast address is used together with a given IPoIB link An IP multicast address is used together with a given IPoIB link
P_Key to form the MGID of the IB multicast group. For IPv6 the lower P_Key to form the MGID of the IB multicast group. For IPv6 the lower
80-bit of the group ID is used directly in the lower 80-bit of the 80-bit of the group ID is used directly in the lower 80-bit of the
MGID. For IPv4, the group ID is only 28-bit long and the rest of the MGID. For IPv4, the group ID is only 28-bit long and the rest of the
80 bits are filled with 0. 80 bits are filled with 0.
The rest of the bits are the same as those of the broadcast MGID. The rest of the bits are the same as those of the broadcast MGID.
E.g. on an IPoIB link that is fully contained within a single IB For example, on an IPoIB link that is fully contained within a single
subnet with a P_Key of 8, the MGIDs for the all-router multicast IB subnet with a P_Key of 0x8006, the MGIDs for the all-router
group with group ID 2 [AARCH, IGMP2] are: multicast group with group ID 2 [AARCH, IGMP2] are:
FF12:401B:8:0:0:0:0:2 FF12:401B:8006:0:0:0:0:2
or or
FF12:401B:8::2 FF12:401B:8006::2
for IPv4 in a compressed format, and for IPv4 in a compressed format, and
FF12:601B:8:0:0:0:0:2 FF12:601B:8006:0:0:0:0:2
or or
FF12:601B:8::2 FF12:601B:8006::2
for IPv6 in a compressed format. for IPv6 in a compressed format.
A special case exists for the IPv4 limited broadcast address A special case exists for the IPv4 limited broadcast address
"255.255.255.255" [HOSTS]. The address SHALL be mapped to the "255.255.255.255" [HOSTS]. The address SHALL be mapped to the
broadcast MGID for IPv4 networks as described in section 7 above. broadcast MGID for IPv4 networks as described in section 7 above.
Also the IPv6 all-node multicast address "FF0X::1" [AARCH] maps Also the IPv6 all-node multicast address "FF0X::1" [AARCH] maps
naturally to the the special broadcast MGID for IPv6 networks. naturally to the the special broadcast MGID for IPv6 networks.
9.0 Sending and Receiving IP Multicast Packets 9.0 Sending and Receiving IP Multicast Packets
For any MGID the equivalent IB multicast group must be created first Multicast in InfiniBand differs in a number of ways from multicast in
before use. The implication for a sender is that to send a packet Ethernet. This adds some complexity to an IPoIB implementation when
destined for an IP multicast address, it must first check for the supporting IP multicast over IB.
existence of the IB multicast group corresponding to the MGID on the
outbound link. If one already exists, the MLID associated with the
multicast group is used as the DLID for the packet. Otherwise, it
implies no member exists on the local link. The packet should be
forwarded to locally connected routers. This is to allow local
routers to forward the packet to multicast listeners on remote
networks. The specific mechanism for a sender to forward packets to
routers are left to implementations. One can use, for example, the
broadcast group, or the all-router multicast group for this purpose.
A sender of multicast packets should cache information regarding the A) An IB multicast group must be explicitly created through the SA
the MLID and other attributes of the target IB multicast group in before it can be used.
order to avoid expensive SA calls on every outgoing multicast packet.
The cache may need to be validated periodically. E.g., if SA supports
multicast group create/delete traps, the sender should register to
monitor the status of the target IB multicast group through event
notification. If multicast packets were sent to the all-router
multicast group because no local listener existed, the sender must be
notified by SA when listeners show up later on the local link. This
allows the sender to change the forwarding to the right multicast
group.
For a node to join an IP multicast group, it must first construct a This implies that in order to send a packet destined for an IP
MGID for it, using the rule described above. Note that it must multicast address, the IPoIB implementation must check with the SA on
remember the scope bits from the all-node MGID, and use the same the outbound link first for a "MCMemberRecord" that matches the MGID.
scope in all the MGIDs it constructs. If one does exist, the MLID associated with the multicast group is
used as the DLID for the packet. Otherwise, it implies no member
exists on the local link. The packet SHOULD be forwarded to locally
connected routers. This is to allow local routers to forward the
packet to multicast listeners on remote networks. The specific
mechanism for a sender to forward packets to routers are left to
implementations. One can use, for example, the broadcast group, or
the all-router multicast group for this purpose.
The local node then calls SA to join the IB multicast group B) A multicast sender must join the target multicast group as a
corresponding to the MGID. If the group doesn't already exist, one "SendOnlyNonMember" before outgoing multicast messages from it can be
successfully routed. The "SendOnlyNonMember" join is different from
the regular "FullMember" join in two aspects. First, both types of
joins enable multicast packets to be routed FROM the local port, but
only the "FullMember" join causes multicast packets to be routed TO
the port. Second, the sender port of a "SendOnlyNonMember" join will
not be counted as a member of the multicast group for purposes of
group creation and deletion.
The following code snippet demonstrates the steps in a typical
implementation when processing an egress multicast packet.
if the egress port is already a "SendOnlyNonMember", or a
"FullMember"
=> send the packet
else if the target multicast group exists
=> do "SendOnlyNonMember" join
=> send the packet
else if the all-router multicast group exists
=> send the packet to all routers
else
=> drop the packet
Implementations should cache the information about the existence of
an IB multicast group, its MLID and other attributes. This is to
avoid expensive SA calls on every outgoing multicast packet. The
cache may need to be validated periodically. Senders should also
subscribe to the multicast group create and delete traps in order to
monitor the status of specific IB multicast groups. Multicast packets
directed to the all-router multicast group due to a lack of listener
on the local subnet must be forwarded to the right multicast group if
the group is created later. This happens when a listener shows up on
the local subnet.
A node joining an IP multicast group must first construct a MGID
according to the rule described in section 8 above. Once the correct
MGID is calculated, the node must call the SA of the outbound link to
attempt a "FullMember" join of the IB multicast group corresponding
to the MGID. If the IB multicast group doesn't already exist, one
must be created first with the IPoIB link MTU. For the rest of must be created first with the IPoIB link MTU. For the rest of
attributes, it is recommended the same values from the all-node attributes, it is recommended the same values from the all-node
multicast/broadcast group be used. multicast/broadcast group be used.
The join call enables SM to program local IB switches and routers The join request will cause the local port to be added to the
with the new multicast information. Specifically it causes an IB multicast group. It also enables the SM to program IB switches and
switch to add the LID of the caller to its forwarding table entry routers with the new multicast information to ensure the correct
corresponding to the MLID allocated for the group. It also causes an forwarding of multicast packets for the group.
IB router to attach itself to the IB multicast tree corresponding to
the MGID.
When a node leaves an IP multicast group, it SHOULD notify the SA in When a node leaves an IP multicast group, it SHOULD make a
order for all the related resources to be freed up. This gives SM an "FullMember" leave request to the SA. This gives SM an opportunity to
opportunity to delete an IB multicast group that is no longer in use, update relevant forwarding information, to delete an IB multicast
and free up the MLID allocated for it. The specific algorithm is group if the local port is the last FullMember to leave, and free up
implementation-dependent, and is out of the scope of this document. the MLID allocated for it. The specific algorithm is implementation-
dependent, and is out of the scope of this document.
Note that for an IPoIB link that spans more than one IB subnet Note that for an IPoIB link that spans more than one IB subnet
connected by IB routers, an adequate multicast forwarding support at connected by IB routers, an adequate multicast forwarding support at
the IB level is required for multicast packets to reach listeners on the IB level is required for multicast packets to reach listeners on
remote IB subnets. The specific mechanism for this will be covered in a remote IB subnet. The specific mechanism for this will be covered
[IBTA], and is beyond the scope of IPoIB. in [IBTA], and is beyond the scope of IPoIB.
10.0 Security Considerations 10.0 IP Multicast Routing
IP multicast routing requires multicast routers to receive a copy of
every link multicast packet on a locally connected link [IPMULT,
IP6MLD]. For Ethernet this is usually achieved by turning on the
promiscuous multicast mode on a locally connected Ethernet interface.
IBA does not provide any hardware support for promiscuous multicast
mode. Fortunately a promiscuous multicast mode can be emulated in
the software running on a router through the following steps.
A) Obtain a list of all active IB multicast groups from the local SA.
B) Make a "NonMember" join request to the SA for every group that has
a signature in its MGID matching the one for either IPv4 or IPv6.
C) Subscribe to the IB multicast group creation events using a
wildcarded MGID so that the router can "NonMember" join all IB
multicast groups created subsequently for IPv4 or IPv6.
The "NonMember" join has the same effect as a "FullMember" join
except that the former will not be counted as a member of the
multicast group for purposes of group creation or deletion. That is,
when the last "FullMember" leaves a multicast group, the group can be
safely deleted by the SA without concerning any "NonMember" routers.
11.0 Security Considerations
All the operations for creating and configuring an IPoIB link All the operations for creating and configuring an IPoIB link
described in this document, including assigning P_Keys to CAs, described in this document, including assigning P_Keys to CAs,
creating IB multicast groups in SA, creating and attaching QPs to IB creating IB multicast groups in SA, creating and attaching QPs to IB
multicast groups,... etc are privileged operations, and MUST be multicast groups,... etc, are privileged operations, and MUST be
protected by the underlying operating system. This is to prevent protected by the underlying operating system. This is to prevent
malicious, non- privileged software from hijacking important malicious, non- privileged software from hijacking important
resources and configurations. E.g. A bogus IPoIB broadcast group may resources and configurations. For example, A bogus IPoIB broadcast
prevent a proper one from being created when the network group may prevent a proper one from being created when the network
administrator tries to set up a link. administrator tries to set up a link.
Controlled Q_Keys SHOULD be used in IPoIB links. This is to prevent Controlled Q_Keys SHOULD be used in IPoIB links. This is to prevent
non-privileged software from fabricating IP datagrams to send, as non-privileged software from fabricating IP datagrams to send, as
mentioned in section 6.2. mentioned in section 6.2.
11.0 Acknowledgments 12.0 Acknowledgments
The authors would like to thank Bruce Beukema, David Brean, Dan The authors would like to thank Bruce Beukema, David Brean, Dan
Cassiday, Aditya Dube, Yaron Haviv, Michael Krause, Thomas Narten, Cassiday, Aditya Dube, Yaron Haviv, Michael Krause, Thomas Narten,
Erik Nordmark, Greg Pfister, Renato Recio, Satya Sharma, and David L. Erik Nordmark, Greg Pfister, Renato Recio, Satya Sharma, and David L.
Stevens for their suggestions and many clarifications on the IBA Stevens for their suggestions and many clarifications on the IBA
specification. specification.
12.0 References 13.0 References
[AARCH] Hinden, R. and S. Deering "IP Version 6 Addressing [AARCH] Hinden, R. and S. Deering "IP Version 6 Addressing
Architecture", RFC 2373, July 1998. Architecture", RFC 2373, July 1998.
[DHCP] R. Droms "Dynamic Host Configuration Protocol", RFC 2131, [DHCP] R. Droms "Dynamic Host Configuration Protocol", RFC 2131,
March 1997. March 1997.
[DISC] Narten, T., Nordmark, E. and W. Simpson, "Neighbor [DISC] Narten, T., Nordmark, E. and W. Simpson, "Neighbor
Discovery for IP Version 6 (IPv6)", RFC 2461, December Discovery for IP Version 6 (IPv6)", RFC 2461, December
1998. 1998.
skipping to change at page 11, line 36 skipping to change at page 13, line 5
[IPMULT] Deering S., "Host Extensions for IP Multicasting", RFC [IPMULT] Deering S., "Host Extensions for IP Multicasting", RFC
1112, August 1989. 1112, August 1989.
[IPoIB_ARCH] draft-ietf-ipoib-architecture-01.txt [IPoIB_ARCH] draft-ietf-ipoib-architecture-01.txt
[IPoIB_ENCAP] draft-ietf-ipoib-ip-over-infiniband-01.txt [IPoIB_ENCAP] draft-ietf-ipoib-ip-over-infiniband-01.txt
[IPV6] Deering, S. and R. Hinden, "Internet Protocol, Version 6 [IPV6] Deering, S. and R. Hinden, "Internet Protocol, Version 6
(IPv6) Specification", RFC 2460, December 1998. (IPv6) Specification", RFC 2460, December 1998.
[IP6MLD] Deering S., Fenner W., Haberman B., "Multicast Listener
Discovery (MLD) for IPv6", RFC 2710, October 1999.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997. Requirement Levels", BCP 14, RFC 2119, March 1997.
13.0 Author's Address 14.0 Author's Address
H.K. Jerry Chu H.K. Jerry Chu
17 Network Circle, UMPK17-201 17 Network Circle, UMPK17-201
Menlo Park, CA 94025 Menlo Park, CA 94025
USA USA
Phone: +1 650 786-5146 Phone: +1 650 786-5146
EMail: jerry.chu@sun.com EMail: jerry.chu@sun.com
Vivek Kashyap Vivek Kashyap
IBM IBM
15450, SW Koll Parkway 15450, SW Koll Parkway
Beaverton, OR 97006 Beaverton, OR 97006
Phone: 503 578 3422 Phone: 503 578 3422
EMail: vivk@us.ibm.com EMail: vivk@us.ibm.com
14.0 Full Copyright Statement 15.0 Full Copyright Statement
Copyright (C) The Internet Society (2002>. All Rights Reserved. Copyright (C) The Internet Society (2003>. All Rights Reserved.
This document and translations of it may be copied and furnished to This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph are kind, provided that the above copyright notice and this paragraph are
included on all such copies and derivative works. However, this included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of Internet organizations, except as needed for the purpose of
 End of changes. 

This html diff was produced by rfcdiff 1.23, available from http://www.levkowetz.com/ietf/tools/rfcdiff/