draft-ietf-ipoib-link-multicast-00.txt   draft-ietf-ipoib-link-multicast-01.txt 
INTERNET-DRAFT H.K. Jerry Chu INTERNET-DRAFT H.K. Jerry Chu
<draft-ietf-ipoib-link-multicast-00.txt> Sun Microsystems <draft-ietf-ipoib-link-multicast-01.txt> Sun Microsystems
Vivek Kashyap Vivek Kashyap
IBM IBM
Expires: July, 2002 January, 2002 Expires: October, 2002 April, 2002
IP link and multicast over InfiniBand networks IP link and multicast over InfiniBand networks
Status of this Memo Status of this Memo
This document is an Internet-Draft and is in full conformance with This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026. all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that other Task Force (IETF), its areas, and its working groups. Note that other
skipping to change at page 1, line 41 skipping to change at page 1, line 41
Copyright (C) The Internet Society (date). All Rights Reserved. Copyright (C) The Internet Society (date). All Rights Reserved.
Abstract Abstract
This document specifies a method for setting up IP subnets and This document specifies a method for setting up IP subnets and
multicast services over InfiniBand(TM) networks. Discussions in this multicast services over InfiniBand(TM) networks. Discussions in this
document are applicable to both IPv4 and IPv6, unless explicitly document are applicable to both IPv4 and IPv6, unless explicitly
specified. A separate document will cover unicast and encapsulation specified. A separate document will cover unicast and encapsulation
of IP datagrams over InfiniBand networks. of IP datagrams over InfiniBand networks.
1. Introduction Table of Contents
1.0 Introduction
2.0 Terminology
3.0 Basic IPoIB Transport - Unreliable Datagram
4.0 IB Multicast Architecture
5.0 IB Links vs IPoIB Links
6.0 Setting up an IPoIB Link
6.1 Maximum Transmission Unit
6.2 IPoIB Link Q_Key
6.3 Other Link Attributes
7.0 The IPoIB All-Node Multicast and Broadcast Group
8.0 Mapping for other Multicast Groups
9.0 Sending and Receiving IP Multicast Packets
10.0 Support for IP Multicast Routing
11.0 Security Considerations
12.0 Acknowledgments
13.0 References
14.0 Author's Address
15.0 Full Copyright Statement
1.0 Introduction
InfiniBand Architecture (IBA) defines four layers of network services InfiniBand Architecture (IBA) defines four layers of network services
corresponding to layer one through layer four of the OSI reference corresponding to layer one through layer four of the OSI reference
model. For the purpose of running IP over an InfiniBand (IB) model. For the purpose of running IP over an InfiniBand (IB)
network, the IB network and all its link, network, and transport network, the IB link, network, and transport layers collectively
layers collectively constitute the data link layer to the IP stack. constitute the data link layer to the IP stack. One can find a
general overview of IB architecture related to IP networks in
An IP network is often divided into many subnets connected by IP [IPoIB_ARCH].
routers. Subnetting allows the containment of broadcast traffic
within a single subnet. It also provides certain degree of isolation
between nodes on different subnets. The latter may be an important
consideration for administration purpose.
This document will focus on all the steps required to lay out an IP This document will focus on the steps required to lay out an IP
network on top of an IB network. It will describe all the elements an network on top of an IB network. It will describe all the elements an
IP over InfiniBand (IPoIB) link consists of, how to configure its IP over InfiniBand (IPoIB) link consists of, how to configure its
associated link attributes, and how to set up basic broadcast and associated link attributes, and how to set up basic broadcast and
multicast services on an IPoIB link. multicast services on an IPoIB link. IPoIB link is the building block
for an IP network to be decomposed into multiple subnets connected by
IP routers. Subnetting allows the containment of broadcast traffic
within a single subnet. It also provides certain degree of isolation
between nodes on different subnets. The latter may be an important
consideration for administration purpose.
2. Terminology 2.0 Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119]. document are to be interpreted as described in [RFC2119].
3. Basic IPoIB Transport - Unreliable Datagram 3.0 Basic IPoIB Transport - Unreliable Datagram
InfiniBand defines four types of transport services [IBTA]. They are InfiniBand defines four types of transport services [IBTA]. They are
reliable connection, unreliable connection, reliable datagram, reliable connection, unreliable connection, reliable datagram,
unreliable datagram. IBA also defines a special raw datagram service unreliable datagram. IBA also defines a special raw datagram service
for encapsulation purpose. Both unreliable datagram and raw datagram for encapsulation purpose. Both unreliable datagram and raw datagram
define support for multicast. They provide the basic transport define support for multicast. They provide the basic transport
mechanism that best matches the IP datagram paradigm. mechanism that best matches the IP datagram paradigm.
IB unreliable datagram provides many additional transport features IB unreliable datagram provides many additional transport features
such as the partition key (P_Key) protection, multiple queue pairs such as the partition key (P_Key) protection, multiple queue pairs
(QPs), and Q_Key protection. Moreover, it requires a 32-bit invariant (QPs), and Q_Key protection. Moreover, it requires a 32-bit invariant
CRC checksum, which provides a much stronger protection against data CRC checksum, which provides a much stronger protection against data
corruption, compared with the 16-bit CRC that a raw datagram carries. corruption, compared with the 16-bit CRC that a raw datagram carries.
For these reasons, IB unreliable datagram is considered to be a much For these reasons, IB unreliable datagram is considered to be a much
better choice as the basic IPoIB transport than the raw datagram, and better choice as the basic IPoIB transport than the raw datagram, and
is chosen as the default IPoIB transport mechanism for the rest of is chosen as the default IPoIB transport mechanism ([IPoIB_ARCH],
discussions in this document. [IPoIB_ENCAP]).
An IB unreliable datagram contains the following headers:
o Local Route Header (LRH) - provides IB link-layer addressing
information. An IB link layer address is based on a 16-bit
identifier called Local Identifier (LID), and is used by IB
switches to relay packets within an IB subnet.
o Global Route Header (GRH) - provides routing information for IB
routers to relay packets between IB subnets inside an IB fabric.
GRH is only required for all multicast packets and any unicast
packet that is destined to a node in a different IB subnet. GRH
carries an IB network-layer address, which is an 128-bit
identifier called Global Identifier (GID) that closely mimics IPv6
addressing architecture [AARCH].
o Base Transport Header (BTH) - provides various information,
including P_Key, destination queue pair number (QPN) for IB
transport services.
o Datagram Extended Header (DETH) - provides additional IB
information such as Q_Key, source queue pair number for datagram
services.
From the perspective of IP over IB encapsulation, all the above IB 4.0 IB Multicast Architecture
headers are considered as link layer encapsulation for IP datagrams.
4. IB Multicast Architecture The following discussion gives a short overview of the multicast
architecture in InfiniBand. For a more complete description, the
reader is referred to [IBTA] and [IPoIB_ARCH].
IBA defines two layers of multicast services. Its link layer uses IBA defines two layers of multicast services. Its link layer uses
multicast LIDs (MLIDs), which are allocated by the Subnet Manager multicast LIDs (MLIDs), which are allocated by the Subnet Manager
(SM) and fall in the range between 0xC0000 to 0xFFFE (approximately (SM) and fall in the range between 0xC0000 to 0xFFFE (approximately
16k). MLIDs are used by IB switches to program their multicast 16k). MLIDs are used by IB switches to program their multicast
forwarding tables. An IB switch implementation may support much fewer forwarding tables. An IB switch implementation may support much fewer
MLIDs in its forwarding table though. MLIDs in its forwarding table though.
IB network layer uses multicast GIDs (MGIDs), which closely resemble IB network layer uses multicast GIDs (MGIDs), which closely resemble
IPv6 multicast addresses [AARCH] as shown in the following. IPv6 multicast addresses [AARCH] shown below.
| 8 | 4 | 4 | 112 bits | | 8 | 4 | 4 | 112 bits |
+------ -+----+----+---------------------------------------------+ +------ -+----+----+---------------------------------------------+
|11111111|flgs|scop| group ID | |11111111|flgs|scop| group ID |
+--------+----+----+---------------------------------------------+ +--------+----+----+---------------------------------------------+
11111111 at the start of the address identifies the address as [IPoIB_ARCH] describes each field in more details.
being a multicast address.
+-+-+-+-+
flgs is a set of 4 flags: |0|0|0|T|
+-+-+-+-+
The high-order 3 flags are reserved, and must be initialized to
0.
T = 0 indicates a permanently-assigned ("well-known") multicast
address, assigned by the global internet numbering authority.
T = 1 indicates a non-permanently-assigned ("transient")
multicast address.
scop is a 4-bit multicast scope value used to limit the scope of
the multicast group. The values are:
0 reserved
1 node-local scope
2 link-local scope
3 (unassigned)
4 (unassigned)
5 site-local scope
6 (unassigned)
7 (unassigned)
8 organization-local scope
9 (unassigned)
A (unassigned)
B (unassigned)
C (unassigned)
D (unassigned)
E global scope
F reserved
group ID identifies the multicast group, either permanent or
transient, within the given scope.
MGIDs are mainly used by IB routers when forwarding multicast packets
to remote IB subnets that are part of a multicast forwarding tree.
Since every IB multicast packet is required to carry both LRH and Since every IB multicast packet is required to carry both LRH and
GRH, a valid MGID and a valid MLID are both needed before a valid IB GRH, a valid MGID and a valid MLID are both needed before a valid IB
multicast packet can be constructed. multicast packet can be constructed.
An IB multicast group is uniquely identified by a valid MGID. Before An IB multicast group is uniquely identified by a valid MGID. Before
a MGID can be used within an IB subnet, either as a destination a MGID can be used within an IB subnet, either as a destination
address of a multicast packet, or representing a multicast group that address of a multicast packet, or representing a multicast group that
an IB node can join, a "MCGroupRecord" corresponding to the MGID must an IB node can join, a "MCGroupRecord" corresponding to the MGID must
be created through the Subnet Administrator (SA). Besides the MGID, be created through the Subnet Administrator (SA). Besides the MGID,
the creator must supply values of the path MTU, Q_Key, TClass, the creator must supply values of the path MTU, Q_Key, TClass,
FlowLabel, HopLimit that are appropriate for all the potential FlowLabel, HopLimit that are appropriate for all the potential
clients of the multicast group to use. In return, SA will allocate a clients of the multicast group to use. In return, SA will allocate a
MLID to be used by switches in the local IB subnet. MLID to be used by switches in the local IB subnet.
Note that MLIDs are allocated and managed by SM when new MGIDs are
created though the creation of MCGroupRecords. The number of valid
MLIDs that are available in a given IB subnet is limited by the
implementation-dependent size of multicast forwarding table of IB
switches. Since the number can be small, reuses of MLIDs for MGIDs
may be inevitable. Implementation should nevertheless avoid sharing
the same MLID among high volume multicast groups in order to reduce
software filtering overhead and attain higher efficiency.
Unreliable multicast is defined by IBA as an optional functionality Unreliable multicast is defined by IBA as an optional functionality
for channel adaptors (CAs) and switches. In today's IP technology, for channel adaptors (CAs) and switches. In today's IP technology,
link multicast has become an indispensable function for better link multicast has become an indispensable function for better
supporting a modern IP network. For this reason, it is required that supporting a modern IP network. For this reason, it is required that
an IPoIB fabric supports multicast. This includes all the CAs and an IPoIB fabric supports multicast. This includes all the CAs and
switches that make up an IP network. switches that make up an IP network.
5. IB Links vs IPoIB Links 5.0 IB Links vs IPoIB Links
A link segment on top of which an IP subnet can be configured is A link segment on top of which an IP subnet can be configured is
defined in [IPV6] as a communication facility or medium over which defined in [IPV6] as a communication facility or medium over which
nodes can communicate at the "link" layer. For most types of nodes can communicate at the "link" layer. For most types of
communication media, the boundary between different data link communication media, the boundary between different data link
segments follows the physical topology of the network connectivity, segments follows the physical topology of the network connectivity,
and is pretty obvious. E.g. an Ethernet network connected by and is pretty obvious. E.g. an Ethernet network connected by
switches, hubs, or bridges usually forms a single link segment and switches, hubs, or bridges usually forms a single link segment and
broadcast/multicast domain. Different Ethernet segments can be broadcast/multicast domain. Different Ethernet segments can be
connected by IP routers at the network layer. connected by IP routers at the network layer.
InfiniBand defines its own link-layer and subnets consisting of nodes InfiniBand defines its own link-layer and subnets consisting of nodes
connected by IB switches. However, the IPoIB link boundary needs not connected by IB switches. However, the IPoIB link boundary need not
follow the IB link boundary. Nodes residing on different IB subnets follow the IB link boundary. Nodes residing on different IB subnets
can still communicate directly with one another through IB routers at can still communicate directly with one another through IB routers at
the InfiniBand network layer. The same applies to multicast as well. the InfiniBand network layer. This communication at the network layer
I.e. nodes on the same IB subnet can exchange multicast packets with applies to both unicast as well as multicast.
one another all within the same subnet through the IB link multicast
facility. But even nodes on different IB subnets can still exchange
multicast packets with one another using IB network-layer multicast.
The ultimate requirement for two nodes in the same IB fabric to The ultimate requirement for two nodes in the same IB fabric to
communicate at the IB level, besides the physical connectivity, is a communicate at the IB level, besides the physical connectivity, is a
common P_Key. common P_Key.
Partitioning in IB provides an isolation mechanism among nodes in an Partitioning in IB provides an isolation mechanism among nodes in an
IB fabric, much like VLANs in an Ethernet network. Each port of an IB fabric, much like VLANs in an Ethernet network. Each HCA (Host
endnode contains a P_Key table of all the valid P_Keys the port is Channel Adaptor) port of an endnode contains a P_Key table of all the
allowed to use. The P_Key table is set up by the SM of the local IB valid P_Keys the port is allowed to use. The P_Key table is set up by
subnet. Each QP is programmed with a P_Key from the local P_Key the SM of the local IB subnet. Each QP is programmed with a P_Key
table. This P_Key is carried in the BTH of all the outgoing packets from the local P_Key table. This P_Key is carried in all the
from the QP, and is used to compare against the P_Key in the BTH of outgoing packets from the QP, and is used to compare against the
all the incoming packets to the QP. Reception of an invalid P_Key P_Key of the incoming packets to the QP. Reception of an invalid
causes the packet to be discarded. IB switches may optionally enforce P_Key causes the packet to be discarded. IB switches may optionally
partition checking too. enforce partition checking too.
Therefore P_Key and IB partition are the most natural choice for Therefore P_Key and IB partition are the natural choice for defining
defining IPoIB link boundary. It also affords much flexibility to the IPoIB link boundary. It also affords much flexibility to the network
network administrators when different links are set up in a large administrators when multiple links are set up in a large network.
network. This is very similar to VLANs in Ethernet.
6. Setting up an IPoIB Link 6.0 Setting up an IPoIB Link
A network administrator defines an IPoIB link by setting up an IB A network administrator defines an IPoIB link by setting up an IB
partition and assigning it a unique P_Key. An IB partition may or may partition and assigning it a unique P_Key. An IB partition may or may
not span multiple IB subnets. But whether it does or not is mostly not span multiple IB subnets, and whether it does or not is mostly
irrelevant to IPoIB. transparent to IPoIB.
Each node attached to the IB partition MUST have one of its CA Each node attached to the IB partition MUST have one of its HCAs
assigned the P_Key to use. assigned the P_Key to use. Note that the P_key table of a HCA port
may contain many P_Keys. It is up to an implementation to define the
method by which the P_Key relevant to a particular IPoIB subnet is
determined and conveyed to the IPoIB stack. E.g. implementations can
resort to a manual configuration to choose the P_key or a set of
P_Keys for IPoIB use, and rely on DHCP [DHCP] on each IPoIB link to
assign an IP subnet number.
Once an IB partition is established for IPoIB use, the link MTU and Once an IB partition is established for IPoIB use, the link MTU and
Q_Key are two other important attributes that must be chosen before Q_Key are two other important attributes that must be chosen before
the IPoIB link can be configured. the IPoIB link can be configured.
6.1 Maximum Transmission Unit 6.1 Maximum Transmission Unit
IB defines five permissible maximum payload sizes. They are 256, 512, IB defines five permissible maximum payload sizes. They are 256, 512,
1024, 2048 and 4096 bytes. [IPV6] requires a link MTU of 1280 bytes 1024, 2048 and 4096 bytes. [IPV6] requires a link MTU of 1280 bytes
or greater. This leaves only 2048 and 4096 bytes as two acceptable or greater. To be better compatible with Ethernet, the dominant
choices for IPv6. Channel adaptors supporting a maximum payload size network media in both the LAN and WAN environment, the IPoIB link MTU
less than the minimal MTU requirement can still expose an acceptable SHALL be 1500 bytes or greater. This leaves only 2048 and 4096 bytes
link MTU to IP through an adaptation layer that fragments larger as two acceptable maximum payload sizes for IPoIB. Channel adaptors
messages into smaller IB packets, and reassembles them on the supporting a maximum payload size less than the minimal requirement
receiving end. But this must be done in a way that is completely can still expose an acceptable link MTU to IP through an adaptation
transparent to the IP stack. layer that fragments larger messages into smaller IB packets, and
reassembles them on the receiving end. But this must be done in a way
that is totally transparent to the IP stack.
It is up to the network administrator to select a link MTU to use It is up to the network administrator to select a link MTU to use
when configuring an IPoIB link. The link MTU SHALL not be greater when configuring an IPoIB link. The link MTU SHALL not be greater
than the maximum payload size of any CA or switch connected to the than the maximum payload size of any IB component that is part of the
IPoIB link. IPoIB link minus "EtherType" [IPoIB_ENCAP]. This includes IB
switches, CAs, or routers.
In general a larger link MTU can potentially offer a better In general, a full link MTU should be employed whenever possible to
throughput performance. The caveat is that once a link MTU is chosen attain better throughput performance. One caveat is that once a link
for a given IPoIB link, nodes connected by CAs of a smaller maximum MTU is chosen for a given IPoIB link, nodes connected by CAs of a
payload size won't be able to join the link unless the whole link and smaller maximum payload size won't be able to join the link unless
all the nodes attached to it are reconfigured to use a smaller MTU. the whole link and all the devices attached to it are reconfigured to
use a smaller MTU.
Note that the above discussion assumes that IP datagrams are fully The flexibility of configuring a smaller than the full link MTU size
encapsulated in the payload of IB unreliable datagrams. The actual does make it easier for one to bridge an IPoIB link with an Ethernet
MTU size, i.e., the payload size available for IP datagrams to use, link, by setting the MTU of all the connecting nodes to 1500 bytes.
may be slightly smaller. This will depend on the actual IPoIB For IPv4, this may require a manual configuration of a MTU different
encapsulation scheme, which will be covered in a separate document. from the default, link MTU size on all the nodes belonging to an
IPoIB link. For IPv6, one can use the link MTU option of the router
advertisement [DISC] to announce a smaller MTU to all the nodes.
Note also that in case an IPoIB link spans more than one IB subnet, In case an IPoIB link spans more than one IB subnet, the IPoIB link
the IPoIB link MTU MUST not be set to greater than the path MTU of MTU MUST not exceed the path MTU of any path connecting two nodes in
any path connecting two nodes in the same IB partition. It is up to the same IB partition. It is up to the network administrator to
the network administrator to determine the appropriate path MTU value determine the appropriate path MTU value that will work for any node
that will work for any node in the same IPoIB link. in the same IPoIB link.
6.2. IPoIB Link Q_Key 6.2 IPoIB Link Q_Key
A Q_Key is programmed by the source QP in every IB datagram, and is A Q_Key is programmed by the source QP in every IB datagram, and is
verified by the destination QP against the Q_Key it has been verified by the destination QP against the Q_Key it has been
assigned. A Q_Key violation will cause the offending datagram to be assigned. A Q_Key violation will cause the offending datagram to be
dropped, and a Q_Key violation trap to be raised. dropped, and a Q_Key violation trap to be raised.
A Q_Key must be selected to be used by all the QPs attached to an A Q_Key must be selected to be used by all the QPs attached to an
IPoIB link. It is recommended that a controlled Q_Key be used with IPoIB link. It is recommended that a controlled Q_Key be used with
the high order bit set. This is to prevent non-privileged software the high order bit set. This is to prevent non-privileged software
from fabricating and sending out bogus IP datagrams. All QPs from fabricating and sending out bogus IP datagrams. All QPs
skipping to change at page 7, line 32 skipping to change at page 6, line 40
6.3 Other Link Attributes 6.3 Other Link Attributes
TClass, FlowLabel, and HopLimit are three other attributes that are TClass, FlowLabel, and HopLimit are three other attributes that are
required for an IPoIB link covering more than a single IB subnet. required for an IPoIB link covering more than a single IB subnet.
The selection of these values are implementation dependent. But it The selection of these values are implementation dependent. But it
must take into account the topology of IB subnets comprising the must take into account the topology of IB subnets comprising the
IPoIB link to ensure successful communication between any two nodes IPoIB link to ensure successful communication between any two nodes
in the same IPoIB link. in the same IPoIB link.
7. The IPoIB All-Node Multicast and Broadcast Group 7.0 The IPoIB All-Node Multicast and Broadcast Group
Once an IB partition is created with link attributes identified for Once an IB partition is created with link attributes identified for
an IPoIB link, the network administrator must create a special IB an IPoIB link, the network administrator must create a special IB
multicast group for every node on the IPoIB link to join. This is multicast group for every node on the IPoIB link to join. This is
achieved through the creation of "MCGroupRecord" in each IB subnet achieved through the creation of "MCGroupRecord" in each IB subnet
that the IB partition encompasses, as described in section 4 above. that the IB partition encompasses, as described in section 4 above.
The MGID will have the P_Key of the IB partition that defines the The MGID will have the P_Key of the IB partition that defines the
IPoIB link embedded in it. A special signature is also embedded to IPoIB link embedded in it. A special signature is also embedded to
identify the MGID for IPoIB use only. For IPv4 over IB, the signature identify the MGID for IPoIB use only. For IPv4 over IB, the signature
skipping to change at page 8, line 16 skipping to change at page 7, line 25
| 8 | 4 | 4 | 16 bits | 16 bits | 80 bits | | 8 | 4 | 4 | 16 bits | 16 bits | 80 bits |
+--------+----+----+-----------------+---------+--------------------+ +--------+----+----+-----------------+---------+--------------------+
|11111111|0001|scop|<IPoIB signature>|< P_Key >|000.............0001| |11111111|0001|scop|<IPoIB signature>|< P_Key >|000.............0001|
+--------+----+----+-----------------+---------+--------------------+ +--------+----+----+-----------------+---------+--------------------+
As for the scop bits, if the IPoIB link is fully contained within a As for the scop bits, if the IPoIB link is fully contained within a
single IB subnet, the scop bits SHALL be set to 2 (link-local). single IB subnet, the scop bits SHALL be set to 2 (link-local).
Otherwise the scope will be set higher. Otherwise the scope will be set higher.
A MCGroupRecord will be created with all the IPoIB link attributes A MCGroupRecord will be created with all the IPoIB link attributes
described before, including the link MTU, Q_Key, TClass, FlowLabel, described before. When a node is attached to an IPoIB link identified
and HopLimit. When a node is attached to an IPoIB link identified by by a P_Key, it must look for a special, all-node multicast/broadcast
a P_Key, it must look for a special, all-node multicast/broadcast
group to join. This is done by constructing the MGID with the link group to join. This is done by constructing the MGID with the link
P_Key and the IPoIB signature. The node SHOULD always look for a MGID P_Key and the IPoIB signature. The node SHOULD always look for a MGID
of a link-local scope first before attempting one with a greater of a link-local scope first before attempting one with a greater
scope. scope.
Once the right MGID and MCGroupRecord are identified, the local node Once the right MGID and MCGroupRecord are identified, the local node
SHOULD use the link MTU recorded in the MCGroupRecord. It MUST accept SHOULD use the link MTU recorded in the MCGroupRecord. In case the
a smaller MTU if one is advertised through the link MTU option of a link MTU is greater than the maximum payload size that the local HCA
router advertisement [DISC]. can support, the node can not join the IPoIB link and operate as an
IP node. Otherwise the local node must join the special all-node
In case the link MTU is greater than the maximum payload size that multicast/broadcast group by calling the SA to create a
the local HCA can support, the node can not join the IPoIB link and MCMemberRecord corresponding to the MGID. The SA will return all the
operate as an IP node. link attributes for the local node to use. The node MUST use these
attributes in all future multicast operations to the local IPoIB
After the right MTU is determined, the local node must join the link. The broadcast group for IPv4 will serve to provide a broadcast
special all-node multicast/broadcast group by calling the SA to service for protocol like ARP to use.
create a MCMemberRecord corresponding to the MGID. The SA will return
all the link attributes for the local node to use. The node MUST use
these attributes in all future multicast operations to the local
IPoIB link. The broadcast group for IPv4 will serve to provide a
broadcast service for protocol like ARP to use.
In addition to the all-node multicast/broadcast group, an all-router In addition to the all-node multicast/broadcast group, an all-router
multicast group SHOULD be created at link configuration time if an IP multicast group SHOULD be created at link configuration time if an IP
router will be attached to the link. This is to facilitate IP router will be attached to the link. This is to facilitate IP
multicast operations described later. A MCGroupRecord for the all- multicast operations described later. A MCGroupRecord for the all-
router MGID must be created in every IB subnet that the IPoIB link router MGID must be created in every IB subnet that the IPoIB link
encompasses. The format of the all-router MGID will be covered in encompasses. The format of the all-router MGID will be covered in
next section. next section.
8. Mapping for other Multicast Groups 8.0 Mapping for other Multicast Groups
The support of general IP multicast [IPMULT] over IB is similar to The support of general IP multicast [IPMULT] over IB is similar to
the case of the special all-node multicast/broadcast group discussed the case of the special all-node multicast/broadcast group discussed
above. An algorithmic mapping is used so that given an IP multicast above. An algorithmic mapping is used so that given an IP multicast
address, individual host can compute the corresponding IB multicast address, individual host can compute the corresponding IB multicast
address (MGID) all by itself without having to consult an external address (MGID) all by itself without having to consult an external
entity. This also removes the need for an externally maintained IP to entity. This also removes the need for an externally maintained IP to
IB multicast mapping table. IB multicast mapping table.
The IPoIB multicast mapping is defined as follows. The same mapping The IPoIB multicast mapping is defined as follows. The same mapping
function is used for both IPv4 and IPv6 except the IPoIB signature function is used for both IPv4 and IPv6 except the IPoIB signature
skipping to change at page 9, line 50 skipping to change at page 9, line 4
FF12:401B:8::2 FF12:401B:8::2
for IPv4 in a compressed format, and for IPv4 in a compressed format, and
FF12:601B:8:0:0:0:0:2 FF12:601B:8:0:0:0:0:2
or or
FF12:601B:8::2 FF12:601B:8::2
for IPv6 in a compressed format. for IPv6 in a compressed format.
A special case exists for the IPv4 limited broadcast address A special case exists for the IPv4 limited broadcast address
"255.255.255.255" [HOSTS]. The address SHALL be mapped to the "255.255.255.255" [HOSTS]. The address SHALL be mapped to the
broadcast MGID for IPv4 networks as described in section 7 above. broadcast MGID for IPv4 networks as described in section 7 above.
Also the IPv6 all-node multicast address "FF0X::1" [AARCH] will be Also the IPv6 all-node multicast address "FF0X::1" [AARCH] will be
mapped to the the special all-node MGID for IPv6 networks. mapped to the the special all-node MGID for IPv6 networks.
When a node wishes to join an IP multicast group on a local link, it 9.0 Sending and Receiving IP Multicast Packets
first needs to construct the corresponding MGID, using the rule
described above. Note that it must remember the scope bits from the In order to send a packet destined for an IP multicast address, a
all-node MGID, and use the same scope in all later MGIDs it node must first check for the existence of MCGroupRecord
constructs. corresponding to the MGID of the outbound link. If one already
exists, the MLID allocated by the SM for the MCGroupRecord is used as
the DLID for the packet. Otherwise, it means no member exists on the
local link. The packet should be forwarded to the all-router
multicast group to ensure the correct delivery of the packet to
multicast listeners on remote networks. (See section 10 below.) If an
all-router multicast group doesn't already exist, it implies no
router presence on the local subnet. The packet can then be safely
dropped.
Note that the local node MUST be notified when an IB multicast group
corresponding to the MGID ever comes into existence later. This
signifies that an interested party just showed up on the local link
and therefore must be copied.
For a node to join an IP multicast group to receive IP multicast
packets, it must first construct a MGID corresponding to the IP
multicast group, using the rule described above. Note that it must
remember the scope bits from the all-node MGID, and use the same
scope in all the MGIDs it constructs.
The local node then checks with SA to see if a MCGroupRecord The local node then checks with SA to see if a MCGroupRecord
corresponding to the MGID already exists. If not, one must be created corresponding to the MGID already exists. If not, one must be
first. The MCGroupRecord MUST be created with the IPoIB link MTU. For created. The MCGroupRecord MUST be created with the IPoIB link MTU.
the rest of the attributes, it is recommended that it uses the same For the rest of the attributes, it is recommended that it uses the
values from the all-node multicast/broadcast group corresponding to same values from the all-node multicast/broadcast group corresponding
the link. to the link.
Note that for an IPoIB link that spans more than one IB subnet Note that for an IPoIB link that spans more than one IB subnet
connected by IB routers, adequate multicast forwarding support at the connected by IB routers, adequate multicast forwarding support at the
IB level is required for multicast packets to be forwarded properly IB level is required for multicast packets to be forwarded properly
to members in remote IB subnets. The specific mechanism for this will to members in remote IB subnets. The specific mechanism for this will
be covered in [IBTA], and is out of scope of this document. be covered in [IBTA], and is out of scope of this document.
Once the IB multicast group is identified, the node must join the Once the IB multicast group is identified, the node must join the
group, unless it is a member already, by calling the SA to create a group, unless it is a member already, by calling the SA to create a
MCMemberRecord corresponding to the MGID. The join call enables SM MCMemberRecord corresponding to the MGID. The join call enables SM
to program local IB switches and routers with the new multicast to program local IB switches and routers with the new multicast
information. Specifically it causes an IB switch to add the LID of information. Specifically it causes an IB switch to add the LID of
the caller to its forwarding table entry corresponding to the MLID the caller to its forwarding table entry corresponding to the MLID
allocated for the group. It also causes an IB router to attach allocated for the group. It also causes an IB router to attach
itself to the IB multicast tree corresponding to the MGID. itself to the IB multicast tree corresponding to the MGID.
In case any of the above IB operations fails, a node MAY choose to
simply join the all-router multicast group. This will ensure it
receives a copy of every multicast packet on the local link. Nodes
doing so MUST filter out those multicast packets that are of no
interest to the local node.
When a node leaves an IP multicast group, it SHOULD delete the When a node leaves an IP multicast group, it SHOULD delete the
MCMemberRecord from the SA. This allows the SA to free up related MCMemberRecord from the SA. This allows the SA to free up related
resources. SM should delete MCGroupRecords that are no longer in use, resources. SM should delete MCGroupRecords that are no longer in use,
and free up the MLIDs allocated for them. The specific algorithm is and free up the MLIDs allocated for them. The specific algorithm is
implementation-dependent, and therefore is out of scope of this implementation-dependent, and therefore is out of scope of this
document. document.
In order to send a packet destined for an IP multicast address, a 10.0 Support for IP Multicast Routing
node must first check if a MCGroupRecord for the corresponding MGID
of the outbound link exists or not. If one already exists, the MLID
allocated by the SM for the MCGroupRecord is used as the DLID for the
packet. Otherwise, it means no member exists on the local link. The
packet should be forwarded to the all-router multicast group
described before. If one doesn't already exist, it implies no router
presence on the local subnet. The packet can then be silently
dropped.
Note that the local node MUST be notified when an IB multicast group
corresponding to the MGID ever comes into existence later. This
signifies that an interested party just showed up on the local link
and therefore must be copied.
9. Support for IP Multicast Routing
IP multicast routing requires a router to receive a copy of every IP multicast routing requires a router to receive a copy of every
link multicast packet on a locally connected link [IPMULT, IP6MLD]. link multicast packet on a locally connected link [IPMULT, IP6MLD].
For Ethernet this is usually done by turning on promiscuous multicast For Ethernet this is usually done by turning on promiscuous multicast
mode on a locally connected Ethernet interface. mode on a locally connected Ethernet interface.
Unfortunately IBA does not support promiscuous multicast mode. Unfortunately IBA does not support promiscuous multicast mode.
Therefore the IPoIB driver should forward a copy of every outbound Therefore the IPoIB driver should forward a copy of every outbound
multicast datagram to the MGID corresponding to the all-router multicast datagram to the MGID corresponding to the all-router
multicast group. This is to ensure multicast packets be properly multicast group. This is to ensure multicast packets be properly
forwarded to remote IP networks. forwarded to remote IP networks, and applies to IP hosts as well as
multicast routers.
10. Security Considerations 11.0 Security Considerations
All the operations for creating and configuring an IPoIB link All the operations for creating and configuring an IPoIB link
described in this document, including assigning P_Keys to CAs, described in this document, including assigning P_Keys to CAs,
creating MCGroupRecords and MCMemberRecords in SA, creating and creating MCGroupRecords and MCMemberRecords in SA, creating and
attaching QPs to IB multicast groups,... etc are privileged attaching QPs to IB multicast groups,... etc are privileged
operations, and MUST be protected by the underlying operating system. operations, and MUST be protected by the underlying operating system.
This is to prevent malicious, non- privileged software from hijacking This is to prevent malicious, non- privileged software from hijacking
important resources and configurations. E.g. A bogus all-node IPoIB important resources and configurations. E.g. A bogus all-node IPoIB
multicast group may prevent a proper one from being created when the multicast group may prevent a proper one from being created when the
network administrator tries to set up a link. network administrator tries to set up a link.
Controlled Q_Keys SHOULD be used in IB multicast groups in order to Controlled Q_Keys SHOULD be used in IB multicast groups in order to
prevent non-privileged software from fabricating IP datagrams to prevent non-privileged software from fabricating IP datagrams to
send, as mentioned in section 6.2. send, as mentioned in section 6.2.
11. Acknowledgments 12.0 Acknowledgments
The authors would like to thank Bruce Beukema, David Brean, Dan The authors would like to thank Bruce Beukema, David Brean, Dan
Cassiday, Thomas Narten, Erik Nordmark, Greg Pfister, Renato Recio, Cassiday, Aditya Dube, Yaron Haviv, Michael Krause, Thomas Narten,
David L. Stevens, and Madhu Talluri for their suggestions and many Erik Nordmark, Greg Pfister, Renato Recio, Satya Sharma, David L.
Stevens, and Madhu Talluri for their suggestions and many
clarifications on the IBA specification. clarifications on the IBA specification.
12. References 13.0 References
[AARCH] Hinden, R. and S. Deering "IP Version 6 Addressing [AARCH] Hinden, R. and S. Deering "IP Version 6 Addressing
Architecture", RFC 2373, July 1998. Architecture", RFC 2373, July 1998.
[DHCP] R. Droms "Dynamic Host Configuration Protocol", RFC 2131,
March 1997.
[DISC] Narten, T., Nordmark, E. and W. Simpson, "Neighbor [DISC] Narten, T., Nordmark, E. and W. Simpson, "Neighbor
Discovery for IP Version 6 (IPv6)", RFC 2461, December Discovery for IP Version 6 (IPv6)", RFC 2461, December
1998. 1998.
[HOSTS] Braden R., "Requirements for Internet Hosts -- [HOSTS] Braden R., "Requirements for Internet Hosts --
Communication Layers", RFC 1122, October 1989 Communication Layers", RFC 1122, October 1989
[IBTA] InfiniBand Architecture Specification, Release 1.0.a by [IBTA] InfiniBand Architecture Specification, Release 1.0.a by
InfiniBand Trade Association at www.infinibandta.org InfiniBand Trade Association at www.infinibandta.org
[IGMP2] Fenner W., "Internet Group Management Protocol, Version 2", [IGMP2] Fenner W., "Internet Group Management Protocol, Version 2",
RFC 2236, November 1997. RFC 2236, November 1997.
[IPMULT] Deering S., "Host Extensions for IP Multicasting", RFC [IPMULT] Deering S., "Host Extensions for IP Multicasting", RFC
1112, August 1989. 1112, August 1989.
[IPoIB_ARCH] draft-ietf-ipoib-architecture-01.txt
[IPoIB_ENCAP] draft-ietf-ipoib-ip-over-infiniband-00.txt
[IPV6] Deering, S. and R. Hinden, "Internet Protocol, Version 6 [IPV6] Deering, S. and R. Hinden, "Internet Protocol, Version 6
(IPv6) Specification", RFC 2460, December 1998. (IPv6) Specification", RFC 2460, December 1998.
[IP6MLD] Deering S., Fenner W., Haberman B., "Multicast Listener [IP6MLD] Deering S., Fenner W., Haberman B., "Multicast Listener
Discovery (MLD) for IPv6", RFC 2710, October 1999. Discovery (MLD) for IPv6", RFC 2710, October 1999.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997. Requirement Levels", BCP 14, RFC 2119, March 1997.
13. Author's Address 14.0 Author's Address
H.K. Jerry Chu H.K. Jerry Chu
901 San Antonio Road, UMPK17-201 901 San Antonio Road, UMPK17-201
Palo Alto, CA 94303-4900 Palo Alto, CA 94303-4900
USA USA
Phone: +1 650 786-5146 Phone: +1 650 786-5146
EMail: jerry.chu@sun.com EMail: jerry.chu@sun.com
Vivek Kashyap Vivek Kashyap
IBM IBM
15450, SW Koll Parkway 15450, SW Koll Parkway
Beaverton, OR 97006 Beaverton, OR 97006
Phone: 503 578 3422 Phone: 503 578 3422
EMail: vivk@us.ibm.com EMail: vivk@us.ibm.com
14. Full Copyright Statement 15.0 Full Copyright Statement
Copyright (C) The Internet Society (2001>. All Rights Reserved. Copyright (C) The Internet Society (2001>. All Rights Reserved.
This document and translations of it may be copied and furnished to This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph are kind, provided that the above copyright notice and this paragraph are
included on all such copies and derivative works. However, this included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other the copyright notice or references to the Internet Society or other
 End of changes. 

This html diff was produced by rfcdiff 1.23, available from http://www.levkowetz.com/ietf/tools/rfcdiff/