--- 1/draft-ietf-intarea-frag-fragile-05.txt 2019-01-29 11:13:09.442352760 -0800 +++ 2/draft-ietf-intarea-frag-fragile-06.txt 2019-01-29 11:13:09.490353940 -0800 @@ -1,27 +1,27 @@ Internet Area WG R. Bonica Internet-Draft Juniper Networks Intended status: Best Current Practice F. Baker -Expires: July 15, 2019 Unaffiliated +Expires: August 2, 2019 Unaffiliated G. Huston APNIC R. Hinden Check Point Software O. Troan Cisco F. Gont SI6 Networks - January 11, 2019 + January 29, 2019 IP Fragmentation Considered Fragile - draft-ietf-intarea-frag-fragile-05 + draft-ietf-intarea-frag-fragile-06 Abstract This document describes IP fragmentation and explains how it reduces the reliability of Internet communication. This document also proposes alternatives to IP fragmentation and provides recommendations for developers and network operators. Status of This Memo @@ -32,21 +32,21 @@ Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." - This Internet-Draft will expire on July 15, 2019. + This Internet-Draft will expire on August 2, 2019. Copyright Notice Copyright (c) 2019 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents @@ -60,29 +60,30 @@ 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 2. IP Fragmentation . . . . . . . . . . . . . . . . . . . . . . 3 2.1. Links, Paths, MTU and PMTU . . . . . . . . . . . . . . . 3 2.2. Fragmentation Procedures . . . . . . . . . . . . . . . . 5 2.3. Upper-Layer Reliance on IP Fragmentation . . . . . . . . 6 3. Requirements Language . . . . . . . . . . . . . . . . . . . . 7 4. Reduced Reliability . . . . . . . . . . . . . . . . . . . . . 7 4.1. Policy-Based Routing . . . . . . . . . . . . . . . . . . 7 4.2. Network Address Translation (NAT) . . . . . . . . . . . . 8 - 4.3. Stateless Firewalls . . . . . . . . . . . . . . . . . . . 8 + 4.3. Stateless Firewalls . . . . . . . . . . . . . . . . . . . 9 4.4. Stateless Load Balancers . . . . . . . . . . . . . . . . 9 - 4.5. IPv4 Reassembly Errors at High Data Rates . . . . . . . . 10 - 4.6. Security Vulnerabilities . . . . . . . . . . . . . . . . 10 - 4.7. Blackholing Due to ICMP Loss . . . . . . . . . . . . . . 11 - 4.7.1. Transient Loss . . . . . . . . . . . . . . . . . . . 12 - 4.7.2. Incorrect Implementation of Security Policy . . . . . 12 - 4.7.3. Persistent Loss Caused By Anycast . . . . . . . . . . 13 - 4.8. Blackholing Due To Filtering . . . . . . . . . . . . . . 13 + 4.5. Equal Cost Multipath (ECMP) . . . . . . . . . . . . . . . 10 + 4.6. IPv4 Reassembly Errors at High Data Rates . . . . . . . . 10 + 4.7. Security Vulnerabilities . . . . . . . . . . . . . . . . 10 + 4.8. PMTU Blackholing Due to ICMP Loss . . . . . . . . . . . . 11 + 4.8.1. Transient Loss . . . . . . . . . . . . . . . . . . . 12 + 4.8.2. Incorrect Implementation of Security Policy . . . . . 12 + 4.8.3. Persistent Loss Caused By Anycast . . . . . . . . . . 13 + 4.9. Blackholing Due To Filtering or Loss . . . . . . . . . . 13 5. Alternatives to IP Fragmentation . . . . . . . . . . . . . . 14 5.1. Transport Layer Solutions . . . . . . . . . . . . . . . . 14 5.2. Application Layer Solutions . . . . . . . . . . . . . . . 15 6. Applications That Rely on IPv6 Fragmentation . . . . . . . . 16 6.1. DNS . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 6.2. OSPF . . . . . . . . . . . . . . . . . . . . . . . . . . 17 6.3. Packet-in-Packet Encapsulations . . . . . . . . . . . . . 17 6.4. Licklider Transmission Protocol (LTP) . . . . . . . . . . 17 7. Recommendations . . . . . . . . . . . . . . . . . . . . . . . 18 7.1. For Application and Protocol Developers . . . . . . . . . 18 @@ -128,23 +129,23 @@ link to the next. Internet paths are dynamic. Assume that the path from one node to another contains a set of links and routers. If the network topology changes, that path can also change so that it includes a different set of links and routers. Each link is constrained by the number of bytes that it can convey in a single IP packet. This constraint is called the link Maximum Transmission Unit (MTU). IPv4 [RFC0791] requires every link to - support a specified MTU (see footnote). IPv6 [RFC8200] requires - every link to support an MTU of 1280 bytes or greater. These are - called the IPv4 and IPv6 minimum link MTU's. + support a specified MTU (see NOTE 1). IPv6 [RFC8200] requires every + link to support an MTU of 1280 bytes or greater. These are called + the IPv4 and IPv6 minimum link MTU's. Likewise, each Internet path is constrained by the number of bytes that it can convey in a IP single packet. This constraint is called the Path MTU (PMTU). For any given path, the PMTU is equal to the smallest of its link MTU's. Because Internet paths are dynamic, PMTU is also dynamic. For reasons described below, source nodes estimate the PMTU between themselves and destination nodes. A source node can produce extremely conservative PMTU estimates in which: @@ -161,84 +162,86 @@ performance. By executing Path MTU Discovery (PMTUD) [RFC1191] [RFC8201] procedures, a source node can maintain a less conservative estimate of the PMTU between itself and a destination node. In PMTUD, the source node produces an initial PMTU estimate. This initial estimate is equal to the MTU of the first link along the path to the destination node. It can be greater than the actual PMTU. Having produced an initial PMTU estimate, the source node sends non- - fragmentable IP packets to the destination node. If one of these - packets is larger than the actual PMTU, a downstream router will not - be able to forward the packet through the next link along the path. - Therefore, the downstream router drops the packet and sends an - Internet Control Message Protocol (ICMP) [RFC0792] [RFC4443] Packet - Too Big (PTB) message to the source node. The ICMP PTB message - indicates the MTU of the link through which the packet could not be - forwarded. The source node uses this information to refine its PMTU - estimate. + fragmentable IP packets to the destination node (see NOTE 2). If one + of these packets is larger than the actual PMTU, a downstream router + will not be able to forward the packet through the next link along + the path. Therefore, the downstream router drops the packet and + sends an Internet Control Message Protocol (ICMP) [RFC0792] [RFC4443] + Packet Too Big (PTB) message to the source node (see NOTE 3). The + ICMP PTB message indicates the MTU of the link through which the + packet could not be forwarded. The source node uses this information + to refine its PMTU estimate. PMTUD produces a running estimate of the PMTU between a source node and a destination node. Because PMTU is dynamic, at any given time, the PMTU estimate can differ from the actual PMTU. In order to detect PMTU increases, PMTUD occasionally resets the PMTU estimate to its initial value and repeats the procedure described above. - PMTUD has the following characteristics: + Ideally, PMTUD operates as described above. However, in some + scenarios, PMTUD fails. For example: - o It relies on the network's ability to deliver ICMP PTB messages to - the source node. + o PMTUD relies on the network's ability to deliver ICMP PTB messages + to the source node. If the network cannot deliver ICMP PTB + messages to the source node, PMTUD fails. - o It is susceptible to attack because ICMP messages are easily - forged [RFC5927]. + o PMTUD is susceptible to attack because ICMP messages are easily + forged [RFC5927]. Such attacks can cause PMTUD to produce + unnecessarily conservative PMTU estimates. - FOOTNOTE: In IPv4, every host must be capable of receiving a packet + NOTE 1: In IPv4, every host must be capable of receiving a packet whose length is equal to 576 bytes. However, the IPv4 minimum link MTU is not 576. Section 3.2 of RFC 791 explicitly states that the IPv4 minimum link MTU is 68 bytes. But for practical purposes, many network operators consider the IPv4 minimum link MTU to be 576 bytes. So, for the purposes of this document, we assume that the IPv4 minimum link MTU is 576 bytes. - FOOTNOTE: In the paragraphs above, the term "non-fragmentable packet" - is introduced. A non-fragmentable packet can be fragmented at its - source. However, it cannot be fragmented by a downstream node. An - IPv4 packet whose DF-bit is set to zero is fragmentable. An IPv4 - packet whose DF-bit is set to one is non-fragmentable. All IPv6 - packets are also non-fragmentable. + NOTE 2: A non-fragmentable packet can be fragmented at its source. + However, it cannot be fragmented by a downstream node. An IPv4 + packet whose DF-bit is set to zero is fragmentable. An IPv4 packet + whose DF-bit is set to one is non-fragmentable. All IPv6 packets are + also non-fragmentable. - FOOTNOTE: In the paragraphs above, the term "ICMP PTB message" is - introduced. The ICMP PTB message has two instantiations. In ICMPv4 + NOTE 3:: The ICMP PTB message has two instantiations. In ICMPv4 [RFC0792], the ICMP PTB message is Destination Unreachable message with Code equal to (4) fragmentation needed and DF set. This message - was augmented by [RFC1191] to indicates the MTU of the link through + was augmented by [RFC1191] to indicate the MTU of the link through which the packet could not be forwarded. In ICMPv6 [RFC4443], the ICMP PTB message is a Packet Too Big Message with Code equal to (0). This message also indicates the MTU of the link through which the packet could not be forwarded. 2.2. Fragmentation Procedures When an upper-layer protocol submits data to the underlying IP module, and the resulting IP packet's length is greater than the - PMTU, the packet can be divided into fragments. Each fragment - includes an IP header and a portion of the original packet. + PMTU, the packet is divided into fragments. Each fragment includes + an IP header and a portion of the original packet. [RFC0791] describes IPv4 fragmentation procedures. An IPv4 packet - whose DF-bit is set to one cannot be fragmented. An IPv4 packet - whose DF-bit is set to zero can be fragmented by the source node or - by any downstream router. When an IPv4 packet is fragmented, all IP - options appear in the first fragment, but only options whose "copy" - bit is set to one appear in subsequent fragments. + whose DF-bit is set to one can be fragmented by the source node, but + cannot be fragmented by a downstream router. An IPv4 packet whose + DF-bit is set to zero can be fragmented by the source node or by a + downstream router. When an IPv4 packet is fragmented, all IP options + appear in the first fragment, but only options whose "copy" bit is + set to one appear in subsequent fragments. - [RFC8200] describes IPv6 fragmentation procedures. An IPv6 packets + [RFC8200] describes IPv6 fragmentation procedures. An IPv6 packet can be fragmented at the source node only. When an IPv6 packet is fragmented, all extension headers appear in the first fragment, but only per-fragment headers appear in subsequent fragments. Per- fragment headers include the following: o The IPv6 header. o The Hop-by-hop Options header (if present) o The Destination Options header (if present and if it precedes a @@ -326,21 +329,21 @@ | | | | | | | 2 | Policy- | 2001:db8::1/128 | TCP / 80 | 2001:db8::3 | | | based | | | | +-------+--------------+-----------------+------------+-------------+ Table 1: Policy-Based Routing FIB Assume that a router maintains the FIB in Table 1. The first FIB entry is destination-based. It maps the a destination prefix (2001:db8::1/128) to a next-hop (2001:db8::2). The second FIB entry - is a policy-based. It maps the same destination prefix + is policy-based. It maps the same destination prefix (2001:db8::1/128) and a destination port ( TCP / 80 ) to a different next-hop (2001:db8::3). The second entry is more specific than the first. When the router receives the first fragment of a packet that is destined for TCP port 80 on 2001:db8::1, it interrogates the FIB. Both FIB entries satisfy the query. The router selects the second FIB entry because it is more specific and forwards the packet to 2001:db8::3. @@ -366,46 +369,51 @@ o The Destination IP Address and Destination Port on each inbound packet. A+P [RFC6346] and Carrier Grade NAT (CGN) [RFC6888] are two common NAT strategies. In both approaches the NAT device must virtually reassemble fragmented packets in order to translate and forward each fragment. Virtual reassembly in the network is problematic, because it is computationally expensive and because it is prone to attacks - (Section 4.6). + (Section 4.7). 4.3. Stateless Firewalls IP fragmentation causes problems for stateless firewalls whose rules include TCP and UDP ports. Because port information is not available in the trailing fragments the firewall is limited to the following options: o Accept all trailing fragments, possibly admitting certain classes of attack. o Block all trailing fragments, possibly blocking legitimate traffic. Neither option is attractive. - This problem does not occur in stateful firewalls. + This problem does not occur in stateful firewalls or Network Address + Translation (NAT) devices. Such devices maintain state so that they + can afford identical treatment to each fragment that belongs to a + packet. 4.4. Stateless Load Balancers IP fragmentation causes problems for stateless load balancers. In order to assign a packet or packet fragment to a link, the load- - balancer executes an algorithm. If the packet or packet fragment - contains a transport-layer header, the load balancing algorithm - accepts the following 5-tuple as input: + balancer executes an algorithm. The following paragraphs describe a + commonly deployed load-balancing algorithm. + + If the packet or packet fragment contains a transport-layer header, + the load balancing algorithm accepts the following 5-tuple as input: o IP Source Address. o IP Destination Address. o IPv4 Protocol or IPv6 Next Header. o transport-layer source port. o transport-layer destination port. @@ -418,35 +426,42 @@ o IP Destination Address. o IPv4 Protocol or IPv6 Next Header. Therefore, non-fragmented packets belonging to a flow can be assigned to one link while fragmented packets belonging to the same flow can be divided between that link and another. This can cause suboptimal load balancing. -4.5. IPv4 Reassembly Errors at High Data Rates +4.5. Equal Cost Multipath (ECMP) + + IP fragmentation causes problems for routers that support Equal Cost + Multipath (ECMP). Many routers that support ECMP execute the + algorithm described in Section 4.4. Therefore, the exhibit they same + problematic behaviors described in Section 4.4. + +4.6. IPv4 Reassembly Errors at High Data Rates IPv4 fragmentation is not sufficiently robust for use under some conditions in today's Internet. At high data rates, the 16-bit IP identification field is not large enough to prevent frequent incorrectly assembled IP fragments, and the TCP and UDP checksums are insufficient to prevent the resulting corrupted datagrams from being delivered to higher protocol layers. [RFC4963] describes some easily reproduced experiments demonstrating the problem, and discusses some of the operational implications of these observations. These reassembly issues are not easily reproducible in IPv6 because the IPv6 identification field is 32 bits long. -4.6. Security Vulnerabilities +4.7. Security Vulnerabilities Security researchers have documented several attacks that exploit IP fragmentation. The following are examples: o Overlapping fragment attacks [RFC1858][RFC3128][RFC5722] o Resource exhaustion attacks (such as the Rose Attack) o Attacks based on predictable fragment identification values [RFC7739] @@ -487,63 +502,63 @@ for an attacker to forge malicious IP fragments that would cause the reassembly procedure for legitimate packets to fail. NIDS aims at identifying malicious activity by analyzing network traffic. Ambiguity in the possible result of the fragment reassembly process may allow an attacker to evade these systems. Many of these systems try to mitigate some of these evasion techniques (e.g. By computing all possible outcomes of the fragment reassembly process, at the expense of increased processing requirements). -4.7. Blackholing Due to ICMP Loss +4.8. PMTU Blackholing Due to ICMP Loss As mentioned in Section 2.3, upper-layer protocols can be configured to rely on PMTUD. Because PMTUD relies upon the network to deliver ICMP PTB messages, those protocols also rely on the networks to deliver ICMP PTB messages. According to [RFC4890], ICMP PTB messages must not be filtered. However, ICMP PTB delivery is not reliable. It is subject to both transient and persistent loss. - Transient loss of ICMP PTB messages can cause transient black holes. - When the conditions contributing to transient loss abate, the network - regains its ability to deliver ICMP PTB messages and connectivity - between the source and destination nodes is restored. Section 4.7.1 - of this document describes conditions that lead to transient loss of - ICMP PTB messages. + Transient loss of ICMP PTB messages can cause transient PMTU black + holes. When the conditions contributing to transient loss abate, the + network regains its ability to deliver ICMP PTB messages and + connectivity between the source and destination nodes is restored. + Section 4.8.1 of this document describes conditions that lead to + transient loss of ICMP PTB messages. Persistent loss of ICMP PTB messages can cause persistent black - holes. Section 4.7.2 and Section 4.7.3 of this document describe + holes. Section 4.8.2 and Section 4.8.3 of this document describe conditions that lead to persistent loss of ICMP PTB messages. The problem described in this section is specific to PMTUD. It does not occur when the upper-layer protocol obtains its PMTU estimate from PLPMTUD or from any other source. -4.7.1. Transient Loss +4.8.1. Transient Loss The following factors can contribute to transient loss of ICMP PTB messages: o Network congestion. o Packet corruption. o Transient routing loops. o ICMP rate limiting. The effect of rate limiting may be severe, as RFC 4443 recommends strict rate limiting of IPv6 traffic. -4.7.2. Incorrect Implementation of Security Policy +4.8.2. Incorrect Implementation of Security Policy Incorrect implementation of security policy can cause persistent loss of ICMP PTB messages. Assume that a Customer Premise Equipment (CPE) router implements the following zone-based security policy: o Allow any traffic to flow from the inside zone to the outside zone. @@ -559,42 +574,42 @@ allows the ICMP PTB to flow from the outside zone to the inside zone. If not, the implementation discards the ICMP PTB message. When a incorrect implementation of the above-mentioned security policy receives an ICMP PTB message, it discards the packet because its source address is not associated with an existing flow. The security policy described above is implemented incorrectly on many consumer CPE routers. -4.7.3. Persistent Loss Caused By Anycast +4.8.3. Persistent Loss Caused By Anycast Anycast can cause persistent loss of ICMP PTB messages. Consider the example below: A DNS client sends a request to an anycast address. The network routes that DNS request to the nearest instance of that anycast address (i.e., a DNS Server). The DNS server generates a response and sends it back to the DNS client. While the response does not exceed the DNS server's PMTU estimate, it does exceed the actual PMTU. A downstream router drops the packet and sends an ICMP PTB message the packet's source (i.e., the anycast address). The network routes the ICMP PTB message to the anycast instance closest to the downstream router. That anycast instance may not be the DNS server that originated the DNS response. It may be another DNS server with the same anycast address. The DNS server that originated the - response may never receive the ICMP PTB message and may never updates - it PMTU estimate. + response may never receive the ICMP PTB message and may never update + its PMTU estimate. -4.8. Blackholing Due To Filtering +4.9. Blackholing Due To Filtering or Loss In RFC 7872, researchers sampled Internet paths to determine whether they would convey packets that contain IPv6 extension headers. Sampled paths terminated at popular Internet sites (e.g., popular web, mail and DNS servers). The study revealed that at least 28% of the sampled paths did not convey packets containing the IPv6 Fragment extension header. In most cases, fragments were dropped in the destination autonomous system. In other cases, the fragments were dropped in transit @@ -608,21 +623,21 @@ Possible causes follow: o Hardware inability to process fragmented packets. o Failure to change vendor defaults. o Unintentional misconfiguration. o Intentional configuration (e.g., network operators consciously chooses to drop IPv6 fragments in order to address the issues - raised in Section 4.1 through Section 4.7, above.) + raised in Section 4.1 through Section 4.8, above.) 5. Alternatives to IP Fragmentation 5.1. Transport Layer Solutions The Transport Control Protocol (TCP) [RFC0793]) can be operated in a mode that does not require IP fragmentation. Applications submit a stream of data to TCP. TCP divides that stream of data into segments, with no segment exceeding the TCP Maximum @@ -829,40 +844,41 @@ (e.g.,PLMPTUD). 7.2. For System Developers Software libraries SHOULD include provision for PLPMTUD for each supported transport protocol. 7.3. For Middle Box Developers Middle boxes SHOULD process IP fragments in a manner that is - compliant with RFC 791 and RFC 8200. In many cases, middle boxes + consistent with [RFC0791] and [RFC8200]. In many cases, middle boxes must maintain state in order to achieve this goal. Price and performance considerations frequently motivate network operators to deploy stateless middle boxes. These stateless middle boxes may perform sub-optimally, process IP fragments in a manner that is not compliant with RFC 791 or RFC 8200, or even discard IP fragments completely. Such behaviors are NOT RECOMMENDED. If a middleboxes implements non-standard behavior with respect to IP fragmentation, then that behavior MUST be clearly documented. 7.4. For Network Operators + Operators MUST ensure proper PMTUD operation in their network, + including making sure the network generates PTB packets when dropping + packets too large compared to outgoing interface MTU. + As per RFC 4890, network operators MUST NOT filter ICMPv6 PTB messages unless they are known to be forged or otherwise - illegitimate. As stated in Section 4.7, filtering ICMPv6 PTB packets - causes PMTUD to fail. Operators MUST ensure proper PMTUD operation - in their network, including making sure the network generates PTB - packets when dropping packets too large compared to outgoing - interface MTU. Many upper-layer protocols rely on PMTUD. + illegitimate. As stated in Section 4.8, filtering ICMPv6 PTB packets + causes PMTUD to fail. Many upper-layer protocols rely on PMTUD. As per RFC 8200, network operators MUST NOT deploy IPv6 links whose MTU is less than 1280 bytes. Network operators SHOULD NOT filter IP fragments if they originated at a domain name server or are destined for a domain name server. 8. IANA Considerations This document makes no request of IANA.