CURRENT_MEETING_REPORT_ Reported by Jeffrey Mogul/DEC AGENDA (a) Report on current draft (McCloghrie/Fox/Mogul) (b) Review other alternatives (c) Review goals and assumptions (d) Obtain consensus on approach (e) Focus on details (f) What next? MINUTES This was the second meeting of the MTU Discovery Working Group. We started with a quick presentation by Keith McCloghrie of the draft that he and Rich Fox wrote based on the apparent consensus of the December meeting. Some attendees had not read the draft, and we tried to ensure that everyone understood the basic outline. [Summary: senders occasionally attach an IP PTMU-Query Option to their datagrams. Routers update the PMTU value in the option; the last-hop router returns the PMTU to the sender using the ICMP Path-MTU message. If the destination host detects a change in the MTU (when a fragment is received), it sends an ICMP Unexpected Fragment Report message.] We also reviewed the "Steve Deering" proposal from last year, as there was a realization that it might not be dead, after all. Among other things, we now know that there are not 1 but 4 spare bits in the IP header (there are 3 unused in the TOS field), and that the powers that be might therefore be likely to let us use one. [Summary of Deering proposal: senders often send datagrams with "RF" (Report Fragmentation) bit set in the IP header. A host receiving fragment-0 of a datagram with RF set sends an ICMP Fragmentation Occurred message.] We then started a fairly unstructured discussion comparing the costs and benefits of the two approaches. 1. Lifetime of protocol: on the one hand, in principle MTU discovery should be obviated by the coming revolution in routing protocols. Within "a few" years, the routing protocols will provide path-MTU information, so MTU discovery will be unnecessary. Of course, we all know about things that are supposed to happen "real soon now"; we particularly all know about relatively new things that "everyone" implements. Still, while avoiding the trap of assuming that the world will be perfect in just a couple of years, it may not be worth trying to solve the problem of MTU discovery for all time, since it may not be useful for that long. 2. Rapidity of deployment: Clearly, MTU discovery of any form only works for a sender if some subset of the other nodes (routers 1 and/or destinations) suport it. Query-based schemes depend upon support from a large fraction of the routers; RF-style schemes only help if a large fraction of the end-hosts support it. There was some debate about which population is more likely to upgrade soon (routers or end-hosts). No consensus was reached. 3. Connection lifetimes: Van's data suggest that most non-local TCP connections are short (ca. 4 datagrams). This makes some sense (mostly SMTP) although this is only one sample point, and we agreed that more data would be useful. Van argued that this works against a query-based scheme, since by the time one has useful information, there's not much left to do with it. His argument in favor of the RF scheme was that the right way to use it is to assume that you can send large datagrams (sized by your first-hop MTU, or perhaps some estimate of the NSFNET PMTU, ca. 1500), and let the destination tell you if you are screwing up. In general, we realize that fragmentation is not inherently evil. Although it might create some extra overhead for the routers, what we really have to avoid is the "deterministic fragment loss" problem which causes connections to stall. Thus, (I hope I am correctly paraphrasing Van's argument) it is only worth doing for connections that last a while, either because they are carrying lots of data, or because they are stalled due to fragment loss. Query-based schemes waste router resources because processing IP options is expensive, and the payoff is unlikely. It was argued that, since the senders cache the MTU values learned by either scheme in the per-host routing entries, querying would not have to be done on every connection to be useful. Again, Van drew on his traffic studies to suggest that (even over a 12-hour period) there was generally little correlation between connections ... that is, just because one pair of hosts makes a connection does not mean that they will do so any time soon. Some of us did not believe that is necessarily true (for example, how much traffic comes from mail-hub machines like DECWRL and UUNET?) Again, we agreed that it would be nice to have more traffic data available. 4. Complexity: Now that the draft specification for the query-based scheme is done, we realized that it is a lot more complex than we thought. One problem is the number of tunable parameters. Since the RF scheme doesn't require the receiver to maintain any state about the sender [actually, this is not quite true, as noted later], doesn't require the sender to schedule when to send the option, doesn't cause the receiver to send notifications when intentional fragmentation occurs [NFS would probably not set RF], and it requires no support at all from the routers, it appears to be simpler [but keep reading]. After this discussion, it was pretty clear that the consensus had shifted to trying to use the RF scheme. We made the assumption that we could get a header bit (Van argued that although the RF scheme could be done using an option, the cost/benefit analysis might be against it). The next step was to explore how well that would really work. One problem that came up right away is that James VanBokkelen believes there to exist many PC-based systems that (1) do not reassemble 2 fragments (2) do advertise MSS values of 1500 to non-local peers Currently, these hosts function because the 576-if-nonlocal rule observed by most non-PC hosts means that, given today's Internet, even when they advertise an MTU of 1500 to a non-local host, the host at the other end will not send datagrams big enough to be fragmented. [I suppose it is unlikely for two PCs to talk to each other over long distances.] However, if we use the simplest RF scheme, these hosts are going to get fragmented datagrams. Since we assume that any host which implements MTU discovery is also in conformance with the other rules (specifically, fragmentation reassembly), we therefore know that such sub-standard PCs won't send the ICMP Fragmentation Occurred message, and these connections would stall. The obvious fix is to not invoke MTU discovery (i.e., not send segments > 576 bytes) unless you are sure that the other end supports it. This means that you have to have seen a datagram with RF set coming back to you from the destination before you can send large datagrams. More subtly, since we don't want to mislead these stupid PCs (which apparently don't follow the 576-byte rule in either direction) you cannot even send an MSS > 576 to a non-local peer until you have seen an RF bit from it. Thus, since the TCP MSS option can only be sent on the SYN datagram, a host initiating a TCP connection may not be able to use MTU discovery (and large segments) unless it has talked with the other end recently. (The second host is in a better position; since it sees the RF bit before it has to sends its own MSS option, it can set a large MSS immediately. This is nice for FTP retrieves; it doesn't help for SMTP, alas). The consensus was that this limitation was acceptable, since it erred on the conservative side. (Although it errs on the case of the most common connection-type [SMTP], since SMTP connections are normally short we wouldn't gain much anyway.) When two connections are made in quick succession, things work nicely (e.g., several mail messages, or the control connection of an FTP session followed by the data connection. The control connection will seldom carry large segments, but the exchange of RF bits done then will allow the data connection to use large segments right away.) Mike Karels proposed (off-the-cuff, not necessarily believing that it was right) that routers fragmenting a datagram with RF set could also send the fragmentation-occurred ICMP. This seemed to create problems given the requirement for handshaking imposed by the broken-PC crowd, so Mike agreed to go off and think about this one. One question arose about the use of a previously unused bit in the IP header: what would current implementations do if they see it set? (We know that we can safely add options, since by definition these are ignored if not known.) While the IP spec says these bits must be zero, the "robustness principle" implies that routers and hosts should ignore them. Unfortunately, John Moy from Proteon admitted that Proteon routers drop such datagrams, and Noel Chiappa says that this is true of other implementations based on his old MIT "C-gateway" code. We have to 3 find out just how bad this is going to be; perhaps Proteon will be able to upgrade all of its customers before MTU discovery is widely implemented. [Side note: Clearly, implementations contrary to the basic IP spec are causing us serious grief. How much do we twist the protocol to accomodate them?] An orthogonal issue is that in high-speed long-distance networks, there might be lots of packets in flight when the route changes to one with a lower MTU (e.g., on a satellite link with a half-second RTT, 4kb packets, and 100 Mbit/sec channel, this means 1500 packets per RTT!) Since the source cannot react to a Fragment Occurred message sooner than one RTT worth of packets after the one that triggered the message, we are concerned that setting the RF bit on every packet could lead to positive (i.e., anti-stability) feedback in a network that is loosing capacity. This could be attacked in two ways: limit the rate at which the RF bit is sent, or limit the rate at which the ICMP is sent. The former could be done "once per RTT", once per some constant time period, or perhaps once per window. It's not clear if there is a convenient way of marking out the boundaries between windows ACTION ITEMS 1. Noel Chiappa and Van Jacobson were assigned to try to get the IESG to free up an IP header bit. 2. Mike Karels was going to think more about having routers send ICMPs when they fragment. 3. We need to determine how many routers will drop packets with RF set, and how hard it will be to fix this. Is it any different if we use one of the bits in the TOS area? 4. Ditto for end-hosts; are there any that drop such packets? 5. The Router Requirements WG was known to be considering changing the way that fragmentation was done (fragment into equal-size pieces; currently, routers are supposed to send N maximal-size fragments and one smaller one). This would make the RF scheme nearly useless. [Phil Almquist says that the RRWG will work with us on this, so it shouldn't be a problem]. 6. Perhaps more traffic studies would be useful. 7. Someone has to write the next draft. Keith and Rich were thanked for their hard work, on their draft that is now tabled, and were not coerced into starting a different document. Since Van was the fiercest proponent of RF at the meeting, he was given responsibility to see to it that the draft is written. He agreed but said he was going to try to get Steve Deering to do the work (Steve was absent due to serious thesis time-pressure, so maybe Van is going to be stuck with it.) The chair requested a draft within one month (7 March 1990). 8. James VanBokkelen was going to see just how many hosts out there 4 are unable to reassemble fragmented IPs, how hard it would be to fix this, how many vendors are involved, etc. IESG ACTION On Thursday, February 8, at the open IESG meeting, the IESG was asked to allow this bit to be used for MTU discovery. I was not there, but I understand that the IESG is willing to release this bit if we come to a consensus on a protocol that they think is reasonable. SCHEDULE We expect to meet again at the May IETF meeting. At that point, we will probably either adopt one of the schemes, or give up. 5 ATTENDEES Ballard Bare bare%hprnd@hplabs.hp.com Art Berggreen art@sage.acc.com Richard Bosch probe@mit.edu Ron Broersma ron@nosc.mil John Cavanaugh John.Cavanaugh@StPaul.ncr.com Noel Chiappa jnc@LCS.MIT.EDU James Davin jrd@ptt.lcs.mit.edu Farokh Deboo sun!iruucp!ntrlink!fjd Rich Fox sytek!rfox@sun.com Van Jacobson van@lbl-csam.arpa Mike Karels karels@berkeley.edu Mike Marcinkevicz mdm@gumby.dsd.trw.com Tony Mason mason@transarc.com Keith McCloghrie sytek!kzm@hplabs.HP.COM Bill Melohn melohn@sun.com Jeff Mogul mogul@decwrl.dec.com John Moy jmoy@proteon.com Drew Perkins ddp@andrew.cmu.edu Michael Petry petry@trantor.umd.edu Nuggehalli Pradeep pradeep@orville.nas.nasa.gov Mark Rosenstein mar@athena.mit.edu Tony Staw staw@marvin.enet.dec.com James VanBokkelen jbvb@ftp.com John Veizades veizades@apple.com Steve Willis swillis@wellfleet.com John Wobus JMWobus@suvm.acs.syr.edu David Zimmerman dpz@convex.com 6