CURRENT_MEETING_REPORT_ Reported by Steve Casner/USC-ISI Minutes of the Audio/Video Transport Working Group (AVT) 1. Overview In the first AVT session, rough consensus was given to submit the revised Real-time Transport Protocol specification for area directorate review and IESG Last Call as a Proposed Standard. This revision, denoted RTP version 2, incorporates changes requested by the first area directorate review in November 1993. It is the refinement by Steve Casner, Ron Frederick, Van Jacobson and Henning Schulzrinne of the rough protocol changes presented and discussed at the March 1994 IETF meeting in Seattle. An overview of the revised RTP was presented and discussed in the first AVT session and part of the second. The group concurred with the choices made on all of the previously open issues. It was agreed that the hooks provided for protocol extensions were adequate for planned experiments with mechanisms not included in the current protocol. More details on the issues discussed are included in the sections below. There are a few explanatory paragraphs and example algorithm appendices that need to be completed in the draft RTP specification, then it will be submitted. This should be done well before the next IETF meeting. Later in the second AVT session, presentations were given on specifications for the encapsulation in RTP of three different video encodings: Christian Huitema presented the H.261 encoding, Bill Fenner presented the JPEG encoding, and Don Hoffman presented the MPEG encoding. These specifications will also be completed as Internet-Drafts and then submitted as Proposed Standards. On-line versions of the slides are available via FTP from ftp.isi.edu in directory mbone/avt/toronto-july94/. 2. Changes in RTP Version 2 In a message posted to the working group mailing list (rem-conf) in June, a list of the protocol changes from version 1 to version 2 was given. The changes may be summarized as follows: o Carry the control and data traffic on separate ports o New control packet format, including mandatory reception reports o No options in data packet; CSRC count in fixed header o Locally unique 16-bit SSRC ID is now a 32-bit random global ID, always present (header now 12 octets) o Media-specific timestamps (versus fixed 65536 Hz rate) o Remove the application-level multiplexing of the Channel ID and move it to an encapsulation for the cases where it's needed o Application-specific sync marker bit o Encryption covers the whole packet; authentication omitted o With the change to global and potentially encrypted SSRC IDs, translators cannot do unicast reverse path control packets o Beginning of sync unit (BOS) option was eliminated; encodings that need the info should include it in their own headers o Length fields changed to count 32-bit words not including the length word to eliminate some validity checks at receiver Comments were sought both on the changes that were proposed in a finished form as well as on some open issues that were undecided. Some comments were received, but no major objections. Although not all the open issues were discussed, the authors completed the specification and posted it as Internet-Draft draft-ietf-avt-rtp-05.txt before this meeting. During this meeting, each of the open issues was discussed as outlined in the sections below, and the e-mail comments were addressed. 2.1 Provision for Testing Bolot/Turletti/Wakeman Scheme RTPv2 includes a receiver-initiated congestion reporting scheme based on multicast reception reports. An alternative scheme is the one based on sender-initiated polling described in a paper by Bolot, Turletti and Wakeman at SIGCOMM '94. This was implemented in RTPv1 using options carried in the data packets. Christian Huitema and Ian Wakeman argued that it was important that RTPv2 not preclude further experimentation with this scheme. RTPv2 does not include control options, but does provide a header extension mechanism intended for application-specific extensions. It was agreed that provision should be made for further experiments to compare the two schemes, and that the header extension mechanism would be suitable. Details of the extension format for this use must be defined in the Audio/Video profile that accompanies the RTP specification. 2.2 Provision for Testing Unicast Feedback Mechanism In the Bolot/Turletti/Wakeman scheme, reports from the polled receivers are returned to the sender using the ``unicast reverse path control packet'' mechanism of RTPv1. This mechanism was eliminated in RTPv2 because the change of SSRC ID scheme and complete encryption of the packets preclude translators from passing the reverse packets. Van Jacobson suggested that it would be better to multicast the reports even under the sender-initiated polling scheme, and Ian Wakeman agreed this might be the best method. However, Christian Huitema did not want to rule out entirely the possibility of sending unicast reports as it might have lower cost especially for symmetric sessions with many very small sources. Unicast control packets are also utilized in the H.261 video packetization scheme. It was again agreed that provision should be made for experimentation with both unicast and multicast methods. Since it is feasible to do these experiments in scenarios that do not include translators, receivers can use the IP/UDP source information to return unicast replies directly. The INRIA folks prefer that the source port of the data packets be used as the destination port for the unicast replies. INRIA will take responsibility for designing the details of unicast packet use under RTPv2 in this scenario, and provide a report back to the group on the results of the experiments. 2.3 Need to Define Encapsulations for Protocols Other Than UDP In RTPv2, control and data are sent on separate ports when using UDP. For other protocols, either two associations must be used or some encapsulation must be defined to provide multiplexing of control and data on one association. The RTPv2 specification does not specify any such encapsulations; instead, that task is left to separate specifications in the same manner as the IP encapsulation on Ethernet is defined separately from the IP specification. Don Hoffman suggested that it was the responsibility of ancillary groups, such as ATM Forum for AAL5, to decide whether to provide multiplexing or use two associations, and to define the details of the encapsulations. It was agreed that the RTP specification would simply state the requirement for the underlying layers to provide the multiplexing for separate control and data. 2.4 Removal of FMT Control Packet Van Jacobson, Ron Frederick and Christian Huitema all lobbied for the removal of the FMT control packet. Henning Schulzrinne was the primary proponent but was not present. Henning's argument is that due to the combinatorics of encoding parameters, one cannot define ahead of time all the payload types that you may need to use in a session. The creator of the session cannot know all the ones that others in the session may want to use. Van countered that only a small number of the combinations will ever be used. The group was asked about other uses of dynamically defined payload types that might affect this decision, but none were identified. It as agreed that FMT should not be defined in the main RTP specification, but that it could be defined in profiles as needed. For the initial Audio/Video profile, it was further agreed not to include FMT at this time. If a clear need is demonstrated later, we can define it then, as a profile extension. 2.5 Authentication Omitted The RTPv2 specification does not specify any authentication methods. Encryption is defined because the primary security concern is for privacy in conversations, which seems to be a stronger concern for audio that for typed words. Furthermore, Ron Frederick asserted that without a key management system to use for the authentication, it's a moot point. There were no objections to omitting authentication from the specification. 2.6 Rules for Sending Receiver Reports There are a few items that remain to be fully specified in the RTP draft. One is to clarify when reception reports are required and when they may be omitted. The current statement is simply that they are required when IP multicast is being used. It may not make sense for the specification to describe in detail under what circumstances reports might not be used; we know about the IP multicast case, but we have not really learned about the others yet (e.g., unidirectional systems). Van Jacobson has promised to supply an algorithm for calculating the interval between reception reports such that the overall rate of control traffic from all sources is kept to a small fraction (1the data rate. This will go in an appendix of the RTP specification. Van also brought up a new aspect to be considered for this algorithm that was suggested by Henning Schulzrinne. If more of the control bandwidth is allocated to senders than to receivers so that they can send CNAMEs more often, this will allow receivers to more quickly establish the cross-media binding for functions such as audio/video synchronization. For example, giving 50and the rest to receivers seems reasonable. If all participants get the same amount of control bandwidth, in a 1000-person conference it might be 5 minutes before a new participant received the senders CNAME. The details need to be designed for clamping the sender control rate to a reasonable maximum and insuring that randomization of the sending interval will avoid exceeding the overall control bandwidth on a transient basis over the scale of session sizes. This new feature should be tested before it goes into the specification. 2.7 Bit Allocations, Lengths in 32-bit Units, Control Packet Types The RTP specification defines a particular allocation of bits to functions in the data header. In particular, only 4 bits are allocated to the count of CSRC identifiers following the header so that 7 bits may be allocated to the payload type field. There were no objections to this allocation. A recent change in the specification was that all length fields covering areas required to be a multiple of 32-bits should be counted in units of 32-bits rather than octets, and should not count the first 32-bit word that contains the length field. This avoids a validity check that the bottom two bits of length are zero and a second validity check that the value is not zero. No objections were voiced. Steve Casner proposed that the control packet type space be partitioned among the main specification, profiles, and applications within a profile, as was done with option codes in RTPv1. This allows profiles and applications to define types without conflicting with each other or future definitions in the main specification. This topic was not yet addressed in the specification, but was agreed and will be added. 2.8 RTP Timestamps and Relationship to Real Time Although it was not listed as an open issue, some questions were raised about how the RTP timestamp should be related to real time for purposes of synchronization. Christian Huitema pointed out that the RTPv1 timestamp provided the relationship to real time directly in the signal stream where it fits naturally. In RTPv2 the relationship is carried in the control packets to optimize data packet processing, and this may be less convenient for some implementations. Julio Escobar noted that for some applications such as data fusion, the limitations on control traffic bandwidth might make they delay before synchronization too long. For such applications, the profile may specify that the RTP timestamp will carry part of a real-time timestamp and/or that additional real-time timestamp information may be carried in a header that's part of the encoding or in an RTP header extension. However, the RTP timestamp is supposed to have a random initial offset for stronger encryption, so for the RTP timestamp to carry part of an NTP timestamp this offset must be communicated to the receivers out of band so it can be subtracted. Christian said the IVS implementors had also observed a problem that the audio input on some workstations skips samples under heavy load, thereby causing the media clock to drift with respect to real time. It should be possible for the normal playout buffer adaptation to accommodate this. For synchronized playback, the relationship to real time may be adjusted at the next start of talkspurt following each Sender Report control packet that is received. These must be sent often enough that the drift out of sync does not become too large in between, which relates back to the control packet bandwidth limit. 3. Open Questions about Audio/Video Profile In addition to the open issues regarding the RTP specification itself, there were a few open issues to be settled for the specification of the initial Audio/Video profile. These are described below. 3.1 Number and Meaning of Marker Bits The RTP specification allows a profile to trade off the number of marker bits and payload type bits in the second octet of the data header. The proposal for the audio/video profile is to have one marker bit, and that it would mark the start of a talkspurt for audio and the end of a frame for video. There was some discussion of the value of marking the end of a talkspurt, but Van Jacobson argued that the functions to be performed were independent of the bit. The choice of marker bit was accepted by the group. 3.2 Default Encryption Method Encryption at the RTP level is defined to cover the entire packet, and header validity checks are used to verify decryption with the correct key. The specification also identifies an alternative to not use encryption at the RTP level, but instead to allow both unencrypted and encrypted payload types to be defined. For example, two payload types, one for unencrypted PCM and one for encrypted PCM. This allows feeding an encrypted, compressed stream to hardware that expects such a stream. It was proposed that the audio/video profile should specify RTP-level encryption as the default, based on the general principle to encrypt all information that does not have to be left in the clear. This was accepted by the group. 3.3 Relationship Between Control and Data Port Numbers The RTP specification currently defines the default relationship between the control and data port numbers to be ``control = data + 1,'' but allows profiles to define a different relationship. Van Jacobson proposed to change this default to be more strict: that the data port must be even (making the control port odd), and that we use this choice in the audio/video profile. This change would allow a network provider to notice traffic on either port and find the control channel to monitor without having any external information about the conference. This proposed changed was agreed, and in addition it was agreed that both the address allocator and the media applications should force the data port number to be even. This policy could be implemented only in the address allocator, such as the sd tool. However, since the current implementation of sd does not force the data port to be even, it was agreed that enforcing the policy in both places would ensure that it was upheld and avoid compatibility problems. 4. Profiles for Packetization of Video Encodings During the second session, presentations were given on the specifications of how the H.261, JPEG and MPEG video encodings should be packetized for carriage over RTP. These specifications will be companions to the RTP specification. 4.1 Packetization of H.261 Video Encoding Christian Huitema gave a presentation on the revision of the H.261 video packetization specification. This encoding works without the H.221 bit-level multiplexing that is used with H.261 over circuits, carrying GOBs (groups of blocks) in packets instead. INRIA implemented compression in software; UCL has interfaced a hardware codec and stripped off the H.221 framing. The packet format allows for arbitrary bit alignment of the data to accommodate the hardware codecs. After the RTP header, there is a 16-bit header that describes the format of the encoding that follows. Included are the bit positions of the starting and ending bits within their bytes. These are now constrained to be zero (byte aligned) except at the beginning and end of a GOB. Also in the header are several flags and the image size. Van Jacobson requested a change to allow reassembly of the packets of a GOB into a contiguous buffer even when packets arrive out of order. The contiguous buffer permits a simpler and faster decoding loop. This can be achieved by establishing the rule that all packets of a GOB other than the last are the same size; or, alternatively, by adding a fragment offset field to the H.261 packetization header. Christian preferred the first option because it did not introduce an incompatibility in the packet format and did not add more overhead. It was agreed to make this change in the specification. Steve Casner pointed out that the recent draft still requires some changes in packet formats to reflect the use of RTCP control packets on the control port rather than options in the data packets. As was noted above, some detail on the use of unicast reverse packets must also be specified. When these steps are completed, it was agreed that this draft should be submitted in conjunction with the RTP specification as a Proposed Standard. 4.2 Packetization of JPEG Video Encoding Bill Fenner gave a description of the JPEG over RTP specification which has resulted from discussions with Ron Frederick, Steve McCanne and Lance Berc. (See the slides for details.) An Internet-Draft on the JPEG packetization specification will be produced by the next IETF meeting in December. The encoding has a 64-bit header including a fragment offset since it is not possible to guarantee same-size packets in JPEG. JPEG markers are defined to be 0xFF bytes in the data stream; if you have a hardware codec that does not support this, you have to remove them in software. (Since there are also hardware codecs that require the 0xFF stuffing, you cannot always win, and including them allows additional functions.) The only JPEG markers supported are restarts which allow recovery in case some data is lost. A type field has replaced the collection of individual parameters in the previous version of this specification. Types 0-127 will be statically defined, with type 0 being YUV 4:2:2 and type 1 being YUV 4:2:0. Since not all hardware supports restarts, type 0 is defined to exclude them to maximize interoperability. Restart codes will be supported in the future in some future types after all the details are determined. Types 128-255 are dynamically defined by the session protocol or by a control packet, basically by sending all of the JFIF header describing that type. 4.3 Packetization of MPEG Video Encoding Don Hoffman gave an update on the changes in the MPEG packetization draft for RTPv2. (The Cell-B draft will just be re-issued and discussed via e-mail.) MPEG-2 is in development as an ISO/IEC standard. In the MPEG profile for RTP, two formats are proposed. The first translates and encapsulates the information in the MPEG-2 Systems environment for interoperation with other transport mechanisms. The second is a much simpler for ``native'' Internet uses (eliminating a lot of the application-level functionality that does not apply). It is expected that MPEG hardware will provide an interface at the Packetized Elementary Stream (PES) level to make this possible. For both the MPEG and Cell-B specifications, the goal is to have Internet-Drafts completed by the next meeting. This specification uses the 90 kHz MPEG presentation timestamp clock for RTP timestamps. There is a transport header at the start of the RTP payload that carries a translation of the MPEG transport information. The transport header includes some optional fields whose presence is indicated by a bit field in the first word. There is one issue with regard to MPEG timing. There are I, P and B frames that are produced and interpolated at the receiver. The output from the encoder is not in temporal order, it is in frame dependency order. Therefore, the presentation timestamps in the RTP header will be transmitted out of order with respect to the sequence numbers. The group did not see this as a problem. For the PES encapsulation, will need payload types assigned for MPEG 1 video, MPEG 2 video, and MPEG 2 audio. A 16-bit header was proposed at start of the payload to carry some flag bits and slice counter. One of the flag bits indicates that another 16 bits follow to carry the macro-block absolute position field. However, Ron Frederick suggested that 32-bit alignment was valuable, so the second 16 bits should always be included. Ron agreed this was a worthy consideration. 5. Conclusion During this meeting, the group agreed on essentially all of the open issues for the RTP specification. At the end of the meeting, the group was asked for a show of hands from those who thought the specification choices that had been made were fine, and that we should proceed with filling in the example algorithms and completing the areas in the specification where it now says ``to be determined'' and then submit this protocol specification to the Area Directorate again for publication as a Proposed Standard RFC. The chair interpreted the response as consensus that we should proceed. There is now nothing in the way except completing the remaining details; the people who are responsible know who they are. The specification should be completed and submitted well in advance of the next meeting.