Minutes of the June 21 Message format extensions working group. Attendees --------- Phill Gross pgross@nis.ans.net Peter Svanberg psu@nada.kth.se Byungnam Chung bnchung.sokri.etra.re.kr Bob Kummerfeld bob@ca.pn.oz.au Jonny Eriksson bygg@sunet.se Jan Michael Rynning jmr@nada.kth.se Keld Simonsen keld.simonsen@dkuug.dk Greg Vaudreuil gvaudre@nri.reston.va.us Agenda ------ 1) Character Set Selection - Status and Input to the ISO 10646 process o Unicode <=> ISO 10646 Union? o Use of CO and C1 codespace - Selection of "Common" character sets or schemes o ISO 8859-1, ISO 8859-n, Profiles for the use of ISO 2022? o Specifying "requiredness" - Specification of 8 bit character sets in headers Minutes ------- 1) Character Set Issues a) Unified character set 1) Administrative At last word, the ISO DIS 10646 received 9 YES votes and 14 NO votes, and work is proceeding to resolve the remaining issues. An unofficial but promising effort is the work underway to unify ISO DIS 10646 and Unicode, another scheme for a global character set. This effort is being conducted outside the normal ISO process. This working group was asked to discuss this effort and endorse it if possible. The working group discussed this effort, and agreed that the efforts to combine Unicode and 10646 were in fact positive. 2) Technical The unification of ISO DIS 10646 and Unicode requires the resolution of several technical issues. The primary issue,tentatively resolved involves "Han unification" a scheme that re-uses many of the graphics of the various Kanji character sets. Other issues involve the use of CO and C1 codespace. The use of C0 and C1 codespace involves transport issues and this working group was asked for its input. C0 codespace consists of the spaces between 0 and 31 and 127,traditionally used for control characters. There is a proposal to use this space in the second octet of a multi-byte character for graphic characters. The working group discussed this and rejected the use of this space. A graphic character in the C0 space will likely be interpreted by a transport protocol as a control character. Many transport protocols which interpret in-band data such as SMTP may behave unpredictably in this situation. One example is where the sequence of graphics legally sent by a 8 bit sender may be mis-interpreted by a 7 bit receiver after bit stripping as a 13-10-46-13-10 sequence terminating the SMTP session prematurely. Other related anomalies were envisioned. Unless all transport protocols are made aware of the multi-byte nature of the data, an unlikely occurrence any time soon, reuse of C0 space is not recommended. C1 codespace consists of the spaces between 128-150, space that may be interpreted as control characters if the high order bit is stripped. ISO 8859-n character sets, and the current 10646 proposal reserve this space for control characters only, with an eye toward backward compatibility with 7 bit systems. The working group discussed this and concluded that use of C1 codespace could be used for graphics if transport protocols could be relied upon to never strip the high order bit and interpret the resulting character as control sequences. The working group did not make a specific recommendation, only that the use of C1 space to compact a character set was a positive thing, and future evolution transport protocols should support the use of this space for graphics. b) Common Character Sets In the absence of a single international standard character set,the working group needs to profile the use of a limited number of the 200+ character sets in use worldwide to facilitate interoperation. Keld S. gave an overview of the current character sets in usage. ISO 7 bit family: ASCII National Versions 10 National use 2 Alternate rep # $ ECMA registry 7, 8, 16 bit ISO 2022 shifts ISO 8 bit 8859 family: 1 char = 1 octet ASCII in pos 0-127 Pos 160-255 Latin sets (5) Cyrillic Greek Arabic Hebrew ISO 6937-2 family 8/16 bit: 6937-2, T.61 Non-Spacing accents 1 char = 1 or 2 bytes about 330 graphical chars Vendor 8 bit sets DEC-MCS HP Roman8 IBM PC codepages (5) Uses also 128-159 (C1) IBM EBCDIC Many versions Not ASCII Compatible 16 bit char sets Japanese: JIS 0208, 0212 Chinese: GB 1980 Korean: Japanese 8/16 bit: Shift JIS Unicode: New vendor charset unifies CN, JP, KO sets Incompatible with ISO Multi-byte: EUC: Extended UNIX code ISO 2022 shifting SS1 SS2 SS3 4 char sets 8/16/24 bits 32 Bits: ISO 10646 Also usable in 8, 16, or 24 bit compaction methods Proper encoding subsets: ASCII and ISO 8859-1 Control Character Sets: ISO 646: 0-31, 127 ISO 6429: 0-31, 127-159 EBCDIC: as ISO 646 Several ideas were batted around, including strict use of ISO2022, profiling language to character set mapping, and the use of "preferred" character sets. The working group felt that the best approach was to codify existing practice in the interim,pending adoption of an "international" character set. This existing practice was reduced to the following. If possible, use ISO 8859, with the lowest version number possible, i.e., use 8859-1 (Latin 1) over 8859-10? (Latin 5?). If the characters needed are not in the 8859 sets (i.e. Kanji)use the 2022 character switching standard, declaring 2022 in the header of the document. While this may lead to the use of any of the many characters in the ECMA registry, the WG felt that in practice, only the current Oriental mail systems will use the2022 system and only with limited character sets. c) Use of Non-ASCII character sets in headers. What a mess! The attendees of this meeting spend over an hour working on various schemes for indicating character sets in the headers of a message other than ascii. It was identified as a requirement that the fields defined as TEXT be able to have variable character sets. While this goal was stated, no mechanism for the implementation was agreed upon. A modification of the BNF notation was suggested by Keld S. CHAR-EIGHT = ; (0-377, 0.-255) qtext = ,"\" & CR, and including linear-white-space> quoted-pair = "\" CHAR-EIGHT text = This notation was accepted by the attendees of the meeting, however several problems were identified and not resolved. 1) Identification of the header character set and the need to for conversion, and 2) Encoding the header character sets in 7 bit transport format. It was not clear how a conversion gateway would know that the header was 8 bit and needed encoding. A suggestion accepted by the group was that the use of the new BNF requires the use of a header- charset header line. This additional header adds complexity to user agents and conversion gateways by requiring two passes of the header to determine and convert the header into a passable or readable form. It was felt that this was inelegant but do-able. Several proposals were discussed for encoding the 8 bit text strings when 7 bit transport was required. It was accepted that this was a hard requirement. 1) Variable Substitution On proposal for the insertion of 8 bit text was to substitute a variable name in the header for each text string needing 8 bit characters. The variable could then be defined elsewhere in the header, including the encoded actual string and a token indicating the character set. This was rejected as messy and difficult to implement in current user agents. 2) Message Encapsulation Encapsulate the mail message using the message type body part and a suitable transport encoding, preferable quoted-printable. This proposal is controversial among at least one implementor of the message format standard as having excessive complexity for the user agent. It is not clear the encapsulated message will be permitted to have a transport encoding. 3) Encoded Text Fields This proposal would specify a standard encoding for the header fields, possibly quoted-readable or quoted-printable and identify this fact in a header-transport-encoding header or the header- character-set header. Conclusions While no one was happy, the group tentatively agreed to not permit 8 bit text in the headers. The only reasonable way to encode 7 bit text was to encode the text fields, and insert a new header line. With this overhead the group agreed that while not ideal, a requirement that extended character sets should always be encoded, eliminating the need for intermediate gateways to parse and convert the headers.