Check out the new USENIX Web site. next up previous
Next: Message Catalogs Up: Internationalization Issues Previous: The Locale

Character Encodings

Above and beyond the natural language issues, character encoding issues are probably the most vexing for the Mailman developers. ``Character encoding'' is usually referred to as the character set or charset, after the email header parameter described in RFC 2045.

A naive view would create a one-to-one correspondence between language and charset. For example, you might say that all Spanish text should be rendered in the iso-8859-1 (Latin-1) character set [ISOSoup]. However, even this simple example isn't accurate because the Euro sign is available only in iso-8859-15.

The problem is exacerbated by some Asian languages. Japanese for example may appear in any of euc-jp, iso-2022-jp, shift-jis, and may be different depending on whether the text appears in a web browser or in an email message. In fact, Mailman 2.1's naive approach causes some problems for Japanese users, especially when an email message is displayed as a web page in the archiver. This will be fixed in a future release.

Usually, English text uses the us-ascii character set, but for maximum interoperability, a list conducted in English may still want to be aware of Latin-1 characters. Mailman has to be careful when combining characters in different charsets, especially those for which us-ascii is not a subset.

For example, say a Spanish list received a message in Turkish, which uses Latin-5 (a.k.a. iso-8859-9). When that message is archived, different parts of the HTML page for the message will be in iso-8859-1 and other parts will be in iso-8859-9. But since HTML is inadequate at allowing multiple charsets in a single web page, the characters in one or the other of those charsets must be converted to HTML entities, using their Unicode equivalent.

Multiple character set issues can also arise in the processing of email messages. Say for example that a message to a German list arrives in Japanese. Mailman has a feature called ``headers and footers'' which allow the list administrator to add some canned text to the start and end of a message (e.g. ``To unsubscribe, click here''). Previous versions of Mailman would simply paste the header and/or footer around the original message body. This was broken for several reasons. The most obvious one is that if the message is really a Base64 encoded image, adding some spurious ASCII text around the original body would break the decoding. But if the message contained text in a different character set than the header or footer text, concatenation may render the original body unreadable. The solution requires careful examination of the original message, and in the extreme, ripping apart and reconstituting the structure of the original message, so that the headers and footers will always be added in a MIME-safe way.

Internationalization standards for email and HTML are defined in a series of RFCs, and these must be adhered to. For example, the most fundamental email RFC is 2822 [RFC2822] (which recently superseded RFC 822). This RFC describes the structure of an email message, but it is naive in its ASCII bias. RFCs 2045 through 2047 were added to address the use of multilingual character sets in email messages. RFC 2047 [RFC2047] was added to describe how non-ASCII characters are to be encoded in Subject fields and in other email headers. Mailman must be able to both interpret email messages with RFC 2047 encoded headers, and produce properly formatted ones when necessary. The challenge is to parse well intentioned, but erroneously encoded headers (to give the benefit of the doubt). These types of errors are all too common in email messages found in the wild and Mailman must be made robust against these types of poorly formed messages.

Prodded by these various issues, a comprehensive email package [Email] was developed and added to Python 2.2. The email package is compliant with all the relevant MIME RFCs, as well as other mail related standards.


next up previous
Next: Message Catalogs Up: Internationalization Issues Previous: The Locale
Barry Warsaw 2003-04-08