Check out the new USENIX Web site. next up previous
Next: Other Issues Up: GNU Mailman, Internationalized Previous: Templates

Unicode

Python has two types of string objects, traditional 8-bit byte data strings and Unicode character strings. Python also has literal forms for each string type; quoted text are defined to be 8-bit strings unless the leading quote is prefixed with a ``u'', in which case it is a Unicode string. Because strings can come into Mailman in a variety of ways (e.g. through the web, an email message, or a message catalog), the code must be prepared to handle encoded 8-bit strings and Unicode strings. Encoded 8-bit strings must be converted to Unicode via the unicode() built-in function in order to properly combine strings using concatenation or interpolation. In addition, Unicode strings must be re-encoded when printing them to certain streams, such as the log files, or standard output, but these encoding operations must watch out for unsupported characters. For example, if a Unicode string containing Latin-1 characters is printed to an ASCII-only terminal, a exception can be raised due to the non-ASCII characters in the string.

There is no doubt that character conversion issues have been the thorniest and most common bugs reported on Mailman 2.1 to date. While many issues have been fixed, the most important lesson learned is that Mailman should convert all text (not necessarily all strings!) to Unicode at the earliest possible time, ideally when the text enters the system. Mailman should use Unicode strings everywhere internally, converting to encoded 8-bit strings only where needed, and only at the last possible moment. Analysis will still be needed to decide how to handle conversion errors, such as those described above. In Python, the conversion function can be given an additional argument which specifies how strict the conversion should be, e.g. raise an exception if there are illegal characters found, throw the illegal characters away, or substitute a question mark for any illegal characters. The exact choice of the strictness flag will be dependent on the context in which the conversion is occurring.


next up previous
Next: Other Issues Up: GNU Mailman, Internationalized Previous: Templates
Barry Warsaw 2003-04-08