Check out the new USENIX Web site. next up previous
Next: Templates Up: Internationalization Issues Previous: Character Encodings

Message Catalogs

GNU gettext [Gettext] is a widespread formal model for supporting multilingual applications in traditional C applications. Gettext encourages the use of implicit message ids. This leads to a rhythm whereby the C programmer marks translatable text in the source code by wrapping them in a function call. The function is usually _() - called ``the underscore function'' - and it has both a run-time behavior and an off-line purpose. At run-time, the underscore function performs the lookup of the message id in a global language catalog. There is also an off-line tool which searches all the source code for marked strings, extracting them and placing them in a message catalog template, called a .pot file.

GNU gettext contains both a C library and a suite of tools provided by The Translation Project [TranslationProject] to manage internationalized programs. The message extraction tool is called xgettext. While newer versions of xgettext understand Python source code to some degree, a pure-Python version of the program called pygettext was developed and is distributed with Python. pygettext has some additional benefit, including the ability to extract Python docstrings which may not be marked with the underscore function.

Mailman has adopted the gettext model of marking and translating source strings, and to that end, a GNU gettext-like standard module was implemented for Python [GettextModule]. While the gettext module implements the same global translation model of the C library, two elaborations were necessary for a more Pythonic interface.

First, for long running daemon processes such as Mailman 2.1's mail processor, multiple language contexts are required, so the global state implied by gettext isn't always appropriate. Here's an example to illustrate understand why.

When a new member subscribes to a mailing list, two notification messages can be sent. One is a welcome message sent to the member, and the other a new member notification sent to the list administrator. If the list's preferred language is Spanish, but the user prefers German, these two notifications will be sent out in two different languages. Since a single process crafts and sends both notifications, simply using _() wrapping doesn't give enough information. Which language should the underscore function translate its message id to?

Python solves this problem by providing an object-oriented API in additional to gettext's traditional functional API. Using the object interface, a program can create instances which represent the translation context; in other words, a single target language catalog is fully encapsulated in an object. For convenience, this object can be stored in some global context, and in the Mailman source, this global object can be saved and restored as necessary. Here is a simplified Python example:

# The list's preferred language is in
# effect right now
saved = i18n.get_translation()
try:
    i18n.set_language(
        users_preferred_language)
    send_user_notification()
finally:
    i18n.set_translation(saved)
send_admin_notification()

The second problem might be termed syntactic sugar or simple convenience, but it turns out to be extremely important in a Python program filled with translatable text. Python strings support variable substitution (also called ``interpolation''), whereby a dictionary can be used to supply the substitutions. For example:

listname = get_listname()
member = get_username()
d = {'listname': listname,
     'member': member,
     }
print _('%(member)s has been '
        'subscribed to %(listname)s') % d

This is a critically important feature for internationalized programs because some languages may require a different order of the substitutions to be grammatically correct. While stock Python supports this requirement, its implementation leads to overly verbose code. In the above example, we've written the words ``listname'' and ``member'' four times each. Now imagine that level of verbosity duplicated a hundred times per source file. ``Tedious'' comes to mind!

Mailman solves this by providing its own underscore function, which wraps the gettext standard function, but provides a little bit of useful magic by looking up substitution variables in the local and global namespace of the caller. Using Mailman's special underscore function, the above code can then be rewritten as:

listname = get_listname()
member = get_username()
print _('%(member)s has been '
        'subscribed to %(listname)s')

While the average Perl programmer might ask what all the fuss is about, the Python programmer will notice something interesting: there's no interpolation dictionary and no modulus operator. The dictionary is created from the namespaces of the caller of the underscore function, which contains the ``listname'' and ``member'' local variables. The trick is that the underscore function uses a little known Python function called sys._getframe() to capture the global and local namespaces of the caller of underscore. It then puts these in an interpolation dictionary, with local variables overriding global variables, and then applies the modulo operator to the translated string, using this dictionary.

Marked translatable texts are used all over Mailman, and we run pygettext over all the source code to produce a gettext compatible mailman.pot catalog file. To translate this to a new language, the translation team would start by copying mailman.pot to messages/xx/LC_MESSAGES/mailman.po where ``xx'' is the language code for the new language. From here, standard tools such as po-mode for Emacs or KDE's kbabel can be used to provide translations for all the source message ids. Then, standard gettext tools can be used to generate a mailman.mo binary file, which Python's gettext module can read. In this way, internationalized Python programs can leverage most of the tools translation teams normally use for C programs. Translators don't have to learn new tools just to translation Python programs.


next up previous
Next: Templates Up: Internationalization Issues Previous: Character Encodings
Barry Warsaw 2003-04-08