|
BSDCon 2002 Paper   
[BSDCon '02 Tech Program Index]
Experiences on an Open Source Translation Effort in Japan
Hiroki Sato, Keitaro Sekine
|
location of the source tree | en | ja | % |
doc/${LANG}/articles | 26 | 8 | 30.8% |
doc/${LANG}/books | 72 | 47 | 65.3% |
www/${LANG} | 203 | 153 | 75.4% |
src/release/doc/${LANG} | 22 | 7 | 31.8% |
Total | 323 | 215 | 66.6% |
In this section, I will show several problems raised around the work. They may be relatively biased, but they are from real experiences.
As far as I know, some translation teams have their own CVS repositories. Doc-jp had its own repository in the past, but merged everything into the FreeBSD CVS repository three years ago because of the time lag before merging the results.
This doubled developing model undoubtedly has some advantages. Two such advantages are the main CVS repository is no damaged in the event of a mistake, and that non-committers can use the repository. Although I can't say that which model is preferable (working in the main repository or a separate one), both of them need a certain amount of time to merge their results into the main tree.
Roughly speaking, the release engineering process of FreeBSD is as follows. First the source tree is ``frozen'' and only the release engineers can make changes during the freeze period. This process usually lasts a few weeks to a month to work out any fatal bugs, and then it will be released. During this process, documentation is also prepared in parallel, however most release-related documents, such as release notes, etc., are prepared just before the release point, so we are usually pressed for time to translate them.
You might think that the translation should not be included with the release, however most translation teams would like to see the release documentation in their native language and have the documentation included. Under some circumstances, the translation cannot be prepared in time due to the lack of members and the lack of time available. As shown in Figure 3 and Figure 4, the translation work usually needs at least one or two weeks to catch up with the original documents.
Today, most of the documents in the FreeBSD source tree are marked up as DocBook/SGML or XML, so in order to make them readable, we need a toolchain to process them appropriately. Naturally, doc-jp uses the same toolchain used as the FDP, however this has raised some problems.
As you know, non-English languages have a specific encoding scheme. For example, Japanese has EUC-JP, ISO-2022-JP, and ShiftJIS (also known as MSKanji). All of these consist of 8-bit characters and many toolchains do not support such encodings, so we must find a way to make it work.
For example, Jade, the DSSSL engine which can output documents in several formats (HTML, PostScript, PDF, etc.) can produce Japanese output, but its TeX-based backend which is used for Postscript and PDF output, does not work properly. I am working on this issue, however since it is not completely, a PostScript or PDF version of FreeBSD Handbook in Japanese is not available.
You may think that unicode is the solution. While this may be true, it only solves part of the problem. There are few tools that are currently available which support unicode. In addition to this, from a Japanese point of view, unicode is not sufficient for representing Kanji characters.
This section discusses the kind of sentences and CVS operations that translators have the most trouble dealing with during translation efforts.
While good sentences often include a figure of speech, translators tends to run into trouble when dealing with such complicated expressions. Most translators always expect sentences to be simple, straightforward, and to the point without going about so in a roundabout way. A typical example is using a joke. Do you know of any jokes that people across the world can understand? It is extremely difficult to translate jokes into non-English languages and retain the humor.
Another example is slang, which does not appear in a dictionary. They also make translation work very difficult. I never insist that jokes should be kept out of documents, however, they should be kept conservative for good understanding.
It is also preferable to write full sentences and not one or two word phrases. Some experienced translators can understand such expressions, but it tends to mislead the translators. Remember, simple is preferable.
Unfortunately, I cannot give many examples because the reader would need knowledge of both Japanese and English in order to understand them, but I ask that you remember this; if your message is valuable, it can be translated even if it is a post to a mailing list or newsgroup. In order to increase the chances to find folks in other countries interested in your message, use sentences which they can translate smoothly.
In CVS, original documents should follow the rule of ``carefully separated commits.'' This means that any commit to original documents should be divided into cosmetic changes and content changes. If the two are not divided properly, the diff deltas generated by CVS grow unnecessarily large. Most cosmetic changes have nothing to do with translation, so translation teams always appreciate separate commits because it reduces the amount of the work needed placed upon them.
For example, consider there is a SGML document that consists of two paragraph enclosed with <para> as shown below.
<para>This is a sample document marked up with DocBook/SGML. If you are familiar with HTML, understanding SGML documents is not difficult.</para> <para>Now, consider what kinds of difficulties there are in management of SGML documents.</para> |
When a commit is made that changes the spacing of the first paragraph, the document and delta generated by CVS could look like this:
The modified document:
<para>This is a sample document marked up with DocBook/SGML. If you are familiar with HTML, understanding SGML documents is not difficult.</para> <para>Now, consider what kinds of difficulties there are in management of SGML documents.</para> |
The delta:
<para>This is a sample document marked up with - DocBook/SGML. If you are familiar with HTML, - understanding SGML documents is not - difficult.</para> + DocBook/SGML. + If you are familiar with HTML, understanding + SGML documents is not difficult.</para> |
If the ``separated commits'' rule is not followed, the translator should carefully compare deltas like those seen above. This is a very difficult and wasteful effort. A simple sign such as ``cosmetic changes only'' in the CVS log and a carefully separated commit greatly helps us reduce the amount of the work in translation.
The rule of ``separate commits'' originated in the FDP for this reason. Recently, I have noticed another situation that gives translators trouble. The following is the same two sentences as in the above example, however, they are interchanged. This often occurs when documents are being re-organized.
<para>Now, consider what kinds of difficulties there are in management of SGML documents.</para> <para>This is a sample document marked up with DocBook/SGML. If you are familiar with HTML, understanding SGML documents is not difficult.</para> |
This change generates the following delta:
+<para>Now, consider what kinds of difficulties + there are in management of SGML + documents.</para> + <para>This is a sample document marked up with DocBook/SGML. If you are familiar with HTML, understanding SGML documents is not difficult.</para> - -<para>Now, consider what kinds of difficulties - there are in management of SGML - documents.</para> |
While this does not seem to be a widely known fact, the interchanging of sentences can confuse translators very much. It increases the size of the delta that the translators must examine, and can very difficult to understand. You can imagine such interchange occurs in more complicated way.
I suggest that the changes described above should be considered cosmetic changes and separately committed from content changes. CVS logs help very much in these situations, so when such a change is made, please take what sort of change it is into account when writing the CVS commit message.
In short, as translators, we hope that those writing documentation will pay more attention to the translation work. If this is done, the documents will be able to be read by a larger amount of people, which is also good for the project.
Up to this point, I described problems involving the both the translation work and the parent project. Next, I will show several ways for efficient translation work itself.
It is difficult for all of the translation project members to determine the status of the original and translated documents, so sometimes they hesitate over which to choose. If split documents are provided positively for the project's mailing list and so on, they can translate and review them immediately without unnecessary trouble.
Original documents that will be translated should be divided into relatively small text fragments and provided to translation project members. In addition, it is better for reviewers to keep the translated and the reviewed document side by side with the original text so they can easily compare the two.
In doc-jp, translators need to fetch a target document via CVS, but doing so is sometimes difficult if the translator has no experience with CVS. Thus, an interface supporting translators with target documents and translated documents to be reviewed should be prepared. For instance, there is an experimental one for doc-jp[6], and other projects have similar facilities[7-9]. In particular, [7] is more functional since it includes reservation of translation.
Finally, older documents already translated should be marked so that people who read them are aware of their status. As mentioned earlier, obsolete documents are nothing but harmful for everyone. In doc-jp, a revision checker[10] is used for build process of the translated documents.
The revision check mechanism realized by [10] is quite simple. Original documents surely have a line of CVS ID like this (actually this is one line):
$FreeBSD: doc/en_US.ISO8859-1/books/handbook/book.sgml,v 1.119 2001/11/19 11:38:45 murray Exp $ |
And we make the translated documents have a line indicating its parent document as shown below:
Original revision: 1.119 $FreeBSD: doc/ja_JP.eucJP/books/handbook/book.sgml,v 1.70 2001/10/27 18:12:06 hrs Exp $ |
The revision checker compares the CVS ID of the original document with the ``Original revision'' line in the translated document and the result is reflected in the definition of an entity called %rev.diff; as ``IGNORE'' or ``INCLUDE'' which used for a marked section of SGML. Actually, when the two revision is matched:
<!ENTITY % rev.diff 'IGNORE'> ... <![ %rev.diff; [ this document is obsoleted! ]]> |
and when they are not matched:
<!ENTITY % rev.diff 'INCLUDE'> ... <![ %rev.diff; [ this document is obsoleted! ]]> |
When the documents are rebuilt, this definition is included into each documents, so the documents themselves can notify the reader and maintainer that the translation is not up-to-date.
The important things are: 1) keep translated documents as up-to-date as possible, and 2) if circumstances do not allow this, notify the readers that the translated document is not up-to-date. The simple revision checker described above does just that.
During translation, we often think---especially when it is one of the technical terms---``what does this word mean?'' To make things easy, we maintain a translation word list. Generally, it is a list which includes original and translated words on a word-by-word basis. Personally, I think there are problems with this and it is not sufficient.
First, maintenance of the list is relatively hard work. While many people translate documents, how do we determine which words should be candidates for the list? We have to discuss it, and the discussion usually takes quite a bit of time. Moreover, the objective answer is not always obtained.
Second, translation of sentences always goes with the sentence's context. The list of translation words does not include the context, it is possible to mislead the translators.
However, it is also undoubted that the words list is useful to keep consistency of translated words. The primary disadvantage is that it increases the project's work, and our goal is not to make a comprehensive word list.
I am designing an alternative that will identify already translated documents and includes a full-text search engine. Although it is not finished at the time of writing, using this method will allow translators to find the word they are looking for and the output will include a translated example sentence. Since the results of translation work are used as a database, I believe that the trouble described above can be somewhat relieved.
In this report, characteristics and problems specific to translation work are described through my experiences. To think little of translation efforts or to regard it as normal software development is the wrong idea.
I think that the translation efforts in various projects need much more technical cooperation and information exchange about their efficient management. The majority of frameworks can be shared, and the maintenance of them, such as word lists and style guides, can be done cooperatively instead of reinventing the wheel. The primary objective of the work is translation and not providing infrastructure itself.
With such a goal in mind, I have made a proposal for a project called the ``Doc-ja Archive Project[11],'' which supports various Japanese translation projects in early 2001. Unfortunately, the project virtually has obtained no results thus far, but we hope to become a place for discussion of translation efforts in Japan.
I thank Japan FreeBSD Users Group and FreeBSD Japanese Documentation Project for supporting my translation activities.
This paper was originally published in the
Proceedings of the BSDCon '02 Conference on File and Storage Technologies, February 11-14, 2002, Cathedral Hill Hotel, San Francisco, California, USA.
Last changed: 28 Dec. 2001 ml |
|