ISO C Amendment 1 (MSE)
David Lindner and Finnbarr Murphy
The Single UNIX Specification, Version 2 includes in its System Interfaces Specification (XSH) the ISO/IEC 9899:1990/Amendment 1:1995 (E) to ISO/IEC 9899:1990, Programming Languages C (ISO C). This paper is a brief introduction to this extension. It is assumed that the reader is familiar with the C language and has some basic understanding of internationalization concepts and character encoding methods.
ISO C Amendment 1 (MSE) was part of the first amendment made to the ISO C standard. The MSE consists of a set of library functions that provide a relatively complete and consistent set of functions for application programming using multibyte and wide characters.
The other major items included in this amendment are digraphs, alternate spellings for several C tokens, and the header <iso646.h>. These items are not discussed here since they are outside the scope of this paper.
The ISO C standard laid some groundwork for multibyte and wide character programming by providing a small number of multibyte and wide character functions. The working group decided to wait for the C developer community to acquire more experience with implementing multibyte and wide character libraries before extending this model further.
A working group (ISO/JTC1/SC22/WG14) was set up to study the various existing implementations and developed the Multibyte Support Extension as part of the first amendment (called C Integrity) to the ISO C standard.
The System Interfaces Specification, XSH, Issue 4, Version 2, which was developed in 1994, incorporated a draft version of the MSE. XSH, Issue 5 incorporates the final version of the MSE.
We traditionally think of characters as one byte entities represented by the char data type. This is simple, but allows for a maximum of 256 distinct characters.
In the MSE model, the concept of a character has been extended. Extended characters can be represented in three ways:
A multibyte character is a sequence of one or more bytes that can be represented as an array of type char; in other words, a single character may occupy one or more consecutive bytes. An example of such an encoding is EUC (Extended UNIX Code). EUC provides a structure by which any number of codesets may be encoded into a multibyte encoding.
The primary advantage to the one byte/one character model is that it is very easy to process data in fixed-width chunks. For this reason, the concept of the wide character was invented. A wide character is an abstract data type large enough to contain the largest character that is supported on a particular platform. To date, most system implementors have chosen 32 bits, although there are implementations with 16-bit and 8-bit wide characters. It should be noted that although many vendors have chosen a 32-bit wide character, because the wide character is an abstract type, it is not guaranteed to be the same across all platforms.
To support the concept of wide characters, the MSE defines the integral type wchar_t. However, it does not define the size of wchar_t, but states it shall be as wide as necessary to hold the largest character in the code sets of the locales that an implementation supports.
In addition to the traditional concept of the multibyte character, the MSE has added the concept of the generalized multibyte character.
There are many different multibyte encoding schemes, but these can be broken down into three basic categories:
Restartable multibyte encodings are defined such that if you were to process a multibyte data stream, it would be possible to determine the correct separation of characters no matter where you were positioned in the data stream. In the case of stateful encodings, you need one extra piece of information to be able to correctly process characters in the data stream. This extra piece of information is commonly referred to as the state of the data stream.
Why must we be able to unambiguously restart a data stream? If any byte sequence can have more than one meaning as a sequence of characters, then the multibyte code is ambiguous; that is, you could have multiple meanings for the same data stream depending upon where you started in the data stream. For example, the following multibyte encoding is not restartable:
0x41 0x42 0x61 0x62 0x43
In this particular encoding, the combination of 0x61 and 0x62 produces an "F." If we start processing this string at the beginning, all the characters would be processed correctly and the result would be the string:
A B F C
If we start processing the string at 0x62, then the result would be the partial string:
In a restartable encoding, the conversion interfaces would have recognized the 0x62 as an illegal multibyte character, and our program could choose to ignore that illegal character and move on, or perhaps it might try to back up and see if it could form a complete multibyte character.
In restartable multibyte encodings, each byte sequence in a particular encoding scheme stands for one character; the same character regardless of context. Stateful multibyte encoding schemes have a concept of shift state; certain codes called shift sequences effectively change the data stream to a different shift state, and the meaning of byte sequences is changed according to the current shift state.
If we use the same multibyte encoding and make it a stateful encoding, we will introduce two new operators called shift state operators, SS0 and SS1. The default shift state for this particular codeset is SS0. In this example, the 0x61 in its shifted state produces an "F," and in its default state produces an "a":
0x41 0x42 SS1 0x61 SS0 0x43 0x61
Since the default shift state is SS0, the above sequence of bytes should produce the string:
A B F C a
The stateful multibyte encodings are not restartable either, because if we started processing the string after a shift state operator, we could potentially get the wrong string.
Normally, if you try to pass a string containing multibyte characters to a function that does not know about them, such a function treats a string as a sequence of bytes, and interprets certain byte values specially; for example, the null byte, the slash character. Since it is illegal for a multibyte character to use any of the special byte values as part of its encoding, the function should pass it through as if it were a single byte string. (Note: The multibyte encoding may still use the slash or null byte, it just cannot use them as part of another multibyte character.)
This is where the concept of the generalized multibyte encoding arises. Traditionally, we think of multibyte encodings as file code and wide characters as process code, where file code resides on disk and process code is used by an application. This is not to say that multibyte encodings are not used by applications. Indeed many applications today use multibyte encodings routinely, but because they do not require the ability to process characters as discreet chunks they have no need to convert the multibyte encodings to wide characters.
In summary, generalized multibyte encodings can be encoded in any way. The special byte values discussed above have no meaning in generalized multibyte encodings. Functions that have no concept of multibyte encodings would fail if they tried to process generalized multibyte encodings. By defining the concept of generalized multibyte encodings, we provide a method by which we can say a particular file is associated with a particular locale, and can only be processed by specific routines running in this locale. Generalized multibyte encodings are more of a logical grouping than a specific definition. They provide us with a way to associate files with specific locales and codesets, and allow us to safely operate on those files as long as we are in the proper locale. The important restriction is that generalized multibyte characters can never be processed directly, they can exist only on disk. (Note: Processed refers to the parsing routines available in C. Any file may be processed as binary data.)
To take an example of a generalized multibyte encoding, Unicode is a 16-bit codeset that can be found on Windows 95 and Windows NT. One of the problems with Unicode is that it has NULL bytes embedded in its encoding. For example, the string:
a b c
is actually encoded as follows:
0x00 0x61 0x00 0x62 0x00 0x63 0x00 0x00
Those who are familiar with any of the string handling routines in C, can see that these routines will have problems with this string. Similarly, if you tried to read this file from a disk as a text file you would have problems. However, with the concept of generalized multibyte encodings we can say this file is associated with a Unicode locale, and the stdio routines can be smart enough to know that when they are in the Unicode locale they can read the Unicode file properly.
The MSE defines two headers to support the new functionality:
Character Classification and Mapping Functions
Character classification determines whether a particular character code refers to an upper-case alphabetic, lower-case alphabetic, alphanumeric, digit, punctuation, control or space character, or any one of a number of other groupings.
Mapping functions are sometimes called case conversion functions, because the original mapping functions simply mapped upper-case to lower-case and vice versa.
In the past, macros were often used to classify or map character codes. This was possible since the assumption was that an application was dealing with ASCII characters. Today, classification functions are used which classify wide character codes according to the type rules defined by the category LC_CTYPE of the application's current locale.
In the ISO C standard the behavior of character classification functions is affected by the current locale. Some functions have implementation-dependent behavior when not in the POSIX locale. For example, in the POSIX locale, isupper() returns true (non-zero) only for upper-case letters. The MSE contains no description of how the POSIX locale affects the behavior of the above functions, but states that when a character c causes an isxxx(c) function to return true, the corresponding wide character wc shall cause the corresponding wide character function to return true. Note, however, that the converse is not true.
The ISO C standard defines 11 classification (also known as character testing) functions. The MSE defines an analogous set of wide character classification functions, returning non-zero for true and zero for false, for example iswalnum() is analogous to isalnum().
As the number of defined locales increased, the requirement for additional character classes increased. For example, while a classification function such as isupper() makes perfect sense in the English language, it does not make any sense in a language such as Japanese that has no concept of case. Conversely, a function such as iskana() makes perfect sense for Japanese, but doesn't make any sense in English. For this reason, the MSE defined a number of extensible wide character classification functions wctype(), iswctype(), wctrans(), and towctrans() as general-purpose solutions to this problem.
These two functions are generally used in combination. However, sometimes the wctype() function is used on its own by an application to test whether a character classification is available in a specific locale. If the current setting of the LC_CTYPE locale changes between calls, the behavior is undefined.
The MSE specifies that the following code segments are equivalent to each other:
Number Conversion Functions
Three new functions are included to facilitate conversion from wide character strings (also known as wide strings) to a variety of numeric formats. These are the wide character versions of the ISO C functions strtod(), strtol(), and strtoul().
These functions work as follows:
In other than the POSIX locale, implementation-dependent forms of a subject sequence may be supported.
Sixteen new wide character string functions are defined. Most are similar to their char-based counterparts. For example, wcscopy() is analogous to strcpy(), but operates on wide strings. In general, the data types of some parameters differ, but the purpose of the parameters is the same.The comparison functions wcscmp() and wcsncmp() compare two wide character strings by comparing the wide characters based on the character's encoded value, while the wcscoll() function compares each wide character interpreted according to the collating sequence information specified by the LC_COLLATE category of the current locale.
The wcsxfrm() function transforms a wide character string and places the result in an array of wide characters. The transformation is such that if the wcscmp() function is applied to two transformed wide character strings, the result is the same as if the two wide character strings were compared using wcscoll(). Both wide character strings must be transformed using wcsxfrm(). It is invalid to compare a transformed string to a non-transformed string. Note that no function is defined to restore a transformed string to its original layout.
When wide character strings are likely to be compared more than once, it is more efficient to transform them using wcsxfrm(), compare them using wcscmp(), and retain the transformed strings for subsequent comparisons.
The MSE also defines a number of wide character array functions. These functions operate on arrays of type wchar_t whose size is specified by a separate count argument. These functions are not affected by locale and all wchar_t values are treated identically, including the null wide character and wide characters not corresponding to valid multibyte characters. Thus, the wmemcmp() function compares each wide character array element using the encoded value of each wide character.
The Input/Output Model
The MSE input/output model assumes that characters are handled as wide characters within an application and stored as multibyte characters in files, and that all the wide character input/output functions begin executing with the stream positioned at the boundary between two multibyte characters.
The definition of a stream was changed to include the concept of an orientation for both text and binary streams. After a stream is associated with a file, but before any operations are performed on the stream, the stream is without orientation. If a wide character input or output function is applied to a stream without orientation, the stream becomes wide-oriented. Likewise, if a byte input or output operation is applied to a stream with orientation, the stream becomes byte-oriented. A new function fwide() is used to determine or alter the orientation of a stream.
Byte input/output functions cannot be applied to a wide-oriented stream and wide character input/output functions cannot be applied to a byte-oriented stream.
While wide-oriented streams are sequences of wide characters, the external file associated with a wide-oriented stream may be an implementation-dependent multibyte encoding. Furthermore, it is acceptable that the file associated with this stream is a generalized multibyte encoding such as Unicode.
Note that the input/output model does not preclude applications from storing data in external files as wide characters.
As discussed earlier, multibyte character streams may have state-dependent encodings. To handle state-dependent encodings, the MSE includes the concept of a conversion state that is associated with each FILE object that affects the behavior of a conversion between multibyte and a wide character encoding.
The conversion state information augments the FILE object's information about the current position of the multibyte character stream with information about the parse state for the next multibyte character to be obtained from the stream. For state-dependent encodings, the remembered shift state is part of this parse state. Every wide character input or output function makes use of this state information and updates its corresponding FILE object's conversion state accordingly.
The non-array type mbstate_t is defined to encode the conversion state under the rules of the current locale and provide a character accumulator. This implies that encoding rule information is part of the conversion state. No initialization function is provided to initialize mbstate_t. A zero-valued mbstate_t is assumed to describe the initial conversion state. Such a zero-valued mbstate_t object is said to be unbound. Once a multibyte or wide character conversion function is called with the mbstate_t object as an argument, the object becomes bound and holds the conversion state information which it obtains from the LC_CTYPE category of the current locale. No comparison function is specified for comparing two mbstate_t objects.
The MSE assumes that only wide character input/output functions can maintain consistency between a stream and its corresponding conversion state. Byte input/output functions do not manipulate or use conversion state information. Wide-character input/output functions are assumed to begin processing a stream at the boundary between two multibyte characters. Seek operations reset the conversion state corresponding to the new file position.
The mbsinit() function is specified because many conversion functions treat the initial shift state as a special case and need a portable means of determining whether an mbstate_t object is at initial conversion state.
The MSE provides a method to distinguish between an invalid sequence of bytes in a multibyte stream and a valid prefix to a still incomplete multibyte character. Upon encountering such an incomplete multibyte sequence, the functions mbrlen() and mbrtowc() return -2 instead of -1, and the character accumulator in the mbstate_t object may store the partial character information. This allows applications to convert streams one byte at a time or even to suspend and resume conversion if required. The conversion functions are thus said to be restartable.
The function btowc() is used to determine whether its argument is a valid multibyte character in the initial shift state, and to return the corresponding wide character. The function returns WEOF if the character has a value of EOF or if it is not a valid multibyte character in the initial shift state.
Similarly, the function wctob() is used to determine whether its argument is a member of the extended character set whose multibyte character representation is a single byte when in the initial shift state, and to return the corresponding single byte character. The function returns EOF if the character does not correspond to a valid multibyte character of length 1 in the initial shift state.
The MSE specifies a number of restartable functions which take as their last argument a pointer to an object of type mbstate_t. If the pointer is NULL, each function uses its own internal mbstate_t object instead, which is initialized at startup to the initial conversion state. Note that, unlike their corresponding ISO C standard functions, a function's return value does not represent whether the encoding is state-dependent. These functions are:
A more detailed explanation of two of the above functions will help to clarify the concept of restartable functions.
The function mbrtowc() inspects at most n bytes to determine the number of bytes needed to complete the next multibyte character. If a multibyte character can be completed, mbrtowc() determines the corresponding wide character and returns it in *pwc. If the corresponding wide character is the null wide character, the conversion state is reset to the initial conversion state. This function returns one of the following:
pointed to by ps, from the array indirectly pointed to by src into a sequence of corresponding wide characters pointed to by dst. Conversion continues up to and including a terminating null character which is also stored in dst. Each conversion takes place as if by a call to the mbrtowc() function. If an error occurs, errno is set to the macro EILSEQ and mbsrtowcs() returns (size_t)-1.
Conversion stops when one of the following occurs:
The wcsftime() function behaves as if the character string generated by the strftime() function is passed to the mbstowcs() function as the character string parameter, and the mbstowcs() function places the result in the wcs parameter of wcsftime(), up to the limit of the number of wide characters specified by maxsize.
This function uses the local time zone information. The format parameter is a wide character string consisting of a sequence of wide character format codes that specify the format of the date and time to be written to wcs.
More information on the Single UNIX Specification, Version 2 can be obtained from the following sources:
About the Authors
David Lindner is a principal engineer with Digital Equipment Corporation and a former member of The Open Group Internationalization Technical Working Group.
Finnbarr P. Murphy is a principal software engineer with Digital Equipment Corporation and is vice-chair of The Open Group Base Technical Working Group.