Hello, i have started to write a multibyte and multi-codepoint aware byte-based interface for C that tries to address the deficiencies that the current ISO C and POSIX interfaces have in this area. There was recently some talk on the POSIX mailing list regarding this topic, and chances are that this issue will possibly be addressed for real if a reference implementation would be available. While i'm far from stating that the draft that i have written in the last ten days (yep, i have lost an entire week because i got lost in looking at Plan9, then 9front, then had to read about ATF(7)) can be anything more than just that, but it's true that something has to happen to make the C and POSIX interfaces Unicode-aware. Well, i've spend the last almost four hours writing a README for what i have so far, having you in mind as an addressee (and not being a native english speaker plus slow in doing such things), so i would like to paste that now. I'll attach a tarball that also includes the complete README (the "File layout" section is missing below). The next step, after possible adjustments to the interface, and addressing the FIXMEs, would be to integrate this into the C library (just read below), meaning that the thread-safe level 2 could be implemented from this code point of view, plus adding new flags for the string/buffer conversions of printf() and scanf() etc., so that these would optionally work character- instead of byte-wise. (If it would be done like that.) P.S.: Just in case you're wondering, i've asked Christos wether it would be acceptible that i use the NetBSD copyright header. Thank you, and ciao from Germany --steffen Overview -------- 1. Introduction REMOVE. ISO C99 / POSIX interfaces and their "ctext" mappings 3. Implementation, status and porting discussion 1. Introduction --------------- Unix continues to be byte-based, the wchar_t wide character family of functions has not found its way into daily programming practice. Moreover, the wide character interface family is not capable to deal with characters that are formed from sequences of multiple codepoints, so-called grapheme clusters ("user-perceived characters" [1]). But it is also much too restricted to truly deal with properties that are -- or can be, and truly will be in the future -- transported through Unicode text. The biggest problem for this interface seems to be however that Unix / POSIX continues to be byte-based, slowly drifting towards the byte-based UTF-8 multibyte character set, and that, to be able to work with data that is stored in this format, a round-trip conversion from UTF-8 to wchar_t and then back to UTF-8 is necessary. In the temporary wchar_t based layer it is then possible to work with the data on a per-codepoint level. And this is insufficient. It follows a heavily shortened citation of Tom Christiansen from Perl.org, from some years ago: "Character" can sometimes be a confusing term when it means something different to us programmers as it does to users. Code point to mean the integer is a lot clearer to us but to no one else. At work I often just give in and go along with the crowd and say character for the number that sits in a char or wchar_t or Character variable, even though of course that's a code point. I only rebel when they start calling code units characters, which (inexperienced) Java people tend to do, because that leads to surrogate splitting and related errors. By grapheme I mean something the user perceives as a single character. In full Unicodese, this is an extended grapheme cluster. These are code point sequences that start with a grapheme base and have zero or more grapheme extenders following it. For our purposes, that's *mostly* like saying you have a non-Mark followed by any number of Mark code points, the main excepting being that a CR followed by a LF also counts as a single grapheme in Unicode. If you are in an editor and wanted to swap two "characters", the one under the user's cursor and the one next to it, you have to deal with graphemes not individual code points, or else you'd get the wrong answer. Imagine swapping the last two characters of the first string below, or the first two characters of second one: contrôlée contro\x{302}le\x{301}e élève e\x{301}le\x{300}ve While you can sometimes fake a correct answer by considering things in NFC not NFD, that's doesn't work in the general case, as there are only a few compatibility glyphs for round-tripping for legacy encodings (like ISO 8859-1) compared with infinitely many combinations of combining marks. Particularly in mathematics and in phonetics, you often end up using marks on characters for which no pre-combined variant glyph exists. Here's the IPA for a couple of Spanish words with their tight (phonetic, not phonemic) transcriptions: anécdota [a̠ˈne̞ɣ̞ð̞o̞t̪a̠] rincón [rĩŋˈkõ̞n] NFD: ane\x{301}cdota [a\x{320}\x{2C8}ne\x{31E}\x{263}\x{31E} \x{F0}\x{31E}o\x{31E}t\x{32A}a\x{320}] rinco\x{301}n [ri\x{303}\x{14B}\x{2C8}ko\x{31E}\x{303}n] NFD: an\x{E9}cdota [a\x{320}\x{2C8}ne\x{31E}\x{263}\x{31E} \x{F0}\x{31E}o\x{31E}t\x{32A}a\x{320}] rinc\x{F3}n [r\x{129}\x{14B}\x{2C8}k\x{F5}\x{31E}n] So combining marks don't "just go away" in NFC, and you really do have to deal with them. Notice that to get the tabs right (your favorite subject :), you have to deal with print widths, which is another place that you get into trouble if you only count code points. […] It's something you only want to do once and never think about again. :( --tom [Full text was under <http://bugs.python.org/issue12729#msg143061>] It follows that we have exactly two possibilities. Either we extend the wide character interface such that it is capable to work on sequences, i.e., multiple adjacent wchar_t codepoints. Or we introduce a new byte-based interface, one that works with multi-codepoint sequences of (most likely) multibyte characters. So why should we pay the cost in time and the space inefficiency of round-tripping byte->wchar_t->byte, when the temporary intermediate layer must itself be capable to work on sequences nonetheless? Especially when we take into account that Unix drifts towards the omnipresent usage of UTF-8, which was designed to be self-synchronizing, meaning that it is possible to exactly identify the start of a multibyte sequence, and, once found, how long the forthcoming sequence is. Also, UTF-8 is backward compatible to US-ASCII, meaning that documents that have been written fourty years ago are implicitly compatible with an UTF-8 text reader that will try to read this document in fourty years from today. So the interface that is presented here takes the byte-based approach. It tries to map the string (and wide string) interface of ISO C99 and the extensions of POSIX.1-2008 and map it into a new byte-based interface that is capable to deal with multibyte codepoints and multi-codepoint "characters", i.e., sequences-of-sequences. It tries to address the problem that such strings represent "bytes" as well as "characters / graphems" by extending the meaning of return values that occur in case of errors, and sometimes by additional, yet optional arguments. The interface should look familiar. It is designed to have several levels of quality-of-implementation, represented by the TEXT_SUPPORT_LEVEL macro and some additional auxiliary macros. The main goal was to design these levels of support in a way that makes it possible to start using this new interface immediately in all system environments which offer a wide character string interface that is compliant to ISO C90, Amendment 1. This non-integrated support level 1 is incapable to deal with multi-codepoint characters, but extended support levels will not require interface changes. Furthermore, should the system represent wide characters as UCS values (ISO 10646; TEXT_SUPPORT_WUCS is defined), as seems to be the case on all so-far tested systems, it is possible to reliably offer upward-compatible and extended features, like a better replacement of the wcwidth(3) function family, already today. Please see the introductional comment in src/text.h for more on this. [1] http://www.unicode.org/reports/tr29/ REMOVE. ISO C99 / POSIX interfaces and their "ctext" mappings ------------------------------------------------------------- #### int mblen(const char *s, size_t n); int mblen_l(const char *s, size_t n, locale_t loc); size_t mbrlen(const char *restrict s, size_t n, mbstate_t *restrict ps); size_t mbrlen_l(const char *restrict s, size_t n, mbstate_t *restrict ps, locale_t loc); -> txtbound() series, also txtlen() #### char *stpcpy(char *restrict s1, const char *restrict s2); char *strcpy(char *restrict s1, const char *restrict s2); unchanged wchar_t *wcpcpy(wchar_t *restrict ws1, const wchar_t *restrict ws2); wchar_t *wcscpy(wchar_t *restrict ws1, const wchar_t *restrict ws2); -> not needed, txtcpy() series returns length #### char *stpncpy(char *restrict s1, const char *restrict s2, size_t n); char *strncpy(char *restrict s1, const char *restrict s2, size_t n); wchar_t *wcpncpy(wchar_t restrict *ws1, const wchar_t *restrict ws2, size_t n); wchar_t *wcsncpy(wchar_t *restrict ws1, const wchar_t *restrict ws2, size_t n); -> txtcpy() series #### int strcasecmp(const char *s1, const char *s2); int strcasecmp_l(const char *s1, const char *s2, locale_t locale); int wcscasecmp(const wchar_t *ws1, const wchar_t *ws2); int wcscasecmp_l(const wchar_t *ws1, const wchar_t *ws2, locale_t locale); int strncasecmp(const char *s1, const char *s2, size_t n); int strncasecmp_l(const char *s1, const char *s2, size_t n, locale_t locale); int wcsncasecmp(const wchar_t *ws1, const wchar_t *ws2, size_t n); int wcsncasecmp_l(const wchar_t *ws1, const wchar_t *ws2, size_t n, locale_t locale); -> txtcasecmp() series #### char *strcat(char *restrict s1, const char *restrict s2); wchar_t *wcscat(wchar_t *restrict ws1, const wchar_t *restrict ws2); -> txtcat() series #### char *strchr(const char *s, int c); wchar_t *wcschr(const wchar_t *ws, wchar_t wc); -> txtstr() series #### int strcmp(const char *s1, const char *s2); int strncmp(const char *s1, const char *s2, size_t n) int wcscmp(const wchar_t *ws1, const wchar_t *ws2); int wcsncmp(const wchar_t *ws1, const wchar_t *ws2, size_t n); -> txtcmp() series #### int strcoll(const char *s1, const char *s2); int strcoll_l(const char *s1, const char *s2, locale_t locale); int wcscoll(const wchar_t *ws1, const wchar_t *ws2); int wcscoll_l(const wchar_t *ws1, const wchar_t *ws2, locale_t locale); -> TODO not yet implemented #### size_t strcspn(const char *s1, const char *s2); size_t wcscspn(const wchar_t *ws1, const wchar_t *ws2); -> TODO some kind of find_(first|last)_(not_)?of() is missing set #### char *strdup(const char *s); char *strndup(const char *s, size_t size); wchar_t *wcsdup(const wchar_t *string); -> txtdup() series TODO plain txtdup() doesn't check validity! #### char *strerror(int errnum); char *strerror_l(int errnum, locale_t locale); int strerror_r(int errnum, char *strerrbuf, size_t buflen); -> No change needed since strerror_r() bails with ERANGE if *buflen* is insufficient. XXX For interface clarity there may be wrappers? #### ssize_t strfmon(char *restrict s, size_t maxsize, const char *restrict format, ...); ssize_t strfmon_l(char *restrict s, size_t maxsize, locale_t locale, const char *restrict format, ...); -> No change needed since E2BIG error occurs if *maxsize* is insufficient XXX For interface clarity there may be wrappers? #### size_t strftime(char *restrict s, size_t maxsize, const char *restrict format, const struct tm *restrict timeptr) size_t strftime_l(char *restrict s, size_t maxsize, const char *restrict format, const struct tm *restrict timeptr, locale_t locale); size_t wcsftime(wchar_t *restrict wcs, size_t maxsize, const wchar_t *restrict format, const struct tm *restrict timeptr); -> No change needed, since, if *maxsize* is insufficient, "0 shall be returned and the contents of the array are unspecified". XXX For interface clarity there may be wrappers? #### size_t strlen(const char *s); size_t strnlen(const char *s, size_t maxlen); size_t wcslen(const wchar_t *ws); size_t wcsnlen(const wchar_t *ws, size_t maxlen); -> txtlen() series #### char *strncat(char *restrict s1, const char *restrict s2, size_t n); wchar_t *wcsncat(wchar_t *restrict ws1, const wchar_t *restrict ws2, size_t n); -> txtcat() series #### char *strpbrk(const char *s1, const char *s2); wchar_t *wcspbrk(const wchar_t *ws1, const wchar_t *ws2); -> TODO some kind of find_(first|last)_(not_)?of() is missing set #### char *strptime(const char *restrict buf, const char *restrict format, struct tm *restrict tm); -> The system environment needs to be extended and use the text interface when parsing. #### char *strrchr(const char *s, int c); wchar_t *wcsrchr(const wchar_t *ws, wchar_t wc); -> txtstr() series #### char *strsignal(int signum); -> XXX For interface clarity there may be a wrapper? #### size_t strspn(const char *s1, const char *s2); size_t wcsspn(const wchar_t *ws1, const wchar_t *ws2); -> TODO some kind of find_(first|last)_(not_)?of() is missing set #### char *strstr(const char *s1, const char *s2); wchar_t *wcsstr(const wchar_t *restrict ws1, const wchar_t *restrict ws2); -> txtstr() series #### double strtod(const char *restrict nptr, char **restrict endptr); float strtof(const char *restrict nptr, char **restrict endptr); long double strtold(const char *restrict nptr, char **restrict endptr); double wcstod(const wchar_t *restrict nptr, wchar_t **restrict endptr); float wcstof(const wchar_t *restrict nptr, wchar_t **restrict endptr); long double wcstold(const wchar_t *restrict nptr, wchar_t **restrict endptr); -> The system environment needs to be extended and use the text interface when parsing. TODO TEXT_SUPPORT_LEVEL 3 and above is required to offer txtdigittoi(). #### intmax_t strtoimax(const char *restrict nptr, char **restrict endptr, int base); uintmax_t strtoumax(const char *restrict nptr, char **restrict endptr, int base); intmax_t wcstoimax(const wchar_t *restrict nptr, wchar_t **restrict endptr, int base); uintmax_t wcstoumax(const wchar_t *restrict nptr, wchar_t **restrict endptr, int base); -> The system environment needs to be extended and use the text interface when parsing. TODO TEXT_SUPPORT_LEVEL 3 and above is required to offer txtdigittoi(). #### char *strtok(char *restrict s1, const char *restrict s2); char *strtok_r(char *restrict s, const char *restrict sep, char **restrict lasts); wchar_t *wcstok(wchar_t *restrict ws1, const wchar_t *restrict ws2, wchar_t **restrict ptr); -> TODO strtok() missing; better some kind of strsep() instead? #### long strtol(const char *restrict str, char **restrict endptr, int base); long long strtoll(const char *restrict str, char **restrict endptr, int base); unsigned long strtoul(const char *restrict str, char **restrict endptr, int base); unsigned long long strtoull(const char *restrict str, char **restrict endptr, int base); long wcstol(const wchar_t *restrict nptr, wchar_t **restrict endptr, int base); long long wcstoll(const wchar_t *restrict nptr, wchar_t **restrict endptr, int base); unsigned long wcstoul(const wchar_t *restrict nptr, wchar_t **restrict endptr, int base); unsigned long long wcstoull(const wchar_t *restrict nptr, wchar_t **restrict endptr, int base); -> The system environment needs to be extended and use the text interface when parsing. TODO TEXT_SUPPORT_LEVEL 3 and above is required to offer txtdigittoi(). #### size_t strxfrm(char *restrict s1, const char *restrict s2, size_t n); size_t strxfrm_l(char *restrict s1, const char *restrict s2, size_t n, locale_t locale); size_t wcsxfrm(wchar_t *restrict ws1, const wchar_t *restrict ws2, size_t n); size_t wcsxfrm_l(wchar_t *restrict ws1, const wchar_t *restrict ws2, size_t n, locale_t locale); -> TODO xfrm() is missing; while at it, it should be considered wether the TODO result of xfrm() should really be usable via txtcmp()! TODO Better would be a special function, like xcoll(). TODO *Very* complicated, and Unicode collation is in permanent transition! #### int wcswidth(const wchar_t *pwcs, size_t n); int wcwidth(wchar_t wc); -> txtwidth() series #### int toascii(int c); TODO this is missing int tolower(int c); int tolower_l(int c, locale_t locale); wint_t towlower(wint_t wc); wint_t towlower_l(wint_t wc, locale_t locale); int toupper(int c); int toupper_l(int c, locale_t locale); wint_t towupper(wint_t wc); wint_t towupper_l(wint_t wc, locale_t locale); -> tot*() series #### int isalnum(int); int isalnum_l(int, locale_t); int isalpha(int); int isalpha_l(int, locale_t); int isascii(int); int isblank(int); int isblank_l(int, locale_t); int iscntrl(int); int iscntrl_l(int, locale_t); int isdigit(int); int isdigit_l(int, locale_t); int isgraph(int); int isgraph_l(int, locale_t); int islower(int); int islower_l(int, locale_t); int isprint(int); int isprint_l(int, locale_t); int ispunct(int); int ispunct_l(int, locale_t); int isspace(int); int isspace_l(int, locale_t); int isupper(int); int isupper_l(int, locale_t); int isxdigit(int); int isxdigit_l(int, locale_t); int iswalnum(wint_t); int iswalpha(wint_t); int iswblank() int iswcntrl(wint_t); int iswctype(wint_t, wctype_t); int iswdigit(wint_t); int iswgraph(wint_t); int iswlower(wint_t); int iswprint(wint_t); int iswpunct(wint_t); int iswspace(wint_t); int iswupper(wint_t); int iswxdigit(wint_t); -> ist*() series #### *printf() family -> no changes except a flag for %s that changes the handling of the 'char const *' argument from byte-wise to character-wise. The '#' flag seems sensible, ''' being possibly even better. If this flag is set, it is also to be ensured that .* precision stops at a completed multibyte character. Alternatively: t_*printf() series with a new %T conversion that implicitly works on multibyte. (Or both, using aliases.) #### *scanf() family -> ditto *printf(), but it is possibly better to go t_*scanf() because of the many string related scan formats and the necessary whitespace skip, which should/has to use the t_is*() series. #### There needs to be a better way to gain a locale_t argument. Even better would be some kind of charset_t or so, because the real locale is rather interesting for collation purposes or monetary or time formatting only, rather than for character and digit classification and string transformation, in Unicode sense. 3. Implementation, status and porting discussion ------------------------------------------------ For a general first overview, please see src/text.h. - This interface does not support so-called locking shift states, as are used by, e.g., ISO-2022-JP, which is rather compliant to POSIX: 6.2 Character Encoding A locking-shift encoding (where the state of the character is determined by a shift code that may affect more than the single character following it) cannot be defined with the current character set description file format. Use of a locking-shift encoding with any of the standard utilities in the Shell and Utilities volume of POSIX.1-2008 or with any of the functions in the System Interfaces volume of POSIX.1-2008 that do not specifically mention the effects of state-dependent encoding is implementation-defined. A.6.2 Character Encoding Encoding mechanisms based on single shifts, such as the EUC encoding used in some Asian and other countries, can be supported via the current charmap mechanism. With single-shift encoding, each character is preceded by a shift code (SS2 or SS3). A complete EUC code, consisting of the portable character set (G0) and up to three additional character sets (G1, G2, G3), can be described using the current charmap mechanism; the encoding for each character in additional character sets G2 and G3 must then include their single-shift code. Other mechanisms to support locales based on encoding mechanisms such as locking shift are not addressed by this volume of POSIX.1-2008. Encoding with locking shift states cannot be addressed, even with this new multi-codepoint interface. It seems that the only possible way to deal with such encodings is through a sequential stream interface, like iconv(3), that encodes complete lines or other bounded buffers. UTF-8 can be used as a self-synchronizing multibyte target of such an encoding. - The source code of support level 1 should be very portable, and maximally need some adjustments of preprocessor feature macros in src/text.h and interface macros in src/local.h. Further levels of course need integration into the locale environment of a system, but adjusting those macros may still be sufficient. - The current code: - mbstate_t are yet recreated from scratch; we should use one or two pages full of them, and zero those as a whole; i.e., a "cache". This is important. - We are not yet optimized for MB_CUR_MAX == 1 cases. - Most of the tests need overhauling already; tests for the *n* versions should be extended in that no NULs should be in sight. (The first tests that were written, starting at 2013-03-20, place tests for each function in one file, instead of doing so family-wise. All those tests are suspicious.) - The tot(lower|upper)*() series should use the restrict keyword i think. (Sofar i personally *never* used that keyword.) - Since handling of multibyte/multi-codepoint is a bit distressing, convenience functions like convert-entire-string (lower,upper etc.) may be desirable. - errno is not yet used (except for the passed-through case for dup()). - There is a problem with the interface in respect to illegal sequences. For the byte -> wchar_t -> byte round-trip, illegal sequences would be detected when the conversion takes place. This interface will have to check this over and over again. However, since for higher support levels we would internally work on UTF-8 only, penaltizing other encodings by converting them first (under the assumption that on the long run Unix will have been switched to UTF-8 in its entirety), this is very wasteful. UTF-8 is self-synchronizing, which means we could jump over entire sequences simply by looking at the leading byte. Also it seems very redundant to check for the invalid UTF-8 sequences. So it is desirable to offer an interface series that does not verify sequences, assuming that the user performs a validation once (txtlen() can be used), then working with the validated result. - FIXME txtdup(), txtcpy() and txtcat() don't validate multibyte chars, FIXME whereas txtcmp() necessarily does, and txtlen() does by definition. FIXME This should probably, or even more *definetely*, be changed so that FIXME all the interfaces behave the same! FIXME This is an outdated approach!! They need to be changed!!! FIXME See the discussion on the check-less interface after input FIXME validation above!!
Attachment:
ctext.tar.gz
Description: GNU Zip compressed data