A draft for a multibyte and multi-codepoint C string interface

To: tech-userlevel%NetBSD.org@localhost
Subject: A draft for a multibyte and multi-codepoint C string interface
From: Steffen "Daode" Nurpmeso <sdaoden%gmail.com@localhost>
Date: Sat, 30 Mar 2013 22:36:06 +0100
Hello,
i have started to write a multibyte and multi-codepoint aware
byte-based interface for C that tries to address the deficiencies
that the current ISO C and POSIX interfaces have in this area.
There was recently some talk on the POSIX mailing list regarding
this topic, and chances are that this issue will possibly be
addressed for real if a reference implementation would be
available.

While i'm far from stating that the draft that i have written in
the last ten days (yep, i have lost an entire week because i got
lost in looking at Plan9, then 9front, then had to read about
ATF(7)) can be anything more than just that, but it's true that
something has to happen to make the C and POSIX interfaces
Unicode-aware.  Well, i've spend the last almost four hours
writing a README for what i have so far, having you in mind as an
addressee (and not being a native english speaker plus slow in
doing such things), so i would like to paste that now.  I'll
attach a tarball that also includes the complete README (the "File
layout" section is missing below).

The next step, after possible adjustments to the interface, and
addressing the FIXMEs, would be to integrate this into the C
library (just read below), meaning that the thread-safe level 2
could be implemented from this code point of view, plus adding new
flags for the string/buffer conversions of printf() and scanf()
etc., so that these would optionally work character- instead of
byte-wise.  (If it would be done like that.)

P.S.: Just in case you're wondering, i've asked Christos wether it
would be acceptible that i use the NetBSD copyright header.

Thank you, and ciao from Germany

--steffen

Overview
--------

1. Introduction
REMOVE. ISO C99 / POSIX interfaces and their "ctext" mappings
3. Implementation, status and porting discussion

1. Introduction
---------------

Unix continues to be byte-based, the wchar_t wide character family of
functions has not found its way into daily programming practice.
Moreover, the wide character interface family is not capable to deal
with characters that are formed from sequences of multiple codepoints,
so-called grapheme clusters ("user-perceived characters" [1]).  But
it is also much too restricted to truly deal with properties that are
-- or can be, and truly will be in the future -- transported through
Unicode text.  The biggest problem for this interface seems to be
however that Unix / POSIX continues to be byte-based, slowly drifting
towards the byte-based UTF-8 multibyte character set, and that, to be
able to work with data that is stored in this format, a round-trip
conversion from UTF-8 to wchar_t and then back to UTF-8 is necessary.
In the temporary wchar_t based layer it is then possible to work with
the data on a per-codepoint level.  And this is insufficient.  It
follows a heavily shortened citation of Tom Christiansen from Perl.org,
from some years ago:

  "Character" can sometimes be a confusing term when it means something
  different to us programmers as it does to users.  Code point to mean
  the integer is a lot clearer to us but to no one else.  At work I
  often just give in and go along with the crowd and say character for
  the number that sits in a char or wchar_t or Character variable,
  even though of course that's a code point.  I only rebel when they
  start calling code units characters, which (inexperienced) Java
  people tend to do, because that leads to surrogate splitting and
  related errors.

  By grapheme I mean something the user perceives as a single character.
  In full Unicodese, this is an extended grapheme cluster.  These are
  code point sequences that start with a grapheme base and have zero
  or more grapheme extenders following it.  For our purposes, that's
  *mostly* like saying you have a non-Mark followed by any number of
  Mark code points, the main excepting being that a CR followed by a
  LF also counts as a single grapheme in Unicode.

  If you are in an editor and wanted to swap two "characters", the one
  under the user's cursor and the one next to it, you have to deal
  with graphemes not individual code points, or else you'd get the
  wrong answer.  Imagine swapping the last two characters of the first
  string below, or the first two characters of second one:

      contrôlée    contro\x{302}le\x{301}e
      élève        e\x{301}le\x{300}ve

  While you can sometimes fake a correct answer by considering things
  in NFC not NFD, that's doesn't work in the general case, as there
  are only a few compatibility glyphs for round-tripping for legacy
  encodings (like ISO 8859-1) compared with infinitely many combinations
  of combining marks.  Particularly in mathematics and in phonetics,
  you often end up using marks on characters for which no pre-combined
  variant glyph exists.  Here's the IPA for a couple of Spanish words
  with their tight (phonetic, not phonemic) transcriptions:

          anécdota    [a̠ˈne̞ɣ̞ð̞o̞t̪a̠]
          rincón      [rĩŋˈkõ̞n]

      NFD:
          ane\x{301}cdota    [a\x{320}\x{2C8}ne\x{31E}\x{263}\x{31E}
                              \x{F0}\x{31E}o\x{31E}t\x{32A}a\x{320}]
          rinco\x{301}n      [ri\x{303}\x{14B}\x{2C8}ko\x{31E}\x{303}n]

      NFD:
          an\x{E9}cdota    [a\x{320}\x{2C8}ne\x{31E}\x{263}\x{31E}
                            \x{F0}\x{31E}o\x{31E}t\x{32A}a\x{320}]
          rinc\x{F3}n      [r\x{129}\x{14B}\x{2C8}k\x{F5}\x{31E}n]

  So combining marks don't "just go away" in NFC, and you really do have to
  deal with them.  Notice that to get the tabs right (your favorite subject :),
  you have to deal with print widths, which is another place that you get
  into trouble if you only count code points.
  […]
  It's something you only want to do once and never think about again. :(

  --tom
  [Full text was under <http://bugs.python.org/issue12729#msg143061>]

It follows that we have exactly two possibilities.  Either we extend
the wide character interface such that it is capable to work on
sequences, i.e., multiple adjacent wchar_t codepoints.  Or we introduce
a new byte-based interface, one that works with multi-codepoint sequences
of (most likely) multibyte characters.

So why should we pay the cost in time and the space inefficiency of
round-tripping byte->wchar_t->byte, when the temporary intermediate
layer must itself be capable to work on sequences nonetheless?
Especially when we take into account that Unix drifts towards the
omnipresent usage of UTF-8, which was designed to be self-synchronizing,
meaning that it is possible to exactly identify the start of a multibyte
sequence, and, once found, how long the forthcoming sequence is.  Also,
UTF-8 is backward compatible to US-ASCII, meaning that documents that
have been written fourty years ago are implicitly compatible with an
UTF-8 text reader that will try to read this document in fourty years
from today.

So the interface that is presented here takes the byte-based approach.
It tries to map the string (and wide string) interface of ISO C99 and
the extensions of POSIX.1-2008 and map it into a new byte-based interface
that is capable to deal with multibyte codepoints and multi-codepoint
"characters", i.e., sequences-of-sequences.  It tries to address the
problem that such strings represent "bytes" as well as "characters /
graphems" by extending the meaning of return values that occur in case
of errors, and sometimes by additional, yet optional arguments.  The
interface should look familiar.

It is designed to have several levels of quality-of-implementation,
represented by the TEXT_SUPPORT_LEVEL macro and some additional auxiliary
macros.  The main goal was to design these levels of support in a way
that makes it possible to start using this new interface immediately
in all system environments which offer a wide character string interface
that is compliant to ISO C90, Amendment 1.  This non-integrated support
level 1 is incapable to deal with multi-codepoint characters, but
extended support levels will not require interface changes.

Furthermore, should the system represent wide characters as UCS values
(ISO 10646; TEXT_SUPPORT_WUCS is defined), as seems to be the case on
all so-far tested systems, it is possible to reliably offer
upward-compatible and extended features, like a better replacement of
the wcwidth(3) function family, already today.  Please see the
introductional comment in src/text.h for more on this.

[1] http://www.unicode.org/reports/tr29/

REMOVE. ISO C99 / POSIX interfaces and their "ctext" mappings
-------------------------------------------------------------

####
int mblen(const char *s, size_t n);
int mblen_l(const char *s, size_t n, locale_t loc);
size_t mbrlen(const char *restrict s, size_t n, mbstate_t *restrict ps);
size_t mbrlen_l(const char *restrict s, size_t n, mbstate_t *restrict ps,
        locale_t loc);
->
txtbound() series, also txtlen()
####
        char *stpcpy(char *restrict s1, const char *restrict s2);
        char *strcpy(char *restrict s1, const char *restrict s2);
                unchanged
        wchar_t *wcpcpy(wchar_t *restrict ws1, const wchar_t *restrict ws2);
        wchar_t *wcscpy(wchar_t *restrict ws1, const wchar_t *restrict ws2);
->
not needed, txtcpy() series returns length
####
char *stpncpy(char *restrict s1, const char *restrict s2, size_t n);
char *strncpy(char *restrict s1, const char *restrict s2, size_t n);
wchar_t *wcpncpy(wchar_t restrict *ws1, const wchar_t *restrict ws2, size_t n);
wchar_t *wcsncpy(wchar_t *restrict ws1, const wchar_t *restrict ws2, size_t n);
->
txtcpy() series
####
int strcasecmp(const char *s1, const char *s2);
int strcasecmp_l(const char *s1, const char *s2, locale_t locale);
int wcscasecmp(const wchar_t *ws1, const wchar_t *ws2);
int wcscasecmp_l(const wchar_t *ws1, const wchar_t *ws2, locale_t locale);
int strncasecmp(const char *s1, const char *s2, size_t n);
int strncasecmp_l(const char *s1, const char *s2, size_t n, locale_t locale);
int wcsncasecmp(const wchar_t *ws1, const wchar_t *ws2, size_t n);
int wcsncasecmp_l(const wchar_t *ws1, const wchar_t *ws2, size_t n,
  locale_t locale);
->
txtcasecmp() series
####
char *strcat(char *restrict s1, const char *restrict s2);
wchar_t *wcscat(wchar_t *restrict ws1, const wchar_t *restrict ws2);
->
txtcat() series
####
char *strchr(const char *s, int c);
wchar_t *wcschr(const wchar_t *ws, wchar_t wc);
->
txtstr() series
####
int strcmp(const char *s1, const char *s2);
int strncmp(const char *s1, const char *s2, size_t n)
int wcscmp(const wchar_t *ws1, const wchar_t *ws2);
int wcsncmp(const wchar_t *ws1, const wchar_t *ws2, size_t n);
->
txtcmp() series
####
int strcoll(const char *s1, const char *s2);
int strcoll_l(const char *s1, const char *s2, locale_t locale);
int wcscoll(const wchar_t *ws1, const wchar_t *ws2);
int wcscoll_l(const wchar_t *ws1, const wchar_t *ws2, locale_t locale);
->
TODO not yet implemented
####
size_t strcspn(const char *s1, const char *s2);
size_t wcscspn(const wchar_t *ws1, const wchar_t *ws2);
->
TODO some kind of find_(first|last)_(not_)?of() is missing set
####
char *strdup(const char *s);
char *strndup(const char *s, size_t size);
wchar_t *wcsdup(const wchar_t *string);
->
txtdup() series TODO plain txtdup() doesn't check validity!
####
char *strerror(int errnum);
char *strerror_l(int errnum, locale_t locale);
int strerror_r(int errnum, char *strerrbuf, size_t buflen);
->
No change needed since strerror_r() bails with ERANGE if *buflen* is
insufficient.
XXX For interface clarity there may be wrappers?
####
ssize_t strfmon(char *restrict s, size_t maxsize, const char *restrict format,
  ...);
ssize_t strfmon_l(char *restrict s, size_t maxsize, locale_t locale,
  const char *restrict format, ...);
->
No change needed since E2BIG error occurs if *maxsize* is insufficient
XXX For interface clarity there may be wrappers?
####
size_t strftime(char *restrict s, size_t maxsize, const char *restrict format,
  const struct tm *restrict timeptr)
size_t strftime_l(char *restrict s, size_t maxsize,
  const char *restrict format, const struct tm *restrict timeptr,
  locale_t locale);
size_t wcsftime(wchar_t *restrict wcs, size_t maxsize,
  const wchar_t *restrict format, const struct tm *restrict timeptr);
->
No change needed, since, if *maxsize* is insufficient, "0 shall be
returned and the contents of the array are unspecified".
XXX For interface clarity there may be wrappers?
####
size_t strlen(const char *s);
size_t strnlen(const char *s, size_t maxlen);
size_t wcslen(const wchar_t *ws);
size_t wcsnlen(const wchar_t *ws, size_t maxlen);
->
txtlen() series
####
char *strncat(char *restrict s1, const char *restrict s2, size_t n);
wchar_t *wcsncat(wchar_t *restrict ws1, const wchar_t *restrict ws2, size_t n);
->
txtcat() series
####
char *strpbrk(const char *s1, const char *s2);
wchar_t *wcspbrk(const wchar_t *ws1, const wchar_t *ws2);
->
TODO some kind of find_(first|last)_(not_)?of() is missing set
####
char *strptime(const char *restrict buf, const char *restrict format,
  struct tm *restrict tm);
->
The system environment needs to be extended and use the text interface
when parsing.
####
char *strrchr(const char *s, int c);
wchar_t *wcsrchr(const wchar_t *ws, wchar_t wc);
->
txtstr() series
####
char *strsignal(int signum);
->
XXX For interface clarity there may be a wrapper?
####
size_t strspn(const char *s1, const char *s2);
size_t wcsspn(const wchar_t *ws1, const wchar_t *ws2);
->
TODO some kind of find_(first|last)_(not_)?of() is missing set
####
char *strstr(const char *s1, const char *s2);
wchar_t *wcsstr(const wchar_t *restrict ws1, const wchar_t *restrict ws2);
->
txtstr() series
####
double strtod(const char *restrict nptr, char **restrict endptr);
float strtof(const char *restrict nptr, char **restrict endptr);
long double strtold(const char *restrict nptr, char **restrict endptr);
double wcstod(const wchar_t *restrict nptr, wchar_t **restrict endptr);
float wcstof(const wchar_t *restrict nptr, wchar_t **restrict endptr);
long double wcstold(const wchar_t *restrict nptr, wchar_t **restrict endptr);
->
The system environment needs to be extended and use the text interface
when parsing.
TODO TEXT_SUPPORT_LEVEL 3 and above is required to offer txtdigittoi().
####
intmax_t strtoimax(const char *restrict nptr, char **restrict endptr, int base);
uintmax_t strtoumax(const char *restrict nptr, char **restrict endptr,
  int base);
intmax_t wcstoimax(const wchar_t *restrict nptr, wchar_t **restrict endptr,
  int base);
uintmax_t wcstoumax(const wchar_t *restrict nptr, wchar_t **restrict endptr,
  int base);
->
The system environment needs to be extended and use the text interface
when parsing.
TODO TEXT_SUPPORT_LEVEL 3 and above is required to offer txtdigittoi().
####
char *strtok(char *restrict s1, const char *restrict s2);
char *strtok_r(char *restrict s, const char *restrict sep,
  char **restrict lasts);
wchar_t *wcstok(wchar_t *restrict ws1, const wchar_t *restrict ws2,
  wchar_t **restrict ptr);
->
TODO strtok() missing; better some kind of strsep() instead?
####
long strtol(const char *restrict str, char **restrict endptr, int base);
long long strtoll(const char *restrict str, char **restrict endptr, int base);
unsigned long strtoul(const char *restrict str, char **restrict endptr,
  int base);
unsigned long long strtoull(const char *restrict str, char **restrict endptr,
  int base);
long wcstol(const wchar_t *restrict nptr, wchar_t **restrict endptr, int base);
long long wcstoll(const wchar_t *restrict nptr, wchar_t **restrict endptr,
  int base);
unsigned long wcstoul(const wchar_t *restrict nptr, wchar_t **restrict endptr,
  int base);
unsigned long long wcstoull(const wchar_t *restrict nptr,
  wchar_t **restrict endptr, int base);
->
The system environment needs to be extended and use the text interface
when parsing.
TODO TEXT_SUPPORT_LEVEL 3 and above is required to offer txtdigittoi().
####
size_t strxfrm(char *restrict s1, const char *restrict s2, size_t n);
size_t strxfrm_l(char *restrict s1, const char *restrict s2, size_t n,
  locale_t locale);
size_t wcsxfrm(wchar_t *restrict ws1, const wchar_t *restrict ws2, size_t n);
size_t wcsxfrm_l(wchar_t *restrict ws1, const wchar_t *restrict ws2, size_t n,
  locale_t locale);
->
TODO xfrm() is missing; while at it, it should be considered wether the
TODO result of xfrm() should really be usable via txtcmp()!
TODO Better would be a special function, like xcoll().
TODO *Very* complicated, and Unicode collation is in permanent transition!
####
int wcswidth(const wchar_t *pwcs, size_t n);
int wcwidth(wchar_t wc);
->
txtwidth() series
####
int toascii(int c); TODO this is missing
int tolower(int c);
int tolower_l(int c, locale_t locale);
wint_t towlower(wint_t wc);
wint_t towlower_l(wint_t wc, locale_t locale);
int toupper(int c);
int toupper_l(int c, locale_t locale);
wint_t towupper(wint_t wc);
wint_t towupper_l(wint_t wc, locale_t locale);
->
tot*() series
####
int   isalnum(int);
int   isalnum_l(int, locale_t);
int   isalpha(int);
int   isalpha_l(int, locale_t);
int   isascii(int);
int   isblank(int);
int   isblank_l(int, locale_t);
int   iscntrl(int);
int   iscntrl_l(int, locale_t);
int   isdigit(int);
int   isdigit_l(int, locale_t);
int   isgraph(int);
int   isgraph_l(int, locale_t);
int   islower(int);
int   islower_l(int, locale_t);
int   isprint(int);
int   isprint_l(int, locale_t);
int   ispunct(int);
int   ispunct_l(int, locale_t);
int   isspace(int);
int   isspace_l(int, locale_t);
int   isupper(int);
int   isupper_l(int, locale_t);
int   isxdigit(int);
int   isxdigit_l(int, locale_t);

int           iswalnum(wint_t);
int           iswalpha(wint_t);
int             iswblank()
int           iswcntrl(wint_t);
int           iswctype(wint_t, wctype_t);
int           iswdigit(wint_t);
int           iswgraph(wint_t);
int           iswlower(wint_t);
int           iswprint(wint_t);
int           iswpunct(wint_t);
int           iswspace(wint_t);
int           iswupper(wint_t);
int           iswxdigit(wint_t);
->
ist*() series
####
*printf() family
->
no changes except a flag for %s that changes the handling of the
'char const *' argument from byte-wise to character-wise.
The '#' flag seems sensible, ''' being possibly even better.
If this flag is set, it is also to be ensured that .* precision stops
at a completed multibyte character.
Alternatively: t_*printf() series with a new %T conversion that
implicitly works on multibyte.  (Or both, using aliases.)
####
*scanf() family
->
ditto *printf(), but it is possibly better to go t_*scanf() because of
the many string related scan formats and the necessary whitespace skip,
which should/has to use the t_is*() series.
####

There needs to be a better way to gain a locale_t argument.  Even better
would be some kind of charset_t or so, because the real locale is rather
interesting for collation purposes or monetary or time formatting only,
rather than for character and digit classification and string
transformation, in Unicode sense.

3. Implementation, status and porting discussion
------------------------------------------------

For a general first overview, please see src/text.h.

- This interface does not support so-called locking shift states, as
  are used by, e.g., ISO-2022-JP, which is rather compliant to POSIX:

  6.2 Character Encoding
    A locking-shift encoding (where the state of the character is
    determined by a shift code that may affect more than the single
    character following it) cannot be defined with the current character
    set description file format. Use of a locking-shift encoding with
    any of the standard utilities in the Shell and Utilities volume of
    POSIX.1-2008 or with any of the functions in the System Interfaces
    volume of POSIX.1-2008 that do not specifically mention the effects
    of state-dependent encoding is implementation-defined.

  A.6.2 Character Encoding
    Encoding mechanisms based on single shifts, such as the EUC encoding
    used in some Asian and other countries, can be supported via the
    current charmap mechanism. With single-shift encoding, each
    character is preceded by a shift code (SS2 or SS3). A complete EUC
    code, consisting of the portable character set (G0) and up to three
    additional character sets (G1, G2, G3), can be described using the
    current charmap mechanism; the encoding for each character in
    additional character sets G2 and G3 must then include their
    single-shift code. Other mechanisms to support locales based on
    encoding mechanisms such as locking shift are not addressed by this
    volume of POSIX.1-2008.

  Encoding with locking shift states cannot be addressed, even with this
  new multi-codepoint interface.
  It seems that the only possible way to deal with such encodings is
  through a sequential stream interface, like iconv(3), that encodes
  complete lines or other bounded buffers.  UTF-8 can be used as
  a self-synchronizing multibyte target of such an encoding.

- The source code of support level 1 should be very portable, and
  maximally need some adjustments of preprocessor feature macros in
  src/text.h and interface macros in src/local.h.

  Further levels of course need integration into the locale environment
  of a system, but adjusting those macros may still be sufficient.

- The current code:
  - mbstate_t are yet recreated from scratch; we should use one or two
    pages full of them, and zero those as a whole; i.e., a "cache".
    This is important.
  - We are not yet optimized for MB_CUR_MAX == 1 cases.
  - Most of the tests need overhauling already; tests for the *n*
    versions should be extended in that no NULs should be in sight.
    (The first tests that were written, starting at 2013-03-20, place
    tests for each function in one file, instead of doing so
    family-wise.  All those tests are suspicious.)

- The tot(lower|upper)*() series should use the restrict keyword i think.
  (Sofar i personally *never* used that keyword.)

- Since handling of multibyte/multi-codepoint is a bit distressing,
  convenience functions like convert-entire-string (lower,upper etc.)
  may be desirable.

- errno is not yet used (except for the passed-through case for dup()).

- There is a problem with the interface in respect to illegal sequences.
  For the byte -> wchar_t -> byte round-trip, illegal sequences would be
  detected when the conversion takes place.
  This interface will have to check this over and over again.

  However, since for higher support levels we would internally work on
  UTF-8 only, penaltizing other encodings by converting them first
  (under the assumption that on the long run Unix will have been
  switched to UTF-8 in its entirety), this is very wasteful.
  UTF-8 is self-synchronizing, which means we could jump over entire
  sequences simply by looking at the leading byte.
  Also it seems very redundant to check for the invalid UTF-8 sequences.

  So it is desirable to offer an interface series that does not verify
  sequences, assuming that the user performs a validation once (txtlen()
  can be used), then working with the validated result.

- FIXME txtdup(), txtcpy() and txtcat() don't validate multibyte chars,
  FIXME whereas txtcmp() necessarily does, and txtlen() does by definition.
  FIXME This should probably, or even more *definetely*, be changed so that
  FIXME all the interfaces behave the same!
  FIXME This is an outdated approach!!  They need to be changed!!!
  FIXME See the discussion on the check-less interface after input
  FIXME validation above!!
Attachment: ctext.tar.gz
Description: GNU Zip compressed data
Follow-Ups:
- Re: A draft for a multibyte and multi-codepoint C string interface
  - From: Mouse
Prev by Date: Re: posix shared memory
Next by Date: Re: A draft for a multibyte and multi-codepoint C string interface
Previous by Thread: posix shared memory
Next by Thread: Re: A draft for a multibyte and multi-codepoint C string interface
Indexes:
Home | Main Index | Thread Index | Old Index