Subject: Re: fs transcoding, was Re: Unicode support in iso9660.
To: None <tech-kern@NetBSD.org>
From: Ian Lance Taylor <ian@wasabisystems.com>
List: tech-kern
Date: 11/23/2004 12:49:01
der Mouse <mouse@Rodents.Montreal.QC.CA> writes:
> > I just checked SUSv3. It says nothing particularly useful.
>
> > "For a filename to be portable across implementations conforming to
> > IEEE Std 1003.1-2001, it shall consist only of the portable filename
> > character set as defined in Portable Filename Character Set.
>
> That's very interesting information. But it makes me ask, does the SUS
> specify any particular encoding scheme for converting those characters
> into addressing units, or is the encoding left unspecified?
Not really. There is this description of the Portable Character Set
(the Portable Filename Character Set is a subset of this):
"IEEE Std 1003.1-2001 places only the following requirements on the
encoded values of the characters in the portable character set:
* If the encoded values associated with each member of the
portable character set are not invariant across all locales
supported by the implementation, if an application accesses any
pair of locales where the character encodings differ, or
accesses data from an application running in a locale which has
different encodings from the application's current locale, the
results are unspecified.
* The encoded values associated with the digits 0 to 9 shall be
such that the value of each character after 0 shall be one
greater than the value of the previous character.
* A null character, NUL, which has all bits set to zero, shall be
in the set of characters.
* The encoded values associated with the members of the portable
character set are each represented in a single byte. Moreover,
if the value is stored in an object of C-language type char, it
is guaranteed to be positive (except the NUL, which is always
zero)."
Also, I found this in the rationale:
"At the present time, the primary responsibility for truncating
filenames containing multi-byte characters must reside with the
application. Some industry groups involved in internationalization
believe that in the future the responsibility must reside with the
kernel. For the moment, a clearer understanding of the implications
of making the kernel responsible for truncation of multi-byte
filenames is needed.
Character-level truncation was not adopted because there is no
support in POSIX.1 that advises how the kernel distinguishes between
single and multi-byte characters. Until that time, it must be
incumbent upon application writers to determine where multi-byte
characters must be truncated."
Ian