tech-userlevel archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: Alternative to hash-bang
Hello,
Justin Cormack <justin%specialbusservice.com@localhost> wrote:
|On Jul 19, 2014 4:00 PM, "Steffen Nurpmeso" <sdaoden%yandex.com@localhost>
wrote:
|> And because of this last part again i finally come the conclusion
|> that the UTF-8 BOM will become a vivid part of the future, because
|> it carries information of a file's encoding along with the file as
|> a part of the encoding itself.
|
|UTF8 BOMs are only really used on Windows due to its UTF16 heritage. I have
|never seen them used on a Unix system. That is probably why Perl added
|support. That should not mean the use should be encouraged.
Maybe. Yes. But in respect to the first two i had to learn that
some Unix systems (AIX) also use UTF-16; i don't know how hard IBM
as a i think paying core member of the POSIX standard will try to
push UTF-16 into the standard once that finally moves forward
towards true support for the languages of the world; maybe not at
all (their ICU library seems to improve UTF-8 support, still
i think the core is UTF-16).
|> The real question is: what should be done with BOMs in `$ cat f1
|> f2 > f3', they cannot simply become stripped off?
|
|Write a utfcat command?
Tja. A locale modifier like POSIX.UTF-8@BOM wouldn't cause the
right thing. Martin Dürst of W3C wrote a few years ago
Yes exactly. In the RFC 2070 and HTML4 time-frame, nobody that I know
was thinking about a BOM for UTF-8. Only later BOMs at the start of
HTML4 started to turn up, and browser makers were surprised. Roughly the
same happened for XML. Early XML parsers didn't handle the BOM.
When Windows notepad started to use the BOM to distinguish between UTF-8
and "ANSI" (the local system legacy encoding), this BOM leaked into
HTML, and was difficult to stop. So XML got updated, and parsers started
to get updated, too.
...
The problem with the BOM in UTF-8 is that it can be quite helpful (for
quickly distinguishing between UTF-8 and legacy-encoded files) and quite
damaging (for programs that use the Unix/Linux model of text
processing), and that's why it creates so much controversy.
--steffen
Home |
Main Index |
Thread Index |
Old Index