Re: Unicode to ASCII

To: Netbsd-Users-List <netbsd-users%netbsd.org@localhost>
Subject: Re: Unicode to ASCII
From: Bob Proulx <bob%proulx.com@localhost>
Date: Sat, 20 Feb 2021 21:48:07 -0700

Silas wrote:
> Bob Proulx wrote:
> >    iconv -f UTF-8 -t ASCII//TRANSLIT <filein >fileout
> 
> It seems it is not possible on NetBSD 9.0 iconv :-(

It looks like //TRANSLIT is a GNU glibc extension not available in
NetBSD's version of libc.  Sorry.

> $ echo 'pão' | iconv -f UTF-8 -t ASCII//TRANSLIT
> iconv: iconv_open(ASCII//TRANSLIT, UTF-8): Invalid argument

I can use iconv to translate from one codeset to another but it
doesn't know how to transliterate.  It's not listed in the
documentation for it.

    man iconv

     -t    Specifies the destination codeset name as to_name.

And that is all it says.  So can change codesets.

    $ echo 'pão' | iconv -f UTF-8 -t LATIN1 | od -tx1 -c
    0000000   70  e3  6f  0a                                                
      p 343   o  \n                                                

I passed the output through od to show the e3 of it in LATIN1 to avoid
the mismash of it here in what will be a UTF-8 mailing.  But I can
show that it can be converted back.

    $ echo 'pão' | iconv -f UTF-8 -t LATIN1 | iconv -f LATIN1 -t UTF-8
    pão

> Is there something that could be installed from pkgsrc (or another
> iconv implementation) to make it work?

For transliteration it looks like you would need the GNU version of
iconv.  Sorry!

    https://manpages.debian.org/buster/manpages/iconv.1.en.html

    -t to-encoding, --to-code=to-encoding
        Use to-encoding for output characters.

    	If the string //IGNORE is appended to to-encoding, characters that
    	cannot be converted are discarded and an error is printed after
    	conversion.

    	If the string //TRANSLIT is appended to to-encoding, characters
    	being converted are transliterated when needed and possible. This
    	means that when a character cannot be represented in the target
    	character set, it can be approximated through one or several
    	similar looking characters. Characters that are outside of the
    	target character set and cannot be transliterated are replaced
    	with a question mark (?) in the output.

Bob

Follow-Ups:
- Re: Unicode to ASCII
  - From: Mark Carroll

References:
- Unicode to ASCII
  - From: Todd Gruhn
- Re: Unicode to ASCII
  - From: Bob Proulx
- Re: Unicode to ASCII
  - From: Silas

Prev by Date: Re: Unicode to ASCII
Next by Date: Re: Unicode to ASCII
Previous by Thread: Re: Unicode to ASCII
Next by Thread: Re: Unicode to ASCII
Indexes:

Home | Main Index | Thread Index | Old Index