NetBSD-Users archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: Unicode to ASCII
On Fri, 19 Feb 2021, Todd Gruhn wrote:
I extracted the "text" from a large PDF using a NetBSD prog called
pdftotext(1).
I got the desired ASCII text, but it has many occurances of the sequence
\x{80}\x{9c} ... \x{80}\x{9d}
Is there a nice and universal utility that can convert these to ASCII chars?
Those look like Unicode code points rather than UTF-8:
U+809c = https://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=809C
U+809d = https://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=809D
Rather than trying to convert to ASCII (which is either a) nonsensical,
or b) already being done, above, with the \x{} representation), what
you should do is set your locale to a UTF-8 one and then use the font
which covers the code-points you're likely to encounter.
If you set a proper locale then pdftotext can just convert Unicode
to UTF-8 in a "lossless" manner.
Put this minimal set of env. vars in ~/.xinitrc or ~/.xsession files:
(the NetBSD console doesn't handle UTF-8 natively yet, I think, so
this stuff below is not useful there)
For a US native:
export LANG=en_US.UTF-8
export LC_CTYPE=$LANG
export LC_ALL=""
For fonts (in xterm):
Use bitmap fonts with the widest glyph coverage:
$ xterm -fn -misc-*-r-normal--20-*-iso10646-1 \
-fw -misc-*-r-normal-ko-18-*-iso10646-1 \
-fg Black -bg Ivory ...
For TTF fonts, install `noto-ttf' (warning: 800MB+, but, you get
practically every font):
# pkgin install noto-ttf
Then, start xterm like this:
$ xterm -fa 'Noto Mono:style=Regular' \
-fd 'Noto Sans Mono CJK JP:style=Regular' \
...
You can choose other fonts for -fn and -fa (the standard ASCII ones)
if you want.
-RVP
Home |
Main Index |
Thread Index |
Old Index