Subject: bin/21645: Localized comments and indent(1)
To: None <gnats-bugs@gnats.netbsd.org>
From: None <mishka@terabyte.com.ua>
List: netbsd-bugs
Date: 05/22/2003 18:57:53
>Number: 21645
>Category: bin
>Synopsis: indent(1) doesn't handle non English characters in comments
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: bin-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Thu May 22 15:59:01 UTC 2003
>Closed-Date:
>Last-Modified:
>Originator: Mishka
>Release: NetBSD 1.6R
>Organization:
Terabyte ACS
>Environment:
System: NetBSD batraq.anything3d.com 1.6R NetBSD 1.6R (BATRAQ) #0: Fri Apr 25 14:37:48 EEST 2003 mishka@batraq.anything3d.com:/usr/home/mishka/netbsd/src-current/sys/arch/i386/compile/BATRAQ i386
Architecture: i386
Machine: i386
>Description:
Greetings!
The indent(1) is a very excelent tool for ugly code
normalization, but currently it doesn't handle correctly
non English character in comments. Sometimes it "eat" they,
and sometimes it splits comment line very non intelligent.
English text passed perfectly.
This appears so as in some conditions to determine character
class we have did a comparison of "signed" char against
some numerical value using implicit clauses, i.e. "<", ">",
"<=", ">=". In some codesets many characters placed in
second half of extendend ASCII table and that comparison
became incorrect.
For example, in Russain language (KOI8-R) the "A" letter
have an <E1> code, which is greater that <7F> (last character
in first half of ASCII table). And *signed* variable "foo"
compared to, say, space (code <20>) will be less than
space!!!
char foo;
foo = 0xe1; /* Cyrillic A */
if (foo > 0x20)
print("Is character.\n");
else
print("Is control.\n");
The program above give us "Is control."
Generally, this not indent(1) only problem. Moreover, the
problem will apears on non comments too (fortunately, the
C text itself doesn't allow non English characters).
>How-To-Repeat:
You can create any C text with comments contained characters
above first ASCII table half and then run indent on it.
Please note: to reproduce this effect indent must have deal
with splitting long lines, and better many times at one
comment.
>Fix:
Please use the patch below. In this case setlocale() is
not really needed, but if any functions like isalnum()
appears, it should be enabled. Maybe in some localizations
the blank chars not ' ' and '\t' only, who knows?
Index: indent.c
===================================================================
RCS file: /cvsroot/src/usr.bin/indent/indent.c,v
retrieving revision 1.13
diff -u -r1.13 indent.c
--- indent.c 2002/05/26 22:53:38 1.13
+++ indent.c 2003/05/22 15:08:16
@@ -61,6 +61,7 @@
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
+#include <locale.h>
#define EXTERN
#include "indent_globs.h"
#undef EXTERN
@@ -104,6 +105,8 @@
| INITIALIZATION |
\*-----------------------------------------------*/
+ if (!setlocale(LC_ALL, ""))
+ fprintf(stderr, "indent: can't set locale.\n");
hd_type = 0;
ps.p_stack[0] = stmt; /* this is the parser's stack */
Index: pr_comment.c
===================================================================
RCS file: /cvsroot/src/usr.bin/indent/pr_comment.c,v
retrieving revision 1.7
diff -u -r1.7 pr_comment.c
--- pr_comment.c 2002/05/26 22:53:38 1.7
+++ pr_comment.c 2003/05/22 15:08:31
@@ -47,6 +47,7 @@
#include <stdio.h>
#include <stdlib.h>
+#include <ctype.h>
#include "indent_globs.h"
/*
@@ -184,7 +185,7 @@
while (1) { /* this loop will go until the comment is
* copied */
- if (*buf_ptr > 040 && *buf_ptr != '*')
+ if (!iscntrl(*buf_ptr) && *buf_ptr != '*')
ps.last_nl = 0;
CHECK_SIZE_COM;
switch (*buf_ptr) { /* this checks for various spcl cases */
@@ -376,7 +377,8 @@
/* remember we saw a blank */
++e_com;
- if (now_col > adj_max_col && !ps.box_com && unix_comment == 1 && e_com[-1] > ' ') {
+ if (now_col > adj_max_col && !ps.box_com && unix_comment == 1
+ && !iscntrl(e_com[-1]) && !isblank(e_com[-1])) {
/*
* the comment is too long, it must be broken up
*/
@@ -399,7 +401,7 @@
}
*e_com = '\0'; /* print what we have */
*last_bl = '\0';
- while (last_bl > s_com && last_bl[-1] < 040)
+ while (last_bl > s_com && iscntrl(last_bl[-1]) )
*--last_bl = 0;
e_com = last_bl;
dump_line();
--
Best regards,
Mishka.
>Release-Note:
>Audit-Trail:
>Unformatted: