Subject: port-i386/21255: i386 random stack alignment causes wildly inconsistent double-precision performance
To: None <gnats-bugs@gnats.netbsd.org>
From: None <jbernard@mines.edu>
List: netbsd-bugs
Date: 04/21/2003 17:41:01
>Number: 21255
>Category: port-i386
>Synopsis: i386 random stack alignment causes wildly inconsistent double-precision performance
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: port-i386-maintainer
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Apr 21 23:42:01 UTC 2003
>Closed-Date:
>Last-Modified:
>Originator: Jim Bernard
>Release: NetBSD 1.6Q
>Organization:
>Environment:
System: NetBSD nool 1.6Q NetBSD 1.6Q (NOOL-$Revision: 1.35 $) #0: Sat Mar 22 17:59:26 MST 2003 jim@roc:/wd1/var/tmp/compile/sys/arch/i386/compile/NOOL i386
Architecture: i386
Machine: i386
>Description:
The performance of programs that use double-precision floating-point
variables heavily can vary dramatically, depending on accidents of
alignment of stack-resident variables. I'll include a small
demonstration program below that illustrates the effect. With that
program, I observed a 55% increase in execution time when the stack
misaligns the double-precision array, relative to the time when it
is aligned on an 8-byte boundary. This is similar to the slowdown
I've found in a production code written in Fortran. (FWIW, this same
test code exhibits stable, 8-byte alignment of doubles on a Linux/i386
system. I haven't tried it on other BSDs, etc.)
There appear to be two places where the alignment problems occur.
The first is in the execve system call (sys_execve in
sys/kern/kern_exec.c). There the argv array and the environment
array are copied to the stack, and the stack size is rounded up
to the nearest 4-byte boundary, using the ALIGN and ALIGNBYTES
macros from sys/arch/i386/include/param.h. Because the size is
rounded only to a 4-byte boundary, the alignment of stack-resident
double-precision variables in the program can be inefficient,
depending on details of the argument list (e.g., the program name)
and the environment (e.g., PWD and OLDPWD). Variations in these
cause the execution time to vary, seemingly randomly, between a
fast time and a much slower time.
Rounding up the stack size in execve to an 8-byte boundary removes
the variability, but the result is guaranteed misalignment and
slow execution all of the time. Adding an additional 4-byte
increment to the stack size in execve yields correct alignment
and fast execution all the time. So, there is evidently another
place where the stack size is changed by an odd multiple of 4 bytes,
and that change seems to be constant.
I owe many thanks to Sverre Froyen for coming up with several
critical insights that nailed this down. In particular, he
pointed the finger at execve and figured out that the environment
was causing variable execution times, as well as generating a
fix that proved this was the key to the varying alignment.
>How-To-Repeat:
The code below demonstrates the effect. It's a slightly modified
version of a program found in:
http://compilers.iecc.com/comparch/article/00-11-168
originally designed to demonstrate the importance of 8-byte
alignment for double-precision calculations. Here the array is
declared, rather than being allocated with malloc, so we can
observe the effects of execve on alignment of stack-resident
double-precision variables. (Thanks to Sverre for the idea to
declare the array.)
#include <stdio.h>
#define N 10000
int main (int argc, char **argv) {
double *x;
int i, j;
/*
x=(double*)malloc((N+1)*sizeof(double));
*/
double y[N+1];
x = y;
/*
if(argc==2) x=(double*)((int)x+4);
*/
printf("0x%x\n", (int)x);
printf("%d\n", (int)x%8);
for(i=0;i<N;i++) x[i]=(double)i;
for(i=0;i<N;i++) for (j=0;j<N;j++) x[i]=0.5*(x[j]+x[i]);
printf("%f\n", x[N-1]);
exit(0);
}
To observe the effect:
copy the code to a file, say dbl.c
cc -O2 dbl.c
ln a.out a
ln a.out aa
ln a.out aaa
ln a.out aaaa
time ./a
time ./aa
time ./aaa
time ./aaaa
You will notice that two of the executions are fast, and two
are slow. For those that are fast, the code will print a 0
for x%8, indicating 8-byte alignment of x. For those that are
slow, it will print +-4, indicating misalignment. The variation
is due to the different lengths of argv[0]. The numerical results
of all of the calculations will be identical.
You can also try changing your environment (e.g., through a
couple of cd's) to see its effect on the execution times.
Changes in the lengths of PWD and OLDPWD (e.g.) can change
the stack size and thus the alignment of x.
>Fix:
Changing ALIGNBYTES to (sizeof(double) - 1) in param.h and
adding 4 to len (the stack size) in sys_execve in kern_exec.c
(just after len = ALIGN(len)) does the trick, at least if only
the kernel is rebuilt.
But both of these changes may be a bit dangerous, and they
don't seem ideal. Both ALIGN and ALIGNBYTES are used in a
number of places in the kernel, in libc, and elsewhere, and
it may not be desirable to change the alignment in all those
places. I note that the arm port defines STACKALIGN and
STACKALIGNBYTES macros in its param.h, which sound like good
names to use in execve, but they're not used there. Maybe
they should be.
Certainly it would not be friendly to other ports
to arbitrarily add 4 to the stack size in execve, but maybe
adding sizeof(int) or some such would be acceptable.
I don't know what is the source of the additional change in
the stack size by an odd multiple of 4 bytes. I've looked
a bit at crt0.c (lib/csu/i386_elf/crt0.c), which seems like a
good candidate, but I haven't managed to figure out exactly
what its effect on the stack is, so I'll leave it to someone
familiar with that code to sort out whether it is at fault
and whether the fix should go there or in execve.
Also, given that this is probably going to be easy to break
in the future, it might be a good idea to put something like
the test program above into a regression test.
>Release-Note:
>Audit-Trail:
>Unformatted: