Port-macppc archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: Userland instability in NetBSD 6.0.1 MacPPC
>> HOWEVER, I noticed that if I run the exact program with the same
>> input twice, I get different crazy numbers. (!!)
>
>My first inclination would be to suspect flaky hardware.
>
>> This may well be due to some bug in analog where it is referencing
>> some uninitialized data that just happens to be different on every
>> run.
>
>> It occurs to me, though that if single threaded (and analog is old,
>> so I would expect that), even bugs should be deterministic.
>
>True as far as it goes. But...
>
>> I wonder if the "different answers on different runs" might be caused
>> by some OS behavior where it is not properly zeroing new vm pages, or
>> some other anti-social, but not fatally incorrect behavior.
>
>...this, while perhaps possible, is rather unlikely. But there is
>something I've seen called address space layout randomization, which
>tries to put the various pieces of the address space at different
>adresses each run. It's intended, AIUI, to mostly-defeat
>code-injection malware that has fixed addresses and/or offsets wired
>into it. If NetBSD has anything of the sort (you said 6.0.1, so it's
>not a version I know), this could mean that the trash left on the stack
>from one routine call to the next can differ from run to run.
>
>> I have seen some strange behavior that seems non-reproducable, though
>> it's hard to tell when bringing up a new box and debugging 12 things
>> at once.
>
>So true.
>
I ran my test case on my x86_64 VM. I did 60 runs, and none failed.
Rock solid.
I have another non-Quicksilver PPC machine, but I don't have time to pursue
this more right now.
I have packed up my test case into a 55 MByte tgz file at:
ftp mercy.icompute.com pub drop analog.bug.tgz (add slashes for spaces)
(ftp command is cryptic to avoid crawlers finding this file. It's not
high security, but I don't want it smeared all over the net)
Any machine with analog can run the test. The scripts are a little cryptic,
but you change the input and output directories in stattmp, and then
use runN 1 5 to do 5 runs. (The "nums" script" is below)
My nickel is on this being an OS problem of some sort - cache flush, I/O
timing, VM page locking. Something easy to find and fix. <snicker>
I can't see how it could be in the VM or I/O subsystems without showing up
elsewhere, though. It's a mystery......
By the way.... It **appears** to happen less frequently when the CPU is
otherwise
fairly idle. It seems to trigger more failures is I am pulling one of the
big files into vi while I run the test case. It still only fails about 1 in
10 runs, though. My big production runs that take 2 hours to run _all_ fail.
-dgl-
$ cat ~/bin/nums
#!/bin/ksh
if [ $# -ne 2 ] ; then
echo "usage: $0 start end"
exit 1
fi
start=$1
end=$2
i=$start
while [ $i -le $end ] ; do
echo $i
i=`expr $i + 1`
done
Home |
Main Index |
Thread Index |
Old Index