Subject: Re: macppc trap (-current from yesterday) -- tracked down
To: None <port-macppc@NetBSD.org>
From: Thomas Klausner <wiz@NetBSD.org>
List: port-macppc
Date: 01/05/2005 13:06:19
Hi again.
About the problem I reported earlier:
On Sun, Nov 28, 2004 at 04:22:42PM +0100, Thomas Klausner wrote:
> Yesterday, during a bulk build, I got the following trap:
> trap: kernel read DSI trap @ 0x7c3143c6 by 0x2758e4 (DSISR 0x40000000, err=14)
>
> It was just untarring a tar file served over NFS.
> The setup is approximately:
> shell chrooted in /usr/sandbox
> nearly everything in /usr/sandbox unionfs mounted from /
> (except for /dev, /etc and perhaps some others)
> /usr/pkgsrc/packages is NFS-mounted on a local network,
> and then nullfs-mounted into the sandbox.
> The bulk build in the sandbox was just adding a dependency,
> i.e. untarring an already built package (tar file on NFS,
> target file system local, both via nullfs).
>
> Stopped in pid 28455.1 (tar) at netbsd:cpu_Debugger+0x10
> db> bt
> panic
> trap
> kernel DSI read trap @ 0x7c3143c6 by cache_lookup+0x84
> cache_lookup
> ufs_lookup
> layer_lookup
> lookup
> namei
> rename_files
> syscall_plain
> user SC trap #128 by 0x418737f8: srr1=0xf032 r1=0xffffd330 cr=0x24004082 xer=0 xctr=0x418737f0
>
> I still have the db prompt if you want to know more.
> This is with a kernel without awacs (I took it out after
> I had the same trap with awacs in the kernel, just to remove
> one possible cause).
>
> Any idea what's happening here?
I could and can reproduce this problem with -current kernels starting
sometime in October, while 2.0 and earlier kernels are rock solid.
My test case is just trying to compile firefox-gtk2 locally (no
NFS, no nullfs) -- at some point during the compilation I will
usually get the trap.
Since 2.0 is rock solid (completed a ~120 package bulk build without
problems), I discarded the 'hardware problem' thought and tried to
track the software one down. Now I've narrowed it down to two
commits, of which one is a no-op, and the other one doesn't really
look like it should cause this. Here they are, anyway:
sys/lib/libkern/arc4random.c
revision 1.13
date: 2004/09/17 21:54:28; author: enami; state: Exp; lines: +4 -3
Redo part of rev. 1.10.
Diff:
@@ -217,8 +217,9 @@
buf = (u_int8_t *)p;
- for (i = 0; i < len; buf[i] = arc4_randbyte(), i++);
- arc4_numruns += len / sizeof(u_int32_t);
+ for (i = 0; i < len; buf[i] = arc4_randbyte(), i++)
+ ;
+ arc4_numruns += len / sizeof(u_int32_t);
if ((arc4_numruns > ARC4_MAXRUNS) ||
(mono_time.tv_sec > arc4_tv_nextreseed.tv_sec)) {
arc4_randrekey();
Looks like a no-op to me (note the ';' in the '-' lines).
The other one:
sys/uvm/uvm_page.c
revision 1.100
date: 2004/09/17 20:46:03; author: yamt; state: Exp; lines: +3 -3
make free page queue filo rather than fifo.
data in pages freed more recently are more likely on cpu cache.
ys/uvm/uvm_pglist.c
revision 1.32
date: 2004/09/17 20:46:03; author: yamt; state: Exp; lines: +3 -3
make free page queue filo rather than fifo.
data in pages freed more recently are more likely on cpu cache.
--- uvm/uvm_page.c 1 Sep 2004 11:53:38 -0000 1.99
+++ uvm/uvm_page.c 17 Sep 2004 20:46:03 -0000 1.100
@@ -1427,7 +1427,7 @@
uvm_pagezerocheck(pg);
#endif /* DEBUG */
- TAILQ_INSERT_TAIL(pgfl, pg, pageq);
+ TAILQ_INSERT_HEAD(pgfl, pg, pageq);
uvmexp.free++;
if (iszero)
uvmexp.zeropages++;
--- uvm/uvm_pglist.c 24 Mar 2004 07:47:33 -0000 1.31
+++ uvm/uvm_pglist.c 17 Sep 2004 20:46:03 -0000 1.32
@@ -483,7 +483,7 @@
if (iszero)
uvm_pagezerocheck(pg);
#endif /* DEBUG */
- TAILQ_INSERT_TAIL(&uvm.page_free[uvm_page_lookup_freelist(pg)].
+ TAILQ_INSERT_HEAD(&uvm.page_free[uvm_page_lookup_freelist(pg)].
pgfl_buckets[VM_PGCOLOR_BUCKET(pg)].
pgfl_queues[iszero ? PGFL_ZEROS : PGFL_UNKNOWN], pg, pageq);
uvmexp.free++;
Since I assume that the TAILQ_* macros work, and since this doesn't
seem to cause breakage on other archs (at least, not that I have
heard of), I don't know what's wrong. Any ideas?
Thomas