port-macppc: Re: macppc trap (-current from yesterday) -- tracked down

Subject: Re: macppc trap (-current from yesterday) -- tracked down
To: None <port-macppc@NetBSD.org>
From: Thomas Klausner <wiz@NetBSD.org>
List: port-macppc
Date: 01/05/2005 13:06:19
Hi again.

About the problem I reported earlier:
On Sun, Nov 28, 2004 at 04:22:42PM +0100, Thomas Klausner wrote:
> Yesterday, during a bulk build, I got the following trap:
> trap: kernel read DSI trap @ 0x7c3143c6 by 0x2758e4 (DSISR 0x40000000, err=14)
> 
> It was just untarring a tar file served over NFS.
> The setup is approximately:
> shell chrooted in /usr/sandbox
> nearly everything in /usr/sandbox unionfs mounted from /
> (except for /dev, /etc and perhaps some others)
> /usr/pkgsrc/packages is NFS-mounted on a local network,
> and then nullfs-mounted into the sandbox.
> The bulk build in the sandbox was just adding a dependency,
> i.e. untarring an already built package (tar file on NFS,
> target file system local, both via nullfs).
> 
> Stopped in pid 28455.1 (tar) at netbsd:cpu_Debugger+0x10
> db> bt
> panic
> trap
> kernel DSI read trap @ 0x7c3143c6 by cache_lookup+0x84
> cache_lookup
> ufs_lookup
> layer_lookup
> lookup
> namei
> rename_files
> syscall_plain
> user SC trap #128 by 0x418737f8: srr1=0xf032 r1=0xffffd330 cr=0x24004082 xer=0 xctr=0x418737f0
> 
> I still have the db prompt if you want to know more.
> This is with a kernel without awacs (I took it out after
> I had the same trap with awacs in the kernel, just to remove
> one possible cause).
> 
> Any idea what's happening here?

I could and can reproduce this problem with -current kernels starting
sometime in October, while 2.0 and earlier kernels are rock solid.
My test case is just trying to compile firefox-gtk2 locally (no
NFS, no nullfs) -- at some point during the compilation I will
usually get the trap.

Since 2.0 is rock solid (completed a ~120 package bulk build without
problems), I discarded the 'hardware problem' thought and tried to
track the software one down. Now I've narrowed it down to two
commits, of which one is a no-op, and the other one doesn't really
look like it should cause this. Here they are, anyway:

sys/lib/libkern/arc4random.c
revision 1.13
date: 2004/09/17 21:54:28;  author: enami;  state: Exp;  lines: +4 -3
Redo part of rev. 1.10.

Diff:
@@ -217,8 +217,9 @@

        buf = (u_int8_t *)p;

-       for (i = 0; i < len; buf[i] = arc4_randbyte(), i++);
-               arc4_numruns += len / sizeof(u_int32_t);
+       for (i = 0; i < len; buf[i] = arc4_randbyte(), i++)
+               ;
+       arc4_numruns += len / sizeof(u_int32_t);
        if ((arc4_numruns > ARC4_MAXRUNS) ||
            (mono_time.tv_sec > arc4_tv_nextreseed.tv_sec)) {
                arc4_randrekey();

Looks like a no-op to me (note the ';' in the '-' lines).

The other one:

sys/uvm/uvm_page.c
revision 1.100
date: 2004/09/17 20:46:03;  author: yamt;  state: Exp;  lines: +3 -3
make free page queue filo rather than fifo.
data in pages freed more recently are more likely on cpu cache.
ys/uvm/uvm_pglist.c
revision 1.32
date: 2004/09/17 20:46:03;  author: yamt;  state: Exp;  lines: +3 -3
make free page queue filo rather than fifo.
data in pages freed more recently are more likely on cpu cache.

--- uvm/uvm_page.c      1 Sep 2004 11:53:38 -0000       1.99
+++ uvm/uvm_page.c      17 Sep 2004 20:46:03 -0000      1.100
@@ -1427,7 +1427,7 @@
                uvm_pagezerocheck(pg);
 #endif /* DEBUG */
 
-       TAILQ_INSERT_TAIL(pgfl, pg, pageq);
+       TAILQ_INSERT_HEAD(pgfl, pg, pageq);
        uvmexp.free++;
        if (iszero)
                uvmexp.zeropages++;
--- uvm/uvm_pglist.c    24 Mar 2004 07:47:33 -0000      1.31
+++ uvm/uvm_pglist.c    17 Sep 2004 20:46:03 -0000      1.32
@@ -483,7 +483,7 @@
                if (iszero)
                        uvm_pagezerocheck(pg);
 #endif /* DEBUG */
-               TAILQ_INSERT_TAIL(&uvm.page_free[uvm_page_lookup_freelist(pg)].
+               TAILQ_INSERT_HEAD(&uvm.page_free[uvm_page_lookup_freelist(pg)].
                    pgfl_buckets[VM_PGCOLOR_BUCKET(pg)].
                    pgfl_queues[iszero ? PGFL_ZEROS : PGFL_UNKNOWN], pg, pageq);
                uvmexp.free++;


Since I assume that the TAILQ_* macros work, and since this doesn't
seem to cause breakage on other archs (at least, not that I have
heard of), I don't know what's wrong. Any ideas?

 Thomas