Re: Three failure modes seen with -current amd64 modular kernels (somewhat lengthy)

To: current-users%netbsd.org@localhost
Subject: Re: Three failure modes seen with -current amd64 modular kernels (somewhat lengthy)
From: Paul Goyette <paul%vps1.whooppee.com@localhost>
Date: Tue, 3 Nov 2015 07:06:48 +0800 (PHT)

On Mon, 2 Nov 2015, Paul Goyette wrote:

With current kernels, I am seeing three distinct failure modes when
loading/unloading modules.  (FWIW, my base kernel contains as few
built-in modules as possible; everything is loaded as needed.)  At
least issues 1 and 3 have been uncovered as a result of my updating
some inter-module dependencies on Oct. 1 (as far as I can tell, the
issues were present but not triggered because some modules weren't
being loaded).


Hmm, almost a week gone by, and no comments at all?  Am I the only
one who has so strongly embraced modular kernels?  I feel so lonely!

:)

I'm still trying to figure out what's going on with all three of these
issues, but i'm not making any progress.  I have determined that, for
issue #2, the problem can be masked by manually loading the nfs module,
preventing it from being auto-{,un}loaded as a dependency of nfsserver,
which in turn is a recently-introduced dependency of compat_netbsd32.
(This is similar to the work-around for issue #1 where manually loading
the mqueue module hides the issue.)  Without this dependency, the
compat_netbsd32 module fails to load because symbol do_nfssvc is not
resolved.

If anyone has any ideas on how to make some progress, I'd be happy to
try them out.  At least I can recreate two of the issues in a qemu
environment without putting my production box at risk!


Hooray!  Issue #2 (below) is resolved with the following commit

	Module Name:    src
	Committed By:   pgoyette
	Date:           Mon Nov  2 09:57:43 UTC 2015

	Modified Files:
	        src/sys/nfs: nfs_vfsops.c

	Log Message:
	Don't forget to call nfs_fini() when we're finished.  Without this,
	we leave a dangling pool nfsrvdescpl around.


	To generate a diff of this commit:
	cvs rdiff -u -r1.230 -r1.231 src/sys/nfs/nfs_vfsops.c


One down, two remaining.

Issue #2
========
The second failure mode also occurred while running a package-source
build.  In my "live" environment, it doesn't happen until the 369th
package in my build-list, which happens to be www/firefox (which has a
lot of dependencies).  It takes about 3 hours to get to this point in
my build sequence (on real hardware).

I tried to build _only_ firefox and its dependencies, but the bug did
not trigger.  Yet when I ran the entire set of package builds again, it
showed up again.

In my qemu environment, the problem shows up much earlier, on the 29th
package in my list - boost-libs.  (It took more than 4 hours to get here
in the slow qemu environment!)

Since it triggers on different packages, it is unlikely that the problem
is related to any specific package.  And even though I have set
kern.module.verbose = 1, there is no module-related load/unload activity
for more than three hours prior to the crash (when it was building perl5).

It's not clear from where this one is being triggered.  The backtrace
gives almost no clue:

(From console)
uvm_fault(0xffffffff805dbb00, 0xffffffff807c5000, 1) -> e
fatal page fault in supervisor mode

trap type 6 code 0 rip ffffffff802ca7e5 cs 8 rflags 282 cr2ffffffff807c5b40 ilevel 0 rsp fffffe8002f55dd8

curlwp 0xfffffe8002f320c0 pid 0.33 lowest kstack 0xfffffe8002f522c0


crash> bt
_KERNEL_OPT_NAGR() at 0
_KERNEL_OPT_NAGR() at 0
db_reboot_cmd() at db_reboot_cmd
db_command() at db_command+0xf0
db_command_loop() at db_command_loop+0x8a
db_trap() at db_trap+0xe9
kdb_trap() at kdb_trap+0xe5
trap() at trap+0x1b4
--- trap (number 6) ---
pool_drain() at pool_drain+0x3b
uvm_pageout() at uvm_pageout+0x45f
crash> show proc/p 0

system: pid 0 proc ffffffff80576580 vmspace/map ffffffff805e01a0 flags20002

...

lwp 33 [pgdaemon] fffffe8002f320c0 pcb fffffe8002f52000

   stat 7 flags 200 cpu 0 pri 126
...

The code at this point is

1427            mutex_enter(&pool_head_lock);
1428            do {
1429                    if (drainpp == NULL) {
1430                            drainpp = TAILQ_FIRST(&pool_head);
1431                    }
1432                    if (drainpp != NULL) {
1433                            pp = drainpp;
1434                            drainpp = TAILQ_NEXT(pp, pr_poollist);
1435                    }
1436                    /*

1437 * Skip completely idle pools. We depend on atleast

1438                     * one pool in the system being active.
1439                     */
1440            } while (pp == NULL || pp->pr_npages == 0);

(gdb) disass pool_drain
...
  0xffffffff802ca7cb <+33>:    callq  0xffffffff8011bbc0 <mutex_enter>

0xffffffff802ca7d0 <+38>: mov 0x2acec9(%rip),%rcx #0xffffffff805776a0 <pool_head>0xffffffff802ca7d7 <+45>: mov 0x31cb82(%rip),%rax #0xffffffff805e7360 <drainpp>

  0xffffffff802ca7de <+52>:    xor    %ebx,%ebx
  0xffffffff802ca7e0 <+54>:    test   %rax,%rax
  0xffffffff802ca7e3 <+57>:    je     0xffffffff802ca7fa <pool_drain+80>
=> 0xffffffff802ca7e5 <+59>:    mov    (%rax),%rdx
...
(gdb) print drainpp
$1 = (struct pool *) 0xffffffff807c5b40
(gdb) print *drainpp
Cannot access memory at address 0xffffffff807c5b40
(gdb) info reg
rax            0xffffffff807c5b40       -2139333824
rbx            0x0      0
rcx            0xffffffff805d8c40       -2141352896
rdx            0x0      0
...

This problem also appears to be 100% reproducible, on both real and
virtual machines.


+------------------+--------------------------+-------------------------+
| Paul Goyette     | PGP Key fingerprint:     | E-mail addresses:       |
| (Retired)        | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com    |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+------------------+--------------------------+-------------------------+

Follow-Ups:
- Re: Three failure modes seen with -current amd64 modular kernels (somewhat lengthy)
  - From: Paul Goyette

References:
- Three failure modes seen with -current amd64 modular kernels (somewhat lengthy)
  - From: Paul Goyette
- Re: Three failure modes seen with -current amd64 modular kernels (somewhat lengthy)
  - From: Paul Goyette

Prev by Date: Automated report: NetBSD-current/i386 build failure
Next by Date: Migrating from MBR to GPT?
Previous by Thread: Re: Three failure modes seen with -current amd64 modular kernels (somewhat lengthy)
Next by Thread: Re: Three failure modes seen with -current amd64 modular kernels (somewhat lengthy)
Indexes:

Home | Main Index | Thread Index | Old Index