Issue #2
========
The second failure mode also occurred while running a package-source
build. In my "live" environment, it doesn't happen until the 369th
package in my build-list, which happens to be www/firefox (which has a
lot of dependencies). It takes about 3 hours to get to this point in
my build sequence (on real hardware).
I tried to build _only_ firefox and its dependencies, but the bug did
not trigger. Yet when I ran the entire set of package builds again, it
showed up again.
In my qemu environment, the problem shows up much earlier, on the 29th
package in my list - boost-libs. (It took more than 4 hours to get here
in the slow qemu environment!)
Since it triggers on different packages, it is unlikely that the problem
is related to any specific package. And even though I have set
kern.module.verbose = 1, there is no module-related load/unload activity
for more than three hours prior to the crash (when it was building perl5).
It's not clear from where this one is being triggered. The backtrace
gives almost no clue:
(From console)
uvm_fault(0xffffffff805dbb00, 0xffffffff807c5000, 1) -> e
fatal page fault in supervisor mode
trap type 6 code 0 rip ffffffff802ca7e5 cs 8 rflags 282 cr2
ffffffff807c5b40 ilevel 0 rsp fffffe8002f55dd8
curlwp 0xfffffe8002f320c0 pid 0.33 lowest kstack 0xfffffe8002f522c0
crash> bt
_KERNEL_OPT_NAGR() at 0
_KERNEL_OPT_NAGR() at 0
db_reboot_cmd() at db_reboot_cmd
db_command() at db_command+0xf0
db_command_loop() at db_command_loop+0x8a
db_trap() at db_trap+0xe9
kdb_trap() at kdb_trap+0xe5
trap() at trap+0x1b4
--- trap (number 6) ---
pool_drain() at pool_drain+0x3b
uvm_pageout() at uvm_pageout+0x45f
crash> show proc/p 0
system: pid 0 proc ffffffff80576580 vmspace/map ffffffff805e01a0 flags
20002
...
lwp 33 [pgdaemon] fffffe8002f320c0 pcb fffffe8002f52000
stat 7 flags 200 cpu 0 pri 126
...
The code at this point is
1427 mutex_enter(&pool_head_lock);
1428 do {
1429 if (drainpp == NULL) {
1430 drainpp = TAILQ_FIRST(&pool_head);
1431 }
1432 if (drainpp != NULL) {
1433 pp = drainpp;
1434 drainpp = TAILQ_NEXT(pp, pr_poollist);
1435 }
1436 /*
1437 * Skip completely idle pools. We depend on at
least
1438 * one pool in the system being active.
1439 */
1440 } while (pp == NULL || pp->pr_npages == 0);
(gdb) disass pool_drain
...
0xffffffff802ca7cb <+33>: callq 0xffffffff8011bbc0 <mutex_enter>
0xffffffff802ca7d0 <+38>: mov 0x2acec9(%rip),%rcx #
0xffffffff805776a0 <pool_head>
0xffffffff802ca7d7 <+45>: mov 0x31cb82(%rip),%rax #
0xffffffff805e7360 <drainpp>
0xffffffff802ca7de <+52>: xor %ebx,%ebx
0xffffffff802ca7e0 <+54>: test %rax,%rax
0xffffffff802ca7e3 <+57>: je 0xffffffff802ca7fa <pool_drain+80>
=> 0xffffffff802ca7e5 <+59>: mov (%rax),%rdx
...
(gdb) print drainpp
$1 = (struct pool *) 0xffffffff807c5b40
(gdb) print *drainpp
Cannot access memory at address 0xffffffff807c5b40
(gdb) info reg
rax 0xffffffff807c5b40 -2139333824
rbx 0x0 0
rcx 0xffffffff805d8c40 -2141352896
rdx 0x0 0
...
This problem also appears to be 100% reproducible, on both real and
virtual machines.