Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Three failure modes seen with -current amd64 modular kernels (somewhat lengthy)



On Tue, 27 Oct 2015, Paul Goyette wrote:

With current kernels, I am seeing three distinct failure modes when
loading/unloading modules.  (FWIW, my base kernel contains as few
built-in modules as possible; everything is loaded as needed.)  At
least issues 1 and 3 have been uncovered as a result of my updating
some inter-module dependencies on Oct. 1 (as far as I can tell, the
issues were present but not triggered because some modules weren't
being loaded).

Hmm, almost a week gone by, and no comments at all?  Am I the only
one who has so strongly embraced modular kernels?  I feel so lonely!

:)

I'm still trying to figure out what's going on with all three of these
issues, but i'm not making any progress.  I have determined that, for
issue #2, the problem can be masked by manually loading the nfs module,
preventing it from being auto-{,un}loaded as a dependency of nfsserver,
which in turn is a recently-introduced dependency of compat_netbsd32.
(This is similar to the work-around for issue #1 where manually loading
the mqueue module hides the issue.)  Without this dependency, the
compat_netbsd32 module fails to load because symbol do_nfssvc is not
resolved.

If anyone has any ideas on how to make some progress, I'd be happy to
try them out.  At least I can recreate two of the issues in a qemu
environment without putting my production box at risk!


Issue #1
========

While running a package-source build of perl5, something calls
exec_autoload() to load a long list of CLASS_EXEC modules (many of which
don't even exist on amd64 systems, like exec_aout, exec_coff, and
exec_ecoff).  Following inter-module dependencies, this eventually
causes the mqueue module to be auto-loaded.  A short while later,
many/most of the modules (including the mqueue module) get auto-
unloaded.  Some time later, it calls exec_autoload() again, and the same
modules get loaded, and then unloaded.

The problem occurs during the second auto-unload of mqueue.  The final
code in mqueue_modcmd() calls pool_cache_destroy(), which in turn calls
pool_destroy.  The list of pools associated with pool_allocator_nointr
is apparently corrupt, and the line

	TAILQ_REMOVE(&pp->pr_alloc->pa_list, pp, pr_alloc_list);

follows an invalid pointer.

(gdb) bt
#0  0xffffffff801196a5 in cpu_reboot (howto=howto@entry=256,
   bootstr=bootstr@entry=0x0)
   at /build/netbsd-local/src/sys/arch/amd64/amd64/machdep.c:671
#1  0xffffffff80238426 in db_sync_cmd (addr=<optimized out>,
   have_addr=<optimized out>, count=<optimized out>,
   modif=<optimized out>)
   at /build/netbsd-local/src/sys/ddb/db_command.c:1359
#2  0xffffffff80238be7 in db_command (
   last_cmdp=last_cmdp@entry=0xffffffff806b90a0 <db_last_command>)
   at /build/netbsd-local/src/sys/ddb/db_command.c:908
#3  0xffffffff80238f2e in db_command_loop ()
   at /build/netbsd-local/src/sys/ddb/db_command.c:566
#4  0xffffffff8023c5bd in db_trap (type=type@entry=6,
   code=code@entry=0)
   at /build/netbsd-local/src/sys/ddb/db_trap.c:90
#5  0xffffffff80116130 in kdb_trap (type=type@entry=6,
   code=code@entry=0, regs=regs@entry=0xfffffe810e5e0cd0)
   at /build/netbsd-local/src/sys/arch/amd64/amd64/db_interface.c:227
#6  0xffffffff8011a82f in trap (frame=0xfffffe810e5e0cd0)
   at /build/netbsd-local/src/sys/arch/amd64/amd64/trap.c:287
#7  0xffffffff80100fde in alltraps ()
#8  0xffffffff80332510 in pool_destroy (pp=pp@entry=0xfffffe81dda47300)
   at /build/netbsd-local/src/sys/kern/subr_pool.c:690
#9  0xffffffff803339dd in pool_cache_bootstrap_destroy (
   pc=pc@entry=0xfffffe81dda47300)
   at /build/netbsd-local/src/sys/kern/subr_pool.c:1864
#10 0xffffffff803339fd in pool_cache_destroy (pc=0xfffffe81dda47300)
   at /build/netbsd-local/src/sys/kern/subr_pool.c:1828
#11 0xffffffff809a027d in ?? ()
#12 0xfffffe81f141d0c8 in ?? ()
#13 0xfffffe810e5e0e98 in ?? ()
#14 0xffffffff809a02f5 in ?? ()
#15 0x00000000ffffffff in ?? ()
#16 0xfffffe810e5e0e60 in ?? ()
#17 0xffffffff80333ca6 in splraiseipl (icookie=...) at ./x86/intr.h:164
#18 splvm () at /build/netbsd-local/src/sys/sys/spl.h:56
#19 pool_cache_put_paddr (pc=0x0, object=0xfffffe821f53d880,
   pa=18446744071572167040)
   at /build/netbsd-local/src/sys/kern/subr_pool.c:2403
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) fr 8
#8  0xffffffff80332510 in pool_destroy (pp=pp@entry=0xfffffe81dda47300)
   at /build/netbsd-local/src/sys/kern/subr_pool.c:690
690             TAILQ_REMOVE(&pp->pr_alloc->pa_list, pp, pr_alloc_list);
(gdb) disass pool_destroy
...
  0xffffffff803324df <+128>:   callq  0xffffffff8011bbc0 <mutex_enter>
  0xffffffff803324e4 <+133>:   mov    0x80(%rbx),%rax
  0xffffffff803324eb <+140>:   test   %rax,%rax
  0xffffffff803324ee <+143>:   je     0xffffffff803326bc <pool_destroy+605>
  0xffffffff803324f4 <+149>:   mov    0x88(%rbx),%rdx
  0xffffffff803324fb <+156>:   mov    %rdx,0x88(%rax)
  0xffffffff80332502 <+163>:   mov    0x88(%rbx),%rax
  0xffffffff80332509 <+170>:   mov    0x80(%rbx),%rdx
=> 0xffffffff80332510 <+177>:   mov    %rdx,(%rax)
...
(gdb) info reg
rax            0xffffffff8099cbc0       -2137404480
rbx            0xfffffe81dda47300       -1641253932288
rcx            0x0      0
rdx            0x0      0
rsi            0xffffffff       4294967295
rdi            0xffffffff806c0a5a       -2140403110
rbp            0xfffffe810e5e0e00       0xfffffe810e5e0e00
...
(gdb) print *0xffffffff8099cbc0
Cannot access memory at address 0xffffffff8099cbc0
(gdb) print *pp
$2 = {pr_poollist = {tqe_next = 0xfffffe821f9cf440,
   tqe_prev = 0xfffffe821f53e280}, pr_emptypages = {lh_first = 0x0},
 pr_fullpages = {lh_first = 0x0}, pr_partpages = {lh_first = 0x0},
 pr_curpage = 0x0, pr_phpool = 0xffffffff80732fc0 <phpool>, pr_cache = 0x0,
 pr_size = 1024, pr_align = 64, pr_itemoffset = 0, pr_minitems = 0,
 pr_minpages = 0, pr_maxpages = 4294967295, pr_npages = 0,
 pr_itemsperpage = 4, pr_slack = 0, pr_nitems = 0, pr_nout = 0,
 pr_hardlimit = 4294967295, pr_refcnt = 0,
 pr_alloc = 0xffffffff806c0a40 <pool_allocator_nointr>, pr_alloc_list = {
   tqe_next = 0x0, tqe_prev = 0xffffffff8099cbc0}, pr_drain_hook = 0x0,
 pr_drain_hook_arg = 0x0, pr_wchan = 0xffffffff809a1958 "mqmsgpl",
 pr_flags = 0, pr_roflags = 0, pr_lock = {u = {mtxa_owner = 0}}, pr_cv = {
   cv_opaque = {0x0, 0xfffffe81dda473b8, 0xffffffff809a1958}}, pr_ipl = 0,
 pr_phtree = {sph_root = 0x0}, pr_maxcolor = 0, pr_curcolor = 0,
 pr_phoffset = 0, pr_hardlimit_warning = 0x0, pr_hardlimit_ratecap = {
   tv_sec = 0, tv_usec = 0}, pr_hardlimit_warning_last = {tv_sec = 0,
   tv_usec = 0}, pr_nget = 0, pr_nfail = 0, pr_nput = 0, pr_npagealloc = 0,
 pr_npagefree = 0, pr_hiwat = 0, pr_nidle = 0, pr_freecheck = 0x0,
 pr_qcache = 0x0, pr_redzone = false, pr_reqsize = 63}
(gdb) print *pp->pr_alloc
$3 = {pa_alloc = 0xffffffff8033043c <pool_page_alloc>,
pa_free = 0xffffffff80330476 <pool_page_free>, pa_pagesz = 4096, pa_lock = {
   u = {mtxa_owner = 67073}}, pa_list = {
   tqh_first = 0xffffffff80727940 <uvm_map_entry_cache>,
   tqh_last = 0xffffffff8099cbc0}, pa_refcnt = 56, pa_pagemask = -4096,
 pa_pageshift = 12}
(gdb)

(Note that pp->pr_alloc->pa_list.tqh_last points to the same invalid
memory address; this would imply that the pool at pp was not (yet)
found in its allocator's list.)

This problem has been reproduced in a qemu environment, but I have not
yet figured out where the allocator's pool list is getting corrupted.

This problem is 100% reproducible, and even in a slow qemu-based
environment it takes only about 10-15 minutes.  The issue can be
avoided by including the mqueue module built-in to the kernel, or
manually loading the module (which avoids the auto-unload).

(If anyone wants to recreate the qemu environment, I can easily
provide the required kernel configuration file.)

Perhaps a clue here? ...

The pool being removed at this point, pp, seems to have its
pr_alloc_list.tqe_prev pointing to the same invalid address as the
allocator's pa_list.tqh_last value.  Could this mean that we might
have just destroyed and unmapped an entry that was still listed in
these lists?


Issue #2
========
The second failure mode also occurred while running a package-source
build.  In my "live" environment, it doesn't happen until the 369th
package in my build-list, which happens to be www/firefox (which has a
lot of dependencies).  It takes about 3 hours to get to this point in
my build sequence (on real hardware).

I tried to build _only_ firefox and its dependencies, but the bug did
not trigger.  Yet when I ran the entire set of package builds again, it
showed up again.

In my qemu environment, the problem shows up much earlier, on the 29th
package in my list - boost-libs.  (It took more than 4 hours to get here
in the slow qemu environment!)

Since it triggers on different packages, it is unlikely that the problem
is related to any specific package.  And even though I have set
kern.module.verbose = 1, there is no module-related load/unload activity
for more than three hours prior to the crash (when it was building perl5).

It's not clear from where this one is being triggered.  The backtrace
gives almost no clue:

(From console)
uvm_fault(0xffffffff805dbb00, 0xffffffff807c5000, 1) -> e
fatal page fault in supervisor mode
trap type 6 code 0 rip ffffffff802ca7e5 cs 8 rflags 282 cr2 ffffffff807c5b40 ilevel 0 rsp fffffe8002f55dd8
curlwp 0xfffffe8002f320c0 pid 0.33 lowest kstack 0xfffffe8002f522c0


crash> bt
_KERNEL_OPT_NAGR() at 0
_KERNEL_OPT_NAGR() at 0
db_reboot_cmd() at db_reboot_cmd
db_command() at db_command+0xf0
db_command_loop() at db_command_loop+0x8a
db_trap() at db_trap+0xe9
kdb_trap() at kdb_trap+0xe5
trap() at trap+0x1b4
--- trap (number 6) ---
pool_drain() at pool_drain+0x3b
uvm_pageout() at uvm_pageout+0x45f
crash> show proc/p 0
system: pid 0 proc ffffffff80576580 vmspace/map ffffffff805e01a0 flags 20002
...
lwp 33 [pgdaemon] fffffe8002f320c0 pcb fffffe8002f52000
   stat 7 flags 200 cpu 0 pri 126
...

The code at this point is

1427            mutex_enter(&pool_head_lock);
1428            do {
1429                    if (drainpp == NULL) {
1430                            drainpp = TAILQ_FIRST(&pool_head);
1431                    }
1432                    if (drainpp != NULL) {
1433                            pp = drainpp;
1434                            drainpp = TAILQ_NEXT(pp, pr_poollist);
1435                    }
1436                    /*
1437                     * Skip completely idle pools.  We depend on at least
1438                     * one pool in the system being active.
1439                     */
1440            } while (pp == NULL || pp->pr_npages == 0);

(gdb) disass pool_drain
...
  0xffffffff802ca7cb <+33>:    callq  0xffffffff8011bbc0 <mutex_enter>
  0xffffffff802ca7d0 <+38>:    mov    0x2acec9(%rip),%rcx        # 0xffffffff805776a0 <pool_head>
  0xffffffff802ca7d7 <+45>:    mov    0x31cb82(%rip),%rax        # 0xffffffff805e7360 <drainpp>
  0xffffffff802ca7de <+52>:    xor    %ebx,%ebx
  0xffffffff802ca7e0 <+54>:    test   %rax,%rax
  0xffffffff802ca7e3 <+57>:    je     0xffffffff802ca7fa <pool_drain+80>
=> 0xffffffff802ca7e5 <+59>:    mov    (%rax),%rdx
...
(gdb) print drainpp
$1 = (struct pool *) 0xffffffff807c5b40
(gdb) print *drainpp
Cannot access memory at address 0xffffffff807c5b40
(gdb) info reg
rax            0xffffffff807c5b40       -2139333824
rbx            0x0      0
rcx            0xffffffff805d8c40       -2141352896
rdx            0x0      0
...

This problem also appears to be 100% reproducible, on both real and
virtual machines.


Issue #3
========
Sometimes when loading the sysv_ipc module, some of the entries that
are placed on the sysctllog (for later use in sysctl_teardown) have an
incorrect mib-path.  Instead of 1.82.x (kern.ipc.x), the entries
contain 1.83.x (kern.bootime.x).  This is obviously wrong since
kern.boottime is a terminal/leaf node in the sysctl tree, so it can't
possibly have any children!  One result of this failure mode is that
these entries don't get removed during module unload, and if you later
try to read the entries (for example, using the command "sysctl kern")
it will try to execute the associated func for those entries and
likely panic the machine.  (The address for func points to memory
which may have been unmapped and/or re-used; in any case it no longer
points to a valid copy of the sysctl's func code.)

The problem never happened on the first load of the sysv_ipc module,
and sometimes took many (ten or more) load/unload/load cycles before
triggering.  It affected only the kern.ipc.sem* and kern.ipc/msg*
variables, all of which use dynamically-assigned mib-numbers;  the
entries for kern.ipc.info and kern.ipc.shm* were not affected.

This problem initially appeared while I was starting to debug issue #2
above.  I was getting close to figuring out where this one was being
triggered, but then when I put in a printf() at the point where the
code is adding the entry to the sysctllog, the problem went away.  I
know that this bug is still lurking in there somewhere, just need to
figure out a way to trigger it with appropriate debug code, without
affecting the code in such a way as to mask the bug.   :)  For now, I
have applied the following patch in an attempt to panic when this issue
occurs;  unfortunately this seems to mask the bug, as it has not yet
triggered a panic.  :(

Index: kern_sysctl.c
===================================================================
RCS file: /cvsroot/src/sys/kern/kern_sysctl.c,v
retrieving revision 1.258
diff -u -p -r1.258 kern_sysctl.c
--- kern_sysctl.c       23 Oct 2015 01:58:43 -0000      1.258
+++ kern_sysctl.c       27 Oct 2015 02:10:46 -0000
@@ -1955,6 +1955,7 @@ sysctl_createv(struct sysctllog **log, i
	const struct sysctlnode *root, *pnode;
	struct sysctlnode nnode, onode, *dnode;
	size_t sz;
+	const struct sysctlnode *snode;

	/*
	 * where are we putting this?
@@ -2111,6 +2112,7 @@ sysctl_createv(struct sysctllog **log, i
		 */
		pnode = root;
		error = sysctl_locate(NULL, &name[0], namelen - 1, &pnode, &ni);
+		snode = pnode;

		/*
		 * manual scan of last layer so that aliased nodes
@@ -2131,6 +2133,9 @@ sysctl_createv(struct sysctllog **log, i
		 * not expecting an error here, but...
		 */
		if (error == 0) {
+			if (pnode->sysctl_parent != snode)
+				panic("sysctl parentage error %p vs %p",
+					pnode->sysctl_parent, snode);
			if (log != NULL)
				sysctl_log_add(log, pnode);
			if (cnode != NULL)


For issues 1 and 2, I have (relatively small) core dumps available,
exported from the qemu environment, if anyone wants to look more
closely.


+------------------+--------------------------+-------------------------+
| Paul Goyette     | PGP Key fingerprint:     | E-mail addresses:       |
| (Retired)        | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com    |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org  |
+------------------+--------------------------+-------------------------+


Home | Main Index | Thread Index | Old Index