NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/52769: hang with an ffs stored in an nvme device
The following reply was made to PR kern/52769; it has been noted by GNATS.
From: Chuck Silvers <chuq%chuq.com@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: kern-bug-people%netbsd.org@localhost, gnats-admin%netbsd.org@localhost,
netbsd-bugs%netbsd.org@localhost
Subject: Re: kern/52769: hang with an ffs stored in an nvme device
Date: Fri, 1 Dec 2017 10:45:09 -0800
I got a copy of the dump from martin and here are some things I found.
the two buffers that he pointed out are for different parts of the same device:
(gdb) p *(struct buf *)0xffffe4041d8f7240
$4 = {
b_u = {
u_actq = {
tqe_next = 0xffffe4040e6b6c68,
tqe_prev = 0xffffe4029c033ea0
},
u_rbnode = {
rb_nodes = {0xffffe4040e6b6c68, 0xffffe4029c033ea0},
rb_info = 0x0
},
u_work = {
wk_dummy = 0xffffe4040e6b6c68
}
},
b_iodone = 0x0,
b_error = 0x0,
b_resid = 0x388e,
b_flags = 0x404,
#define B_ASYNC 0x00000004 /* Start I/O, do not wait. */
#define B_COWDONE 0x00000400 /* Copy-on-write already done. */
b_prio = 0x1,
b_bufsize = 0x8000,
b_bcount = 0x8000,
b_dev = 0x1300,
b_data = 0xffff800148420000,
b_blkno = 0xddd06c0,
b_rawblkno = 0xddd0ec0,
b_proc = 0x0,
b_saveaddr = 0x0,
b_private = 0x0,
b_dcookie = 0x0,
b_busy = {
cv_opaque = {0xffffe4040092db80, 0xffffe4040092dc90, 0xffffffff8116c394}
},
b_refcnt = 0x2,
b_unused = 0x0,
b_hash = {
le_next = 0x0,
le_prev = 0xffffe403f8913f40
},
b_vnbufs = {
le_next = 0xffffe4029c033ea0,
le_prev = 0xffffe4040e6b6d18
},
b_freelist = {
tqe_next = 0xffffe4041739a6c0,
tqe_prev = 0xffffe403b883c308
},
b_wapbllist = {
tqe_next = 0x0,
tqe_prev = 0x0
},
b_lblkno = 0xddd06c0,
b_freelistindex = 0x1,
b_cflags = 0x4800010,
#define BC_BUSY 0x00000010 /* I/O in progress. */
#define BC_WANTED 0x00800000 /* Process wants this buffer. */
#define BC_VFLUSH 0x04000000 /* Buffer is being synced. */
b_vp = 0xffffe4042b5b2bd0,
b_done = {
cv_opaque = {0x0, 0xffffe4041d8f7338, 0xffffffff8116c39c}
},
b_oflags = 0x0,
b_objlock = 0xffffe4042b619640
}
(gdb) p *(struct buf *)0xffffe40294774480
$5 = {
b_u = {
u_actq = {
tqe_next = 0xffffe403b48ac368,
tqe_prev = 0xffffe403b48ac488
},
u_rbnode = {
rb_nodes = {0xffffe403b48ac368, 0xffffe403b48ac488},
rb_info = 0x0
},
u_work = {
wk_dummy = 0xffffe403b48ac368
}
},
b_iodone = 0x0,
b_error = 0x0,
b_resid = 0x49ec,
b_flags = 0x404,
#define B_ASYNC 0x00000004 /* Start I/O, do not wait. */
#define B_COWDONE 0x00000400 /* Copy-on-write already done. */
b_prio = 0x1,
b_bufsize = 0x8000,
b_bcount = 0x8000,
b_dev = 0x1300,
b_data = 0xffff80014cfb0000,
b_blkno = 0x121f1100,
b_rawblkno = 0x121f1900,
b_proc = 0x0,
b_saveaddr = 0x0,
b_private = 0x0,
b_dcookie = 0x0,
b_busy = {
cv_opaque = {0xffffe403ccf032c0, 0xffffe403ccf033d0, 0xffffffff8116c394}
},
b_refcnt = 0x2,
b_unused = 0x0,
b_hash = {
le_next = 0xffffe403e8c7b480,
le_prev = 0xffff80001d75c158
},
b_vnbufs = {
le_next = 0xffffe403b48ac488,
le_prev = 0xffffe403b48ac418
},
b_freelist = {
tqe_next = 0xffffe40412c51120,
tqe_prev = 0xffffe4041739a300
},
b_wapbllist = {
tqe_next = 0x0,
tqe_prev = 0x0
},
b_lblkno = 0x121f1100,
b_freelistindex = 0x1,
b_cflags = 0x4800010,
#define BC_BUSY 0x00000010 /* I/O in progress. */
#define BC_WANTED 0x00800000 /* Process wants this buffer. */
#define BC_VFLUSH 0x04000000 /* Buffer is being synced. */
b_vp = 0xffffe4042b5b2bd0,
b_done = {
cv_opaque = {0x0, 0xffffe40294774578, 0xffffffff8116c39c}
},
b_oflags = 0x0,
b_objlock = 0xffffe4042b619640
}
b_dev = 0x1300 is "ld0a", which is associated with nvme0 as martin mentioned.
the buffers are B_ASYNC, so they will be unlocked automatically by biodone().
however the nvme driver does not have any I/O pending:
(gdb) p *((struct nvme_softc *)nvme_cd.cd_devs[0]->dv_private)->sc_admin_q
$9 = {
q_sc = 0xffff80001da34000,
q_sq_mtx = {
u = {
mtxa_owner = 0x601
}
},
q_cq_mtx = {
u = {
mtxa_owner = 0x601
}
},
q_sq_dmamem = 0xffffe4042f7ceac8,
q_cq_dmamem = 0xffffe4042f7ceaf0,
q_sqtdbl = 0x1000,
q_cqhdbl = 0x1004,
q_id = 0x0,
q_entries = 0x20,
q_sq_tail = 0x4,
q_cq_head = 0x4,
q_cq_phase = 0x1,
q_ccb_mtx = {
u = {
mtxa_owner = 0x601
}
},
q_nccbs = 0x20,
q_nccbs_avail = 0x21,
q_ccbs = 0xffffe4011d6e6000,
q_ccb_list = {
sqh_first = 0xffffe4011d6e6000,
sqh_last = 0xffffe4011d6e6e88
},
q_ccb_prpls = 0xffffe4042f7ceb18
}
(gdb) p *((struct nvme_softc *)nvme_cd.cd_devs[0]->dv_private)->sc_q[0]
$12 = {
q_sc = 0xffff80001da34000,
q_sq_mtx = {
u = {
mtxa_owner = 0x601
}
},
q_cq_mtx = {
u = {
mtxa_owner = 0x601
}
},
q_sq_dmamem = 0xffffe4042f7ceb40,
q_cq_dmamem = 0xffffe4042f7ceb68,
q_sqtdbl = 0x1008,
q_cqhdbl = 0x100c,
q_id = 0x1,
q_entries = 0x400,
q_sq_tail = 0x3b7,
q_cq_head = 0x3b7,
q_cq_phase = 0x0,
q_ccb_mtx = {
u = {
mtxa_owner = 0x601
}
},
q_nccbs = 0x3ff,
q_nccbs_avail = 0x3ff,
q_ccbs = 0xffff800020d5b000,
q_ccb_list = {
sqh_first = 0xffff800020d78cb8,
sqh_last = 0xffff800020d78f10
},
q_ccb_prpls = 0xffffe4042f7ceb90
}
the one normal queue is empty (q_nccbs == q_nccbs_avail, and q_sq_tail == q_cq_head).
the admin queue is a bit odd in that q_nccbs_avail is one greater than q_nccbs,
but it does appear to also be empty since q_sq_tail == q_cq_head.
the two buffers above, along with 554 others, are still in the bufq of ld0:
(gdb) p ((struct ld_softc *)ld_cd.cd_devs[0]->dv_private)->sc_dksc.sc_bufq
$1 = (struct bufq_state *) 0xffffe4042f686ce8
(gdb) xbufq_fcfs 0xffffe4042f686ce8
bufq 0xffffe4042f686ce8
0xffffe4025ff23b48
0xffffe4025ff23a28
0xffffe4025ff23908
0xffffe4025ff237e8
0xffffe4025ff236c8
...
0xffffe403b23b4480
0xffffe403cfef36c0
total buffers 556
ld0's dk_softc also indicates that no buffers are currently in the hands of
the underlying driver:
(gdb) p *(struct ld_softc *)ld_cd.cd_devs[0]->dv_private
$14 = {
sc_dksc = {
...
sc_bufq = 0xffffe4042f686ce8,
sc_dtype = 0xf,
sc_deferred = 0xffffe4025ff23c68,
sc_busy = 0x0,
...
sc_queuecnt = 0x0,
...
sc_maxqueuecnt = 0x3ff,
}
so the bug appears to be that the dk layer is sitting on a bunch of buffers
even though the underlying nvme driver is idle.
one problem I see is in ld_diskstart():
static int
ld_diskstart(device_t dev, struct buf *bp)
{
struct ld_softc *sc = device_private(dev);
int error;
if (sc->sc_queuecnt >= sc->sc_maxqueuecnt)
return EAGAIN;
if ((sc->sc_flags & LDF_MPSAFE) == 0)
KERNEL_LOCK(1, curlwp);
mutex_enter(&sc->sc_mutex);
if (sc->sc_queuecnt >= sc->sc_maxqueuecnt)
error = EAGAIN;
else {
error = (*sc->sc_start)(sc, bp);
if (error == 0)
sc->sc_queuecnt++;
}
mutex_exit(&sc->sc_mutex);
if ((sc->sc_flags & LDF_MPSAFE) == 0)
KERNEL_UNLOCK_ONE(curlwp);
return error;
}
the first sc_queuecnt check here is done without holding any locks,
so it could race with lddone() decrementing sc_queuecnt and return EAGAIN,
which will cause dk_start() to stop sending new I/O to the underlying
nvme driver even though the driver queue is no longer full.
lddone() calls dk_start(), but the buffer that ld_diskstart() is rejecting
is not on the bufq at that point so the lddone() thread wouldn't know about it.
just removing the unlocked check of sc_queuecnt here would fix this race.
however, it seems unlikely that this race would result in more than 1 I/O
becoming stuck in the bufq, and in this dump we have a bunch of them,
so it seems very likely that there are more problems that I haven't found yet.
the logic in dk_start() that allows two threads to be processing the bufq
in parallel looks like it could result in losing track of a buffer entirely.
if both threads get EAGAIN from d_diskstart then then will both try to
save the buffer pointer in dksc->sc_deferred, and the first one will be lost.
but that isn't the problem with the two buffers above since they are both
still on the bufq. the comment says this special hidey hole for one buffer
is intended to keep the order of I/Os the same, but that isn't needed for
correctness, and doing that in the dk layer rather than in the bufq layer
seems poor. if we want a way to undo a bufq_get() rather than just putting
the buffer back on the bufq with bufq_put(), then we should add a new API
for that rather than making callers deal with it.
(for nvme especially it would be better to not require putting all buffers
for a device on a single bufq at all, since that rather reduces the benefit
of the hardware having multiple hardware queues. but that kind of design
change is beyond the scope of the issue at hand.)
I'll keep looking for more bugs that could have caused this.
-Chuck
Home |
Main Index |
Thread Index |
Old Index