hello. After thinking about Jaromir's message for a while, I began
looking into this issue more. It's definitely a software issue. As part
of that process, I put some instrumentation in
src/sys/arch/xen/xen/xbd_xenbus.c to see if I could figure out what was
going on. I was inspired by the port-xen/53506 bug report.
After a bunch of trial and error, I've narrowed down the issue, or, at
least, I think I have. The problem seems to be that we get duplicate
requests off of the ring between the domu and the backend from time to time
in xbd_handler(). In the debug output below, I've added two parenthetical
numbers in the xbd_handler printf where the bp the handler is working on is
shown. The first represents i in the for loop of xbd_handler() and the
second represents the value of resp_prod. The bug triggers when the
difference between i and resp_prod is greater than 1.
Given that these domu's work flawlessly on Xen-3.3.2 and on
FreeBSD running as dom0 on xen-4.12, I'm thinking this behavior is a
symptom of the problem, rather than the cause of the problem.
Given this additional information, does anyone have an idea what might be
going on or what I might try next to resolve the issue?
I tried checking to see if the bp was the same on sequential passes through
the for loop and not calling biodone on the second pass. That stops the
panic, but freezes the domu in physio. So, I think I'm close to the
problem.
-thanks
-Brian
<good trip through xbd_handler()>
xbdstrategy(0xffffa0000fa06d20): b_bcount = 16384
xbdstart(0xffffa0000fa06d20): b_bcount = 16384
xbd_handler(xbd0)
xbd_handler(0xffffa0000fa06d20): b_bcount = 16384 (376, 377)
xbd_handler(xbd0)
. . .
<Bad trip through xbd_handler()>
xbdstrategy(0xffffa0000fa06d20): b_bcount = 32768
xbdstart(0xffffa0000fa06d20): b_bcount = 32768
xbd_handler(xbd0)
xbdstrategy(0xffffa0000fa06e38): b_bcount = 32768
xbdstart(0xffffa0000fa06e38): b_bcount = 32768
xbd_handler(xbd0)
xbd_handler(0xffffa0000fa06e38): b_bcount = 32768 (383, 385)
xbd_handler(0xffffa0000fa06e38): b_bcount = 32768 (384, 385)
panic: biodone2 already
fatal breakpoint trapxbd_handler(xbd0)
in supervisor mode
trap type 1 code 0 rip ffffffff8031d08d cs e030 rflags 246 cr2 7f7ffd60a087 cpl 0 rsp ffffa00055407b00
Stopped in pid 0.4 (system) at netbsd:breakpoint+0x5: leave
db> bt
breakpoint() at netbsd:breakpoint+0x5
panic() at netbsd:panic+0x242
biodone2() at netbsd:biodone2+0xd8
biointr() at netbsd:biointr+0x31
softint_thread() at netbsd:softint_thread+0x66
db>