Subject: Re: Corrupt data when reading filesystems under Linux guest
To: None <port-xen@NetBSD.org>
From: Jed Davis <jdev@panix.com>
List: port-xen
Date: 06/09/2005 00:46:34
In article <200506062108.j56L8gG23306@panix5.panix.com>,
<marcotte@panix.com> wrote:
>
> xbdback op 0 req_cons 0xcd40 req_prod 0xcd47/0x7 resp_prod 0xcd40/0x0
> xbdback_io domain 2: frame_and_sects[0]=0x27f68000
> xbdback_io domain 2: frame_and_sects[1]=0x27f68009
> xbdback_io domain 2: frame_and_sects[2]=0x27f68012
> xbdback_io domain 2: frame_and_sects[3]=0x27f6801b
> xbdback_io domain 2: frame_and_sects[4]=0x27f68024
> xbdback_io domain 2: frame_and_sects[5]=0x27f6802d
> xbdback_io domain 2: frame_and_sects[6]=0x27f68036
> xbdback_io domain 2: frame_and_sects[7]=0x27f6803f
> xbdback_io domain 2: frame_and_sects[8]=0x27f67000
> xbdback_io domain 2: frame_and_sects[9]=0x27f67009
> xbdback_io domain 2: frame_and_sects[10]=0x27f67012
> xbdback_io domain 2: start sect 1 start 0 size 5632
> xbdback_io domain 2: start request
While I know very little about the innards of Xen, it looks to me, from
reading xbdback_io(), as if it's assuming that all of the elements of
frame_and_sects will specify consecutive areas of... um... it looks like
memory after some sort of mapping operation. This bit in particular:
572 start_offset =
573 blkif_first_sect(req->frame_and_sects[0]) * VBD_BSIZE;
574 req_size = blkif_last_sect(req->frame_and_sects[0]) -
575 blkif_first_sect(req->frame_and_sects[0]) + 1;
576 XENPRINTF(("xbdback_io domain %d: frame_and_sects[0]=0x%lx\n",
577 xbdi->domid, req->frame_and_sects[0]));
578 for (i = 1; i < req->nr_segments; i++) {
579 XENPRINTF(("xbdback_io domain %d: frame_and_sects[%d]=0x%lx\n",
580 xbdi->domid, i, req->frame_and_sects[i]));
581 req_size += blkif_last_sect(req->frame_and_sects[i]) -
582 blkif_first_sect(req->frame_and_sects[i]) + 1;
583 }
and this bit lower down:
608 xbd_req->rq_buf.b_blkno = req->sector_number;
609 xbd_req->rq_buf.b_bcount = (daddr_t)req_size;
610 xbd_req->rq_buf.b_data = (void *)start_offset;
suggest that if the next segment doesn't pick up where the last one
left off, the wrong thing will be done. However, that's not the
problem here, as a look at the frame_and_sects in the quoted log
shows. Fortunately, I'm not done yet; this, in xbdback_map_shm (also in
arch/xen/xen/xbdback.c):
788 for (i = 0; i < req->nr_segments; i++) {
789 xbd_req->rq_ma[i] = (req->frame_and_sects[i] & ~PAGE_MASK);
790 }
791 return xen_shm_map(xbd_req->rq_ma, req->nr_segments,
792 xbd_req->rq_xbdi->domid, &xbd_req->rq_vaddr, 0);
and this, in xen_shm_map (in arch/xen/i386/xen_shm_machdep.c):
156 for (i = 0; i < nentries; i++, new_va_pg++) {
157 mcl[i].op = __HYPERVISOR_update_va_mapping_otherdomain;
158 mcl[i].args[0] = new_va_pg;
159 mcl[i].args[1] = ma[i] | remap_prot;
160 mcl[i].args[2] = 0;
161 mcl[i].args[3] = domid;
suggest to me that something decidedly wrong will happen if a frame
number is repeated in the scatter/gather array; namely, that it will
map the same physical page into multiple successive virtual pages, or
something along those lines, then transfer to/from some contiguous
virtual address range, thus causing some the transfer to get sent
to/from the wrong place in the wrong page, and also leaving other parts
of the guest's buffer untouched.
Which, conveniently, matches what I was seeing with hexdump: some blocks
showing up in the wrong order, other blocks being replaced with random
kernel memory on read, and all of this being rather nondeterministic --
i.e., I could dd off the first 256 sectors of the guest's /dev/hdc1 and
pipe that to "openssl sha1", 100 times, and get 90-odd different hashes.
(That is, of course, with nothing else using that disk.)
Meanwhile, for the case where the xbd was working correctly, the
transfers all consisted entirely of whole pages, like this:
xbdback op 1 req_cons 0xd01c req_prod 0xd03a/0x3a resp_prod 0xd01b/0x1b
xbdback_io domain 2: frame_and_sects[0]=0x226af007
xbdback_io domain 2: frame_and_sects[1]=0x226ac007
xbdback_io domain 2: frame_and_sects[2]=0x2260d007
xbdback_io domain 2: frame_and_sects[3]=0x225e1007
xbdback_io domain 2: start sect 524304 start 0 size 16384
xbdback_io domain 2: start request
So: am I completely wrong in my reading of the relevant source, or is
NetBSD's xbd backend broken?
And no, I don't see the point in scatter/gathering a single page of I/O
into individual segments for each constituent sector, which is what it
looks like the problematic guest is doing, but that doesn't change that
the backend should answer it correctly.
--
(let ((C call-with-current-continuation)) (apply (lambda (x y) (x y)) (map
((lambda (r) ((C C) (lambda (s) (r (lambda l (apply (s s) l)))))) (lambda
(f) (lambda (l) (if (null? l) C (lambda (k) (display (car l)) ((f (cdr l))
(C k))))))) '((#\J #\d #\D #\v #\s) (#\e #\space #\a #\i #\newline)))))