Subject: Re: Corrupt data when reading filesystems under Linux guest
To: None <port-xen@NetBSD.org>
From: Jed Davis <jdev@panix.com>
List: port-xen
Date: 06/09/2005 00:46:34
In article <200506062108.j56L8gG23306@panix5.panix.com>,
 <marcotte@panix.com> wrote:
> 
> xbdback op 0 req_cons 0xcd40 req_prod 0xcd47/0x7 resp_prod 0xcd40/0x0
> xbdback_io domain 2: frame_and_sects[0]=0x27f68000
> xbdback_io domain 2: frame_and_sects[1]=0x27f68009
> xbdback_io domain 2: frame_and_sects[2]=0x27f68012
> xbdback_io domain 2: frame_and_sects[3]=0x27f6801b
> xbdback_io domain 2: frame_and_sects[4]=0x27f68024
> xbdback_io domain 2: frame_and_sects[5]=0x27f6802d
> xbdback_io domain 2: frame_and_sects[6]=0x27f68036
> xbdback_io domain 2: frame_and_sects[7]=0x27f6803f
> xbdback_io domain 2: frame_and_sects[8]=0x27f67000
> xbdback_io domain 2: frame_and_sects[9]=0x27f67009
> xbdback_io domain 2: frame_and_sects[10]=0x27f67012
> xbdback_io domain 2: start sect 1 start 0 size 5632
> xbdback_io domain 2: start request

While I know very little about the innards of Xen, it looks to me, from
reading xbdback_io(), as if it's assuming that all of the elements of
frame_and_sects will specify consecutive areas of... um... it looks like
memory after some sort of mapping operation.  This bit in particular:

    572         start_offset =
    573             blkif_first_sect(req->frame_and_sects[0]) * VBD_BSIZE;
    574         req_size = blkif_last_sect(req->frame_and_sects[0]) -
    575             blkif_first_sect(req->frame_and_sects[0]) + 1;
    576         XENPRINTF(("xbdback_io domain %d: frame_and_sects[0]=0x%lx\n",
    577              xbdi->domid, req->frame_and_sects[0]));
    578         for (i = 1; i < req->nr_segments; i++) {
    579                 XENPRINTF(("xbdback_io domain %d: frame_and_sects[%d]=0x%lx\n",
    580                      xbdi->domid, i, req->frame_and_sects[i])); 
    581                 req_size += blkif_last_sect(req->frame_and_sects[i]) -  
    582                     blkif_first_sect(req->frame_and_sects[i]) + 1;
    583         }

and this bit lower down:

    608         xbd_req->rq_buf.b_blkno = req->sector_number;
    609         xbd_req->rq_buf.b_bcount = (daddr_t)req_size;
    610         xbd_req->rq_buf.b_data = (void *)start_offset;

suggest that if the next segment doesn't pick up where the last one
left off, the wrong thing will be done.  However, that's not the
problem here, as a look at the frame_and_sects in the quoted log
shows.  Fortunately, I'm not done yet; this, in xbdback_map_shm (also in
arch/xen/xen/xbdback.c):

    788         for (i = 0; i < req->nr_segments; i++) {
    789                 xbd_req->rq_ma[i] = (req->frame_and_sects[i] & ~PAGE_MASK);
    790         }
    791         return xen_shm_map(xbd_req->rq_ma, req->nr_segments,
    792                 xbd_req->rq_xbdi->domid, &xbd_req->rq_vaddr, 0);

and this, in xen_shm_map (in arch/xen/i386/xen_shm_machdep.c):

    156         for (i = 0; i < nentries; i++, new_va_pg++) {
    157                 mcl[i].op = __HYPERVISOR_update_va_mapping_otherdomain;
    158                 mcl[i].args[0] = new_va_pg;
    159                 mcl[i].args[1] = ma[i] | remap_prot;
    160                 mcl[i].args[2] = 0;
    161                 mcl[i].args[3] = domid;

suggest to me that something decidedly wrong will happen if a frame
number is repeated in the scatter/gather array; namely, that it will
map the same physical page into multiple successive virtual pages, or
something along those lines, then transfer to/from some contiguous
virtual address range, thus causing some the transfer to get sent
to/from the wrong place in the wrong page, and also leaving other parts
of the guest's buffer untouched.

Which, conveniently, matches what I was seeing with hexdump: some blocks
showing up in the wrong order, other blocks being replaced with random
kernel memory on read, and all of this being rather nondeterministic --
i.e., I could dd off the first 256 sectors of the guest's /dev/hdc1 and
pipe that to "openssl sha1", 100 times, and get 90-odd different hashes.
(That is, of course, with nothing else using that disk.)

Meanwhile, for the case where the xbd was working correctly, the
transfers all consisted entirely of whole pages, like this:

 xbdback op 1 req_cons 0xd01c req_prod 0xd03a/0x3a resp_prod 0xd01b/0x1b
 xbdback_io domain 2: frame_and_sects[0]=0x226af007
 xbdback_io domain 2: frame_and_sects[1]=0x226ac007
 xbdback_io domain 2: frame_and_sects[2]=0x2260d007
 xbdback_io domain 2: frame_and_sects[3]=0x225e1007
 xbdback_io domain 2: start sect 524304 start 0 size 16384
 xbdback_io domain 2: start request

So: am I completely wrong in my reading of the relevant source, or is
NetBSD's xbd backend broken?

And no, I don't see the point in scatter/gathering a single page of I/O
into individual segments for each constituent sector, which is what it
looks like the problematic guest is doing, but that doesn't change that
the backend should answer it correctly.

-- 
(let ((C call-with-current-continuation)) (apply (lambda (x y) (x y)) (map
((lambda (r) ((C C) (lambda (s) (r (lambda l (apply (s s) l))))))  (lambda
(f) (lambda (l) (if (null? l) C (lambda (k) (display (car l)) ((f (cdr l))
(C k)))))))    '((#\J #\d #\D #\v #\s) (#\e #\space #\a #\i #\newline)))))