Jean-Yves Migeon wrote:
Christoph Egger wrote:jym: Does your save/restore/migration work have some fixes related to machine-to-phys / phys-to-machine tables ? If yes, can you commit them, please ?I'm actively investigating this matter, as I have similar problems with the page fault handler during a migration with an updated current, on a call to xc_map_foreign_batch().During the live save, vaddr 0xbba9c000 <> 0xbba9d000 access trigger a page fault in dom0, and it keeps calling privpgop_fault in a loop, which leads to a hang.These faults happen when dom0 maps the p2m translation tables. I am looking at it.
Alright, little update, as this thing is a real pain to track down.For what I gathered so far, the p2m/m2p tables are handled correctly by NetBSD (manually auditing the content of the tables does not reveal any bogus entry).
However, I discovered today that the privcmd routines seem to hande their associated ioctls (the IOCTL_PRIVCMD_MMAP{BATCH} commands) incorrectly. I frequently get "off by one" errors (that is, the correct expected data is found in index 1 of an array instead of 0, for example).
The first element of the array contains a poison (see christoph's mail), and the fault routine manipulates incorrect values, which results either in an endless loop, or a crash during a mmu_update.
In my case, it happens with the mfn array during xc_map_foreign_batch. The array[0] value contains 4101 (== 0x1005, the "poison entry"), and array[1] contains the correct mfn to map.
I am now looking at the inside stuff between privcmd and uvm. From my PoV, the bug lies somewhere in there (alignement issue, like an improper cast, I don't know specifically yet), but IMHO, it is not Xen's direct fault.
Somewhere between uvm_map and privpgop_fault, the mfns are not passed down correctly.
Stay tuned. Cheers, -- Jean-Yves Migeon jeanyves.migeon%free.fr@localhost