Port-xen archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: Crashes with large SuperMicro based server.
Greg,
gdt%ir.bbn.com@localhost:
> Good, that is indeed what you should have verified first before xen.
I did. Before Xen. :-)
> Presumably you are using amd64.
I am indeed.
> It seems there is a bug in the kernel someplace, probably in a driver
> (because it works fine for most other people), that is somehow tickled
> with xen and your hardware.
Tobias Nygren (in a private mail) seems to have nailed it down. He
hinted that there might be a problem with the Areca RAID controller not
being recognized properly, and since I don't use it, i unplugged it from
the PCI bus, and the machine came up - WITH Xen. Tobias also proposed a
patch, which I haven't had time to install yet, but I'll try to build a
kernel with the patch later in the week.
> Suggestions more or less in order of increasing difficulty:
Thanks for all of these!
> 1) Look up the program counter above (rip) in the kernel binary. One
> way is to run gdb and then "disass 0xffffffff80540792" to find the
> function it's in.
Ack.
> 2) photograph/video the boot screen to find the netbsd kernel messages
> preceding the hang. The key point is to know where in the boot sequence
> it is. Both the driver that printed the last line that came out and the
> driver after that are suspect (compare to non-xen boot).
(I actually tried that, but my handy camera proved broken, and my phone
wasn't good enough to catch the rapid scrolling on the screen.)
> 3) Use a serial console and capture the output from xen and netbsd
> before the crash.
That would have been my next step, but it would require a lot of
fiddling ...
> 4) Figure out how to do remote gdb on the netbsd kernel. I am not sure
> how to do this in xen.
A good challenge! :-)
>> Are there any limitations I should know about (# of cores, max mem)?
> Not that I know of (that you're close to; if you had 256 cores and 1024G
> of RAM I would not be sure).
Only in my dreams ... :-)
>> Are there any BIOS settings that I need to check? (CPU flags?)
> I would try disabling SMP in the bios, so that you boot with one core.
> Probably that's not it, but it's easy to try.
I'll save that one for the future.
>> Are there any combos of hypervisor and kernel that are less or more
>> likely to work?
> Hard to say, but xen41 and xen45 are good versions to try. I would
> suggest trying to boot a netbsd-6 DOM0 kernel also. I don't think it's
> likely to work better, but it's an easy test.
I'll save that one too.
>> Are there BIOS devices that can get in the way and should be
>> removed/disabled? (USB, COM, IPMI ...)?
> Not really, but you could try to turn off everything that isn't
> necessary. IPMI I would leave on.
Ack.
>> Should I look at the PCI bus? (RAID board)?
> It's unlikely that an unrecognized PCI device would cause trouble.
> (Note that I am saying "unlikely"; there are more or less no
> certainties.)
It does indeed seem to be the the case, though. But I fully agree. It's
a surprise to mee too!
> I am unfamiliar with bootscrub; try without.
(Disabling it saves time on large-mem machines when you boot. I've used
it successfully on Xen/Debian.)
> I don't see anything scary in your non-xen dmesg.
Again, thanks for all good hints!
Cheers,
/Liman
Home |
Main Index |
Thread Index |
Old Index