Port-xen archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: Problems with many DOMUs on a single DOM0.
Hi Harry,
On Jan 7, 2013, at 22:18 , Harry Waddell wrote:
> At the risk of being completely and utterly wrong in a public forum,
No, I think that's my role ;-)
> I would suggest you look at your open file descriptor limits to at least rule
> out the possibility that xenconsoled is running out of file descriptors for
> the pty's it's managing.
Good idea, and Brian Buhrow suggested the same thing.
> I haven't looked into this too deeply or examined the source, but lsof seems
> to indicate that there are 2 fd used per domU ( which makes sense ), plus a
> few used for overhead. It wouldn't take long to run out if you didn't take
> some steps to increase things from the defaults.
Turns out that 3 fds per DOMU + overhead are needed and that with the NetBSD
default of 128 file descriptors ends up nicely around 40 or so DOMUs, which is
exactly where I was seeing problems.
Small patch to xend and I'm a happy camper. It's wonderful with the problems
that have simple solutions ;-)
> I've been using xen for a long time now -- nearly a decade -- and with each
> major version, the console support seems to be improving. Once 4.2 hits
> pkgsrc and has a chance to gel a bit, you may want to consider upgrading.
Yes, I'm considering that. However, I depend heavily on file-backed DOMUs, and
until recently there seemed to be problems with that in Xen 4. So "4.2 hitting
pkgsrc" is exactly what I'm waiting for.
BTW, I notice that there's no xentools pkg at all for Xen 4.1.2, although there
is a xenkernel-4.1.2 pkg. Hmm?
> Also, with that many domUs, even with SSD, that's a lot of backend I/O, so
> you'll also want to the normal steps to make sure dom0 gets the resources it
> needs.
Yes, that's certainly true. Back in the day I striped my DOMUs across four
physical servers to get the number of spinning disks that I needed for IO. Now
I do the same thing in a Shuttle box with four SSDs...
Many, many thanks for the quick clue to where I ought to have looked first.
Regards,
Johan
> On Mon, 7 Jan 2013 18:53:06 +0100
> Johan Ihrén <johani%johani.org@localhost> wrote:
>
>> Hi,
>>
>> All of this is NetBSD-6.0, XEN 3.3.2, with ptyfs mounted, all VND-devices
>> created, etc. However, the results are basically the same for 5.2. I have
>> looked at the XEN logs, but haven't found any clues there.
>>
>> I run many DOMUs on the same DOM0. No need for optimal performance, but
>> strong need for many separate DOMUs. They are all file-backed, using VND and
>> PV (not HVM). The DOM0 is always amd64, while the DOMUs used to be i386pae,
>> but I'm migrating them to also be amd64.
>>
>> Previously over the years I've been limited by CPU, by disk IO, by available
>> memory, etc, to make the reasonable limit around 30 DOMUs on a quad core box
>> with 8GB memory and four SSDs, and that works like a charm. I.e. I've been
>> constrained by the hardware, not the OS.
>>
>> But I would like to get to around 50-60 DOMUs and current hardware has
>> enough cores and memory to provide that without too much fuss. I.e. if there
>> are constraints now, they are likely OS or XEN constraints.
>>
>> And I'm running into problems. Several problems actually.
>>
>> As I start more DOMUs eventually I reach a point where the consoles no
>> longer work:
>> ------
>> witch:labconfig# xm console domu38
>> NetBSD/amd64 (domu38) (console)
>> login: # login prompt, this DOMU is fine
>>
>> witch:labconfig# xm console domu39 # this one, however, is not:
>>
>> xenconsole: Could not read tty from store: No such file or directory
>> ------
>> It is interesting to note that the limit is "soft" in the sense that if I
>> kill a couple of machines it is possible to start a few other ones that will
>> then get working consoles. I.e. it is not a permanent resource exhaustion.
>>
>> What's also interesting, though, is that sometimes (but not always) "domu39"
>> is fine, except for the lack of a console. I.e. as long as I don't screw up
>> my networking, I can add some more DOMUs... until I hit the next problem.
>> This time, all machines up to and including "domu44" was ok. But "dom45" is
>> not working ("not working" defined as "doesn't respond to ping").
>>
>> There's another problem with non-working DOMUs, and that is that they tend
>> to go to 100% CPU and stay there. It is not exactly clear to me when this
>> happens. Sometimes it is immediately when the DOMU is created, sometimes
>> I've been able to use a DOMU for hours with no problems (except lack of
>> console) and then it goes to 100% CPU when try to kill it off with "xm
>> shutdown" (which doesn't work). "xm destroy" does kill them off, though.
>>
>> And now it gets really strange. If I kill off the non-working DOMUs with "xm
>> destroy" and then start them again then sometimes they work (still no
>> console, but networking ok, so it is possible to get to them). This way, by
>> booting DOMUs, and destroying and rebooting them until they work, I've been
>> able to get to 52 working DOMUs, which is enough for me. But the last few
>> machines are really skittish and may require several restarts before they
>> work at all.
>>
>> And sometimes (but not always) I get problems with xend:
>> ------
>> Unable to connect to xend: Connection refused. Is xend running?
>> ------
>> xend IS running. But not functioning for some reason.
>>
>> When this happens, it is not possible to restart xend with "/etc/rc.d/xend
>> restart". Only way to kill xend is with "kill -9" (it is in state "Il"). But
>> once xend is restarted it is possible to recover without rebooting.
>>
>> The first problem (no console for machines ~40 and up) is likely some sort
>> of PTY resource exhaustion, although I don't understand why or where. When
>> it happens I've run a small python script to check whether (the python)
>> openpty function is able to allocate a PTY and that seems to work ok. I used
>> python only because xen is written in python. Other suggestions for what to
>> try would be appreciated.
>>
>> The second problem (some DOMUs going to 100% CPU and in general not
>> functioning) is probably more difficult. But without a console it is
>> difficult to diagnose.
>>
>> The third problem (xend becoming catatonic) happens less frequently, and
>> sometimes not at all. And as it is possible to recover by killing xend and
>> restarting it it is less of a pain than the others. But there's still a
>> problem in there somewhere.
>>
>> Suggestions anyone?
>>
>> Regards,
>>
>> Johan Ihrén
>>
>>
>
Home |
Main Index |
Thread Index |
Old Index