Re: Problems with many DOMUs on a single DOM0.

To: Harry Waddell <waddell%caravaninfotech.com@localhost>
Subject: Re: Problems with many DOMUs on a single DOM0.
From: Johan Ihrén <johani%johani.org@localhost>
Date: Mon, 7 Jan 2013 22:50:40 +0100

Hi Harry,

On Jan 7, 2013, at 22:18 , Harry Waddell wrote:

> At the risk of being completely and utterly wrong in a public forum,

No, I think that's my role ;-)

> I would suggest you look at your open file descriptor limits to at least rule 
> out the possibility that xenconsoled is running out of file descriptors for 
> the pty's it's managing.

Good idea, and Brian Buhrow suggested the same thing.

> I haven't looked into this too deeply or examined the source, but lsof seems 
> to indicate that there are 2 fd used per domU ( which makes sense ), plus a 
> few used for overhead. It wouldn't take long to run out if you didn't take 
> some steps to increase things from the defaults. 

Turns out that 3 fds per DOMU + overhead are needed and that with the NetBSD 
default of 128 file descriptors ends up nicely around 40 or so DOMUs, which is 
exactly where I was seeing problems.

Small patch to xend and I'm a happy camper. It's wonderful with the problems 
that have simple solutions ;-)

> I've been using xen for a long time now -- nearly a decade -- and with each 
> major version, the console support seems to be improving. Once 4.2 hits 
> pkgsrc and has a chance to gel a bit, you may want to consider upgrading. 

Yes, I'm considering that. However, I depend heavily on file-backed DOMUs, and 
until recently there seemed to be problems with that in Xen 4. So "4.2 hitting 
pkgsrc" is exactly what I'm waiting for. 

BTW, I notice that there's no xentools pkg at all for Xen 4.1.2, although there 
is a xenkernel-4.1.2 pkg. Hmm?

> Also, with that many domUs, even with SSD, that's a lot of backend I/O, so 
> you'll also want to the normal steps to make sure dom0 gets the resources it 
> needs. 

Yes, that's certainly true. Back in the day I striped my DOMUs across four 
physical servers to get the number of spinning disks that I needed for IO. Now 
I do the same thing in a Shuttle box with four SSDs...

Many, many thanks for the quick clue to where I ought to have looked first.

Regards,

Johan

> On Mon, 7 Jan 2013 18:53:06 +0100
> Johan Ihrén <johani%johani.org@localhost> wrote:
> 
>> Hi,
>> 
>> All of this is NetBSD-6.0, XEN 3.3.2, with ptyfs mounted, all VND-devices 
>> created, etc. However, the results are basically the same for 5.2. I have 
>> looked at the XEN logs, but haven't found any clues there.
>> 
>> I run many DOMUs on the same DOM0. No need for optimal performance, but 
>> strong need for many separate DOMUs. They are all file-backed, using VND and 
>> PV (not HVM). The DOM0 is always amd64, while the DOMUs used to be i386pae, 
>> but I'm migrating them to also be amd64.
>> 
>> Previously over the years I've been limited by CPU, by disk IO, by available 
>> memory, etc, to make the reasonable limit around 30 DOMUs on a quad core box 
>> with 8GB memory and four SSDs, and that works like a charm. I.e. I've been 
>> constrained by the hardware, not the OS.
>> 
>> But I would like to get to around 50-60 DOMUs and current hardware has 
>> enough cores and memory to provide that without too much fuss. I.e. if there 
>> are constraints now, they are likely OS or XEN constraints.
>> 
>> And I'm running into problems. Several problems actually.
>> 
>> As I start more DOMUs eventually I reach a point where the consoles no 
>> longer work:
>> ------
>> witch:labconfig# xm console domu38
>> NetBSD/amd64 (domu38) (console)
>> login:                                 # login prompt, this DOMU is fine
>> 
>> witch:labconfig# xm console domu39     # this one, however, is not:
>> 
>> xenconsole: Could not read tty from store: No such file or directory
>> ------
>> It is interesting to note that the limit is "soft" in the sense that if I 
>> kill a couple of machines it is possible to start a few other ones that will 
>> then get working consoles. I.e. it is not a permanent resource exhaustion.
>> 
>> What's also interesting, though, is that sometimes (but not always) "domu39" 
>> is fine, except for the lack of a console. I.e. as long as I don't screw up 
>> my networking, I can add some more DOMUs... until I hit the next problem. 
>> This time, all machines up to and including "domu44" was ok. But "dom45" is 
>> not working ("not working" defined as "doesn't respond to ping").
>> 
>> There's another problem with non-working DOMUs, and that is that they tend 
>> to go to 100% CPU and stay there. It is not exactly clear to me when this 
>> happens. Sometimes it is immediately when the DOMU is created, sometimes 
>> I've been able to use a DOMU for hours with no problems (except lack of 
>> console) and then it goes to 100% CPU when try to kill it off with "xm 
>> shutdown" (which doesn't work). "xm destroy" does kill them off, though.
>> 
>> And now it gets really strange. If I kill off the non-working DOMUs with "xm 
>> destroy" and then start them again then sometimes they work (still no 
>> console, but networking ok, so it is possible to get to them). This way, by 
>> booting DOMUs, and destroying and rebooting them until they work, I've been 
>> able to get to 52 working DOMUs, which is enough for me. But the last few 
>> machines are really skittish and may require several restarts before they 
>> work at all.
>> 
>> And sometimes (but not always) I get problems with xend:
>> ------
>> Unable to connect to xend: Connection refused. Is xend running?
>> ------
>> xend IS running. But not functioning for some reason.
>> 
>> When this happens, it is not possible to restart xend with "/etc/rc.d/xend 
>> restart". Only way to kill xend is with "kill -9" (it is in state "Il"). But 
>> once xend is restarted it is possible to recover without rebooting.
>> 
>> The first problem (no console for machines ~40 and up) is likely some sort 
>> of PTY resource exhaustion, although I don't understand why or where. When 
>> it happens I've run a small python script to check whether (the python) 
>> openpty function is able to allocate a PTY and that seems to work ok. I used 
>> python only because xen is written in python. Other suggestions for what to 
>> try would be appreciated.
>> 
>> The second problem (some DOMUs going to 100% CPU and in general not 
>> functioning) is probably more difficult. But without a console it is 
>> difficult to diagnose.
>> 
>> The third problem (xend becoming catatonic) happens less frequently, and 
>> sometimes not at all. And as it is possible to recover by killing xend and 
>> restarting it it is less of a pain than the others. But there's still a 
>> problem in there somewhere.
>> 
>> Suggestions anyone?
>> 
>> Regards,
>> 
>> Johan Ihrén
>> 
>> 
>

Follow-Ups:
- Re: Problems with many DOMUs on a single DOM0.
  - From: Harry Waddell

References:
- Problems with many DOMUs on a single DOM0.
  - From: Johan Ihrén
- Re: Problems with many DOMUs on a single DOM0.
  - From: Harry Waddell

Prev by Date: Re: Problems with many DOMUs on a single DOM0.
Next by Date: Re: Panic on dom0 shutdown
Previous by Thread: Re: Problems with many DOMUs on a single DOM0.
Next by Thread: Re: Problems with many DOMUs on a single DOM0.
Indexes:

Home | Main Index | Thread Index | Old Index