On Sun, 22 Mar 2009, Martin Husemann wrote:
On Sun, Mar 22, 2009 at 03:27:40PM +0100, Anders Lindgren wrote:I got an RC3 DIAGNOSTICS+DEBUG+LOCKDEBUG kernel running with changes suggested by Martin. Foor good measure, I eliminated RAIDframe from the picture by booting a second install from a different disk and started a build.sh -j8 release-build on it to kill it. Rather than deadlock hard within 10 minutes, it now survived 37 minutes -- but then it panicked! But now I have ddb!It's running out of mmu contexts on one of the cpus - and something goes wrong in the code supposed to recover from that (not realy a heavily tested code path). I'll have to read the code again and try to see if can reproduce it with aritificially limited number of contexts quicker on a local machine.
Are we talking about ASI leakage here? Dunno about USII, but USI has (iirc) 4k ASIs.. I haven't looked into how they're handled, but if they're just used round-robin (can't see a reason to do otherwise?), it should be impossible to run out of them unless there are more than 4k concurrent processes?
I see your kmutex_init patch made it into RC4, and removed my local modification. However, something fishy appears to have sneaked into RC4 too; I built a new RC4 LOCKDEBUG kernel, but it never gets to the ASI leakage bug -- it ddb:s trying to read address 0x40 in an openfirmware() call from OF_read, coming from pcons_poll. This happens at the "filesystem type (generic)?" question in the boot -a dialogue, right after answering the root- and dump-device questions.
On a different note: I'm looking into getting a remotely controlled relay to the power cord of this E3k box so I can remotely power cycle it. Does anyone know if this could cause damage to the box (as opposed to power cycling it with the key)? It's no worse than a regular power outage, but I'm not sure how healthy that really is. Manuals tend to not recommend it.
/ali:)