Subject: Re: RAIDframe crash
To: Chris Jones <chris@cjones.org>
From: Greg Oster <oster@cs.usask.ca>
List: current-users
Date: 05/08/2001 18:03:39
Chris Jones writes:
> Thanks for the verbose help, Greg.
>
> On Tue, May 08, 2001 at 05:20:58PM -0600, Greg Oster wrote:
>
> > The date jumps here... is there data missing? If so, at least some of it=
> is=20
> > critical to solving this...=20
>
> Yeah; sorry for my incomplete report. As you guessed later on, sd4
> wasn't in the system at all.
Ah! Ok.. that helps.
> I was basically testing the whole thing
> to see how well it's going to work, before I put it into production.
> So I've got two out of my three disks online, and I'm thrashing the
> filesystems.
Fair 'nough. (It's good to see people actually doing this -- that way they
have a better idea of what to expect if something goes wrong)
> I guess I was hoping that, while the RAID array was operating in
> degraded mode, it would fail analogously to a single-disk filesystem:
> Reboot, fsck (and possibly lose some un-synced data), and keep going.
> In fact, it looks more like: Reboot, make the sysadmin force a RAID
> configure, then fsck (and possibly lose data).
Arguably RAIDframe should fail much like a single disk... but parts of the
guts of RAIDframe arn't really geared for that..
> > > May 8 16:10:46 gamera /netbsd: sd2(siop1:0:0): command timeout
> >=20
> > "Uh oh.." This will not make RAIDframe happy if the IO didn't complete..=
> . but=20
> > no big deal... RAID 5 will deal with a single disk failure... I'm guessi=
> ng=20
> > that sd2e, sd3e, and sd4e are all in this RAID 5 set... but where is sd4?=
> =20
> > If it's not there, then you've already had 2 failures, which is fatal in =
> a=20
> > RAID 5 set...
>
> Yeah. I don't know the cause of the underlying error; I'll have to
> investigate that. This machine has been giving me a lot of trouble,
> though, with various SCSI controllers, cables, drives, and
> enclosures. Sometimes I wonder if SCSI just doesn't like me...
You havn't mentioned "termination" in that mix ;)
> > > May 8 16:10:47 gamera /netbsd: sd3(siop1:1:0): parity error
> >=20
> > "Uh Oh#2". If sd3 is in the RAID set, RAIDframe is going to be really up=
> set,=20
> > as with 2 (or is it 3 now?) disks gone, it's pretty much game over. =20
> > (And RAIDframe will turn out the lights, rather than continuing on...)=20
> > (It should probably just stop doing IO's to the RAID set, but that's a=20
> > different problem).
>
> Aha. So you're saying that all RAID sets will fail (or more
> accurately, RAIDframe will fail) in the event of a double disk
> failure? That's fine, really; it's just something I wasn't aware of.
Yes. And the last time I looked at teaching RAIDframe to fail gracefully on
this it looked like it was going to be quite entertaining...
> > > May 8 16:10:47 gamera /netbsd: siop1: scsi bus reset
> > > May 8 16:10:47 gamera /netbsd: cmd 0xc06700c0 (target 0:0) in reset li=
> st
> > >=20
> > > =3D2E..and then it crashed. The console had some message about RAIDfra=
> me
> > > being unable to allocate a DAG. I didn't write it down or get a
> > > backtrace, because I knew it would make a core dump. :-/
> >=20
> > Writing it down would not have caused a core dump, and would have helped=
> =20
> > confirmed what I suspect happened. Basically when 2 disks in a RAID 5 se=
> t=20
> > fail, RAIDframe gives up. And by the looks of it, the machine had some=
> =20
> > serious SCSI problems, errors were returned to RAIDframe, RAIDframe marke=
> d the=20
> > components as failed, and when more than one component failed, RAIDframe =
> said=20
> > "enough".
>
> :) I didn't mean to say that I thought writing down a backtrace would
> cause a core dump. I meant to say that I didn't bother, because I
> assumed that this crash would trigger a core dump, regardless. Which
> it did. I just wasn't able to grab the core dump off the dump device
> when it came back up.
You know, it wasn't till the 2nd time I read it that I interpreted it the way
you intended, but for got destracted, and forgot to change my comment...
> > > Problem 2: I'd like to get raid1 back up again, but it won't
> > > configure:
> > > May 8 16:38:31 gamera /netbsd: raidlookup on device: /dev/sd4e failed!
> > > May 8 16:38:31 gamera /netbsd: Hosed component: /dev/sd4e
> >=20
> > So sd4 is no longer on the system? Oh... looks like it wasn't there=20
> > before on May 7??? (Unless the logs you have here are incomplete...)
>
> I included all information relevant to scsibuses and my fxp0. If you
> want more, I can certainly send it. :) Though it sounds like that's
> all irrelevant to what's going on.
Ya... sd4 being there or not was the critical bit of information that was
missing...
> > [...] If,
> > however, sd4 was *not* in the system and you were running in
> > degraded mode from the get-go, then you can just use sd2 and sd3,
> > and things should be reasonably ok. (there will likely be some
> > filesystem lossage, but hopefully not much.) Once we know whether
> > sd4 was there or not we'll have a better idea of what components you
> > want to forcibly configure...
>
> Yeah; sd2 and sd3. I've done that, and I'm in the process of running
> fsck now. For what it's worth, given the SCSI errors. Sigh.
You should actually be fairly OK. (well.. in terms of RAIDframe having most of
the bits right -- the filesystem might be hosed, but RAIDframe would have done
the best it could with the bits it got :) ).
Later...
Greg Oster