Subject: Re: raidframe re-mirroring (cont'd)
To: Louis Guillaume <lguillaume@berklee.edu>
From: Greg Oster <oster@cs.usask.ca>
List: current-users
Date: 08/13/2004 08:34:02
Louis Guillaume writes:
> Hi Everyone,
>
> I posted a few weeks ago about a problem I had with a raid set, where
> one disk was failed and I wanted to bring it back online. Here's what
> happened...
>
> . Booted into single-user
>
> . Rebuilt all arrays on the pair of disks: raid0 raid1 raid2 raid3 raid4
> - all raid-1. It's set up like this...
>
> #############################
> raid0 raid1 raid2 raid3 raid4
>
> wd0a wd0e wd0f wd0g wd0b
> wd1a wd1e wd1f wd1g wd1b
>
> / /usr /var /home swap
> #############################
>
> . fsck-ed all filesystems. reboot
>
> Immediately, I noticed apache2 and spamass-milter fail during startup
> (recently built from pkgsrc and very reliable). Immediatiely!
How do they fail? What do they do/not do? (i.e. what is the nature
of the error?)
> This is
> what caused me to believe the second disk was bad in the first place.
>
> Now I believed that the disk was actually bad and not the kernel/raidframe.
>
> . Rebooted back to single user.
> . Failed all wd1 raid components.
> . fsck (finds and fixes errors) and reboot again.
>
> All is well! For a week and a half, not a hitch.
>
> More reason to believe it's the disk.
>
> . Replace suspect disk with another one, disklabeled raidctl -a ...etc.
>
> . Incorporated new spare components into arrays.
>
> . rebooted. raidctl -F ... , fsck , reboot.
>
> SAME FAILURES as before!! Apache2 and spamass-milter are the first to
> go. In the past I had not noticed these right away and kept running.
>
> This is very strange. I'd really like to get my redundancy back. But
> once again, I'm running on a set of single-component raid-1 arrays.
>
> Here is some other information that may be useful...
>
> Machine - i386
> Problem first noticed at NetBSD-2.0E GENERIC.MP kernel
> Still a problem at NetBSD-2.0G GENERIC.MP kernel
>
> I'm guessing my disk is good. The machine runs great on one disk. Weeks
> of uptime - even months without a peep. So I'm not thinking that there's
> a memory problem as someone suggested earlier.
>
> The only other thing I can think of is perhaps the ribbon cable from the
> board to the disk. But if that was bad, wouldn't we have much more
> obvious issues?
>
> I don't know if this is a config problem, or something else. But there
> definitely is a strange problem that's preventing me from mirroring
> successfully.
>
> Perhaps too many raid devices on one pair of disks?
No.
> Maybe problems with MP kernel and raidframe?
Not supposed to be. I havn't seen anything here that would suggest
that...
> Any help would be great. Please let me know if I can provide more
> information.
The apache/milter errors would be useful. RAID config files and a
'dmesg' output would also help.
Have you tried isolating which of the RAID sets seems to be causing
the problem?
Later...
Greg Oster