Subject: kern/32018: raidframe reconstruction will panic when new component fails
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: Wolfgang Stukenbrock <Wolfgang.Stukenbrock@nagler-company.com>
List: netbsd-bugs
Date: 11/08/2005 14:04:00
>Number: 32018
>Category: kern
>Synopsis: raidframe reconstruction will panic when new component fails
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue Nov 08 14:04:00 +0000 2005
>Originator: Wolfgang Stukenbrock
>Release: NetBSD 2.0.2
>Organization:
Dr. Nagler & Comapny GmbH
>Environment:
System: NetBSD s011 2.0.2 NetBSD 2.0.2 (NSW-Webproxy) #10: Mon Jun 13 14:14:26 CEST 2005 wgstuken@s012:/export/netbsd-2.0.2/usr/src/sys/arch/i386/compile/NSW-Webproxy i386
Architecture: i386
Machine: i386
>Description:
A panic will occure while reconstructing (e.g. a mirror) if there is a problem
writing some blocks to the new device.
A message "raid0: Recon write failed" is printed, followed by a panic in line
880 of rf_reconstruct.c.
This is a very bad behaviour. If such error occurs, the new component should be
set to failed and the reconstruction at all should fail.
There is no need to kill a running server if a reconstruction failed. The previous
state of the raid-device (in degraded mode) is still there.
The problem is located in dev/raidframe/rf_reconstruct.c.
At line 872 there is the label RF_REVENT_WRITE_FAILED of the event-processing
stuff and this is a fall through into the panic at line 880.
The event RF_REVENT_WRITE_FAILED is set at line 1290 in ReconWriteDoneProc() in
dev/raidframe/rf_reconstruct.c. This is the one and only place where this event
is triggered.
>How-To-Repeat:
This is a little bit complicate, because you need a disk that will fail to write
some blocks. If you have such disk, just setup a raiddevice (e.g. a mirror) fail
one component and start reconstruction onto the disk with the write problem.
If the write-failed-blocks are reached, the system will panic.
>Fix:
Add code to the event processing part (around line 872 in rf_reconstruct.c) that
will abort the reconstruction and set the new component to failed.
PS. perhaps something equivalent should be added to read-errors. In this case,
the reconstruction has failed and at least another component of the raid-device
has gone (-> status = failed). I don't know if this read-error will be already
handled somewhere else.
I've not the time to completly understand the whole raidframe stuff, so I cannot
provide some code that will fix this problem. Sorry.
>Unformatted: