Subject: Data corruption with dump (mmap related??)
To: None <port-mips@netbsd.org>
From: Wayne Knowles <w.knowles@niwa.cri.nz>
List: port-mips
Date: 08/25/2000 23:06:21
I have recently uncovered a serious problem with dump corrupting data
that appears to be a -mips related problem.
What is happening is at random 8k intervals (sometimes 2k or 4k)
4 bytes of the file are getting corrupted. Generally the replacement
data is all 0's
This problem was first witnessed on NetBSD/mipsco 1.5E but the problem can
also be reproduced at will under NetBSD/pmax running 1.5B
NetBSD/alpha or NetBSD/sparc 1.4.2 cannot reproduce this bug which
eliminated dump as a possible cause.
This is the script used to reproduce the problem. You will have to set
OUTDIR to a different disk partition from root to avoid problems.
------CUT HERE-----
#! /bin/sh
OUTDIR=/home/tmp
mkdir $OUTDIR
cd $OUTDIR
dump -0f - / | restore -rf -
for F in /sbin/*
do
echo file: $F
cmp -l $F $OUTDIR/$F
done
------CUT HERE-----
When the script runs on a mipsco or pmax system it produces errors like
the following:
file: /sbin/atactl
8189 214 0
8190 306 0
file: /sbin/badsect
file: /sbin/ccdconfig
2045 217 0
2046 231 0
2047 200 0
2048 130 0
24573 24 0
24574 100 0
24575 377 0
24576 272 0
43005 257 0
43006 264 0
43008 340 0
65533 217 0
65534 231 0
65535 202 0
65536 224 0
......
I'm pretty confident it is mmap related as the following patch to dump
which fills the mmap'ed region to 255 also changes the 0 to 255 in the
corrupted region:
Index: rcache.c
===================================================================
RCS file: /cvsroot/basesrc/sbin/dump/rcache.c,v
retrieving revision 1.4
diff -u -r1.4 rcache.c
--- rcache.c 1999/10/01 04:35:23 1.4
+++ rcache.c 2000/08/24 22:18:52
@@ -139,6 +139,7 @@
sizeof(struct cdesc) * cachebufs;
memset(shareBuffer, '\0', sharedSize);
+ memset(cdata, (char) 0xff, nblksread *cachebufs*dev_bsize);
}
}
/*-----------------------------------------------------------------------*/
If the machine is performing other tasks (ie large compiles) there is a
higher change of data corruption. Also, 'dump -k 16' does not corrupt data
whereas the default (-k 32) does. If your test works first time around you
might want to try -r 512 to allocate 512k in the mmap memory segment.
I would be interested in hearing back reports about other Mips machines
In particular those R4000 based. If we can cover all of the ports a
better picture might start to emerge as to the cause.
This is not the kind of bug we want lurking on a production system!!!!
Any feedback will be appreciated.
Wayne
--
_____ Wayne Knowles, Systems Manager
/ o \/ National Institute of Water & Atmospheric Research Ltd
\/ v /\ P.O. Box 14-901 Kilbirnie, Wellington, NEW ZEALAND
`---' Email: w.knowles@niwa.cri.nz