Subject: kern/8889: -current LFS corruption
To: None <gnats-bugs@gnats.netbsd.org>
From: None <jbernard@mines.edu>
List: netbsd-bugs
Date: 11/26/1999 15:30:57
>Number: 8889
>Category: kern
>Synopsis: null files, dirty files in clean segments, non-removable directories
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people (Kernel Bug People)
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri Nov 26 15:30:01 1999
>Last-Modified:
>Originator: Jim Bernard
>Organization:
Speaking for myself
>Release: Nov. 25, 1999
>Environment:
1.4P, Nov. 25, 1999, i386
>Description:
LFS is exhibiting several forms of corruption. I don't know exactly
what minimum sequence of events leads to each, so I'll describe the
sequence that has so far caused the problems.
I rebuilt the LFS (3 GB in size) under a Nov. 14 kernel with
userland from the Nov. 13 snapshot, and unpacked a -current source
tree onto the filesystem. No corruption was evident (except for
the appearance of UNREF FILE messages from fsck_lfs, which is
apparently harmless) in a couple of days of relatively light
operation. At some point (not sure exactly when, but I think it was
at this point), I noticed a couple of files with wrong timestamps
(more recent than they should be), but dismissed that as not critical.
I then built a new kernel (sources supped after Nov. 25 supscan; built
in a scratch directory on the LFS, with the source tree, also on LFS,
union mounted beneath the scratch directory), and booted it, with no
immediately apparent problems. I then did a full system build (again
with sources on the LFS filesystem union mounted beneath a scratch
directory on the LFS), with no immediately apparent problems, and
rebooted. Then "fsck_lfs -n -d" reported in phase 1 some 15,000
messages like:
! INO 1318: daddr 0x2779b3 is in clean segment 1263
(The number of these has decreased a bit with time, over a period of
a bit less than 24 hours---down to a minimum of about 7,000, then
rising slightly.) No other corruption, besides the UNREF FILE's
was found by fsck_lfs.
I then unpacked the latest xsrc tarball, and supped pkgsrc (which was
about a week out of date) and xsrc (all onto the LFS), with no apparent
problems. BTW: at this point, df reported something like 600 MB of
space in use. I then successfully built and installed the tcl80
package (the installed files go on a different filesystem, but the
source tree and build space were on the LFS---no union mount was used
here). An immediately subsequent attempt to build tk80 failed
miserably, because some directories and files in the work subdirectory
(where the source tarball gets unpacked) were null (ls showed all mode
bits off and 0 link count):
---------- 0 someuser somegroup (not sure of the rest)
Furthermore, some directories could not be removed, reminiscent of
the problem reported in kern/8815, which has, however been fixed
(the current directory-removal problem evidently occurs far less
frequently).
Altogether, I tried building tk80 three times; finding different
problem directories and files each time (only on the second try
were there non-removable directories). The third try crashed the
machine (and I'm miles from the machine at the moment, so can't
get a traceback right now---I'll submit an addendum when I get one).
>How-To-Repeat:
I imagine any use of LFS for a while would eventually lead to these
problems, but the sequence that did it for me is described above.
>Fix:
Unknown.
>Audit-Trail:
>Unformatted: