On 16 Feb 2016, at 07:29, Dave Vitek <dvitek%grammatech.com@localhost> wrote:
Hi all,
We have an amd64 NetBSD 6.1.4 (stable) machine that we use as a build server and also for testing. We're having an intermittent problem where occasionally, a 4096 byte long 4096 byte aligned chunk of an archive (.a) file gets overwritten with a bunch of human readable text that we recognize as stdout output from another process that is in no way related to building archive files.
This other process runs (potentially concurrently) as a different user in a far away directory on the same file system. It can produce quite a bit of output. The text is redirected to a file and/or sent over a socket. Either way, there's no way the middle of that log gets written to the middle of this unrelated archive file.
On Feb 10 between 14:41 and 19:00, the .a file was created by ar and ranlib. The linker used this file basically immediately with success at the time. There's no way the linker would not choke in the presence of the file corruption. It made a copy of the archive for later on the same file system. There's no way the copy process should have access to the text later observed in the file.
Later the same day, between 19:00 and 23:12, the copy of the archive file gets read. By this time, it contains the damaged page and the linker complains.
I haven't yet determined when the process that logged the human readable text ran. I may never know.
I have both the original undamaged .o file and the damaged .a file. I used "ar x" to extract the bad object file and did a binary diff of the entire file to find the messed up 4096 byte chunk. The rest of the file is unchanged. The entire archive is about 10mb.
/var/log/messages shows a handful of these messages overlapping the time period in question:
/netbsd: file: table is full - increase kern.maxfiles or MAXFILES
At the risk of speculating: Are there any known issues with horrible things happening in the kernel when there is file descriptor pressure?
We've also seen software-layer I/O checksum errors intermittently, with the same sort of text overwriting chunks of files. Now that we've also seen it in these .a files I'm leaning towards blaming the OS.
We're pretty sure the hardware isn't to blame: This machine was originally a virtual machine running the same version of NetBSD, and it had the same problem. Other guests had no problems. It's now a physical machine on completely different hardware and still has the problem. I don't know of a lot of hardware problems that would consistently manifest in this fashion anyway.
There's only one disk on the system:
/dev/sd0a at /
It likely always has 500gb+ free space. It has 16 logical cores and 24GB of RAM. It's a busy system doing lot's of I/O all the time.
There are also a few nfs mounts, but they aren't used much and shouldn't be involved with the data in question.
We have roughly the same setup on many other platforms (linux, mac, solaris, freebsd, windows), none of which have this problem.
I am not yet able to artificially cause the problem to manifest. I am thankful that the file corruption occurs in large enough chunks that the consequences are unlikely to be subtle.
I could start storing checksums along side the archive files. Let's assume I do that and I discover that the file was OK when it was written, but the checksum no longer matches a couple hours later. What next?
I couldn't find any PRs that looked like this issue, but who knows if my search was any good. Does this sound familiar to anyone?
I could imagine trying several things at this point:
- Turning on assertions in the kernel
- Running in single CPU mode to see if it helps
- Switching file systems
- Trying different versions of netbsd (6.1.5?)
Suggestions? I suspect I need something that maintains binary compatibility with the 6 series.