netbsd-bugs: port-mips/29395: r5k (specific?) cache problem / data corruption

Subject: port-mips/29395: r5k (specific?) cache problem / data corruption
To: None <port-mips-maintainer@netbsd.org, gnats-admin@netbsd.org,>
From: Markus W Kilbinger <kilbi@rad.rwth-aachen.de>
List: netbsd-bugs
Date: 02/15/2005 22:18:00
>Number:         29395
>Category:       port-mips
>Synopsis:       r5k (specific?) cache problem / data corruption
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    port-mips-maintainer
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Feb 15 22:18:00 +0000 2005
>Originator:     kilbi@rad.rwth-aachen.de
>Release:        NetBSD 2.99.15, netbsd-2... branch also
>Organization:
>Environment:
System: NetBSD qube 2.99.15 NetBSD 2.99.15 (QUBE) #3: Tue Feb 15 17:04:08 MET 2005  kilbi@qie:/usr/src/sys/arch/cobalt/compile/QUBE cobalt
Architecture: cobalt
Machine: mipsel
>Description:
	For several months now (as long as I own a qube2) I observe
	some kind of data corruption on my qube2 in handling 'larger'
	data amounts/disk access. I've recognized this obviously first
	while installing new userland *.tgz sets ending up with a non
	working system (libc.so... was corrupted).

	In the following more systematic approach disk access (viaide,
	wd0) was always involved if a data corruption occured. I did
	not notice these data corruptions in pure network traffic (I
	use the qube2 as a router) or ram access
	(pkgsrc/sysutils/memtester showed no errors).

	The corruptions occur always at 32 bytes boundaries and in 32
	bytes size (see below) which matches very well my qube2's
	cache properties:

	cpu0 at mainbus0: QED RM5200 CPU (0x28a0) Rev. 10.0 with built-in FPU Rev. 10.0
	cpu0: 32KB/32B 2-way set-associative L1 Instruction cache, 48 TLB entries
	cpu0: 32KB/32B 2-way set-associative write-back L1 Data cache

	The data corruption seems to appear while reading _and_
	writing from/to disk (see below).

	As Izumi Tsutsui <tsutsui@ceres.dti.ne.jp> noted this problem
	seems to involve other platforms with same cpu type too (a
	R5000 O2 sgimips in his case):

	  http://mail-index.netbsd.org/port-mips/2005/02/14/0000.html

	The problem can be diminished (not avoided!) with:

	- Putting some additional cpu load onto my qube2: E. g. for
	  installing new *.tgz sets I run 'nice pax -zvrpe ...' over a
	  ssh connection, so that pax's '-v' vorbose output produces
	  some additional load which prevents most file corruptions.

	- Compiling kernel with higher optimization (-O3 -mtune r5000
	  -mips2).
>How-To-Repeat:
	First I've copied a 100 mb file multiple times onto the
	machines harddisk and compared it ('cmp') with the original
	file. This revealed the above mentioned quite randomly spread
	32 bytes boundary and size mismatches.

	Repeating just the 'cmp's (no new file copying) revealed
	different differences between the files from time to time.

	On advice of Chuck Silvers <chuq@chuq.com> I wrote a small
	pattern generator (C program), which generates/writes large
	files containing consecutive int (4 bytes) numbers to
	differentiate better if corruption is write and/or read
	related.

	Within this special scenario all data corruptions occured
	during writing the data onto disk. Reading a single file for
	comparison with consecutive numbering showed no data
	corruption so far, in opposite to 'cmp' (two open files
	simultaneously) which showed data corruptions.

	Following is an example of all data corruptions occured after
	writing a 100 mb file with my pattern generator. First column
	shows the expected/generated consecutive number (4 byte int,
	starting with 00000000), second the read/corrupted value from
	the test file. Each line consequently represents 4 bytes of
	data:

	  008aebf0:  00000000
	  008aebf1:  00000000
	  008aebf2:  00000000
	  008aebf3:  00000000
	  008aebf4:  00000000
	  008aebf5:  00000000
	  008aebf6:  00000000
	  008aebf7:  cc4abee8
	  00a44a00:  deadbeef
	  00a44a01:  8b6ef400
	  00a44a02:  8b6efc04
	  00a44a03:  01a24a09
	  00a44a04:  0204000c
	  00a44a05:  00002e2e
	  00a44a06:  01a27289
	  00a44a07:  08080014
	  00a8aa00:  deadbeef
	  00a8aa01:  8b347c00
	  00a8aa02:  8bada7ec
	  00a8aa03:  01262041
	  00a8aa04:  0204000c
	  00a8aa05:  00002e2e
	  00a8aa06:  01267b16
	  00a8aa07:  0f080018
	  00bdc100:  deadbeef
	  00bdc101:  8789dc00
	  00bdc102:  8aa2c894
	  00bdc103:  01a24a03
	  00bdc104:  0204000c
	  00bdc105:  00002e2e
	  00bdc106:  01a24b6f
	  00bdc107:  07080010
	  00c7f400:  deadbeef
	  00c7f401:  00000000
	  00c7f402:  8fe6d06c
	  00c7f403:  00000002
	  00c7f404:  0204000c
	  00c7f405:  00002e2e
	  00c7f406:  00030601
	  00c7f407:  0304000c
	  00da23f0:  01000a02
	  00da23f1:  0000000b
	  00da23f2:  69736f70
	  00da23f3:  635f3278
	  00da23f4:  6e69625f
	  00da23f5:  00000064
	  00da23f6:  00000000
	  00da23f7:  cc4abee8
	  00df87f0:  00c1af00
	  00df87f1:  00c1aff1
	  00df87f2:  00c1aff2
	  00df87f3:  00c1aff3
	  00df87f4:  00c1aff4
	  00df87f5:  00c1aff5
	  00df87f6:  00000000
	  00df87f7:  cc4abee8
	  00e66ff0:  008c1ff0
	  00e66ff1:  008c1ff1
	  00e66ff2:  008c1ff2
	  00e66ff3:  008c1ff3
	  00e66ff4:  008c1ff4
	  00e66ff5:  008c1ff5
	  00e66ff6:  80f89fd8
	  00e66ff7:  cc4abee8

	If I understood Chuck correctly he supposes some kind of
	interaction problem between bus_dma and r5k cache handling
	(missing cache (line) invalidation?).
>Fix:
	n/a

>Unformatted: