Subject: stability problems with NetBSD/sparc 1.3.2
To: NetBSD/sparc Discussion List <port-sparc@netbsd.org>
From: Greg A. Woods <woods@most.weird.com>
List: netbsd-users
Date: 10/21/1998 23:26:15
Since upgrading my Sparcstation-2 to 1.3.2 I've got the perception that
the machine is a bit less reliable.  With 1.3.1 I don't really remember
any crashes at all that were not due to known bugs.  With 1.3.2 I seem
to be suffering unexplained reboots at a frequence that varies somewhere
between weekly and daily (mind you it's only been four total since the
upgrade on Sept 20).  The longest uptime was 30 days (following the
first boot of the new kernel), and the shortest time between crashes was
less than 12 hours.  Once the system was hung and didn't reboot itself
even after waiting about 15 minutes.  I guess that leaves only two
totally spontaeous reboots.  Maybe I'm jumping the gun, but when they're
less than 24 hours apart on a machine that should be seeing uptimes of
months I'm somewhat perplexed.

Unfortunately except for the time it was hung I've never made it down to
the console terminal fast enough yet to see what the heck might be
printed on it.  The time the system appeared hung there were no abnormal
messages on the screen and the symptoms seemed to be due to something in
the filesystem I/O -- there was no response except to ping and there was
no apparent disk activity.  An attempt to "sync" from the firmware just
gave a "syncing... " message and froze with no disk actiity.  However
the SCSI bus must not have been hung because syslog was able to fsync
the "stopping on keyboard abort" message (and I did see the disk
activity light blink when this happened).

To date these crashes do not leave a core dump (savecore: no core dump)
even though my kernel should be configured properly to do a dump.

The console is a 19200bps rs232 terminal on ttya, and I do have
watchdog-reboot?=true set in the eeprom.

These crashes seem to be "triggered" by heavy activity of one sort or
another, but nothing specific has jumped out as an obvious trigger and I
can't seem to cause them on demand.  There have been no hardware errors
reported by the kernel to date either.

I've not been paying quite close enough attention to the commits going
into the preparation of 1.3.3 to notice if anything that might be
causing problems such as this has already been fixed....

Even worse is the fact that these server crashes seem to cause my
diskless client to get it's I/O system into a lock-up state (it's a
sparc-1, also running 1.3.2, sharing /usr with the server).  The X clock
keeps running, the caps-lock and num-lock lights toggle, but anything
that might cause/require NFS activity freezes or is already frozen.  The
machine is pingable, but doesn't answer connections.  To date I've found
no recourse but to reboot the workstation.  I suspect this is because I
mount my partitions with '-T' (i.e. NFS over TCP).  I'll take this out
to see if anything improves at the next crash.  And I thought TCP mounts
would make things more reliable and easier to recover!  :-(

Which reminds me:  the diskless client seems to maintain it's dmesg
buffer with a manual reboot, but the server's buffer seems to have been
cleared after a crash and automatic reboot.

Lastly I received this rather cryptic error message in my diskless
client's xconsole, presumably from some application that I started after
logging in (it comes after a message I customarily see when starting
emacs with rsh to the server):

	ERROR! ERROR! ERROR! YOU SHOULD NOT BE HERE!!!

I've poked around in most of the obvious places (emacs sources, xterm
sources, etc.), but as yet have not found where it could have come from.
Something weird about this rings a tiny almost silent bell in the back
of my mind, but I don't know what it is.....  I should have installed
glimpse long ago, I guess.  Any clues would be appreciated....

Speaking of reboots -- I still seem to have to manually publish the arp
entry for my diskless workstation else it can't seem to boot itself more
than once.  It gets stuck with an "(incomplete)" entry.  If I delete it
and manually publish it while the client is hanging waiting for the TFTP
to start up, everthing magically comes to life and all is well.

-- 
							Greg A. Woods

+1 416 218-0098      VE3TCP      <gwoods@acm.org>      <robohack!woods>
Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>


--kcLsSdV5ll
Content-Type: text/plain
Content-Description: dmesg output from most recent reboot
Content-Disposition: inline;
	filename="dmesg.out"
Content-Transfer-Encoding: 7bit

NetBSD 1.3.2 (MOST) #0: Sun Sep 20 01:28:07 EDT 1998
    woods@most:/usr/src-1.3.2/sys/arch/sparc/compile/MOST
real mem = 67022848
avail mem = 63967232
using 128 buffers containing 524288 bytes of memory
bootpath: /sbus0/esp@0,800000/sd@1,0
mainbus0 (root): SUNW,Sun 4/75
cpu0 at mainbus0: cache chip bug; trap page uncached: CY7C601 @ 40 MHz, TMS390C602A FPU
cpu0: 64K byte write-through, 32 bytes/line, hw flush: cache enabled
memreg0 at mainbus0 ioaddr 0xf4000000
clock0 at mainbus0 ioaddr 0xf2000000: mk48t02 (eeprom)
timer0 at mainbus0 ioaddr 0xf3000000 delay constant 17
auxreg0 at mainbus0 ioaddr 0xf7400003
zs0 at mainbus0 ioaddr 0xf1000000 pri 12, softpri 6
zstty0 at zs0 channel 0 (console)
zstty1 at zs0 channel 1
zs1 at mainbus0 ioaddr 0xf0000000 pri 12, softpri 6
kbd0 at zs1 channel 0
ms0 at zs1 channel 1
audioamd0 at mainbus0 ioaddr 0xf7201000 pri 13, softpri 4
audio0 at audioamd0
sbus0 at mainbus0 ioaddr 0xf8000000: clock = 20 MHz
dma0 at sbus0 slot 0 offset 0x400000: rev 1+
esp0 at sbus0 slot 0 offset 0x800000 pri 3: ESP100A, 20MHz, SCSI ID 7
scsibus0 at esp0: 8 targets
probe(esp0:0:0): max sync rate 4.03Mb/s
sd3 at scsibus0 targ 0 lun 0: <SEAGATE, ST32430N, 0300> SCSI2 0/direct fixed
sd3: 2049MB, 3992 cyl, 9 head, 116 sec, 512 bytes/sect x 4197405 sectors
probe(esp0:1:0): max sync rate 4.03Mb/s
sd1 at scsibus0 targ 1 lun 0: <CONNER, CFP2105S  2.14GB, 172A> SCSI2 0/direct fixed
sd1: 2048MB, 3940 cyl, 10 head, 106 sec, 512 bytes/sect x 4194304 sectors
probe(esp0:2:0): max sync rate 4.03Mb/s
sd2 at scsibus0 targ 2 lun 0: <CONNER, CFP2105S  2.14GB, 172A> SCSI2 0/direct fixed
sd2: 2048MB, 3940 cyl, 10 head, 106 sec, 512 bytes/sect x 4194304 sectors
esp:3: async
sd0 at scsibus0 targ 3 lun 0: <QUANTUM, FIREBALL SE4.3S, PJ09> SCSI2 0/direct fixed
sd0: 4110MB, 7637 cyl, 4 head, 19 sec, 512 bytes/sect x 8418816 sectors
st0 at scsibus0 targ 4 lun 0: <EXABYTE, EXB-8200, 425A> SCSI1 1/sequential removable
st0: density code 0x0, 1024-byte blocks, write-enabled
st1 at scsibus0 targ 5 lun 0: <ARCHIVE, VIPER 2525 25462, -007> SCSI1 1/sequential removable
st1: rogue, drive empty
cd0 at scsibus0 targ 6 lun 0: <TOSHIBA, CD-ROM XM-3301TA, 0272> SCSI2 5/cdrom removable
le0 at sbus0 slot 0 offset 0xc00000 pri 5: address 08:00:20:0e:92:5f
le0: 8 receive buffers, 2 transmit buffers
fdc0 at mainbus0 ioaddr 0xf7200000 pri 11, softpri 4: chip 82072
root on sd1a dumps on sd1b
root file system type: ffs

--kcLsSdV5ll
Content-Type: text/plain
Content-Description: my kernel configuration
Content-Disposition: inline;
	filename="MOST"
Content-Transfer-Encoding: 7bit

include "arch/sparc/conf/std.sparc"

maxusers	64

makeoptions	DEBUG="-g"

options 	SUN4C		# sun4c - SS1, 1+, 2, ELC, SLC, IPC, IPX, etc.

options 	KTRACE		# system call tracing
options 	SYSVMSG		# System V message queues
options 	SYSVSEM		# System V semaphores
options 	SYSVSHM		# System V shared memory
options 	MAXUPRC=128	# max procs per user (RLIMIT_NPROC)

options 	DIAGNOSTIC	# extra kernel sanity checking
options 	SCSIVERBOSE	# Verbose SCSI errors

options 	COMPAT_43	# 4.3BSD system interfaces
options 	COMPAT_10	# NetBSD 1.0 binary compatibility
options 	COMPAT_11	# NetBSD 1.1 binary compatibility
options 	COMPAT_12	# NetBSD 1.2 binary compatibility
options 	COMPAT_SUNOS	# SunOS 4.x binary compatibility
options 	COMPAT_SVR4	# SunOS 5.x binary compatibility
options 	EXEC_ELF32	# Exec module for SunOS 5.x binaries.

file-system	FFS		# Berkeley Fast Filesystem
file-system	NFS		# Sun NFS-compatible filesystem client
file-system	KERNFS		# kernel data-structure filesystem
file-system	NULLFS		# NULL layered filesystem
file-system	MFS		# memory-based filesystem
file-system	FDESC		# user file descriptor filesystem
file-system	PORTAL		# portal filesystem (still experimental)
file-system	PROCFS		# /proc
file-system	CD9660		# ISO 9660 + Rock Ridge file system
file-system	UNION		# union file system
file-system	MSDOSFS		# MS-DOS FAT filesystem(s).

options 	NFSSERVER	# Sun NFS-compatible filesystem server
options 	QUOTA		# FFS quotas
options 	FIFO		# POSIX fifo support (in all filesystems)

options 	INET		# IP stack
options 	GATEWAY		# IP packet forwarding
options 	MROUTING	# packet forwarding of multicast packets
options 	PFIL_HOOKS	# pfil(9) packet filter hooks.
options 	IPFILTER_LOG	# enables logging in ip-filter.

options 	BLINK		# blink the led on supported machines

config		netbsd	root on ? type ?

mainbus0 at root
cpu0	at mainbus0

sbus0	at mainbus0				# sun4c
obio0	at mainbus0				# sun4 and sun4m

audioamd0	at mainbus0				# sun4c
audio*	at audioamd0

auxreg0	at mainbus0				# sun4c

clock0	at mainbus0				# sun4c

memreg0	at mainbus0				# sun4c

timer0	at mainbus0				# sun4c

zs0	at mainbus0					# sun4c
zstty0	at zs0 channel 0	# ttya
zstty1	at zs0 channel 1	# ttyb

zs1	at mainbus0					# sun4c
kbd0	at zs1 channel 0	# keyboard
ms0	at zs1 channel 1	# mouse

dma0	at sbus0 slot ? offset ?			# on-board SCSI
esp0	at sbus0 slot ? offset ? flags 0x0000		# sun4c

dma*	at sbus? slot ? offset ?			# SBus SCSI
esp*	at sbus? slot ? offset ? flags 0x0000		# two flavours
esp*	at dma? flags 0x0000				# depending on model

isp*	at sbus? slot ? offset ?

le0	at sbus0 slot ? offset ?			# sun4c on-board
le*	at sbus? slot ? offset ?

bwtwo0	at sbus0 slot ? offset ?		# sun4c on-board
bwtwo*	at sbus? slot ? offset ?		# sun4c and sun4m
scsibus* at esp?
scsibus* at isp?

sd0	at scsibus? target 3 lun ?		# first SCSI disk
sd1	at scsibus? target 1 lun ?		# second SCSI disk
sd2	at scsibus? target 2 lun ?		# third SCSI disk
sd3	at scsibus? target 0 lun ?		# fourth SCSI disk
st*	at scsibus? target ? lun ?		# SCSI tapes
cd*	at scsibus? target ? lun ?		# SCSI CD-ROMs
ch*	at scsibus? target ? lun ?		# SCSI changer devices

fdc0	at mainbus0				# sun4c controller
fd*	at fdc0					# the drive itself

pseudo-device	loop			# loopback interface; required
pseudo-device	pty		64	# pseudo-ttys (for network, etc.)
pseudo-device	ppp		2	# PPP interfaces
pseudo-device	tun		4	# Network "tunnel" device
pseudo-device	bpfilter	16	# Berkeley Packet Filter
pseudo-device	vnd		4	# disk-like interface to files
pseudo-device	ipfilter		# ip filter

--kcLsSdV5ll--