Subject: stability problems with NetBSD/sparc 1.3.2
To: NetBSD/sparc Discussion List <port-sparc@netbsd.org>
From: Greg A. Woods <woods@most.weird.com>
List: netbsd-users
Date: 10/21/1998 23:26:15
Since upgrading my Sparcstation-2 to 1.3.2 I've got the perception that
the machine is a bit less reliable. With 1.3.1 I don't really remember
any crashes at all that were not due to known bugs. With 1.3.2 I seem
to be suffering unexplained reboots at a frequence that varies somewhere
between weekly and daily (mind you it's only been four total since the
upgrade on Sept 20). The longest uptime was 30 days (following the
first boot of the new kernel), and the shortest time between crashes was
less than 12 hours. Once the system was hung and didn't reboot itself
even after waiting about 15 minutes. I guess that leaves only two
totally spontaeous reboots. Maybe I'm jumping the gun, but when they're
less than 24 hours apart on a machine that should be seeing uptimes of
months I'm somewhat perplexed.
Unfortunately except for the time it was hung I've never made it down to
the console terminal fast enough yet to see what the heck might be
printed on it. The time the system appeared hung there were no abnormal
messages on the screen and the symptoms seemed to be due to something in
the filesystem I/O -- there was no response except to ping and there was
no apparent disk activity. An attempt to "sync" from the firmware just
gave a "syncing... " message and froze with no disk actiity. However
the SCSI bus must not have been hung because syslog was able to fsync
the "stopping on keyboard abort" message (and I did see the disk
activity light blink when this happened).
To date these crashes do not leave a core dump (savecore: no core dump)
even though my kernel should be configured properly to do a dump.
The console is a 19200bps rs232 terminal on ttya, and I do have
watchdog-reboot?=true set in the eeprom.
These crashes seem to be "triggered" by heavy activity of one sort or
another, but nothing specific has jumped out as an obvious trigger and I
can't seem to cause them on demand. There have been no hardware errors
reported by the kernel to date either.
I've not been paying quite close enough attention to the commits going
into the preparation of 1.3.3 to notice if anything that might be
causing problems such as this has already been fixed....
Even worse is the fact that these server crashes seem to cause my
diskless client to get it's I/O system into a lock-up state (it's a
sparc-1, also running 1.3.2, sharing /usr with the server). The X clock
keeps running, the caps-lock and num-lock lights toggle, but anything
that might cause/require NFS activity freezes or is already frozen. The
machine is pingable, but doesn't answer connections. To date I've found
no recourse but to reboot the workstation. I suspect this is because I
mount my partitions with '-T' (i.e. NFS over TCP). I'll take this out
to see if anything improves at the next crash. And I thought TCP mounts
would make things more reliable and easier to recover! :-(
Which reminds me: the diskless client seems to maintain it's dmesg
buffer with a manual reboot, but the server's buffer seems to have been
cleared after a crash and automatic reboot.
Lastly I received this rather cryptic error message in my diskless
client's xconsole, presumably from some application that I started after
logging in (it comes after a message I customarily see when starting
emacs with rsh to the server):
ERROR! ERROR! ERROR! YOU SHOULD NOT BE HERE!!!
I've poked around in most of the obvious places (emacs sources, xterm
sources, etc.), but as yet have not found where it could have come from.
Something weird about this rings a tiny almost silent bell in the back
of my mind, but I don't know what it is..... I should have installed
glimpse long ago, I guess. Any clues would be appreciated....
Speaking of reboots -- I still seem to have to manually publish the arp
entry for my diskless workstation else it can't seem to boot itself more
than once. It gets stuck with an "(incomplete)" entry. If I delete it
and manually publish it while the client is hanging waiting for the TFTP
to start up, everthing magically comes to life and all is well.
--
Greg A. Woods
+1 416 218-0098 VE3TCP <gwoods@acm.org> <robohack!woods>
Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>
--kcLsSdV5ll
Content-Type: text/plain
Content-Description: dmesg output from most recent reboot
Content-Disposition: inline;
filename="dmesg.out"
Content-Transfer-Encoding: 7bit
NetBSD 1.3.2 (MOST) #0: Sun Sep 20 01:28:07 EDT 1998
woods@most:/usr/src-1.3.2/sys/arch/sparc/compile/MOST
real mem = 67022848
avail mem = 63967232
using 128 buffers containing 524288 bytes of memory
bootpath: /sbus0/esp@0,800000/sd@1,0
mainbus0 (root): SUNW,Sun 4/75
cpu0 at mainbus0: cache chip bug; trap page uncached: CY7C601 @ 40 MHz, TMS390C602A FPU
cpu0: 64K byte write-through, 32 bytes/line, hw flush: cache enabled
memreg0 at mainbus0 ioaddr 0xf4000000
clock0 at mainbus0 ioaddr 0xf2000000: mk48t02 (eeprom)
timer0 at mainbus0 ioaddr 0xf3000000 delay constant 17
auxreg0 at mainbus0 ioaddr 0xf7400003
zs0 at mainbus0 ioaddr 0xf1000000 pri 12, softpri 6
zstty0 at zs0 channel 0 (console)
zstty1 at zs0 channel 1
zs1 at mainbus0 ioaddr 0xf0000000 pri 12, softpri 6
kbd0 at zs1 channel 0
ms0 at zs1 channel 1
audioamd0 at mainbus0 ioaddr 0xf7201000 pri 13, softpri 4
audio0 at audioamd0
sbus0 at mainbus0 ioaddr 0xf8000000: clock = 20 MHz
dma0 at sbus0 slot 0 offset 0x400000: rev 1+
esp0 at sbus0 slot 0 offset 0x800000 pri 3: ESP100A, 20MHz, SCSI ID 7
scsibus0 at esp0: 8 targets
probe(esp0:0:0): max sync rate 4.03Mb/s
sd3 at scsibus0 targ 0 lun 0: <SEAGATE, ST32430N, 0300> SCSI2 0/direct fixed
sd3: 2049MB, 3992 cyl, 9 head, 116 sec, 512 bytes/sect x 4197405 sectors
probe(esp0:1:0): max sync rate 4.03Mb/s
sd1 at scsibus0 targ 1 lun 0: <CONNER, CFP2105S 2.14GB, 172A> SCSI2 0/direct fixed
sd1: 2048MB, 3940 cyl, 10 head, 106 sec, 512 bytes/sect x 4194304 sectors
probe(esp0:2:0): max sync rate 4.03Mb/s
sd2 at scsibus0 targ 2 lun 0: <CONNER, CFP2105S 2.14GB, 172A> SCSI2 0/direct fixed
sd2: 2048MB, 3940 cyl, 10 head, 106 sec, 512 bytes/sect x 4194304 sectors
esp:3: async
sd0 at scsibus0 targ 3 lun 0: <QUANTUM, FIREBALL SE4.3S, PJ09> SCSI2 0/direct fixed
sd0: 4110MB, 7637 cyl, 4 head, 19 sec, 512 bytes/sect x 8418816 sectors
st0 at scsibus0 targ 4 lun 0: <EXABYTE, EXB-8200, 425A> SCSI1 1/sequential removable
st0: density code 0x0, 1024-byte blocks, write-enabled
st1 at scsibus0 targ 5 lun 0: <ARCHIVE, VIPER 2525 25462, -007> SCSI1 1/sequential removable
st1: rogue, drive empty
cd0 at scsibus0 targ 6 lun 0: <TOSHIBA, CD-ROM XM-3301TA, 0272> SCSI2 5/cdrom removable
le0 at sbus0 slot 0 offset 0xc00000 pri 5: address 08:00:20:0e:92:5f
le0: 8 receive buffers, 2 transmit buffers
fdc0 at mainbus0 ioaddr 0xf7200000 pri 11, softpri 4: chip 82072
root on sd1a dumps on sd1b
root file system type: ffs
--kcLsSdV5ll
Content-Type: text/plain
Content-Description: my kernel configuration
Content-Disposition: inline;
filename="MOST"
Content-Transfer-Encoding: 7bit
include "arch/sparc/conf/std.sparc"
maxusers 64
makeoptions DEBUG="-g"
options SUN4C # sun4c - SS1, 1+, 2, ELC, SLC, IPC, IPX, etc.
options KTRACE # system call tracing
options SYSVMSG # System V message queues
options SYSVSEM # System V semaphores
options SYSVSHM # System V shared memory
options MAXUPRC=128 # max procs per user (RLIMIT_NPROC)
options DIAGNOSTIC # extra kernel sanity checking
options SCSIVERBOSE # Verbose SCSI errors
options COMPAT_43 # 4.3BSD system interfaces
options COMPAT_10 # NetBSD 1.0 binary compatibility
options COMPAT_11 # NetBSD 1.1 binary compatibility
options COMPAT_12 # NetBSD 1.2 binary compatibility
options COMPAT_SUNOS # SunOS 4.x binary compatibility
options COMPAT_SVR4 # SunOS 5.x binary compatibility
options EXEC_ELF32 # Exec module for SunOS 5.x binaries.
file-system FFS # Berkeley Fast Filesystem
file-system NFS # Sun NFS-compatible filesystem client
file-system KERNFS # kernel data-structure filesystem
file-system NULLFS # NULL layered filesystem
file-system MFS # memory-based filesystem
file-system FDESC # user file descriptor filesystem
file-system PORTAL # portal filesystem (still experimental)
file-system PROCFS # /proc
file-system CD9660 # ISO 9660 + Rock Ridge file system
file-system UNION # union file system
file-system MSDOSFS # MS-DOS FAT filesystem(s).
options NFSSERVER # Sun NFS-compatible filesystem server
options QUOTA # FFS quotas
options FIFO # POSIX fifo support (in all filesystems)
options INET # IP stack
options GATEWAY # IP packet forwarding
options MROUTING # packet forwarding of multicast packets
options PFIL_HOOKS # pfil(9) packet filter hooks.
options IPFILTER_LOG # enables logging in ip-filter.
options BLINK # blink the led on supported machines
config netbsd root on ? type ?
mainbus0 at root
cpu0 at mainbus0
sbus0 at mainbus0 # sun4c
obio0 at mainbus0 # sun4 and sun4m
audioamd0 at mainbus0 # sun4c
audio* at audioamd0
auxreg0 at mainbus0 # sun4c
clock0 at mainbus0 # sun4c
memreg0 at mainbus0 # sun4c
timer0 at mainbus0 # sun4c
zs0 at mainbus0 # sun4c
zstty0 at zs0 channel 0 # ttya
zstty1 at zs0 channel 1 # ttyb
zs1 at mainbus0 # sun4c
kbd0 at zs1 channel 0 # keyboard
ms0 at zs1 channel 1 # mouse
dma0 at sbus0 slot ? offset ? # on-board SCSI
esp0 at sbus0 slot ? offset ? flags 0x0000 # sun4c
dma* at sbus? slot ? offset ? # SBus SCSI
esp* at sbus? slot ? offset ? flags 0x0000 # two flavours
esp* at dma? flags 0x0000 # depending on model
isp* at sbus? slot ? offset ?
le0 at sbus0 slot ? offset ? # sun4c on-board
le* at sbus? slot ? offset ?
bwtwo0 at sbus0 slot ? offset ? # sun4c on-board
bwtwo* at sbus? slot ? offset ? # sun4c and sun4m
scsibus* at esp?
scsibus* at isp?
sd0 at scsibus? target 3 lun ? # first SCSI disk
sd1 at scsibus? target 1 lun ? # second SCSI disk
sd2 at scsibus? target 2 lun ? # third SCSI disk
sd3 at scsibus? target 0 lun ? # fourth SCSI disk
st* at scsibus? target ? lun ? # SCSI tapes
cd* at scsibus? target ? lun ? # SCSI CD-ROMs
ch* at scsibus? target ? lun ? # SCSI changer devices
fdc0 at mainbus0 # sun4c controller
fd* at fdc0 # the drive itself
pseudo-device loop # loopback interface; required
pseudo-device pty 64 # pseudo-ttys (for network, etc.)
pseudo-device ppp 2 # PPP interfaces
pseudo-device tun 4 # Network "tunnel" device
pseudo-device bpfilter 16 # Berkeley Packet Filter
pseudo-device vnd 4 # disk-like interface to files
pseudo-device ipfilter # ip filter
--kcLsSdV5ll--