Subject: Ultra 30 almost working
To: None <port-sparc64@netbsd.org>
From: Geoff Adams <gadams@avernus.com>
List: port-sparc64
Date: 12/09/2001 04:19:09
I've loaded an Ultra 30 with -current (kernels built from sources from
December 5 and from an hour ago), and it's almost usable. It comes up
into single user mode (sometimes), and every operation I've attempted so
far works (after some effort). However, processes frequently get "stuck."
I'll describe the symptoms in some detail, in hopes that someone can
help me figure out how to debug the problem.
If I, for example, 'fsck /dev/rsd0a', it will print out the first line
of output, and hang. If I drop into ddb with '+++++' and then continue,
the next couple lines of fsck output appear, and it hangs again. Repeat
the ddb-continue cycle by typing '+++++<CR>', and get the next line or
two, etc.
Most processes get stuck like this sooner or later. It seems that it
might be related to disk activity. Untarring /usr to disk, for instance,
gets stuck many dozens of times, usually with the disk activity light
lit. When I drop into the debugger and return, the disk just resumes
churning away. Doing a 'df' will hang the first time after filesystem
write activity (as will sync, so the cause is probably similar), while a
subsequent 'df' (after clearing up the first one via '+++++<CR>') will
print its output just fine.
It's not just disk activity, though. The same problem occurs if I
netboot the machine and never mount a local disk. For some reason,
probing SCSI devices takes over a half hour (when it completes at all)
if I netboot, while the same kernel, booted from disk, probes the SCSI
devices just fine. Also, some non-disk-related processes, such as
'ifconfig -a' also exhibit this behavior.
When a process is hung, I can still interact with other processes. For
instance, if the tar process hangs, I can hit ^Z, and after a
ddb-continue cycle, I'll see the "suspended" message. I can then 'bg'
the tar, and it will continue in the background for a few seconds
(before it hangs again), during which time I can do something else, such
as 'ls'. The ddb-continue cycle allows background processes to continue,
just as well as when they were in the foreground.
Traces in ddb don't seem interesting, since I'm not actually breaking
into the debugger during execution of whatever is causing the stoppage.
In fact, it seems as if there's nothing really causing the processes to
stop, but rather some interrupt is being missed, or something is not
occurring to cause the process to be switched in from the run queue. If
there's some interesting piece of data I can provide, please let me know.
I haven't been able to infer much meaning from 'ps -alxww' output,
either. For instance, a hung 'sync' shows up in wait channel "getblk,"
status "D," with a running time of 71582788:15.99. I'm guessing it's
hung before it's even gotten started. A hung 'tar', on the other hand,
shows up in "biowait," status "D," with 0:00.40 on the clock.
Is there anything anyone can think of that I could look at to narrow
down the problem? Does this pattern ring any bells?
This is so close to working, but this problem makes the machine
completely unusable.
Thanks,
- Geoff