netbsd-bugs: kern/6916: narcoleptic dump(8)

Subject: kern/6916: narcoleptic dump(8)
To: None <gnats-bugs@gnats.netbsd.org>
From: None <windsor@warthog.com>
List: netbsd-bugs
Date: 01/30/1999 22:12:18
>Number:         6916
>Category:       kern
>Synopsis:       My dump/restore went to sleep!
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people (Kernel Bug People)
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Jan 30 20:20:01 1999
>Last-Modified:
>Originator:     Rob Windsor
>Organization:
NosePickers Anonymous
>Release:        1.3.3
>Environment:
System: NetBSD evolution 1.3.3 NetBSD 1.3.3 (EVOLUTION) #5: Sat Jan 23 01:00:29 CST 1999 windsor@evolution:/usr/src/sys/arch/sparc/compile/EVOLUTION sparc


>Description:
	I was doing a "dump | restore" action from cron (one disk to another)
	and dump went to la-la land.  The crontab has the following lines
	(just so that you understand what is happening):

		mount /dev/sd1d /mirror/work
		(cd /mirror/work ; dump 0f - /usr | restore -rf - )

	Top shows:

  PID USERNAME PRI NICE   SIZE   RES STATE   TIME   WCPU    CPU COMMAND
  758 windsor   28    0   972K  120K run     7:25  0.93%  0.93% top
  513 root       2    0   788K   68K sleep   1:36  0.05%  0.05% sshd1
  855 root       2    0  4196K   48K sleep   2:39  0.00%  0.00% restore
  159 root      18    0    20K    8K sleep   0:56  0.00%  0.00% update
  857 root      18   -5   600K   32K sleep   0:29  0.00%  0.00% dump
  859 root      18   -5   600K   32K sleep   0:28  0.00%  0.00% dump
  858 root      18   -5   600K   32K sleep   0:28  0.00%  0.00% dump
  205 root       2    0   368K  108K sleep   0:20  0.00%  0.00% sshd1
  487 root       2    0   788K   84K sleep   0:15  0.00%  0.00% sshd1
  856 root       2   -5   656K   88K sleep   0:12  0.00%  0.00% dump
   97 root      10    0    36K    8K sleep   0:11  0.00%  0.00% ipmon
  125 root      10    0    64M    0K sleep   0:11  0.00%  0.00% mount_mfs
  755 windsor    2    0   108K   20K sleep   0:10  0.00%  0.00% tail
  462 root       2    0   788K   68K sleep   0:07  0.00%  0.00% <sshd1>
  854 root      10   -5   600K   76K sleep   0:03  0.00%  0.00% <dump>

	(after it wedged, I tried to renice the dump processes to wake
	them up)

ps -alx shows:

: evolution; ps alx | egrep 'dump|restore|PPID'
  UID   PID  PPID CPU PRI NI   VSZ  RSS WCHAN  STAT TT       TIME COMMAND
    0   854   853  93  10 -5   600   76 wait   IW<  ??    0:03.55 dump 0f - /us
    0   855   853 170   2  0  4196   48 netio  I    ??    2:39.19 (restore)
    0   856   854  15   2 -5   656   88 netio  I<   ??    0:12.62 dump 0f - /us
    0   857   856  25  18 -5   600   32 pause  I<   ??    0:29.05 dump 0f - /us
    0   858   856  28  18 -5   600   32 pause  I<   ??    0:28.65 dump 0f - /us
    0   859   856  25  18 -5   600   32 pause  I<   ??    0:28.42 dump 0f - /us
  101  1248   489   8  30  0    84   84 -      R+   p1    0:00.08 egrep dump|re


>How-To-Repeat:
	hmm.  do a "dump | restore" from one disk to another with enough
	stuff going on that one of the dump processes page out?  I'm not
	sure how this happened, so I'm not sure how to repeat it.

>Fix:
	nfc

>Audit-Trail:
>Unformatted: