netbsd-bugs: kern/12348: deadlock like situation in getnewb

Subject: kern/12348: deadlock like situation in getnewb
To: None <gnats-bugs@gnats.netbsd.org>
From: None <jam@pobox.com>
List: netbsd-bugs
Date: 03/07/2001 10:44:14
>Number:         12348
>Category:       kern
>Synopsis:       deadlock like situation in getnewb
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Mar 07 08:45:00 PST 2001
>Closed-Date:
>Last-Modified:
>Originator:     Kazushi (Jam) Marukawa
>Release:        Feb 23, 2001
>Organization:
none
>Environment:
    Machine: Celeron 366, Asus P3V4X, and 512MB memory
    System: NetBSD sou.nerv.org NetBSD 1.5S (sou) #7: Sat Feb 24 05:21:30 CST 2001 jam@sou.nerv.org:/usr/src/sys/arch/i386/compile/sou i386
    Architecture: i386
    Machine: i386
>Description:
    This time, I experienced some deadlock like situation while
    I was writing large block data to the SCSI disk.  So, I'm
    reporting this.  I'm using serial console and it was off
    when I got this problem, so I don't have console messages.
    
    First, I noticed that all programs I tried to execute hangs.
    I was using screen, so I could switch among screens, but no
    more screens nor new programs.  For example, if I typed "ls"
    nothing happened and ^C, ^\, and ^Z didn't work.  Then,
    screend also hangs since it called getnewb.  However, still
    I can ping and send packets through nat on this machine.
    
    I waited about 30 minutes with a hope it would recover.
    However, it didn't.  So, I sent break through serial console
    and got trace, ps, and uvmexp.  I had more information, but
    I don't know what those information means.  So, this time, I
    just put some of them at the bottom of this message.  If you
    have an interest on this and want to see all log or the
    result of some specific commands, please ask me so.  Or
    should I send-pr about this?  Thanks.
    
    db> trace
    cpu_Debugger(c0f2aa80,e5053560,e5053560,e5092e5c,0) at cpu_Debugger+0x4
    comintr(c0f42c00) at comintr+0xcd
    Xintr4() at Xintr4+0x70
    --- interrupt ---
    idle(e5053560) at idle+0x1b
    bpendtsleep(e505495c,128,c0486fec,0,0) at bpendtsleep
    sigsuspend1(e5053560,e5092f50,e5092f80,0,0) at sigsuspend1+0xf0
    sys___sigsuspend14(e5053560,e5092f88,e5092f80) at sys___sigsuspend14+0x36
    syscall_plain(1f,bfbf001f,bfbfdc69,0,bfbfdc40) at syscall_plain+0x98
    db> c
    Stopped at      cpu_Debugger+0x4:       leave
    db> trace
    cpu_Debugger(c0f2aa80,e50a4abc,e50a4abc,e50c8da8,0) at cpu_Debugger+0x4
    comintr(c0f42c00) at comintr+0xcd
    Xintr4() at Xintr4+0x70
    --- interrupt ---
    idle(e50a4abc) at idle+0x1b
    bpendtsleep(c05e710c,118,c048954d,65,0) at bpendtsleep
    sys_select(e50a4abc,e50c8f88,e50c8f80) at sys_select+0x309
    syscall_plain(1f,1f,0,8,bfbfdc60) at syscall_plain+0x98
    db> ps
     PID             PPID       PGRP        UID S   FLAGS          COMMAND    WAIT
     23132            193        193          0 3     0x4             sshd getnewb
     23131           9668      23130          0 3  0x4006             less getnewb
     23130           9668      23130          0 3  0x4006               ps getnewb
     23129           9560       9560       1000 3  0x4006        fetchmail getnewb
     23128          21992      21992       1000 3  0x4006             perl getnewb
     23105            277      23104       1000 3  0x4006             perl getnewb
     23038            214        214      32767 3   0x184            httpd  netcon
     23036            214        214      32767 3   0x184            httpd  netcon
     23035            214        214      32767 3   0x184            httpd  netcon
     22994            214        214      32767 3   0x184            httpd  netcon
     22971            214        214      32767 3   0x104            httpd getnewb
     22962            214        214      32767 3   0x184            htd  netcon
     22930            214        214      32767 3   0x184            httpd  netcon
     22859           9668      22859          0 4  0x5002               vi
     21992            325      21992       1000 3  0x4086               sh    wait
     21106            266      21106          0 3  0x4006               dd getnewb
     9668             270       9668          0 3  0x5086              zsh   pause
     9560             264       9560       1000 3  0x5086             mush    wait
     1163            1162       1161          0 4  0x4002             less
     1162            1161       1161          0 4  0x4082               sh
     1161             266       1161          0 4  0x5082              man
     325              263        325       1000 3  0x4082              zsh   pause
     270              263        270       1000 3  0x4082              zsh   pause
     266              265        266          0 3  0x4082              zsh   pause
     265              263        265       1000 3  0x4082              zsh   pause
     263                1        263       1000 3   0x104     screen-3.9.8 getnewb
     238                1          1          0 3  0x4004            getty  vnlock
     237                1        237          0 3  0x4006            getty getnewb
     236              206        236      32767 3  0x4184           pinger  select
     231                1        231          0 3     0x4            inetd  vnlock
     224                1        224          0 3     0x4         sendmail  vnlock
     214                1        214          0 3    0x84            httpd  select
     211              206        211      32767 3  0x4080          unlinkd   netio
     206              202          7          0 3  0x4106            squid getnewb
     202                1          7          0 3  0x4082               sh    wait
     193                1        193          0 3     0x4             sshd  vnlock
     12                 0          0          0 3 0x20204             raid rfwcond
     6                  0          0          0 3 0x20204         aiodoned aiodone
     5                  0          0          0 3 0x20204          ioflush getnewb
     4                  0          0          0 3 0x20204           reaper  reaper
     3                  0          0          0 3 0x20204       pagedaemon pgdaemo
     2                  0          0          0 3 0x20204             usb0  usbevt
     1                  0          1          0 3  0x4080             init    wait
     0                 -1          0          0 3 0x20204          swapper schedul
    db> kill CE    <- squid in newcb
    db> kill 48B   <- less in normal status, but I cannot kill either
    db> show uvmexp
    Current UVM status:
      pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12
      127787 VM pages: 78530 active, 36887 inactive, 973 wired, 140 free
        14843 anon, 101138 vnode, 0 vtext
      freemin=64, free-target=85, inactive-target=38518, wired-max=42595
      faults=6992210, traps=85475941, intrs=83588156, ctxswitch=59582042
      softint=78595794, syscalls=292274228, swapins=1434, swapouts=1467
      fault counts:
        noram=3, noanon=0, pgwait=0, pgrele=0
        ok relocks(total)=25163(25163), anget(retrys)=613856(3), amapcopy=241944
        neighbor anon/obj pg=602787/3113099, gets(lock/unlock)=941513/25164
        cases: anon=416621, anoncow=197235, obj=819333, prcopy=122176, przero=219795
    9
      daemon and swap counts:
        woke=797, revs=797, scans=5326460, obscans=4571110, anscans=3436
        busy=0, freed=4571111, reactivate=69923, deactivate=5554947
        pageouts=222, pending=222, nswget=3
        nswapdev=1, nanon=271923, nanonneeded=271923freeanon=257080
        swpages=152451, swpginuse=879, swpgonly=0 paging=0
      kernel pointers:
        objs(kern/kmem/mb)=0xc05cdd40/0xc05cde50/0xc05cde68
    db> sync
    syncing disks... 
    (waited more than 5 minutes, no disk I/O, send break, send
     break...  no response.  reset)
    
    Regards,
>How-To-Repeat:
    What I was doing is "dd if=/dev/zero of=/dev/sd0 count=XXXX bs=8192"
    on a SCSI HDD bigger than 40G.  I though I can zero-fill a hard drive in
    this way before I sell this...

    I received an email from Chuck Silvers <chuq@chuq.com> suggesting
    use rsd0* instead and send pr.

    sd0 is much faster than rsd0 for my case, so I chose it.  Thank
    you for the workaround.
>Fix:
    Chuck said:
	please send a PR about the new problem if you haven't already.
	(this new problem is unrelated to the previous "uvn_fp1" hang.)
    He might know or have an idea on this.  Thanks.
>Release-Note:
>Audit-Trail:
>Unformatted: