Re: kern/46136: processes get stuck in D under high I/O load

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost,Hauke Fath <hf%spg.tu-darmstadt.de@localhost>
Subject: Re: kern/46136: processes get stuck in D under high I/O load
From: Lars Heidieker <lars%heidieker.de@localhost>
Date: Sun, 4 Mar 2012 18:25:04 +0000 (UTC)
The following reply was made to PR kern/46136; it has been noted by GNATS.

From: Lars Heidieker <lars%heidieker.de@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc: 
Subject: Re: kern/46136: processes get stuck in D under high I/O load
Date: Sun, 04 Mar 2012 19:24:13 +0100

 On 03/03/2012 04:10 PM, Hauke Fath wrote:
 >> Number:         46136 Category:       kern Synopsis:
 >> processes get stuck in D under high I/O load Confidential:   no 
 >> Severity:       critical Priority:       high Responsible:
 >> kern-bug-people State:          open Class:          sw-bug 
 >> Submitter-Id:   net Arrival-Date:   Sat Mar 03 15:10:00 +0000
 >> 2012 Originator:     Hauke Fath Release:        NetBSD 6.0_BETA 
 >> Organization:
 > TU Darmstadt
 >> Environment:
 > System: NetBSD venediger 6.0_BETA NetBSD 6.0_BETA (VENEDIGER) #0:
 > Thu Mar 1 18:10:56 CET 2012
 > hf@Hochstuhl:/var/obj/netbsd-builds/6/i386/sys/arch/i386/compile/VENEDIGER
 > i386 Architecture: i386 Machine: i386
 >> Description:
 > 
 > We run an i386 machine equipped with a Super Micro X7SBE (4 core
 > Xeon) and a SCSI MegaRAID 320-4X as file server - mainly NFS.
 > 
 > When we switched the RAID controller from a 320-2 to said 320-4S
 > under netbsd-5, the nfsd developed a tendency to get stuck in 'D'
 > state every other day, making a reboot necessary.
 > 
 > After upgrading to netbsd-6, and tuning buffer and pool sizes, the
 > nfsd problem is somewhat mitigated, although there is still a
 > string-and-ducttape script in place, which checks if nfsd is stuck
 > in 'D' for an extended period of time, and reboots the machine.
 > 
 > Unfortunately, the jobs started from /etc/daily get stuck, too, and
 > wedge the machine such that even a 'reboot 0x04' from the debugger
 > will not, and a hard reset is needed.
 > 
 > From the debugger 'ps' output:
 > 
 > [...] About to run shutdown hooks... Stopping cron. Waiting for
 > PIDS: 826. Stopping inetd. Waiting for PIDS: 302. Saved entropy to
 > disk. Turning off accounting. Removing block-type swap devices 
 > swapctl: removing /dev/ld0b as swap device Sat Mar  3 10:50:53 CET
 > 2012
 > 
 > Done running shutdown hooks. Mar  3 10:50:59 venediger
 > syslogd[184]: Exiting on signal 15 syncing disks... 3 done [--
 > break #0(1) sent -- `\z' -- Sat Mar  3 10:53:16 2012] fatal
 > breakpoint trap in supervisor mode trap type 1 code 0 eip c0183c64
 > cs 8 eflags 200286 cr2 bb688b04 ilevel 8 Stopped in pid 0.7
 > (system) at  netbsd:breakpoint+0x4:  popl    %ebp db{0}> ps PID
 > LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT 9408
 > 1 3   0   9020000           c5ded800                amd tstile 
 > 16127    1 3   3   9020000           c5dedd40                amd
 > tstile 545      1 3   1   9020000           c80a5000
 > amd tstile 17941    1 3   2         0           cc47bd40
 > reboot tstile 29808    1 3   3   9020000           cd494d20
 > find vmem 28944    1 3   0   9020000           c8a86560
 > find vmem 1        1 3   3   8020080           c5d78aa0
 > init wait 0       78 3   3       200           c538e020
 > nfsio nfsiod 0       77 3   2       200           c538e2c0
 > nfsio nfsiod 0       76 3   1       200           c538e560
 > nfsio nfsiod 0       75 3   2       200           c5ded560
 > nfsio nfsiod 0       74 5   3       200           c5e34000
 > (zombie) 0       73 3   3       200           c5ded020
 > physiod physiod 0       72 3   3       200           c5dc5d20
 > aiodoned aiodoned 0       71 3   2       200           c5d782c0
 > ioflush vmem 0       70 3   1       200           c5d78020
 > pgdaemon xclocv 0       67 3   3       200           c5d3b800
 > cryptoret crypto_w 0       66 3   3       200           c5d78560
 > atapibus0 sccomp 0       64 3   2       200           c5d25540
 > usb4 usbevt 0       63 3   0       200           c5d3b2c0
 > usb7 usbevt 0       62 3   3       200           c5d3b560
 > usb6 usbevt 0       61 3   1       200           c5d3baa0
 > usb5 usbevt 0       60 3   3       200           c5d78800
 > usb3 usbevt 0       59 3   3       200           c5d252a0
 > unpgc unpgc 0       58 3   0       200           c5d3bd40
 > usb0 usbevt 0       57 3   0       200           c5d25000
 > usb2 usbevt 0       56 3   2       200           c5d78d40
 > usbtask-dr usbtsk 0       55 3   3       200           c5d3c000
 > usbtask-hc usbtsk 0       54 3   3       200           c5d3c2a0
 > usb1 usbevt 0       53 3   0       200           c5d3c540
 > vmem_rehash vmem_rehash 0       52 3   0       200
 > c5d3c7e0          coretemp3 coretemp3 0       51 3   3       200
 > c5d3ca80          coretemp2 coretemp2 0       50 3   1       200
 > c5d3cd20          coretemp1 coretemp1 0       49 3   2       200
 > c5d3b020          coretemp0 coretemp0 0       40 3   2       200
 > c5d257e0            atabus3 atath 0       39 3   0       200
 > c5d25a80            atabus2 atath 0       38 3   3       200
 > c5d25d20               iic0 iicintr 0       37 3   2       200
 > c5b29020            atabus1 atath 0       36 3   0       200
 > c5b292c0            atabus0 atath 0       35 3   0       200
 > c5b29560               apm0 apmev 0       34 3   3       200
 > c5b29800            xcall/3 xcall 0       33 1   3       200
 > c5b29aa0          softser/3 0       32 1   3       200
 > c5b29d40          softclk/3 0       31 1   3       200
 > c5b1e000          softbio/3 0       30 1   3       200
 > c5b1e2a0          softnet/3 0    >  29 7   3       201
 > c5b1e540             idle/3 0       28 3   2       200
 > c5b1e7e0            xcall/2 xcall 0       27 1   2       200
 > c5b1ea80          softser/2 0       26 1   2       200
 > c5b1ed20          softclk/2 0       25 1   2       200
 > c5b1a020          softbio/2 0       24 1   2       200
 > c5b1a2c0          softnet/2 0    >  23 7   2       201
 > c5b1a560             idle/2 0       22 3   1       200
 > c5b1a800            xcall/1 xcall 0       21 1   1       200
 > c5b1aaa0          softser/1 0       20 1   1       200
 > c5b1ad40          softclk/1 0       19 1   1       200
 > c4ffb000          softbio/1 0       18 1   1       200
 > c4ffb2a0          softnet/1 0    >  17 7   1       201
 > c4ffb540             idle/1 0       16 3   0       200
 > c4ffb7e0             sysmon smtaskq 0       15 3   0       200
 > c4ffba80         pmfsuspend pmfsuspend 0       14 3   0       200
 > c4ffbd20           pmfevent pmfevent 0       13 3   3       200
 > c4ff5020         sopendfree sopendfr 0       12 3   0       200
 > c4ff52c0           nfssilly nfssilly 0       11 3   0       200
 > c4ff5560            cachegc cachegc 0       10 3   3       200
 > c4ff5800              vrele vrele 0        9 3   2       200
 > c4ff5aa0             vdrain vdrain 0        8 3   0       200
 > c4ff5d40          modunload mod_unld 0    >   7 7   0       200
 > c4fed000            xcall/0 0        6 1   0       200
 > c4fed2a0          softser/0 0        5 1   0       200
 > c4fed540          softclk/0 0        4 1   0       200
 > c4fed7e0          softbio/0 0        3 1   0       200
 > c4feda80          softnet/0 0        2 1   0       201
 > c4fedd20             idle/0 0        1 3   3       200
 > c0652400            swapper uvm db{0}> rev boot 0x04 [-- break
 > #0(1) sent -- `\z' -- Sat Mar  3 10:56:58 2012] [-- break #0(1)
 > sent -- `\z' -- Sat Mar  3 10:57:02 2012]
 > 
 > [machine completely stuck]
 > 
 > 
 > Note the reboot(8) in 'tstile', and the find(1) processes (the 
 > original culprits) in 'vmem'.
 > 
 >> How-To-Repeat:
 > 
 > Run netbsd-6 on a busy, scsi raid based  nfs fileserver.
 > 
 >> Fix:
 >  None I can see.
 > 
 > The machine is easy to upset, so I can quickly provide any details
 > someone knowledgable might be interested in, including ddb dances.
 > 
 > (Re-sent because of botched sender mail address)
 > 
 >> Unformatted:
 > 
 > 
 
 Can you reproduce the problem with src/sys/kern/subr_vmem.c updated
 to 1.73 ?
Prev by Date: Re: bin/37415: sysinst should be available after installation
Next by Date: Re: xsrc/46138: netbsd-6 20120303 X11 issue (post expat move)
Previous by Thread: kern/46136: processes get stuck in D under high I/O load
Next by Thread: Re: kern/46136: processes get stuck in D under high I/O load
Indexes:
Home | Main Index | Thread Index | Old Index