NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/46136: processes get stuck in D under high I/O load
The following reply was made to PR kern/46136; it has been noted by GNATS.
From: Lars Heidieker <lars%heidieker.de@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc:
Subject: Re: kern/46136: processes get stuck in D under high I/O load
Date: Sun, 04 Mar 2012 19:24:13 +0100
On 03/03/2012 04:10 PM, Hauke Fath wrote:
>> Number: 46136 Category: kern Synopsis:
>> processes get stuck in D under high I/O load Confidential: no
>> Severity: critical Priority: high Responsible:
>> kern-bug-people State: open Class: sw-bug
>> Submitter-Id: net Arrival-Date: Sat Mar 03 15:10:00 +0000
>> 2012 Originator: Hauke Fath Release: NetBSD 6.0_BETA
>> Organization:
> TU Darmstadt
>> Environment:
> System: NetBSD venediger 6.0_BETA NetBSD 6.0_BETA (VENEDIGER) #0:
> Thu Mar 1 18:10:56 CET 2012
> hf@Hochstuhl:/var/obj/netbsd-builds/6/i386/sys/arch/i386/compile/VENEDIGER
> i386 Architecture: i386 Machine: i386
>> Description:
>
> We run an i386 machine equipped with a Super Micro X7SBE (4 core
> Xeon) and a SCSI MegaRAID 320-4X as file server - mainly NFS.
>
> When we switched the RAID controller from a 320-2 to said 320-4S
> under netbsd-5, the nfsd developed a tendency to get stuck in 'D'
> state every other day, making a reboot necessary.
>
> After upgrading to netbsd-6, and tuning buffer and pool sizes, the
> nfsd problem is somewhat mitigated, although there is still a
> string-and-ducttape script in place, which checks if nfsd is stuck
> in 'D' for an extended period of time, and reboots the machine.
>
> Unfortunately, the jobs started from /etc/daily get stuck, too, and
> wedge the machine such that even a 'reboot 0x04' from the debugger
> will not, and a hard reset is needed.
>
> From the debugger 'ps' output:
>
> [...] About to run shutdown hooks... Stopping cron. Waiting for
> PIDS: 826. Stopping inetd. Waiting for PIDS: 302. Saved entropy to
> disk. Turning off accounting. Removing block-type swap devices
> swapctl: removing /dev/ld0b as swap device Sat Mar 3 10:50:53 CET
> 2012
>
> Done running shutdown hooks. Mar 3 10:50:59 venediger
> syslogd[184]: Exiting on signal 15 syncing disks... 3 done [--
> break #0(1) sent -- `\z' -- Sat Mar 3 10:53:16 2012] fatal
> breakpoint trap in supervisor mode trap type 1 code 0 eip c0183c64
> cs 8 eflags 200286 cr2 bb688b04 ilevel 8 Stopped in pid 0.7
> (system) at netbsd:breakpoint+0x4: popl %ebp db{0}> ps PID
> LID S CPU FLAGS STRUCT LWP * NAME WAIT 9408
> 1 3 0 9020000 c5ded800 amd tstile
> 16127 1 3 3 9020000 c5dedd40 amd
> tstile 545 1 3 1 9020000 c80a5000
> amd tstile 17941 1 3 2 0 cc47bd40
> reboot tstile 29808 1 3 3 9020000 cd494d20
> find vmem 28944 1 3 0 9020000 c8a86560
> find vmem 1 1 3 3 8020080 c5d78aa0
> init wait 0 78 3 3 200 c538e020
> nfsio nfsiod 0 77 3 2 200 c538e2c0
> nfsio nfsiod 0 76 3 1 200 c538e560
> nfsio nfsiod 0 75 3 2 200 c5ded560
> nfsio nfsiod 0 74 5 3 200 c5e34000
> (zombie) 0 73 3 3 200 c5ded020
> physiod physiod 0 72 3 3 200 c5dc5d20
> aiodoned aiodoned 0 71 3 2 200 c5d782c0
> ioflush vmem 0 70 3 1 200 c5d78020
> pgdaemon xclocv 0 67 3 3 200 c5d3b800
> cryptoret crypto_w 0 66 3 3 200 c5d78560
> atapibus0 sccomp 0 64 3 2 200 c5d25540
> usb4 usbevt 0 63 3 0 200 c5d3b2c0
> usb7 usbevt 0 62 3 3 200 c5d3b560
> usb6 usbevt 0 61 3 1 200 c5d3baa0
> usb5 usbevt 0 60 3 3 200 c5d78800
> usb3 usbevt 0 59 3 3 200 c5d252a0
> unpgc unpgc 0 58 3 0 200 c5d3bd40
> usb0 usbevt 0 57 3 0 200 c5d25000
> usb2 usbevt 0 56 3 2 200 c5d78d40
> usbtask-dr usbtsk 0 55 3 3 200 c5d3c000
> usbtask-hc usbtsk 0 54 3 3 200 c5d3c2a0
> usb1 usbevt 0 53 3 0 200 c5d3c540
> vmem_rehash vmem_rehash 0 52 3 0 200
> c5d3c7e0 coretemp3 coretemp3 0 51 3 3 200
> c5d3ca80 coretemp2 coretemp2 0 50 3 1 200
> c5d3cd20 coretemp1 coretemp1 0 49 3 2 200
> c5d3b020 coretemp0 coretemp0 0 40 3 2 200
> c5d257e0 atabus3 atath 0 39 3 0 200
> c5d25a80 atabus2 atath 0 38 3 3 200
> c5d25d20 iic0 iicintr 0 37 3 2 200
> c5b29020 atabus1 atath 0 36 3 0 200
> c5b292c0 atabus0 atath 0 35 3 0 200
> c5b29560 apm0 apmev 0 34 3 3 200
> c5b29800 xcall/3 xcall 0 33 1 3 200
> c5b29aa0 softser/3 0 32 1 3 200
> c5b29d40 softclk/3 0 31 1 3 200
> c5b1e000 softbio/3 0 30 1 3 200
> c5b1e2a0 softnet/3 0 > 29 7 3 201
> c5b1e540 idle/3 0 28 3 2 200
> c5b1e7e0 xcall/2 xcall 0 27 1 2 200
> c5b1ea80 softser/2 0 26 1 2 200
> c5b1ed20 softclk/2 0 25 1 2 200
> c5b1a020 softbio/2 0 24 1 2 200
> c5b1a2c0 softnet/2 0 > 23 7 2 201
> c5b1a560 idle/2 0 22 3 1 200
> c5b1a800 xcall/1 xcall 0 21 1 1 200
> c5b1aaa0 softser/1 0 20 1 1 200
> c5b1ad40 softclk/1 0 19 1 1 200
> c4ffb000 softbio/1 0 18 1 1 200
> c4ffb2a0 softnet/1 0 > 17 7 1 201
> c4ffb540 idle/1 0 16 3 0 200
> c4ffb7e0 sysmon smtaskq 0 15 3 0 200
> c4ffba80 pmfsuspend pmfsuspend 0 14 3 0 200
> c4ffbd20 pmfevent pmfevent 0 13 3 3 200
> c4ff5020 sopendfree sopendfr 0 12 3 0 200
> c4ff52c0 nfssilly nfssilly 0 11 3 0 200
> c4ff5560 cachegc cachegc 0 10 3 3 200
> c4ff5800 vrele vrele 0 9 3 2 200
> c4ff5aa0 vdrain vdrain 0 8 3 0 200
> c4ff5d40 modunload mod_unld 0 > 7 7 0 200
> c4fed000 xcall/0 0 6 1 0 200
> c4fed2a0 softser/0 0 5 1 0 200
> c4fed540 softclk/0 0 4 1 0 200
> c4fed7e0 softbio/0 0 3 1 0 200
> c4feda80 softnet/0 0 2 1 0 201
> c4fedd20 idle/0 0 1 3 3 200
> c0652400 swapper uvm db{0}> rev boot 0x04 [-- break
> #0(1) sent -- `\z' -- Sat Mar 3 10:56:58 2012] [-- break #0(1)
> sent -- `\z' -- Sat Mar 3 10:57:02 2012]
>
> [machine completely stuck]
>
>
> Note the reboot(8) in 'tstile', and the find(1) processes (the
> original culprits) in 'vmem'.
>
>> How-To-Repeat:
>
> Run netbsd-6 on a busy, scsi raid based nfs fileserver.
>
>> Fix:
> None I can see.
>
> The machine is easy to upset, so I can quickly provide any details
> someone knowledgable might be interested in, including ddb dances.
>
> (Re-sent because of botched sender mail address)
>
>> Unformatted:
>
>
Can you reproduce the problem with src/sys/kern/subr_vmem.c updated
to 1.73 ?
Home |
Main Index |
Thread Index |
Old Index