tech-kern archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: Random lockups on an email server - possibly kern/50168
On Sun, 3 Apr 2016 09:51:08 -0400
"D'Arcy J.M. Cain" <darcy%NetBSD.org@localhost> wrote:
> Meanwhile, my system crashed again. I have taken to rebooting every
> morning (better a controlled five minute down time than a minimum half
> hour crash). Here is what was on the screen when it locked up.
Based on discussions with David Maxwell I took out the daily reboot and
ran crash in a screen(1) terminal. The idea was that if I was already
in crash I could run some commands.
Today it hung again. Here's the output of top when it hung:
load averages: 0.33, 0.31, 0.55; up 2+21:36:26 08:11:40
494 processes: 461 sleeping, 31 zombie, 2 on CPU
CPU states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 99.9% idle
Memory: 19G Act, 9272M Inact, 11M Wired, 86M Exec, 26G File, 8584K Free
Swap: 32G Total, 32G Free
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
0 root 0 0 0K 45M CPU/14 27:17 0.00% 0.00% [system]
597 root 117 0 24M 2252K tstile/1 1:12 0.00% 0.00% syslogd
29434 root 117 0 25M 14M tstile/8 1:04 0.00% 0.00% rsync
673 root 43 0 18M 3380K CPU/15 0:39 0.00% 0.00% top
15161 root 85 0 12M 2124K kqueue/1 0:18 0.00% 0.00% log
1713 postgrey 85 0 83M 21M select/3 0:17 0.00% 0.00% perl
234 mailman 117 0 129M 37M tstile/1 0:16 0.00% 0.00% python2.7
1796 mailman 117 0 122M 25M tstile/1 0:16 0.00% 0.00% python2.7
22368 druid 85 0 16M 5024K kqueue/1 0:16 0.00% 0.00% imap
2943 mailman 117 0 124M 30M tstile/1 0:15 0.00% 0.00% python2.7
2469 mailman 117 0 115M 17M tstile/1 0:15 0.00% 0.00% python2.7
21549 root 85 0 16M 6824K kqueue/5 0:15 0.00% 0.00% config
26849 root 117 0 89M 55M tstile/1 0:14 0.00% 0.00% auth
235 mailman 117 0 124M 29M tstile/1 0:14 0.00% 0.00% python2.7
233 mailman 117 0 115M 16M tstile/1 0:14 0.00% 0.00% python2.7
3024 mailman 117 0 115M 16M tstile/2 0:14 0.00% 0.00% python2.7
16888 darcy 85 0 16M 5048K kqueue/0 0:12 0.00% 0.00% imap
16363 www 85 0 354M 38M flt_no/8 0:11 0.00% 0.00% httpd
14358 www 85 0 355M 35M kqueue/1 0:11 0.00% 0.00% httpd
1532 root 85 0 22M 10M pause/3 0:11 0.00% 0.00% ntpd
2245 root 85 0 48M 2472K kqueue/0 0:10 0.00% 0.00% master
25121 root 85 0 12M 1940K flt_no/1 0:10 0.00% 0.00% dovecot
21209 www 85 0 355M 34M semwai/1 0:08 0.00% 0.00% httpd
19179 root 85 0 78M 7324K select/8 0:06 0.00% 0.00% sshd
18999 gogo2 117 0 17M 5000K tstile/1 0:05 0.00% 0.00% imap
27442 www 85 0 353M 33M semwai/9 0:05 0.00% 0.00% httpd
13590 www 85 0 351M 29M semwai/5 0:05 0.00% 0.00% httpd
2430 darcy 85 0 20M 2156K select/0 0:04 0.00% 0.00% screen-4.3.1
2807 jbelknap 117 0 19M 7716K tstile/0 0:03 0.00% 0.00% imap
160 root 85 0 337M 26M select/8 0:03 0.00% 0.00% httpd
Crash didn't help. When I pressed enter it dumped a ps output to the
screen, probably the last command I ran when the system was up. Here
is a partial output of that as far back as screen would go.
0 129 3 4 200 fffffe813ac685e0 coretemp1 coretemp1
0 128 3 10 200 fffffe813ac68a00 coretemp0 coretemp0
0 127 3 11 200 fffffe813ac3f1a0 ciss0 ciss0
0 118 3 0 200 fffffe813ab61140 pms0 pmsreset
0 117 3 0 200 fffffe813ab61560 atabus5 atath
0 116 3 0 200 fffffe813ab61980 atabus4 atath
0 115 3 1 200 fffffe813ab44120 atabus3 atath
0 114 3 1 200 fffffe813ab44540 atabus2 atath
0 113 3 0 200 fffffe813ab44960 atabus1 atath
0 112 3 0 200 fffffe813aa7e100 atabus0 atath
0 111 3 0 200 fffffe813aa7e520 usbtask-dr usbtsk
0 110 3 0 200 fffffe813aa7e940 usbtask-hc usbtsk
0 109 3 0 200 fffffe813a8720e0 scsibus0 sccomp
0 108 3 1 200 fffffe813a872500 lnxsyswq lnxsyswq
0 107 3 4 200 fffffe813a872920 ipmi ipmipoll
0 106 3 15 200 fffffe813a7f20c0 xcall/15 xcall
0 105 1 15 200 fffffe813a7f24e0 softser/15
0 104 1 15 200 fffffe813a7f2900 softclk/15
0 103 1 15 200 fffffe813a7db0a0 softbio/15
0 102 1 15 200 fffffe813a7db4c0 softnet/15
0 101 1 15 201 fffffe813a7db8e0 idle/15
0 100 3 14 200 fffffe813a7ce080 xcall/14 xcall
0 99 1 14 200 fffffe813a7ce4a0 softser/14
0 98 1 14 200 fffffe813a7ce8c0 softclk/14
0 97 1 14 200 fffffe813a7b9060 softbio/14
0 96 1 14 200 fffffe813a7b9480 softnet/14
0 > 95 7 14 201 fffffe813a7b98a0 idle/14
0 94 3 13 200 fffffe813a7aa040 xcall/13 xcall
0 93 1 13 200 fffffe813a7aa460 softser/13
0 92 1 13 200 fffffe813a7aa880 softclk/13
0 91 1 13 200 fffffe813a795020 softbio/13
0 90 1 13 200 fffffe813a795440 softnet/13
0 > 89 7 13 201 fffffe813a795860 idle/13
0 88 3 12 200 fffffe813a776000 xcall/12 xcall
0 87 1 12 200 fffffe813a776420 softser/12
0 86 1 12 200 fffffe813a776840 softclk/12
0 85 1 12 200 fffffe813a757360 softbio/12
0 84 1 12 200 fffffe813a757780 softnet/12
0 > 83 7 12 201 fffffe813a757ba0 idle/12
0 82 3 11 200 fffffe813a752340 xcall/11 xcall
0 81 1 11 200 fffffe813a752760 softser/11
0 80 1 11 200 fffffe813a752b80 softclk/11
0 79 1 11 200 fffffe813a75c320 softbio/11
0 78 1 11 200 fffffe813a75c740 softnet/11
0 > 77 7 11 201 fffffe813a75cb60 idle/11
0 76 3 10 200 fffffe813a736300 xcall/10 xcall
0 75 1 10 200 fffffe813a736720 softser/10
0 74 1 10 200 fffffe813a736b40 softclk/10
0 73 1 10 200 fffffe813a70f2e0 softbio/10
0 72 1 10 200 fffffe813a70f700 softnet/10
0 > 71 7 10 201 fffffe813a70fb20 idle/10
0 70 3 9 200 fffffe813a70a2c0 xcall/9 xcall
0 69 1 9 200 fffffe813a70a6e0 softser/9
0 68 1 9 200 fffffe813a70ab00 softclk/9
0 67 1 9 200 fffffe813a70b2a0 softbio/9
0 66 1 9 200 fffffe813a70b6c0 softnet/9
0 > 65 7 9 201 fffffe813a70bae0 idle/9
0 64 3 8 200 fffffe813a6fe280 xcall/8 xcall
0 63 1 8 200 fffffe813a6fe6a0 softser/8
0 62 1 8 200 fffffe813a6feac0 softclk/8
0 61 1 8 200 fffffe813a6e8260 softbio/8
0 60 1 8 200 fffffe813a6e8680 softnet/8
0 > 59 7 8 201 fffffe813a6e8aa0 idle/8
0 58 3 7 200 fffffe813a6b2240 xcall/7 xcall
0 57 1 7 200 fffffe813a6b2660 softser/7
0 56 1 7 200 fffffe813a6b2a80 softclk/7
0 55 1 7 200 fffffe813a6c3220 softbio/7
0 54 1 7 200 fffffe813a6c3640 softnet/7
0 > 53 7 7 201 fffffe813a6c3a60 idle/7
0 52 3 6 200 fffffe813a6b6200 xcall/6 xcall
0 51 1 6 200 fffffe813a6b6620 softser/6
0 50 1 6 200 fffffe813a6b6a40 softclk/6
0 49 1 6 200 fffffe813a6a01e0 softbio/6
0 48 1 6 200 fffffe813a6a0600 softnet/6
0 > 47 7 6 201 fffffe813a6a0a20 idle/6
0 46 3 5 200 fffffe813a67a1c0 xcall/5 xcall
0 45 1 5 200 fffffe813a67a5e0 softser/5
0 44 1 5 200 fffffe813a67aa00 softclk/5
0 43 1 5 200 fffffe813a6831a0 softbio/5
0 42 1 5 200 fffffe813a6835c0 softnet/5
0 > 41 7 5 201 fffffe813a6839e0 idle/5
0 40 3 4 200 fffffe813a66f180 xcall/4 xcall
0 39 1 4 200 fffffe813a66f5a0 softser/4
0 38 1 4 200 fffffe813a66f9c0 softclk/4
0 37 1 4 200 fffffe813a65f160 softbio/4
0 36 1 4 200 fffffe813a65f580 softnet/4
0 > 35 7 4 201 fffffe813a65f9a0 idle/4
0 34 3 3 200 fffffe813a629140 xcall/3 xcall
0 33 1 3 200 fffffe813a629560 softser/3
0 32 1 3 200 fffffe813a629980 softclk/3
0 31 1 3 200 fffffe813a61a120 softbio/3
0 30 1 3 200 fffffe813a61a540 softnet/3
0 > 29 7 3 201 fffffe813a61a960 idle/3
0 28 3 2 200 fffffe813a62d100 xcall/2 xcall
0 27 1 2 200 fffffe813a62d520 softser/2
0 26 1 2 200 fffffe813a62d940 softclk/2
0 25 1 2 200 fffffe813a6130e0 softbio/2
0 24 1 2 200 fffffe813a613500 softnet/2
0 > 23 7 2 201 fffffe813a613920 idle/2
0 22 3 1 200 fffffe813a6050c0 xcall/1 xcall
0 21 1 1 200 fffffe813a6054e0 softser/1
0 20 1 1 200 fffffe813a605900 softclk/1
0 19 1 1 200 fffffe813a5e80a0 softbio/1
0 18 1 1 200 fffffe813a5e84c0 softnet/1
0 17 1 1 201 fffffe813a5e88e0 idle/1
0 16 3 0 200 fffffe8836ef4080 sysmon smtaskq
0 15 3 0 200 fffffe8836ef44a0 pmfsuspend pmfsuspend
0 14 3 6 200 fffffe8836ef48c0 pmfevent pmfevent
0 13 3 0 200 fffffe883af10060 sopendfree sopendfr
0 12 3 0 200 fffffe883af10480 nfssilly nfssilly
0 11 3 11 200 fffffe883af108a0 cachegc cachegc
0 10 3 4 200 fffffe883df18040 vrele vrele
0 9 3 15 200 fffffe883df18460 vdrain vdrain
0 8 3 3 200 fffffe883df18880 modunload mod_unld
0 7 3 0 200 fffffe883df24020 xcall/0 xcall
0 6 1 0 200 fffffe883df24440 softser/0
0 5 1 0 200 fffffe883df24860 softclk/0
0 4 1 0 200 fffffe883df2a000 softbio/0
0 3 1 0 200 fffffe883df2a420 softnet/0
0 2 1 0 201 fffffe883df2a840 idle/0
0 1 3 7 200 ffffffff810345a0 swapper uvm
I tried doing ps/n|more and crash just hung.
I was able to get someone to plug in a monitor and keyboard. He read
this off the screen.
07:56:55 smaug dovecot: imap (eref): fatal: master: service (imap):
child 11193 killed with signal 6 (core not dumped) set service
imap (drop_priv_before_exec=yes)
08:07:09 smaug dovecot: imap (eref): panic: file imap-client.c: line 841
(client_check_command_hangs): assertion failed:
(!have_wait_unfinished || unfinished_count > 0)
08:07:09 smaug dovecot: imap (eref): fatal: master: service (imap):
child 4798 killed with signal 6 (core not dumped) set service imap
(drop_priv_before_exec=yes)
I am going to look at those sources but I suspect that this is a
symptom, not a cause.
I had the on-site person press <CTRL><ALT><ESC> but it did not drop
into the debugger.
--
D'Arcy J.M. Cain <darcy%NetBSD.org@localhost>
http://www.NetBSD.org/ IM:darcy%Vex.Net@localhost
Home |
Main Index |
Thread Index |
Old Index