Re: Machine livelock with latest (4.99.48) kernel on sparc64 -- mem leak?

To: current-users%netbsd.org@localhost
Subject: Re: Machine livelock with latest (4.99.48) kernel on sparc64 -- mem leak?
From: Rafal Boni <rafal%pobox.com@localhost>
Date: Sun, 06 Jan 2008 15:46:30 -0500

Rafal Boni wrote:

Rafal Boni wrote:
Rafal Boni wrote:
I just rebooted my trusty Netra T1 with a shiny new 4.99.48 kernel and
thought I'd kick off a userland build.  Things seemed to go swimmingly
for a few minutes, then the machine ground to an un-usable state --
userland seems to be mostly non-responsive, though the machine is
pingable, answers a ^T at a tty (well, it seems to be wedged harder
now.. it did for a while after the apparent lockup), and the disk sounds
like progress is being made on the build.

But, I can't get any echo from a tty anymore, and god forbid I should
want to log in ;)

Anyone seeing anything similar?  Should I go back to the last-known-good
kernel for a while? ;)

Machine is a Netra T1 200 -- UltraSPARC-IIe @ 500 MHz with 512MB RAM.
So I thought I'd give it one more try, and I saw the same thing happenthis time with a kernel build (thought I'd see if I maybe there wassomething else in the latest CVS that would help).
The machine locked up ~ 18:01; it's now 2+ hours later and the disk isstill chugging along. Here's the last thing 'top' on the console saidbefore the hang:
load averages: 4.95, 4.71, 3.82 up 0 days, 13:4818:01:34
29 processes:  1 runnable, 27 sleeping, 1 on processor
CPU states: 0.0% user, 0.0% nice, 8.1% system, 3.4% interrupt, 88.5%idle
Memory: 184K Act, 336K Inact, 6096K Wired, 128K Exec, 328K File, 304K Free
Swap: 2050M Total, 36M Used, 2014M Free
Unless top's reporting is just way off (it didn't seem to be at thestart), there's a sucking memory leak somewhere -- where'd the other 500MB of memory go?
DDB's ps/l (as well as backtrace) also shows an interesting fact -- theactive LWP is the system idle loop every time I'd ended up in DDB due tothis hang.

vmstat seems to confirm this is due to some memory-related condition...stats below are samples every 2 seconds (I'm impatient ;)). This hashappened every time I've started off a more significant build on thisbox running 4.99.48 -- be it just a kernel build or an attempt to buildthe whole system.


Here's idle vmstat shortly after the machine booted:
procs    memory      page                       disks   faults      cpu
r b w    avm    fre  flt  re  pi   po   fr   sr m0 c0   in   sy  cs us sy id
0 0 0  17904 450064  595   0   0    0    0    0  0  0  152  503 123  1 5 94
0 0 0  17912 450056    5   0   0    0    0    0  0  0  112   17  47  0 0 100
0 0 0  17912 450056    4   0   0    0    0    0  0  0  105   11  35  0 0 100

Build kicked off:
procs    memory      page                       disks   faults      cpu
r b w    avm    fre  flt  re  pi   po   fr   sr m0 c0   in   sy  cs us sy id
0 0 0  24680 439448 2423   0   0    0    0    0  0  0  153  918 222  2 13 85
2 0 0  24320 430968 10055  0   0    0    0    0  0  0  134 6054 389 42 51  6
0 0 0  30928 417976 6342   0   0    0    0    0  0  0  183 3184 404 35 34 31
2 0 0  31560 410648 7866   0   0    0    0    0  0  0  170 5527 420 39 46 15
1 0 0  32240 403224 7973   0   0    0    0    0  0  0  175 5742 408 40 49 10
1 0 0  32520 396464 7351   0   0    0    0    0  0  0  186 5428 430 38 45 17
2 0 0  32496 389896 8045   0   0    0    0    0  0  0  173 5937 399 41 46 13
0 0 0  32704 384400 5585   0   0    0    0    0  0  0  217 4269 510 35 37 28

Now systems starts to get less and less usable (memory dropping, CPUmostly idle, disk is making lots of noise; where did all my processes go?):


procs    memory      page                       disks   faults      cpu
r b w    avm    fre  flt  re  pi   po   fr   sr m0 c0   in   sy  cs us sy id
4 0 0  36960  20360 8355   0   0    0    0    0  0  0  169 5331 375 51 48  0
4 0 0  38008  12432 8077   0   0    0    0    0  0  0  176 5338 379 52 47  1
2 0 0  38224   5632 7746   0   0    0    0    0  0  0  177 5092 378 50 45  5
2 0 0  24152   1200 8160   1   0   36  127  128  0  0  188 5353 404 48 48  4
2 0 0  22912    680 7761   0   0  152  495  495  0  0  200 5074 383 53 45  2
3 0 0  22416   1056 7672  33   0  265  337  369  0  0  222 5027 405 60 37  3
3 0 0  21944    792 7643  61   0  234  421  482  0  0  222 5013 407 50 44  6
1 0 0  21360   1240 8619  86  23  293  471  958  0  0  279 5476 508 50 44  6
0 0 0  21720   1048 3845 215 151  231  374 1913  0  0  506 2352 827 22 23 55

0 0 0 19928 992 2626 240 202 253 336 1758 0 0 630 1367 1000 1019 71

0 0 0  20384    360  964 174 167  225  290 1535  0  0  598  267 871  9 12 80
0 0 0  21832    360  747 172 166  231  309 1562  0  0  620  255 908  4  8 88
0 0 0  20696    792 1154 130 181  186  335 1061  0  0  684  453 1000 2 12 86
0 0 0  21240    432  839 122 161  123  268  976  0  0  721  285 1061 2 11 88
0 0 0  21000    544  659 170 151   86  258 1135  0  0  654  142 980  1  8 91
0 0 0  21024    424  733 199 182  175  282 1496  0  0  673  188 1029 2  7 91
0 0 0  20968    480  588 175 133  103  253 1290  0  0  651  101 975  0  4 95
0 0 0  21040    416  657 162 146  100  277 1517  0  0  702  132 1026 1  7 92
0 0 0  20864    376  643  50 141   83  257  786  0  0  664  101 995  0  4 96
0 0 0  20256    272  867 201 187  140  340 1628  0  0  821  174 1231 1  9 90
0 0 0  20832    248  523 283 203  158  254 2141  0  0  675   90 1048 1  4 95
0 0 0  21000    368  637 101 151  100  282 1275  0  0  737   96 1088 1  7 92
0 0 0  20720    416  695 200 149  104  288 1586  0  0  723  139 1116 1  7 92
0 0 0  20704    600  726 124 149   90  318  912  0  0  724  135 1075 1  5 93
0 0 0  20232    608  812 186 157  151  318  844  0  0  755  211 1157 3  7 90
0 0 0  20632    336  684 128 118   81  269  656  0  0  709  128 1070 0  6 93
0 0 0  20040    416  733 222 158  150  303 1115  0  0  677  154 1057 3  7 91
0 0 0  20320    360  604 242 183  120  295 1172  0  0  763   73 1136 2  7 91

Here's final fun spike of frentic VM activity before I decided to killthe system due to lack of response:


procs    memory      page                       disks   faults      cpu
r b w    avm    fre  flt  re  pi   po   fr   sr m0 c0   in   sy  cs us sy id

0 0 0 18800 264 6396 2067 2173 701 3327 18417 0 0 8465 85 13536 09 910 0 0 18928 272 4386 1494 1490 431 2268 11808 0 0 5835 49 9153 09 910 0 0 18768 352 3475 1081 1195 319 1789 8276 0 0 4664 50 7081 08 920 0 0 18600 312 5580 1710 2177 658 2846 22349 0 0 7802 94 12409 08 920 0 0 18720 240 4348 1410 1692 484 2249 14998 0 0 5995 95 9471 08 920 0 0 18440 392 5522 1706 2164 660 2812 19205 0 0 7753 86 11928 010 900 0 0 18608 320 6945 2180 2706 818 3560 25290 0 0 9940 103 15098 09 910 0 0 18608 344 6709 1999 2567 838 3462 23361 0 0 9417 107 14569 09 910 0 0 18648 288 3505 1202 1400 386 1781 12668 0 0 4868 42 7655 09 91


The free list never topped 350K after the system cratered.

Follow-Ups:
- re: Machine livelock with latest (4.99.48) kernel on sparc64 -- mem leak?
  - From: matthew green
- Re: Machine livelock with latest (4.99.48) kernel on sparc64 -- mem leak?
  - From: David Laight

References:
- Machine livelock with latest (4.99.48) kernel on sparc64
  - From: Rafal Boni
- Re: Machine livelock with latest (4.99.48) kernel on sparc64
  - From: Rafal Boni
- Re: Machine livelock with latest (4.99.48) kernel on sparc64
  - From: Rafal Boni

Prev by Date: Re: slowness w/ 4.99.48 on amd64
Next by Date: [joel%carnat.net@localhost: could not load wpi firmware]
Previous by Thread: Re: Machine livelock with latest (4.99.48) kernel on sparc64
Next by Thread: Re: Machine livelock with latest (4.99.48) kernel on sparc64 -- mem leak?
Indexes:

Home | Main Index | Thread Index | Old Index