tech-kern: User process eating memory makes system unusable & sometimes crashes

Subject: User process eating memory makes system unusable & sometimes crashes
To: None <tech-kern@netbsd.org>
From: theo borm <theo4490@borm.org>
List: tech-kern
Date: 11/26/2004 14:16:19
Dear list members,

My problem is as follows:
A specific program that is used to calculate maps of genomes ("FPC") either
sometimes makes my system unusable (network stops responding except to icmp
messages and it becomes impossible to log in on a console; top keeps running
as do some other programs, but nothing new can be started), or 
(alternatively)
sometimes even makes the system reboot without any prior warning.

My setup is as follows:
I have a server PC (i386, NetBSD 2.0 BETA, 256 MByte memory) and a diskless
client PC (i386, NetBSD 1.6.2 STABLE, 512 MByte memory). The diskless client
boots reliably, mounts root and swap (512Mbytes) over NFS, and operates 
quite
happily (except for some minor preventable NFS problems) until the problem
occurs (that is: FPC is run on the client). Both use a GENERIC kernel, and
do not share any binaries.

I have looked into the source code of FPC, and (I think) traced the problem
to an excessive number of mallocs and reallocs, and have written a small
program (see below) that reproduces the problem (at least partially):

After allocating 344 MBytes the program hangs in the realloc (does not 
return
from the realloc libc function a.f.a.i.c.t)

There are multiple things I find odd about this:
first:  realloc should always return.
second: user processes should not have this effect on the system
third:  how can a small program allocating a mere 344MBytes on an otherwise
        idle system with 512 MByte physical memory, leaving (net) about
        150 MByte physical memory for other processes do this?
fourth: why should the size of this program grow to 1008MBytes? (see top 
output
        below). Even in the unlikely case that a realloc of 344 Mbytes 
really
        means allocating an /extra/ 344 Mbytes above the previous (336 
MByte)
        allocation, this would only add up to 680 MByte, coincidentally 328
        MByte short of the 1008MByte mark.
fifth:  the pattern of swap space usage follows a rather eratic pattern 
(graph
        available on request), increasing to 100% (at 296 MBytes) before
        dropping to 25%, growing to 100% (at 320 MBytes) then dropping 
to 25%
        before growing to a final 100% at 336 MBytes.

I would appreciate any advice on how to go about debugging this problem; I'm
not even sure where exactly (kernel/libc ?) the problem lies.

with kind regards,

Theo Borm


------------------------------------------------------------------------------
#include <stdio.h>
#include <unistd.h>
#include <sys/wait.h>

void abusememory(char * mem,int length)
{
    int i;
    for (i=0; i<length; i+=2048) mem[i]='!';
}
int main(void)
{
    int i;
    char * mem;
    char * temp;
    mem=(char *)malloc(0);
    for (i=0; i<1024; i+=8)
    {
       printf("allocating %d MBytes\n",i);
       temp=(char *)realloc(mem,i*1048576);
       if (temp==NULL)
       {/* not being able to allocate more is no problem -> just exit */
           printf("failed\n");
           free(mem);
           exit(0);
       }
       mem=temp;
       printf("succeeded\n");
       /* memory is allocated but not used until something is written to 
it */
       abusememory(mem,i*1048576);
       sleep(1);
    }
    free(mem);
}
-------------------------------------------------------------------------------
Top output: (note: 'mt' is the test program)

load averages:  2.98,  2.54,  2.24                                     
12:56:35
25 processes:  3 runnable, 21 sleaping, 1 on processor
CPU states:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% 
idle
Memory: 293M Act, 147M Inact, 124K Wired, 2684K Exec, 9104K File, 4K Free
Swap: 512M Total, 512M Used, 320K Free

  PID USERNAME PRI NICE   SIZE   RES STATE      TIME   WCPU    CPU COMMAND
   10 root     -18    0     0K   19M pgdaemon  14:15  0.00%  0.00% 
[pagedaemon]
 1507 theo     -18    0  1008M  391M flt_nora   0:56  0.00%  0.00% mt
    6 root       2    0     0K   19M netio      0:31  0.00%  0.00% [nfsio]
    7 root      -1    0     0K   19M nfsrcvlk   0:29  0.00%  0.00% [nfsio]

etcetera...