Subject: User process eating memory makes system unusable & sometimes crashes
To: None <tech-kern@netbsd.org>
From: theo borm <theo4490@borm.org>
List: tech-kern
Date: 11/26/2004 14:16:19
Dear list members,
My problem is as follows:
A specific program that is used to calculate maps of genomes ("FPC") either
sometimes makes my system unusable (network stops responding except to icmp
messages and it becomes impossible to log in on a console; top keeps running
as do some other programs, but nothing new can be started), or
(alternatively)
sometimes even makes the system reboot without any prior warning.
My setup is as follows:
I have a server PC (i386, NetBSD 2.0 BETA, 256 MByte memory) and a diskless
client PC (i386, NetBSD 1.6.2 STABLE, 512 MByte memory). The diskless client
boots reliably, mounts root and swap (512Mbytes) over NFS, and operates
quite
happily (except for some minor preventable NFS problems) until the problem
occurs (that is: FPC is run on the client). Both use a GENERIC kernel, and
do not share any binaries.
I have looked into the source code of FPC, and (I think) traced the problem
to an excessive number of mallocs and reallocs, and have written a small
program (see below) that reproduces the problem (at least partially):
After allocating 344 MBytes the program hangs in the realloc (does not
return
from the realloc libc function a.f.a.i.c.t)
There are multiple things I find odd about this:
first: realloc should always return.
second: user processes should not have this effect on the system
third: how can a small program allocating a mere 344MBytes on an otherwise
idle system with 512 MByte physical memory, leaving (net) about
150 MByte physical memory for other processes do this?
fourth: why should the size of this program grow to 1008MBytes? (see top
output
below). Even in the unlikely case that a realloc of 344 Mbytes
really
means allocating an /extra/ 344 Mbytes above the previous (336
MByte)
allocation, this would only add up to 680 MByte, coincidentally 328
MByte short of the 1008MByte mark.
fifth: the pattern of swap space usage follows a rather eratic pattern
(graph
available on request), increasing to 100% (at 296 MBytes) before
dropping to 25%, growing to 100% (at 320 MBytes) then dropping
to 25%
before growing to a final 100% at 336 MBytes.
I would appreciate any advice on how to go about debugging this problem; I'm
not even sure where exactly (kernel/libc ?) the problem lies.
with kind regards,
Theo Borm
------------------------------------------------------------------------------
#include <stdio.h>
#include <unistd.h>
#include <sys/wait.h>
void abusememory(char * mem,int length)
{
int i;
for (i=0; i<length; i+=2048) mem[i]='!';
}
int main(void)
{
int i;
char * mem;
char * temp;
mem=(char *)malloc(0);
for (i=0; i<1024; i+=8)
{
printf("allocating %d MBytes\n",i);
temp=(char *)realloc(mem,i*1048576);
if (temp==NULL)
{/* not being able to allocate more is no problem -> just exit */
printf("failed\n");
free(mem);
exit(0);
}
mem=temp;
printf("succeeded\n");
/* memory is allocated but not used until something is written to
it */
abusememory(mem,i*1048576);
sleep(1);
}
free(mem);
}
-------------------------------------------------------------------------------
Top output: (note: 'mt' is the test program)
load averages: 2.98, 2.54, 2.24
12:56:35
25 processes: 3 runnable, 21 sleaping, 1 on processor
CPU states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100%
idle
Memory: 293M Act, 147M Inact, 124K Wired, 2684K Exec, 9104K File, 4K Free
Swap: 512M Total, 512M Used, 320K Free
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
10 root -18 0 0K 19M pgdaemon 14:15 0.00% 0.00%
[pagedaemon]
1507 theo -18 0 1008M 391M flt_nora 0:56 0.00% 0.00% mt
6 root 2 0 0K 19M netio 0:31 0.00% 0.00% [nfsio]
7 root -1 0 0K 19M nfsrcvlk 0:29 0.00% 0.00% [nfsio]
etcetera...