We had a problem with a Python2.6 program running on NetBSD-6 where the main application would fork() and then exec() some other NetBSD program. We would occassionally observe the child process hang. Running on a single core, this would occur maybe once a week. When run on multi-core, this would occur once every few hours. Our Python program is multi-threaded (about 15 threads), and we use Python 2.6 built from pkgsrc, so it is using native pthreads. After a lot of debugging I traced it down to an interaction between pthreads, the malloc library, and the malloc library's use of semaphores and locks and the rules for what can and can not be done after forking. The IEEE Std 1003.1 Specification says that only async-signal-safe calls may be called in an after-fork child before exec is called. http://pubs.opengroup.org/onlinepubs/009695399/functions/fork.html The number of async-signal-safe calls are few, and malloc and free are not included in this list. https://www.securecoding.cert.org/confluence/display/seccode/SIG30-C.+Call+only+asynchronous-safe+functions+within+signal+handlers In practice, this requirement is only necessary if the program is threaded, and if there is more than 1 thread running. Thus, daemonization (using double fork) at the beginning of a program before it launches its threads is okay, even though it is technically not compliant with the spec. Python does *not* register handlers with pthread_atfork, but it does call its own function PyOS_AfterFork after calling fork. This is in the C code that implements Python, so calling fork in Python (or anything that calls fork such as the functions in the subprocess module) guarantees a call to PyOS_AfterFork. PyOS_AfterFork calls functions which are not async-signal-safe. This is true for both Python2.* and Python3.*. Visual inspection of PyOS_AfterFork uncovered a number of calls to functions that are *not* async-signal-safe. However, the vast majority of the problem calls are to malloc and free. When a child is forked, it gets a copy of the executing thread only. If another thread happens to hold a mutex in the malloc module, and then a child tries to do a malloc or free, the child will hang, because the thread which holds the mutex will never come alive and free it. I have attached a simple non-compliant program to illustrate the problem with malloc and free. In this program, there are threads (four by default, configurabale via command line) which just call malloc and free in a loop. The main thread is a loop which forks a child and waits for it. The child does a single malloc and then exits. The parent process will print the pid of the child forked, and then print the pid and status after waiting for the child. Without the changes recommended below, the program will eventually hang while waiting for a child. When this happens, if the child is subsequently killed with a signal, the wait call in the parent will return a non-zero status, and execution of the parent will resume. A simple workaround is to take all the malloc module locks before calling fork, then release them after the fork. This guarantees that the locks are held in the executing thread, and thus the child will be able to release them. This is what the _malloc_prefork and _malloc_postfork functions do. In both malloc.c and jemalloc.c, there are _malloc_prefork and _malloc_postfork functions defined, but they are unused. They are also not documented, and I am guessing not intended for general use. Taking the malloc module locks must be the last thing done before calling fork, and releasing the locks must be the first thing done after calling fork. This is so other functions registered with pthread_atfork can use malloc and free in their functions. This can be guaranteed by registering _malloc_prefork and _malloc_postfork in pthread_init. pthread_init is called before main, so no user program can possibly get a pthread_atfork handler registered before the ones in pthread_init. Here is the diff for pthread.c: diff --git a/netbsd/src/lib/libpthread/pthread.c b/netbsd/src/lib/libpthread/pthread.c index 4880957..6ba7a00 100644 --- a/netbsd/src/lib/libpthread/pthread.c +++ b/netbsd/src/lib/libpthread/pthread.c @@ -55,6 +55,9 @@ __RCSID("$NetBSD: pthread.c,v 1.125.4.1 2012/05/07 03:12:33 riz Exp $"); #include "pthread.h" #include "pthread_int.h" +/* Need to use libc-private names for malloc pre/post fork operations. */ +#include "../libc/include/extern.h" + pthread_rwlock_t pthread__alltree_lock = PTHREAD_RWLOCK_INITIALIZER; RB_HEAD(__pthread__alltree, __pthread_st) pthread__alltree; @@ -230,6 +233,7 @@ pthread__init(void) } /* Tell libc that we're here and it should role-play accordingly. */ + pthread_atfork(_malloc_prefork, _malloc_postfork, _malloc_postfork); pthread_atfork(NULL, NULL, pthread__fork_callback); __isthreaded = 1; } I did find one subtle bug in the _malloc_prefork and _malloc_postfork implementation in jemalloc.c. In both the prefork and postfork functions, the arenas_mtx is taken and released. I am wondering why it isn't just held in the prefork call to be released in the postfork call. Here's the diff for jemalloc.c: diff --git a/netbsd/src/lib/libc/stdlib/jemalloc.c b/netbsd/src/lib/libc/stdlib/jemalloc.c index 81dbf45..7905d9d 100644 --- a/netbsd/src/lib/libc/stdlib/jemalloc.c +++ b/netbsd/src/lib/libc/stdlib/jemalloc.c @@ -3915,7 +3915,6 @@ _malloc_prefork(void) if (arenas[i] != NULL) malloc_mutex_lock(&arenas[i]->mtx); } - malloc_mutex_unlock(&arenas_mtx); malloc_mutex_lock(&base_mtx); @@ -3933,7 +3932,6 @@ _malloc_postfork(void) malloc_mutex_unlock(&base_mtx); - malloc_mutex_lock(&arenas_mtx); for (i = 0; i < narenas; i++) { if (arenas[i] != NULL) malloc_mutex_unlock(&arenas[i]->mtx); Has anyone had experience using _malloc_prefork and _malloc_postfork? Is there any reason why they are not registered in pthread_init? While the Posix specification doesn't require that we add these handlers, in practice, they are very useful because there is so much non-compliant software out there. (Python is only one example; I have found others.) It's a simple and reasonable change that avoids the vast majority of bugs caused by non-compliant programs. I have tried running my non-compliant program with the modifications described above. My program has not hung once. My non-compliant program runs correctly on my Mac (OS X 10.8.2) and on Fedora 15 Linux installed on an x86 host. I haven't looked into whether that's because of an explicit locking strategy or not. The key hard question is whether a tiny performance hit for fork is a good trade for making buggy programs work better. Given that fork is relatively rare, and many uses of fork are likely to be buggy, it seems like this change would be a net win. Bev Schwartz BBN Technologies bschwart%bbn.com@localhost
Attachment:
thread_fork.c
Description: Binary data