Re: parallel build failure with .c.o rule interrupted mid-step!

To: NetBSD-current Users Discussion List <current-users%netbsd.org@localhost>, NetBSD Toolchain and Build Technical Discussion List <tech-toolchain%NetBSD.org@localhost>
Subject: Re: parallel build failure with .c.o rule interrupted mid-step!
From: "Greg A. Woods" <woods%planix.ca@localhost>
Date: Thu, 06 Mar 2025 14:38:14 -0800

At Wed, 05 Mar 2025 19:17:06 -0800, "Greg A. Woods" <woods%planix.ca@localhost> wrote:
Subject: Re: parallel build failure with .c.o rule interrupted mid-step!
>
> >
> > Hmmmm.... maybe there's a clue to this problem in the fact the child
> > shell's output is to a pipe which must be read by make.  Maybe they get
> > closed too soon?
>
> Ah ha!  PIPE DEATH!  ???
>
> I think the job output pipes are closed too soon!  Anything in the job
> that writes after some other parallel job encounters an error, including
> the ongoing job's shell, triggers its own death.

One thing has been bothering me.

I have not yet been able reproduce the problem I see sometimes in real
builds of NetBSD with commands that exactly mirror the structure of the
".c.o" rule.

Perhaps it's just that much more rare.

Now that I see what's causing the intermediate file to be left (i.e. the
SIGPIPE killing the shell before it starts the final command), I'm
wondering if the empty ".o" is caused by a previous build having left a
".o.o" file behind.

It still doesn't quite make sense.

There's a race between the shell writing the final command-line to the
pipe and the SIGPIPE being delivered, in which time the shell may or may
not get so far as forking the child process to run that command-line.

If the child process gets started then it should complete so long as it
never tries to write to stdout or stderr, and normally I don't think
ctfconvert ever writes any output messages.

On the other hand the likelihood of a left-over ".o.o" file is high as
the shell may be killed before it forks to run the ctfconvert line.

The next run should just run the compiler again, overwriting the
".o.o", and then the ctfconvert will run.

The compiler could also write a warning message, causing it to be
killed.  The shell running it will then also exit before running
ctfconvert because the killed compiler exits non-zero.

But even if that caused an incomplete object file to be left behind
(because say the compiler gets the SIGPIPE part-way through writing the
".o.o"), the next run should just run the compiler again, overwriting
the incomplete ".o.o", and then the ctfconvert will run cleanly.

What could cause the ctfconvert to die after it has created the final
".o", but before it writes anything to that file?  I would think the
only thing could be if it outputs something to stdout/stderr and it gets
a SIGPIPE at the exact wrong time.

I can reproduce a failure if the ctfconvert step writes to stdout, but
it is indeed more rare.

The other thing that bothers me is that the .c.o rule with
CTFCONVERT_RUN is fundamentally wrong w.r.t. how one should write good
reliable Makefiles -- it creates an intermediate file that make does not
know about.  If this were Cook, and Peter Miller were still with us to
comment, he would almost certainly say something like "fix the rule --
the DAG must be complete!"

Fixing this in the way I think it has to be fixed, and without fixing or
changing the .c.o rule, will cause parallel builds that error out to
sometimes hang for rather long times while any running parallel jobs
with long-running compiles complete.  I think that will cause a lot of
people to be somewhat surprised, and possibly dismayed.

Maybe make should run each job in a unique process group, and if one
fails it should just kill all the other process groups for all the other
jobs immediately, and of course make sure all the known product files
for those killed jobs are also cleaned.  Then the .c.o file might leave
a possibly incomplete ".o.o" file, but that would be the worst of it.

And if someone does want to let other parallel jobs run to completion
even if an error occurs in the build, they use "-k".  (I used to always
use "-k", but I've fallen out of the habit for some reason that I'm not
currently conscious of.)

--
					Greg A. Woods <gwoods%acm.org@localhost>

Kelowna, BC     +1 250 762-7675           RoboHack <woods%robohack.ca@localhost>
Planix, Inc. <woods%planix.com@localhost>     Avoncote Farms <woods%avoncote.ca@localhost>

#
#	tfail.mk:
#
# Demo two-commands-in-one-script failure
#
# Run with (remove the trailing 'exit' unless running it in emacs!):
#
#	rm -rf tfail; mkdir -p tfail; touch tfail/tfail.trace; make MAKEOBJDIR=tfail -T tfail.trace -f tfail.mk -j 20; rc=$?; sleep 2; fgrep DEATH tfail/*
#
# or:
#
#	rm -rf tfail; mkdir -p tfail; cd tfail; touch tfail.trace; make -T tfail.trace -f ../tfail.mk -j 20; rc=$?; sleep 2; fgrep DEATH *
#
# or:
#
#	rm -rf tfail; mkdir -p tfail; touch tfail/tfail.trace; make MAKEOBJDIR=tfail -T tfail.trace -f tfail.mk -j 20; rc=$?; sleep 2; ls -l tfail/*.int; for file in tfail/*.int; do if [ -f $file ]; then ls -l $(basename $file .int); fi; done; fgrep DEATH tfail/*
#
# When it fails, if it behaves the same as I see in a NetBSD build, then there
# will be one or more .int files, and for each there might also be an empty
# associated .obj file too.
#
# A left-over .int file means the script ended early, and that shouldn't happen
# -- all scripts should run to completion.
#
# An empty .obj file means something interrupted the script and that causes
# errors as it is an incorrectly built, incomplete, product file.
#

# prevent Ksh or NetBSD sh from running any user-controlled setup
#
# This should speed up builds where $ENV is accidentally set to a valid
# pathname.  Even in running this test it appears to shave off about 1/3 of a
# second of user CPU time, and maybe as much as half the system CPU time.  (In
# my own setup with a login shell of Ksh, it is set to a variable expansion that
# fails, but this should still speed up builds by avoiding having to try to
# parse and expand it.)
#
# Note by default on NetBSD the default shell used by make is /bin/sh and by
# default it is passed the option "-q" (which show up in "$-" as "eLqs") because
# of the .echoFlag 'q' option (which, FYI, /bin/ksh doesn't have).  This has the
# effect of hiding what is being read from $ENV, if anything.
#
ENV = 		# empty
.export ENV

# This should, and appears to, mimic how the "DEFSHELL" is set up now for use in
# NetBSD:
#
#.SHELL: name=sh path=/bin/sh hasErrCtl=false \
#	newline="\n" \
#	check="echo \"%s\"\n" \
#	ignore="%s\n" \
#	errout="{ %s \n} || exit $?\n" \
#	echoFlag=q
#	comment="\#"

# This sets up NetBSD sh to be used like a modern shell:
#
# Note the echoFlag=qv!  Without the 'v' it doesn't have the desired effect, yet
# it still doesn't show up in the executed shell's "$-"!
#
# This is slightly more efficient than the default old Bourne sh setup.
#
#.SHELL: name=sh path=/bin/sh hasErrCtl=true \
#	check="set -e" ignore="set +e" \
#	echo="set -v" quiet="set +v" filter="set +v" \
#	echoFlag=qv errFlag=e newline="'\n'" \
#	comment="\#"

# This should set up ksh to be used in the same antiquated way /bin/sh is set up
# to work as the "DEFSHELL" in NetBSD make, i.e. without Echo or Error control
# (as in the original KSH setup in Make, sans using "print" instead of "echo")
#
#.SHELL: name=ksh path=/bin/ksh hasErrCtl=false \
#	newline="\n" \
#	check="print \"%s\"\n" \
#	ignore="%s\n" \
#	errout="{ %s \n} || exit $?\n" \
#	comment="\#"

# This sets up ksh to be used like a modern shell
#
#.SHELL: name=ksh path=/bin/ksh hasErrCtl=true \
#	check="set -e" ignore="set +e" \
#	echo="set -v" quiet="set +v" filter="set +v" \
#	echoFlag=v errFlag=e newline="'\n'" \
#	comment="\#"

# This sets up ksh to be used like a modern shell with tracing support
#
#.SHELL: name=sh path=/bin/ksh hasErrCtl=true \
#	check="set -e" ignore="set +e" \
#	echo="set -v" quiet="set +v" filter="set +v" \
#	echoFlag=v errFlag=e newline="'\n'" \
#	comment="\#"

#.SHELL: name=sh path=/bin/dash hasErrCtl=true \
#	check="set -e" ignore="set +e" \
#	echo="set -v" quiet="set +v" filter="set +v" \
#	echoFlag=v errFlag=e newline="'\n'" \
#	comment="\#"

#.SHELL: name=ksh path=/usr/pkg/bin/dash

# Clear th list first just to eliminate any possible side-effects from
# <sys.mk>...
#
.SUFFIXES:

.SUFFIXES: .src .obj

OBJECT_TARGET	= ${.TARGET}.int

# N.B.:  In this form it is common for ${OBJECT_TARGET} to remain, and sometimes
# for the corresponding ${.TARGET} to also be there, but complete.
#
.src.obj:
#	@trap 'echo INT DEATH! >> ${.TARGET}; exit 2' INT
#	@trap 'echo HUP DEATH! >> ${.TARGET}; exit 11' HUP
#	@trap 'echo PIPE DEATH! >> ${.TARGET}; exit 13' PIPE
#	@trap 'echo TERM DEATH! >> ${.TARGET}; exit 15' TERM
# pretend one compile has a syntax error
	if [ ${.TARGET} = "src-3-9.obj" ]; then exit 1; fi
# simulate compilation
	touch ${OBJECT_TARGET} && sleep 0.1 && cat ${.IMPSRC} >> ${OBJECT_TARGET}
# simulate ctfconvert, with some compiles writing to stdout
	touch ${.TARGET} && ( if expr ${.TARGET} : src-3 >/dev/null; then echo warning about ${.TARGET}; fi; sleep 0.2 && cat ${OBJECT_TARGET} >> ${.TARGET} ) && rm -f ${OBJECT_TARGET}

# This form often reproduces the problem
#
#.src.obj:
## pretend one compile has a syntax error
#	if [ ${.TARGET} = "src-3-9.obj" ]; then  exit 1; fi
##
## job shells hang around and burn CPU for a bit with any of these traps!  (with
## or without the '@' flag) presumably because they try to write to a closed pipe
## but SIGPIPE is caught
##	@trap 'echo INT DEATH!; exit 2' INT
##	@trap 'echo HUP DEATH!; exit 11' HUP
##	@trap 'echo PIPE DEATH!; exit 13' PIPE
##	@trap 'echo TERM DEATH!; exit 15' TERM
##
## AH HA!  They fail if the job (including shell) writes to stdout/stderr!
##
#	@trap 'echo INT DEATH! >> ${.TARGET}; exit 2' INT
#	@trap 'echo HUP DEATH! >> ${.TARGET}; exit 11' HUP
#	@trap 'echo PIPE DEATH! >> ${.TARGET}; exit 13' PIPE
#	@trap 'echo TERM DEATH! >> ${.TARGET}; exit 15' TERM
## simulate compilation
#	touch ${OBJECT_TARGET}
#	sleep 0.1
#	cat ${.IMPSRC} >> ${OBJECT_TARGET}
## simulate ctfconvert
#	touch ${.TARGET}
#	sleep 0.2
#	cat ${OBJECT_TARGET} >> ${.TARGET} && rm -f ${OBJECT_TARGET}

PROD_ITERS ?= 40
SRC_ITERS ?= 40

all: .PHONY info .WAIT

info: .PHONY
	@printf ".SHELL = '${.SHELL}'\n"
	@printf "ENV = '${ENV}'($${ENV})\n"
	@printf "shell params = $${#}:'$${-}'\n"

# magic range expansions from Roland Illig
#
.for _i in ${:U:${:Urange=${PROD_ITERS}}}

all: dir-${_i}

.for _j in ${:U:${:Urange=${SRC_ITERS}}}

SRCS.${_i} += src-${_i}-${_j}.src
OBJS.${_i} += src-${_i}-${_j}.obj

src-${_i}-${_j}.src:
	echo ${.TARGET} > ${.TARGET}
.endfor

#
# pretend each "foo-*" is built in a separate subdirectory so that they can be
# built in parallel
#
dir-${_i}: .PHONY srcs-${_i}
	${MAKE} -f ${MAKEFILE} foo-${_i}

foo-${_i}: ${OBJS.${_i}}
	touch ${.TARGET}
# pretend foo-1 takes a really long time to build
	@if [ ${.TARGET} = "foo-1" ]; then sleep 10; fi
	cat ${OBJS.${_i}} > ${.TARGET}
	@if [ ${.TARGET} = "foo-1" ]; then echo ${.TARGET} finally done!; fi

# make each set of sources separately just to be sure they exist before the
# "dir-N" is built...
#
srcs-${_i}: .PHONY
	${MAKE} -f ${MAKEFILE} do-srcs-${_i}

do-srcs-${_i}: .PHONY ${SRCS.${_i}}

.endfor

#
# Local Variables:
# eval: (make-local-variable 'compile-command)
# compile-command: (concat "rm -rf tfail; mkdir -p tfail; touch tfail/tfail.trace; ENV=$HOME/.shrc time make MAKEOBJDIR=tfail -T tfail.trace -f tfail.mk -j 20; rc=$?; sleep 2; ls -l tfail/*.int; for file in tfail/*.int; do if [ -f $file ]; then ls -l $(basename $file .int); fi; done; fgrep DEATH tfail/*; exit $rc")
# End:
#

Attachment: pgpP0S3WdxfnJ.pgp
Description: OpenPGP Digital Signature

Follow-Ups:
- Re: parallel build failure with .c.o rule interrupted mid-step!
  - From: Roland Illig
- Re: parallel build failure with .c.o rule interrupted mid-step!
  - From: Mouse

References:
- Re: parallel build failure with .c.o rule interrupted mid-step!
  - From: Greg A. Woods
- parallel build failure with .c.o rule interrupted mid-step!
  - From: Greg A. Woods
- Re: parallel build failure with .c.o rule interrupted mid-step!
  - From: Roland Illig
- Re: parallel build failure with .c.o rule interrupted mid-step!
  - From: Greg A. Woods
- Re: parallel build failure with .c.o rule interrupted mid-step!
  - From: Greg A. Woods
- Re: parallel build failure with .c.o rule interrupted mid-step!
  - From: Greg A. Woods
- Re: parallel build failure with .c.o rule interrupted mid-step!
  - From: Robert Elz
- Re: parallel build failure with .c.o rule interrupted mid-step!
  - From: Greg A. Woods
- Re: parallel build failure with .c.o rule interrupted mid-step!
  - From: Greg A. Woods

Prev by Date: Re: parallel build failure with .c.o rule interrupted mid-step!
Next by Date: Re: parallel build failure with .c.o rule interrupted mid-step!
Previous by Thread: Re: parallel build failure with .c.o rule interrupted mid-step!
Next by Thread: Re: parallel build failure with .c.o rule interrupted mid-step!
Indexes:

Home | Main Index | Thread Index | Old Index