NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/59081: Add close_range() system call



The following reply was made to PR kern/59081; it has been noted by GNATS.

From: Taylor R Campbell <riastradh%NetBSD.org@localhost>
To: =?UTF-8?Q?J=C3=B6rg_Sonnenberger?= <joerg%bec.de@localhost>
Cc: gnats-bugs%NetBSD.org@localhost, netbsd-bugs%NetBSD.org@localhost,
	Ricardo Branco <rbranco%suse.de@localhost>,
	"David H. Gutteridge" <david%gutteridge.ca@localhost>,
	Christos Zoulas <christos%zoulas.com@localhost>
Subject: Re: kern/59081: Add close_range() system call
Date: Sun, 30 Mar 2025 21:41:05 +0000

 > Date: Wed, 26 Mar 2025 16:46:37 +0100
 > From: J=F6rg Sonnenberger <joerg%bec.de@localhost>
 >=20
 > > It would be nice to have this available natively, for sure. I was asked
 > > by an upstream project why NetBSD didn't have this.
 >=20
 > I've never seen a use case that closefrom(3) doesn't cover.
 
 I wish the motivation were more clearly spelled out.  My best guess is
 the following:
 
 Suppose you want to create a process with a specific fd mapping.  It
 is not necessarily contiguous: for example, with librumphijack, we
 deliberately use two separate ranges of file descriptors, one for
 `host' fds (e.g., the socket to talk to the rump server) and one for
 `rump' fds (interpreted by the rump server), these are separated by a
 large number to reduce the chance of collision.
 
 So, the fd mapping might look like this:
 
 parent                 child
 ------                 -----
 0 (stdin)              0 (stdin)
 3 (output file)        1 (stdout)
 3 (output file)        2 (stderr)
 4 (rump socket)        65536
 
 This shape of mapping is, really, the right interface for a program
 running a subprocess, and I was always disappointed that
 posix_spawn(2) had a sequence of open/dup2/close actions instead of
 such a mapping.
 
 How do you effect this mapping?
 
 With closefrom(2), you might do something like this:
 
 	bitmap_t keepopen =3D {0}
 	int maxfd =3D -1
 	for (entry in map) {
 		bitmap_set(&keepopen, entry.child)
 		if (entry.child =3D=3D entry.parent)
 			continue
 		/* If target entry.child is needed as a source, dup. */
 		for (entry1 in map) {
 			if (entry.child =3D=3D entry1.parent)
 				entry1.parent =3D dup(entry1.parent)
 		}
 		dup2(entry.parent, entry.child)
 		maxfd =3D MAX(maxfd, entry.child)
 	}
 	for (fd =3D 0; fd < maxfd; fd++) {
 		if (!bitmap_isset(&keepopen))
 			close(fd)
 	}
 	closefrom(maxfd + 1)
 
 With close_range(2), you can instead do:
 
 	close_range(0, UINT_MAX, CLOSE_RANGE_CLOEXEC)
 	for (entry in map) {
 		if (entry.child =3D=3D entry.parent)
 			continue
 		/* If target entry.child is needed as a source, dup. */
 		for (entry1 in map) {
 			if (entry.child =3D=3D entry1.parent)
 				entry1.parent =3D dup_cloexec(entry.child)
 		}
 		dup2(entry.parent, entry.child)
 		/* Clear FD_CLOEXEC, i.e., keep it open on exec. */
 		fcntl(entry.child, F_SETFD,
 		    fcntl(entry.child, F_GETFD) & ~FD_CLOEXEC)
 	}
 
 (The inner loop could be eliminated, of course, by first indexing the
 parent sources in linear time and then updating a parent->replacement
 map as we go so the whole thing runs in linear rather than quadratic
 time and never dups the same source repeatedly.  But this is the same
 for both algorithms; it doesn't distinguish closefrom(2) from
 close_range(2).)
 
 Here's an example of the second algorithm in the real world (with=20
 
 https://github.com/GNOME/vte/blob/b23aaaeeca588439d4579f4ed06c1f4850219fc5/=
 src/spawn.cc#L380-L385
 https://github.com/GNOME/vte/blob/b23aaaeeca588439d4579f4ed06c1f4850219fc5/=
 src/spawn.cc#L437-L505
 
 One advantage of the second algorithm with close_range(2) is that it
 doesn't require computing any auxiliary data structure for a
 (potentially sparse) bit map in userland, and doesn't require userland
 to iterate over a (potentially large and sparse) range of file
 descriptors below the first one to closefrom(2).
 
 One advantage of the first algorithm with closefrom(2) has only one
 traversal over the whole fd table (userland loop + closefrom), while
 the second algorithm with close_range(2) has two -- close_range(2)
 traverses it once to set CLOEXEC, and then in the subsequent exec, the
 kernel traverses it once more to interpret CLOEXEC.  Maybe the kernel
 traversal is cheaper so that doesn't matter.
 
 So, it's not a priori clear to me that one algorithm wins over the
 other in performance with large fd tables.  But close_range(2) is a
 little more convenient for implementing the interface that is really
 useful.
 


Home | Main Index | Thread Index | Old Index