Subject: Problems with PF_KEY SADB_DUMP
To: None <tech-net@netbsd.org>
From: Jonathan Stone <jonathan@DSG.Stanford.EDU>
List: tech-net
Date: 09/19/2003 15:46:17
Here's a summary of the current status on PF_KEY problems with
SADB_DUMP of modest-to-large SA database (at least as I see it):
* There is a consensus that NetBSD needs a correct, reliable, robust
interface to PF_KEY; and that a kernfs-based approach (as kernfs
is strictly optional in NetBSD) is by definition not a suitable API.
(Bill Studenmund disagrees; Bill would like to make kernfs more standard.
Bill has been heard, but for now that's a different issue).
* The PF_KEY API defines the SAD_DUMP so that the app sends one
SADB_DUMP message, to which the kernel responds with multiple SADB_DUMP
responses. Each response has one SA. Thus, SABD_DUMP cannot be reworked
to use Matt Thomas's suggestion (do the uiomove() directly) without
changing the userspace API.
* There is a genuine bug in the KAME PF_KEY, which has also been
faithfully copied in fast-ipsec (NetBSD and FreeBSD): if a process
requesting an SADB_DUMP and the kernel fills the requesting so_rcv queue,
KAME fails to place an error indication in the last-delivered packet.
(that's why racoon hangs in sbwait(): it is waiting to read another SADB_DUMP message).
KAME setkey has a kludge to avoid the bug: it does a setsockopt()
with SO_RCVTIMEO, and in the loop to read subsequent SADB_DUMP respsones,
setkey interpretes a subsequent EAGAIN as a sign to abort the loop.
IMNSO, that's not up to the standards to which NetBSD code aspires.
A more correct fix is to have the sendup code check whether additional
SADB_DUMP messages are required; if more are required, and there
isn't space for at least one more (in addition to the current
message) then set sadb_msg_errno to (e.g.) ENOBUFS, to indicate
the SADB_DUMP responses are truncated at that message.
* A major reason we run into this is the very small size of the
SADB_DUMP responses. They leave about 70% of each mbuf empty. The
nett result is that the requesting PF_KEY socket is hitting its
sb_mbmax limit while sb_cc is still only at 70k or thereabouts (with
the sb_hiwat limit at 256k).
Thus, increasing the recieve queue via setsockopt (. ,SO_RCVBUF, ..)
*on its own* doesn't help one iota (exactly as I reported to Itojun):
SO_RCVBUF does an sbreserve(), and sb_reserve() clips the socket queue's
sb_mbmax at sb_max (NetBSD sysctl kern.sbmax).
To increase the number of SAs that can be returned, you have to bump
sb_max: and bump it to values way beyond what I consider reasonable
for general-purpose use. (Setting sb_max to 1024*1024 is still on the
low side for the applications I want.)
* I have verified that bumping both sb_max *and* the per-socket receive
queue does indeed increase the number of SAs the kernel can return,
on both a week-old NetBSD fast-ipsec and on FreeBSD 4.x fast-ipsec.
To paraphrase another developer's private email: we may have to do
some papering-over here, but I'm not yet sure whether we paper over
the implementation, or get a ladder big enough to start papering over the spec.
Packing the SAs more densely into the socket queue would have the most
immediate pay-back (if we can do that without breaking the api?). I'm
wondering if the long-term fix is to add an ioctl()-style API, where
we can return an atomic snapshot of the SADB, up to whatever size the
userland process has address-space for.
That's where it's at. Where do we go from here?