Subject: Re: kern/17616: accept on a tcp socket loses options
To: None <gnats-admin@netbsd.org>
From: Bill Studenmund <wrstuden@netbsd.org>
List: tech-net
Date: 07/17/2002 13:11:03
I figured out my recent (May) weird tcp behavior.
On Thu, 2 May 2002, Bill Studenmund wrote:
> I've been playing with some iSCSI code, and noticed a very odd behavior
> with a userland test program. It performs a series of iSCSI pings (NOP-OUT
> with optional NOP-IN echo). They all fly along (less than like .2 seconds
> for 1000 itterations), until the one test case where the iSCSI target is
> echoing 4k of data back. That test takes 200 seconds.
>
> I looked into it, and the problem is that, for that one test case, the
> target is sending data back in two writes. The first is a 48-byte iSCSI
> PDU, the other is the 4k of test data. For some reason, the target waits
> for an ack from the initiator before sending the 4k response. That ack
> takes .199 seconds to arrive, thus adding the delay.
>
> What I really don't get is that both sides are doing the same write
> sequence (48 bytes, 4k), with the same tcp options (TCP_NODELAY), running
> on the same machine (using localhost), but only one side of it is having
> to delay.
Turns out that the problem is that our code doesn't copy tcp options from
the listening socket to the connected socket. As mentioned in the PR, this
seems like a bug to me.
I have a patch which corrects the behavior:
Index: tcp_input.c
===================================================================
RCS file: /cvsroot/syssrc/sys/netinet/tcp_input.c,v
retrieving revision 1.122.2.8
diff -u -r1.122.2.8 tcp_input.c
--- tcp_input.c 2002/06/20 03:48:54 1.122.2.8
+++ tcp_input.c 2002/07/17 20:11:55
@@ -3238,6 +3238,7 @@
#endif
else
tp = NULL;
+ tp->t_flags = sototcpcb(oso)->t_flags & TF_NODELAY;
if (sc->sc_request_r_scale != 15) {
tp->requested_s_scale = sc->sc_requested_s_scale;
tp->request_r_scale = sc->sc_request_r_scale;
***
I only copy over TF_NODELAY as it's the only user-settable tcp flag.
Making this change would be a behavior change. Before, if you set
TCP_NODELAY on a listening socket, you got connected sockets without it.
Now you'll get connected sockets with it.
I don't think we need to worry about this (in this particular case) as I
can't think of a case where a program would sett TCP_NODELAY on a
listening socket and expect it not to be set on the connected ones; BSD
wisdom was that you set TCP_NODELAY after the accept, and Linux wisdom
(one of the things I think Linux actually did right) is that you set
TCP_NODELAY on the listening socket so that you get it set on all of the
connected ones.
I think the change should be documented, but I'm not sure where. It would
seem weird to discuss TCP_NODELAY on the accept(2) man page, but anywhere
else might be a bit burried.
Thoughts?
Take care,
Bill