Source-Changes-HG archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
[src/trunk]: src Reduces the resources demanded by TCP sessions in TIME_WAIT-...
details: https://anonhg.NetBSD.org/src/rev/c231e1897ffd
branches: trunk
changeset: 764776:c231e1897ffd
user: dyoung <dyoung%NetBSD.org@localhost>
date: Tue May 03 18:28:44 2011 +0000
description:
Reduces the resources demanded by TCP sessions in TIME_WAIT-state using
methods called Vestigial Time-Wait (VTW) and Maximum Segment Lifetime
Truncation (MSLT).
MSLT and VTW were contributed by Coyote Point Systems, Inc.
Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.
Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.
Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.
It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.
A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.
diffstat:
distrib/sets/lists/comp/mi | 3 +-
sys/dist/pf/net/pf.c | 8 +-
sys/netinet/Makefile | 5 +-
sys/netinet/files.netinet | 3 +-
sys/netinet/in_pcb.c | 108 +-
sys/netinet/in_pcb.h | 7 +-
sys/netinet/in_pcb_hdr.h | 25 +-
sys/netinet/tcp_input.c | 333 +++-
sys/netinet/tcp_subr.c | 72 +-
sys/netinet/tcp_usrreq.c | 81 +-
sys/netinet/tcp_var.h | 14 +-
sys/netinet/tcp_vtw.c | 2425 ++++++++++++++++++++++++++++++
sys/netinet/tcp_vtw.h | 420 +++++
sys/netinet/udp_usrreq.c | 9 +-
sys/netinet6/in6_pcb.c | 94 +-
sys/netinet6/in6_pcb.h | 9 +-
sys/netinet6/in6_src.c | 14 +-
sys/netinet6/ip6_input.c | 6 +-
sys/netinet6/raw_ip6.c | 10 +-
sys/netinet6/udp6_usrreq.c | 6 +-
sys/rump/net/lib/libnetinet/Makefile.inc | 6 +-
usr.bin/netstat/Makefile | 4 +-
usr.bin/netstat/inet.c | 85 +-
usr.bin/netstat/inet6.c | 98 +-
usr.bin/netstat/main.c | 55 +-
usr.bin/netstat/netstat.h | 6 +-
usr.bin/netstat/vtw.c | 431 +++++
usr.bin/netstat/vtw.h | 8 +
28 files changed, 4200 insertions(+), 145 deletions(-)
diffs (truncated from 5565 to 300 lines):
diff -r ddd2e9439de6 -r c231e1897ffd distrib/sets/lists/comp/mi
--- a/distrib/sets/lists/comp/mi Tue May 03 17:44:30 2011 +0000
+++ b/distrib/sets/lists/comp/mi Tue May 03 18:28:44 2011 +0000
@@ -1,4 +1,4 @@
-# $NetBSD: mi,v 1.1619 2011/04/20 18:55:53 haad Exp $
+# $NetBSD: mi,v 1.1620 2011/05/03 18:28:44 dyoung Exp $
#
# Note: don't delete entries from here - mark them as "obsolete" instead.
#
@@ -1614,6 +1614,7 @@
./usr/include/netinet/tcp_seq.h comp-c-include
./usr/include/netinet/tcp_timer.h comp-c-include
./usr/include/netinet/tcp_var.h comp-c-include
+./usr/include/netinet/tcp_vtw.h comp-c-include
./usr/include/netinet/tcpip.h comp-c-include
./usr/include/netinet/udp.h comp-c-include
./usr/include/netinet/udp_var.h comp-c-include
diff -r ddd2e9439de6 -r c231e1897ffd sys/dist/pf/net/pf.c
--- a/sys/dist/pf/net/pf.c Tue May 03 17:44:30 2011 +0000
+++ b/sys/dist/pf/net/pf.c Tue May 03 18:28:44 2011 +0000
@@ -1,4 +1,4 @@
-/* $NetBSD: pf.c,v 1.64 2010/05/07 17:41:57 degroote Exp $ */
+/* $NetBSD: pf.c,v 1.65 2011/05/03 18:28:45 dyoung Exp $ */
/* $OpenBSD: pf.c,v 1.552.2.1 2007/11/27 16:37:57 henning Exp $ */
/*
@@ -37,7 +37,7 @@
*/
#include <sys/cdefs.h>
-__KERNEL_RCSID(0, "$NetBSD: pf.c,v 1.64 2010/05/07 17:41:57 degroote Exp $");
+__KERNEL_RCSID(0, "$NetBSD: pf.c,v 1.65 2011/05/03 18:28:45 dyoung Exp $");
#include "pflog.h"
@@ -2798,9 +2798,9 @@
#ifdef __NetBSD__
#define in_pcbhashlookup(tbl, saddr, sport, daddr, dport) \
- in_pcblookup_connect(tbl, saddr, sport, daddr, dport)
+ in_pcblookup_connect(tbl, saddr, sport, daddr, dport, NULL)
#define in6_pcbhashlookup(tbl, saddr, sport, daddr, dport) \
- in6_pcblookup_connect(tbl, saddr, sport, daddr, dport, 0)
+ in6_pcblookup_connect(tbl, saddr, sport, daddr, dport, 0, NULL)
#define in_pcblookup_listen(tbl, addr, port, zero) \
in_pcblookup_bind(tbl, addr, port)
#define in6_pcblookup_listen(tbl, addr, port, zero) \
diff -r ddd2e9439de6 -r c231e1897ffd sys/netinet/Makefile
--- a/sys/netinet/Makefile Tue May 03 17:44:30 2011 +0000
+++ b/sys/netinet/Makefile Tue May 03 18:28:44 2011 +0000
@@ -1,4 +1,4 @@
-# $NetBSD: Makefile,v 1.19 2007/10/05 03:28:13 dyoung Exp $
+# $NetBSD: Makefile,v 1.20 2011/05/03 18:28:45 dyoung Exp $
INCSDIR= /usr/include/netinet
@@ -8,7 +8,8 @@
in_var.h ip.h ip_carp.h ip6.h ip_ecn.h ip_encap.h \
ip_icmp.h ip_mroute.h ip_var.h pim.h pim_var.h \
tcp.h tcp_debug.h tcp_fsm.h tcp_seq.h tcp_timer.h tcp_var.h \
- tcpip.h udp.h udp_var.h
+ tcpip.h udp.h udp_var.h \
+ tcp_vtw.h
# ipfilter headers
# XXX shouldn't be here
diff -r ddd2e9439de6 -r c231e1897ffd sys/netinet/files.netinet
--- a/sys/netinet/files.netinet Tue May 03 17:44:30 2011 +0000
+++ b/sys/netinet/files.netinet Tue May 03 18:28:44 2011 +0000
@@ -1,4 +1,4 @@
-# $NetBSD: files.netinet,v 1.21 2010/07/13 22:16:10 rmind Exp $
+# $NetBSD: files.netinet,v 1.22 2011/05/03 18:28:45 dyoung Exp $
defflag opt_tcp_debug.h TCP_DEBUG
defparam opt_tcp_debug.h TCP_NDEBUG
@@ -40,5 +40,6 @@
file netinet/tcp_timer.c inet | inet6
file netinet/tcp_usrreq.c inet | inet6
file netinet/tcp_congctl.c inet | inet6
+file netinet/tcp_vtw.c inet | inet6
file netinet/udp_usrreq.c inet | inet6
diff -r ddd2e9439de6 -r c231e1897ffd sys/netinet/in_pcb.c
--- a/sys/netinet/in_pcb.c Tue May 03 17:44:30 2011 +0000
+++ b/sys/netinet/in_pcb.c Tue May 03 18:28:44 2011 +0000
@@ -1,4 +1,4 @@
-/* $NetBSD: in_pcb.c,v 1.137 2009/05/12 22:22:46 elad Exp $ */
+/* $NetBSD: in_pcb.c,v 1.138 2011/05/03 18:28:45 dyoung Exp $ */
/*
* Copyright (C) 1995, 1996, 1997, and 1998 WIDE Project.
@@ -30,10 +30,12 @@
*/
/*-
- * Copyright (c) 1998 The NetBSD Foundation, Inc.
+ * Copyright (c) 1998, 2011 The NetBSD Foundation, Inc.
* All rights reserved.
*
* This code is derived from software contributed to The NetBSD Foundation
+ * by Coyote Point Systems, Inc.
+ * This code is derived from software contributed to The NetBSD Foundation
* by Public Access Networks Corporation ("Panix"). It was developed under
* contract to Panix by Eric Haszlakiewicz and Thor Lancelot Simon.
*
@@ -91,7 +93,7 @@
*/
#include <sys/cdefs.h>
-__KERNEL_RCSID(0, "$NetBSD: in_pcb.c,v 1.137 2009/05/12 22:22:46 elad Exp $");
+__KERNEL_RCSID(0, "$NetBSD: in_pcb.c,v 1.138 2011/05/03 18:28:45 dyoung Exp $");
#include "opt_inet.h"
#include "opt_ipsec.h"
@@ -137,6 +139,8 @@
#include <netipsec/key.h>
#endif /* IPSEC */
+#include <netinet/tcp_vtw.h>
+
struct in_addr zeroin_addr;
#define INPCBHASH_PORT(table, lport) \
@@ -269,9 +273,12 @@
lport = *lastport - 1;
for (cnt = mymax - mymin + 1; cnt; cnt--, lport--) {
+ vestigial_inpcb_t vestigial;
+
if (lport < mymin || lport > mymax)
lport = mymax;
- if (!in_pcblookup_port(table, sin->sin_addr, htons(lport), 1)) {
+ if (!in_pcblookup_port(table, sin->sin_addr, htons(lport), 1,
+ &vestigial) && !vestigial.valid) {
/* We have a free port, check with the secmodel(s). */
sin->sin_port = lport;
error = kauth_authorize_network(cred,
@@ -347,6 +354,7 @@
return (error);
} else {
struct inpcb *t;
+ vestigial_inpcb_t vestige;
#ifdef INET6
struct in6pcb *t6;
struct in6_addr mapped;
@@ -373,14 +381,19 @@
mapped.s6_addr16[5] = 0xffff;
memcpy(&mapped.s6_addr32[3], &sin->sin_addr,
sizeof(mapped.s6_addr32[3]));
- t6 = in6_pcblookup_port(table, &mapped, sin->sin_port, wild);
+ t6 = in6_pcblookup_port(table, &mapped, sin->sin_port, wild, &vestige);
if (t6 && (reuseport & t6->in6p_socket->so_options) == 0)
return (EADDRINUSE);
+ if (!t6 && vestige.valid) {
+ if (!!reuseport != !!vestige.reuse_port) {
+ return EADDRINUSE;
+ }
+ }
#endif
/* XXX-kauth */
if (so->so_uidinfo->ui_uid && !IN_MULTICAST(sin->sin_addr.s_addr)) {
- t = in_pcblookup_port(table, sin->sin_addr, sin->sin_port, 1);
+ t = in_pcblookup_port(table, sin->sin_addr, sin->sin_port, 1, &vestige);
/*
* XXX: investigate ramifications of loosening this
* restriction so that as long as both ports have
@@ -393,10 +406,22 @@
&& (so->so_uidinfo->ui_uid != t->inp_socket->so_uidinfo->ui_uid)) {
return (EADDRINUSE);
}
+ if (!t && vestige.valid) {
+ if ((!in_nullhost(sin->sin_addr)
+ || !in_nullhost(vestige.laddr.v4)
+ || !vestige.reuse_port)
+ && so->so_uidinfo->ui_uid != vestige.uid) {
+ return EADDRINUSE;
+ }
+ }
}
- t = in_pcblookup_port(table, sin->sin_addr, sin->sin_port, wild);
+ t = in_pcblookup_port(table, sin->sin_addr, sin->sin_port, wild, &vestige);
if (t && (reuseport & t->inp_socket->so_options) == 0)
return (EADDRINUSE);
+ if (!t
+ && vestige.valid
+ && !(reuseport && vestige.reuse_port))
+ return EADDRINUSE;
inp->inp_lport = sin->sin_port;
in_pcbstate(inp, INP_BOUND);
@@ -464,6 +489,7 @@
struct in_ifaddr *ia = NULL;
struct sockaddr_in *ifaddr = NULL;
struct sockaddr_in *sin = mtod(nam, struct sockaddr_in *);
+ vestigial_inpcb_t vestige;
int error;
if (inp->inp_af != AF_INET)
@@ -524,7 +550,8 @@
}
if (in_pcblookup_connect(inp->inp_table, sin->sin_addr, sin->sin_port,
!in_nullhost(inp->inp_laddr) ? inp->inp_laddr : ifaddr->sin_addr,
- inp->inp_lport) != 0)
+ inp->inp_lport, &vestige) != 0
+ || vestige.valid)
return (EADDRINUSE);
if (in_nullhost(inp->inp_laddr)) {
if (inp->inp_lport == 0) {
@@ -794,7 +821,7 @@
struct inpcb *
in_pcblookup_port(struct inpcbtable *table, struct in_addr laddr,
- u_int lport_arg, int lookup_wildcard)
+ u_int lport_arg, int lookup_wildcard, vestigial_inpcb_t *vp)
{
struct inpcbhead *head;
struct inpcb_hdr *inph;
@@ -802,6 +829,9 @@
int matchwild = 3, wildcard;
u_int16_t lport = lport_arg;
+ if (vp)
+ vp->valid = 0;
+
head = INPCBHASH_PORT(table, lport);
LIST_FOREACH(inph, head, inph_lhash) {
inp = (struct inpcb *)inph;
@@ -833,6 +863,54 @@
break;
}
}
+ if (match && matchwild == 0)
+ return match;
+
+ if (vp && table->vestige) {
+ void *state = (*table->vestige->init_ports4)(laddr, lport_arg, lookup_wildcard);
+ vestigial_inpcb_t better;
+
+ while (table->vestige
+ && (*table->vestige->next_port4)(state, vp)) {
+
+ if (vp->lport != lport)
+ continue;
+ wildcard = 0;
+ if (!in_nullhost(vp->faddr.v4))
+ wildcard++;
+ if (in_nullhost(vp->laddr.v4)) {
+ if (!in_nullhost(laddr))
+ wildcard++;
+ } else {
+ if (in_nullhost(laddr))
+ wildcard++;
+ else {
+ if (!in_hosteq(vp->laddr.v4, laddr))
+ continue;
+ }
+ }
+ if (wildcard && !lookup_wildcard)
+ continue;
+ if (wildcard < matchwild) {
+ better = *vp;
+ match = (void*)&better;
+
+ matchwild = wildcard;
+ if (matchwild == 0)
+ break;
+ }
+ }
+
+ if (match) {
+ if (match != (void*)&better)
+ return match;
+ else {
+ *vp = better;
+ return 0;
+ }
+ }
+ }
+
return (match);
}
@@ -843,13 +921,17 @@
struct inpcb *
in_pcblookup_connect(struct inpcbtable *table,
struct in_addr faddr, u_int fport_arg,
- struct in_addr laddr, u_int lport_arg)
+ struct in_addr laddr, u_int lport_arg,
+ vestigial_inpcb_t *vp)
{
struct inpcbhead *head;
struct inpcb_hdr *inph;
struct inpcb *inp;
u_int16_t fport = fport_arg, lport = lport_arg;
+ if (vp)
+ vp->valid = 0;
+
head = INPCBHASH_CONNECT(table, faddr, fport, laddr, lport);
Home |
Main Index |
Thread Index |
Old Index