Subject: bin/32519: ypbind spams syslog
To: None <gnats-admin@netbsd.org, netbsd-bugs@netbsd.org>
From: None <dholland@eecs.harvard.edu>
List: netbsd-bugs
Date: 01/14/2006 07:15:01
>Number: 32519
>Category: bin
>Synopsis: ypbind spams syslog if it loses connectivity
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: bin-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sat Jan 14 07:15:00 +0000 2006
>Originator: David A. Holland <dholland@eecs.harvard.edu>
>Release: NetBSD 3.99.10 (-20051026) (ypbind.c 1.53)
>Organization:
Harvard EECS
>Environment:
System: NetBSD tanaqui 3.99.10 NetBSD 3.99.10 (TANAQUI) #3: Wed Oct 26 18:52:27 EDT 2005 root@tanaqui:/usr/src/sys/arch/i386/compile/TANAQUI i386
Architecture: i386
Machine: i386
>Description:
A couple of days ago we had network maintenance in our building
overnight, and so everything lost connectivity for a while. I
came back the next morning to find the system log stuffed with
messages like these:
Jan 11 07:30:26 tanaqui ypbind[307]: [server1]: Host name lookup failure
Jan 11 07:30:26 tanaqui ypbind[307]: [server2]: Host name lookup failure
Jan 11 07:30:26 tanaqui ypbind[307]: no contactable servers found in
/var/yp/binding/[domain].ypservers
Jan 11 07:30:26 tanaqui ypbind[307]: broadcast: sendto: Host is down
Jan 11 07:30:26 tanaqui ypbind[307]: [server1]: Host name lookup failure
Jan 11 07:30:26 tanaqui ypbind[307]: [server2]: Host name lookup failure
Jan 11 07:30:26 tanaqui ypbind[307]: no contactable servers found in
/var/yp/binding/[domain].ypservers
(Yes, more than one of these groups per second...)
The problem, or at least a problem, appears to be this code:
if (ypdb->dom_check_t >= now) {
/* don't flood it */
ypdb->dom_check_t = 0;
check++;
}
(lines 298-302 of ypbind.c 1.53)
Setting dom_check_t to 0 forces a check of the server at the
next call to checkwork(). This happens at least once a second,
but also after servicing rpc calls. So when the network goes
down and processes start contacting ypbind, it'll discover
right away that it's lost all the servers. However, once
things are dead every subsequent request it gets will trigger
another try at contacting the servers, possibly many times per
second. I think this is what I was seeing.
(And of course every time it tries to contact servers and
fails, it prints assorted messages to the system log.)
>How-To-Repeat:
Find a system using ypbind (that isn't also a yp server),
pull the network plug, and wait a while. If necessary, have
some cron jobs or the like start up while the network is down.
>Fix:
I think the right answer is to make the test
(ypdb->dom_check_t >= now && ypdb->dom_alive == 1)
as if everything's already been found to be unreachable
dom_alive will apparently be 2. However, I could easily be
missing something and I'd like someone else to think about it.
So I'm not including a patch. :-)
Even with this issue fixed ypbind will still probably spam the
system log once every five seconds (the traditional behavior)
and it would be nice to have some exponential backoff on that,
and/or have it suppress the error messages after the first
time. However, this is a separate problem and I'd be inclined
right now to leave it for later.
I do however suggest the following patch to fix a misleading
log message:
Index: ypbind.c
===================================================================
RCS file: /cvsroot/src/usr.sbin/ypbind/ypbind.c,v
retrieving revision 1.53
diff -u -r1.53 ypbind.c
--- ypbind.c 30 Oct 2004 15:57:43 -0000 1.53
+++ ypbind.c 14 Jan 2006 06:52:28 -0000
@@ -774,7 +774,7 @@
if (sendto(rpcsock, buf, outlen, 0, (struct sockaddr *)&bindsin,
sizeof bindsin) == -1)
- yp_log(LOG_WARNING, "broadcast: sendto: %m");
+ yp_log(LOG_WARNING, "nag_servers: sendto: %m");
}
switch (ypbindmode) {
(And I know, the real fix for these problems is to not run
ypbind. Unfortunately, that's not always an option.)