Subject: kern/10362: clock slips under heavy network/interrupt load
To: None <gnats-bugs@gnats.netbsd.org>
From: Hal Murray <murray@pa.dec.com>
List: netbsd-bugs
Date: 06/14/2000 01:23:11
>Number: 10362
>Category: kern
>Synopsis: clock slips under heavy network/interrupt load
>Confidential: no
>Severity: non-critical
>Priority: low
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed Jun 14 01:24:00 PDT 2000
>Closed-Date:
>Last-Modified:
>Originator: Hal Murray
>Release: 1.4Z
>Organization:
Systems Research Center, Compaq Computer Corporation
>Environment:
Alpha and rarely on i386
System: NetBSD mckinley 1.4Z NetBSD 1.4Z (MIATA) #4: Fri Jun 9 04:01:32 PDT 2000
murray@mckinley:/usr/src/sys/arch/alpha/compile/MIATA alpha
>Description:
Sorry this is so long/verbose but I don't have a simple smoking gun.
I'm pretty sure the problem is that the system is spending
more than one clock tick at interrupt level - thus missing
clock interrupts.
The initial symptoms will be something like this:
Jun 13 21:14:02 foraker ntpd[178]: time reset 9.962291 s
I see this when running nasty network tests on a system
with an otherwise rock solid clock.
Note that ntpd is not very good at processing this type of
"noise" on the clock.
A printf from interrupt level is very good at aggrivating
this problem. There are quite a few of those in network
drivers, especially the no-buffer case. As an experiment,
I commented out a few of them but that didn't solve the problem.
I've seen this on i386 too, but just barely. It only lost
a few ticks during several minutes of testing. The clock
ticks are much longer (10 ms vs 1 ms) which makes this problem
harder to provoke.
Another way to get a feel for this problem is to run a script
like this on the target machine.
#!/bin/sh
myname=`hostname`
ref="xxxx" <==== fill in a stable reference machine here
while true
do
echo -n "$myname: "; date <=== Only interesting if this line
echo -n "$ref: "; rsh $ref date
echo -n "$myname: "; date <=== and this line are close in time
ntpq -pn
if [ -x /usr/sbin/ntptime ]; then
/usr/sbin/ntptime | grep frequency
fi
sleep 100
echo; echo
done
Here is some sample printout:
mckinley: Tue Jun 6 20:23:14 PDT 2000
16.4.80.64: Tue Jun 6 20:23:14 PDT 2000
mckinley: Tue Jun 6 20:23:14 PDT 2000
remote refid st t when poll reach delay offset jitter
==============================================================================
+204.123.2.21 204.123.2.5 2 u 280 1024 377 1.130 -0.158 0.094
*204.123.2.71 204.123.2.5 2 u 271 1024 377 2.606 0.327 0.126
+204.123.2.70 204.123.2.5 2 u 245 1024 377 1.913 -0.173 0.010
offset 29.000 us, frequency 161.851 ppm, interval 1 s,
Above is a typical sample after the system had been idle for a long
time. The "when" column is the number of seconds since the last
time the local ntp daemon checked with that server. The "poll" column
is how long to go before checking again. (It ranges from 64 to 1024
by factors of 2.)
Note that the "poll" column is maxed out at 1024 seconds. That indicates
that ntpd is happy with the local clock and believes it can coast
for a long time before checking again.
mckinley: Tue Jun 13 22:36:35 PDT 2000 <== mckinley's clock is a second
16.4.80.64: Tue Jun 13 22:36:36 PDT 2000 <== slower than 16.4.80.64
mckinley: Tue Jun 13 22:36:35 PDT 2000 <==
remote refid st t when poll reach delay offset jitter
==============================================================================
+204.123.2.21 204.123.2.5 2 u 18 64 377 1.620 1034.67 9.797
+204.123.2.71 204.123.2.5 2 u 50 64 377 1.632 1045.78 12.030
*204.123.2.70 204.123.2.5 2 u 37 64 377 1.656 1033.64 10.104
offset 1407.000 us, frequency 353.973 ppm, interval 1 s,
^^^
ntpd has turned up the clock speed, trying to catch up.
(it was 165 above.)
mckinley: Tue Jun 13 22:38:20 PDT 2000
16.4.80.64: Tue Jun 13 22:38:20 PDT 2000
mckinley: Tue Jun 13 22:38:20 PDT 2000
remote refid st t when poll reach delay offset jitter
==============================================================================
204.123.2.21 204.123.2.5 2 u 57 64 1 0.985 -36.575 0.000
204.123.2.71 204.123.2.5 2 u 27 64 1 0.980 -48.408 0.000
204.123.2.70 204.123.2.5 2 u 10 64 1 0.990 -53.891 0.000
offset 0.000 us, frequency 512.000 ppm, interval 1 s,
^^^
ntpd has panicked and is trying to catch up as rapidly as it can.
(512 is the limit on how far it will adjust the frequency knob.)
From the log file between the two samples above:
Jun 13 22:37:05 mckinley ntpd[177]: time reset 1.031576 s
I've seen it overshoot so far that the clock gets reset back.
I've seen huge junps. I think 50 or 60 seconds was my previous record
but I just got this on a machine with an Alteon Gigabit Ethernet
card.
Jun 14 00:36:08 quartz ntpd[178]: time reset 737.971573 s
>How-To-Repeat:
Heavy network traffic, especially things that generate a
lot of printfs.
UDP blast-em tests that use up the whole buffer pool on the
receive end often provoke can't-get-buffer messages. I think
large UDP packets cause more trouble because partial packets
pile up on the reassembly queue when a fragment gets lost.
For testing, it might be simplest to hack in a spin loop
(at spl-int) at some convenient place in the kernel.
>Fix:
Using ratecheck on all printfs from interrupt level is obviously
a step in the right direction.
I think a solid fix will need something like using the cycle
counter if that's available on an architecture.
>Release-Note:
>Audit-Trail:
>Unformatted: