netbsd-bugs: kern/10362: clock slips under heavy network/interrupt load

Subject: kern/10362: clock slips under heavy network/interrupt load
To: None <gnats-bugs@gnats.netbsd.org>
From: Hal Murray <murray@pa.dec.com>
List: netbsd-bugs
Date: 06/14/2000 01:23:11
>Number:         10362
>Category:       kern
>Synopsis:       clock slips under heavy network/interrupt load
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Jun 14 01:24:00 PDT 2000
>Closed-Date:
>Last-Modified:
>Originator:     Hal Murray
>Release:        1.4Z
>Organization:
        Systems Research Center, Compaq Computer Corporation
>Environment:
Alpha and rarely on i386
System: NetBSD mckinley 1.4Z NetBSD 1.4Z (MIATA) #4: Fri Jun 9 04:01:32 PDT 2000
 murray@mckinley:/usr/src/sys/arch/alpha/compile/MIATA alpha


>Description:

        Sorry this is so long/verbose but I don't have a simple smoking gun. 
        
        I'm pretty sure the problem is that the system is spending 
        more than one clock tick at interrupt level - thus missing 
        clock interrupts. 

        The initial symptoms will be something like this:
            Jun 13 21:14:02 foraker ntpd[178]: time reset 9.962291 s
        I see this when running nasty network tests on a system
        with an otherwise rock solid clock.

        Note that ntpd is not very good at processing this type of 
        "noise" on the clock. 

        A printf from interrupt level is very good at aggrivating 
        this problem.  There are quite a few of those in network 
        drivers, especially the no-buffer case.  As an experiment, 
        I commented out a few of them but that didn't solve the problem. 

        I've seen this on i386 too, but just barely.  It only lost 
        a few ticks during several minutes of testing.  The clock 
        ticks are much longer (10 ms vs 1 ms) which makes this problem 
        harder to provoke.  


Another way to get a feel for this problem is to run a script 
like this on the target machine.
        
#!/bin/sh

myname=`hostname`
ref="xxxx"  <==== fill in a stable reference machine here

while true
 do
 echo -n "$myname: "; date        <===  Only interesting if this line
 echo -n "$ref: "; rsh $ref date
 echo -n "$myname: "; date        <===  and this line are close in time
 ntpq -pn
 if [ -x /usr/sbin/ntptime ]; then
   /usr/sbin/ntptime | grep frequency
 fi
 sleep 100
 echo; echo
 done


Here is some sample printout:

mckinley: Tue Jun  6 20:23:14 PDT 2000
16.4.80.64: Tue Jun  6 20:23:14 PDT 2000
mckinley: Tue Jun  6 20:23:14 PDT 2000
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+204.123.2.21    204.123.2.5      2 u  280 1024  377    1.130   -0.158   0.094
*204.123.2.71    204.123.2.5      2 u  271 1024  377    2.606    0.327   0.126
+204.123.2.70    204.123.2.5      2 u  245 1024  377    1.913   -0.173   0.010
  offset 29.000 us, frequency 161.851 ppm, interval 1 s,

Above is a typical sample after the system had been idle for a long 
time.  The "when" column is the number of seconds since the last 
time the local ntp daemon checked with that server.  The "poll" column 
is how long to go before checking again.  (It ranges from 64 to 1024 
by factors of 2.) 

Note that the "poll" column is maxed out at 1024 seconds.  That indicates 
that ntpd is happy with the local clock and believes it can coast 
for a long time before checking again. 


mckinley: Tue Jun 13 22:36:35 PDT 2000   <== mckinley's clock is a second 
16.4.80.64: Tue Jun 13 22:36:36 PDT 2000 <== slower than 16.4.80.64
mckinley: Tue Jun 13 22:36:35 PDT 2000   <==
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+204.123.2.21    204.123.2.5      2 u   18   64  377    1.620  1034.67   9.797
+204.123.2.71    204.123.2.5      2 u   50   64  377    1.632  1045.78  12.030
*204.123.2.70    204.123.2.5      2 u   37   64  377    1.656  1033.64  10.104
  offset 1407.000 us, frequency 353.973 ppm, interval 1 s,
                                ^^^
 ntpd has turned up the clock speed, trying to catch up.
 (it was 165 above.)

mckinley: Tue Jun 13 22:38:20 PDT 2000
16.4.80.64: Tue Jun 13 22:38:20 PDT 2000
mckinley: Tue Jun 13 22:38:20 PDT 2000
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 204.123.2.21    204.123.2.5      2 u   57   64    1    0.985  -36.575   0.000
 204.123.2.71    204.123.2.5      2 u   27   64    1    0.980  -48.408   0.000
 204.123.2.70    204.123.2.5      2 u   10   64    1    0.990  -53.891   0.000
  offset 0.000 us, frequency 512.000 ppm, interval 1 s,
                             ^^^
 ntpd has panicked and is trying to catch up as rapidly as it can.
 (512 is the limit on how far it will adjust the frequency knob.)

From the log file between the two samples above:

Jun 13 22:37:05 mckinley ntpd[177]: time reset 1.031576 s

I've seen it overshoot so far that the clock gets reset back.

I've seen huge junps.  I think 50 or 60 seconds was my previous record 
but I just got this on a machine with an Alteon Gigabit Ethernet 
card.
 
Jun 14 00:36:08 quartz ntpd[178]: time reset 737.971573 s


>How-To-Repeat:

        Heavy network traffic, especially things that generate a 
        lot of printfs. 
        
        UDP blast-em tests that use up the whole buffer pool on the 
        receive end often provoke can't-get-buffer messages.  I think 
        large UDP packets cause more trouble because partial packets 
        pile up on the reassembly queue when a fragment gets lost. 

        For testing, it might be simplest to hack in a spin loop 
        (at spl-int) at some convenient place in the kernel. 

>Fix:
        Using ratecheck on all printfs from interrupt level is obviously 
        a step in the right direction.
        
        I think a solid fix will need something like using the cycle 
        counter if that's available on an architecture. 

>Release-Note:
>Audit-Trail:
>Unformatted: