netbsd-bugs: Re: kern/34461: multiple problems; ioapic related?

Subject: Re: kern/34461: multiple problems; ioapic related?
To: None <gnats-bugs@NetBSD.org>
From: Bert Kiers <kiersb@xs4all.net>
List: netbsd-bugs
Date: 09/04/2006 16:05:17
On Mon, Sep 04, 2006 at 12:20:02PM +0000, Rui Paulo wrote:
> The following reply was made to PR kern/34461; it has been noted by GNATS.
> 
> From: Rui Paulo <rpaulo@fnop.net>
> To: gnats-bugs@NetBSD.org
> Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
> 	netbsd-bugs@netbsd.org
> Subject: Re: kern/34461: multiple problems; ioapic related?
> Date: Mon, 4 Sep 2006 13:16:53 +0100
> 
>  On Sep 4, 2006, at 1:35 AM, kiersb@xs4all.net wrote:
>  
>  >> Number:         34461
>  >> Category:       kern
>  >> Synopsis:       multiple problems; ioapic related?
>  >> Confidential:   no
>  >> Severity:       critical
>  >> Priority:       high
>  >> Responsible:    kern-bug-people
>  >> State:          open
>  >> Class:          sw-bug
>  >> Submitter-Id:   net
>  >> Arrival-Date:   Mon Sep 04 00:35:00 +0000 2006
>  >> Originator:     kiersb@xs4all.net
>  >> Release:        NetBSD 4.99.1
>  >> Organization:
>  > 	XS4All
>  >> Environment:
>  > 	
>  > 	
>  > System: NetBSD raidtest2 4.99.1 NetBSD 4.99.1 (GENERIC.MPACPI) #0:  
>  > Wed Aug 16 17:48:50 CEST 2006 kiers@kleurtjes:/disk2/obj/sys/arch/ 
>  > i386/compile/GENERIC.MPACPI i386
>  > Architecture: i386
>  > Machine: i386
>  >> Description:
>  > 	
>  > While rsyncing 1.3 TB from an ufs file system to an lfs file system  
>  > on same box, after around 200 MB:
>  >
>  > db{0}> bt
>  > Xintr_lapic_ipi() at netbsd:Xintr_lapic_ipi+0x7
>  > --- interrupt ---
>  > Bad frame pointer: 0xc3077318
>  > 0x30:
>  >
>  > then, after rebooting and issueing save rsync command, after 392 GB:
>  >
>  > sd5(isp0:0:21:0): unable to allocate scsipi_xfer
>  >
>  > and nothing special in kernel messages; after this sd5 message, the  
>  > computer is *very* slow,
>  >
>  > root@raidtest2:~# uptime
>  > 12:58AM  up  9:11, 2 users, load averages: 8.66, 8.59, 8.52
>  > took around two minutes. Even typing 'uptime' on the serial console  
>  > echoes around 1 char in two seconds.
>  
>  Can you monitor disk I/O with iostat at the same time you are doing a  
>  copy ?


before starting rsync, parity reconstruction is working on raid1:
(37% complete)

device  read KB/t    r/s   time     MB/s write KB/t    w/s   time     MB/s
fd0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
ld0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
md0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
cd0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
sd0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
sd1          0.00      0   0.00     0.00       0.00      0   0.00     0.00
sd2          0.00      0   0.00     0.00       0.00      0   0.00     0.00
sd3          0.00      0   0.00     0.00       0.00      0   0.00     0.00
sd4          0.00      0   0.00     0.00       0.00      0   0.00     0.00
sd5          0.00      0   0.00     0.00       0.00      0   0.00     0.00
sd6          0.00      0   0.00     0.00       0.00      0   0.00     0.00
sd7         16.00    674   0.61    10.54       0.00      0   0.61     0.00
sd8         16.00    675   0.61    10.55       0.00      0   0.61     0.00
sd9         16.00    674   0.56    10.54       0.00      0   0.56     0.00
sd10        16.00    674   0.59    10.54       0.00      0   0.59     0.00
sd11        16.00    673   0.56    10.52       0.00      0   0.56     0.00
sd12        16.00    674   0.55    10.54       0.00      0   0.55     0.00
sd13         0.00      0   0.00     0.00       0.00      0   0.00     0.00
sd14         0.00      0   0.00     0.00       0.00      0   0.00     0.00
sd15         0.00      0   0.00     0.00       0.00      0   0.00     0.00
raid0        0.00      0   0.00     0.00       0.00      0   0.00     0.00
raid1        0.00      0   0.00     0.00       0.00      0   0.00     0.00
raid2        0.00      0   0.00     0.00       0.00      0   0.00     0.00

during root@raidtest2:/disk0/pub# rsync --size-only -r --progress /disk1/pub/* .:
(i/o to disks sd0..sd5 is very bursty)

device  read KB/t    r/s   time     MB/s write KB/t    w/s   time     MB/s
fd0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
ld0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
md0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
cd0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
sd0         10.69    339   0.85     3.53      12.25    479   0.85     5.73
sd1         12.37    259   0.75     3.13      13.78    422   0.75     5.68
sd2         10.80    340   0.88     3.58      12.27    476   0.88     5.71
sd3         13.20    249   0.71     3.20      14.31    407   0.71     5.69
sd4         10.69    374   0.88     3.91      11.94    491   0.88     5.72
sd5         13.20    241   0.63     3.10      14.35    408   0.63     5.72
sd6          0.00      0   0.00     0.00       0.00      0   0.00     0.00
sd7         16.00    593   0.61     9.27       0.00      0   0.61     0.00
sd8         16.00    593   0.58     9.27       0.00      0   0.58     0.00
sd9         16.00    593   0.57     9.27       0.00      0   0.57     0.00
sd10        16.00    593   0.61     9.27       0.00      0   0.61     0.00
sd11        16.00    591   0.61     9.24       0.00      0   0.61     0.00
sd12        16.00    592   0.59     9.25       0.00      0   0.59     0.00
sd13         0.00      0   0.00     0.00       0.00      0   0.00     0.00
sd14         0.00      0   0.00     0.00       0.00      0   0.00     0.00
sd15         0.00      0   0.00     0.00       0.00      0   0.00     0.00
raid0        0.00      0   1.01     0.00      50.19    489   1.01    23.97
raid1       64.00      4   0.01     0.25       0.00      0   0.01     0.00
raid2        0.00      0   0.00     0.00       0.00      0   0.00     0.00


now, after 20 GB done, system is very unresponsive, but nothing in messages or
dmesg:
(iostat -w 1, rsync and systat vmstat: no output, ssh to box time-out, but
serial console is 'only' very very slow)

device  read KB/t    r/s   time     MB/s write KB/t    w/s   time     MB/s
fd0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
ld0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
md0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
cd0          0.00      0   0.00     0.00       0.00      0   0.00     0.00
sd0         11.37    277   0.74     3.08      12.57    374   0.74     4.59
sd1         11.88    257   0.65     2.99      13.07    362   0.65     4.63
sd2         11.80    258   0.67     2.98      13.03    365   0.67     4.65
sd3         11.24    271   0.81     2.98      12.58    377   0.81     4.63
sd4         11.82    247   0.69     2.85      13.14    362   0.69     4.65
sd5         11.74    256   0.70     2.94      12.98    362   0.70     4.59
sd6          0.00      0   0.00     0.00       0.00      0   0.00     0.00
sd7         16.00    530   0.50     8.28       0.00      0   0.50     0.00
sd8         16.00    530   0.57     8.28       0.00      0   0.57     0.00
sd9         16.00    530   0.57     8.28       0.00      0   0.57     0.00
sd10        16.00    530   0.56     8.28       0.00      0   0.56     0.00
sd11        16.00    530   0.51     8.28       0.00      0   0.51     0.00
sd12        16.00    530   0.51     8.28       0.00      0   0.51     0.00
sd13         0.00      0   0.00     0.00       0.00      0   0.00     0.00
sd14         0.00      0   0.00     0.00       0.00      0   0.00     0.00
sd15         0.00      0   0.00     0.00       0.00      0   0.00     0.00
raid0        0.00      0   1.00     0.00      36.31    535   1.00    18.96
raid1        0.00      0   0.00     0.00       0.00      0   0.00     0.00
raid2        0.00      0   0.00     0.00       0.00      0   0.00     0.00

systat vmstat at this moment:

    4 users    Load  9.34  9.10  8.04                  Mon Sep  4 15:39:01

Proc:r  d  s  w     Csw    Trp    Sys   Int   Sof    Flt      PAGING   SWAPPING
        2 10       6435     23    117 13250   119     11      in  out   in  out
                                                        ops
  20.5% Sy   0.0% Us   0.0% Ni   3.3% In  76.2% Id    pages
|    |    |    |    |    |    |    |    |    |    |
==========%%                                                              forks
                                                                          fkppw
           memory totals (in kB)           23237 Interrupts               fksvm
          real  virtual     free             100 cpu0 softclock           pwait
Active 1227868  1227868     1620              13 cpu0 softnet             relck
All    2052948  2052948   134672                 cpu0 softserial          rlkok
                                             100 cpu0 timer               noram
Namei         Sys-cache     Proc-cache           FPU flush IPI            ndcpy
    Calls     hits    %     hits     %           FPU synch IPI            fltcp
        8        8  100                     2146 TLB shootdown I          zfod
                                                 cpu1 softnet             cow
Disks: seeks xfers bytes %busy               100 cpu1 timer            64 fmin
   fd0                                           FPU flush IPI         85 ftarg
   ld0                                           FPU synch IPI     150021 itarg
   md0                                      2512 TLB shootdown I     1038 wired
   cd0                                           cpu2 softnet             pdfre
   sd0         685 8390K  76.9               100 cpu2 timer               pdscn
   sd1         669 8330K  72.6                   FPU flush IPI
   sd2         671 8349K  73.4                   FPU synch IPI
   sd3         679 8339K  73.8              2403 TLB shootdown I
   sd4         669 8296K  71.6                   cpu3 softnet
   sd5         669 8301K  72.2               100 cpu3 timer
   sd6                                           FPU flush IPI
   sd7         556 8896K  57.6                   FPU synch IPI
   sd8         556 8893K  58.0              2412 TLB shootdown I
   sd9         556 8893K  58.2                   ioapic0 pin 4
  sd10         556 8890K  58.6                   ioapic0 pin 6
  sd11         556 8893K  59.8             13238 ioapic2 pin 0
  sd12         556 8893K  61.4                13 ioapic2 pin 6
  sd13                                           ioapic1 pin 6
  sd14                                           ioapic0 pin 15
  sd15
 raid0         575   20M 100.0
 raid1
 raid2

root@raidtest2:~# date
Mon Sep  4 15:58:02 CEST 2006
root@raidtest2:~# 

took 1 minute, and see that vmstat is frozen for 20 minutes already.

root@raidtest2:~# netstat -anfinet|grep EST
tcp        0      0  194.109.0.22.22        194.109.0.97.63281     ESTABLISHED
tcp        0      0  194.109.0.22.22        194.109.0.97.63327     ESTABLISHED
tcp        0     80  194.109.0.22.22        194.109.0.97.63328     ESTABLISHED

on the console took 30 seconds, just to see the connections are still there and now
(16:00) suddenly the system starts going again; rsync, iostat and systat just continue

>  
>  >
>  > dmesg available on http://dia.zepam.nl/pr-20060904-1-dmesg.txt
>  >
>  > Not production system, so I can test stuff.
>  >> How-To-Repeat:
>  > 	
>  >> Fix:
>  > 	
>  >
>  >> Unformatted:
>  >  	
>  >  	
>  
>  
>  
>  	-- Rui Paulo
>  
>  
> 

-- 
Bert Kiers
XS4All UNIX systeembeheerder, lockpicker & techno anarchist