Subject: Re: kern/34461: multiple problems; ioapic related?
To: None <gnats-bugs@NetBSD.org>
From: Bert Kiers <kiersb@xs4all.net>
List: netbsd-bugs
Date: 09/04/2006 16:05:17
On Mon, Sep 04, 2006 at 12:20:02PM +0000, Rui Paulo wrote:
> The following reply was made to PR kern/34461; it has been noted by GNATS.
>
> From: Rui Paulo <rpaulo@fnop.net>
> To: gnats-bugs@NetBSD.org
> Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
> netbsd-bugs@netbsd.org
> Subject: Re: kern/34461: multiple problems; ioapic related?
> Date: Mon, 4 Sep 2006 13:16:53 +0100
>
> On Sep 4, 2006, at 1:35 AM, kiersb@xs4all.net wrote:
>
> >> Number: 34461
> >> Category: kern
> >> Synopsis: multiple problems; ioapic related?
> >> Confidential: no
> >> Severity: critical
> >> Priority: high
> >> Responsible: kern-bug-people
> >> State: open
> >> Class: sw-bug
> >> Submitter-Id: net
> >> Arrival-Date: Mon Sep 04 00:35:00 +0000 2006
> >> Originator: kiersb@xs4all.net
> >> Release: NetBSD 4.99.1
> >> Organization:
> > XS4All
> >> Environment:
> >
> >
> > System: NetBSD raidtest2 4.99.1 NetBSD 4.99.1 (GENERIC.MPACPI) #0:
> > Wed Aug 16 17:48:50 CEST 2006 kiers@kleurtjes:/disk2/obj/sys/arch/
> > i386/compile/GENERIC.MPACPI i386
> > Architecture: i386
> > Machine: i386
> >> Description:
> >
> > While rsyncing 1.3 TB from an ufs file system to an lfs file system
> > on same box, after around 200 MB:
> >
> > db{0}> bt
> > Xintr_lapic_ipi() at netbsd:Xintr_lapic_ipi+0x7
> > --- interrupt ---
> > Bad frame pointer: 0xc3077318
> > 0x30:
> >
> > then, after rebooting and issueing save rsync command, after 392 GB:
> >
> > sd5(isp0:0:21:0): unable to allocate scsipi_xfer
> >
> > and nothing special in kernel messages; after this sd5 message, the
> > computer is *very* slow,
> >
> > root@raidtest2:~# uptime
> > 12:58AM up 9:11, 2 users, load averages: 8.66, 8.59, 8.52
> > took around two minutes. Even typing 'uptime' on the serial console
> > echoes around 1 char in two seconds.
>
> Can you monitor disk I/O with iostat at the same time you are doing a
> copy ?
before starting rsync, parity reconstruction is working on raid1:
(37% complete)
device read KB/t r/s time MB/s write KB/t w/s time MB/s
fd0 0.00 0 0.00 0.00 0.00 0 0.00 0.00
ld0 0.00 0 0.00 0.00 0.00 0 0.00 0.00
md0 0.00 0 0.00 0.00 0.00 0 0.00 0.00
cd0 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd0 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd1 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd2 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd3 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd4 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd5 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd6 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd7 16.00 674 0.61 10.54 0.00 0 0.61 0.00
sd8 16.00 675 0.61 10.55 0.00 0 0.61 0.00
sd9 16.00 674 0.56 10.54 0.00 0 0.56 0.00
sd10 16.00 674 0.59 10.54 0.00 0 0.59 0.00
sd11 16.00 673 0.56 10.52 0.00 0 0.56 0.00
sd12 16.00 674 0.55 10.54 0.00 0 0.55 0.00
sd13 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd14 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd15 0.00 0 0.00 0.00 0.00 0 0.00 0.00
raid0 0.00 0 0.00 0.00 0.00 0 0.00 0.00
raid1 0.00 0 0.00 0.00 0.00 0 0.00 0.00
raid2 0.00 0 0.00 0.00 0.00 0 0.00 0.00
during root@raidtest2:/disk0/pub# rsync --size-only -r --progress /disk1/pub/* .:
(i/o to disks sd0..sd5 is very bursty)
device read KB/t r/s time MB/s write KB/t w/s time MB/s
fd0 0.00 0 0.00 0.00 0.00 0 0.00 0.00
ld0 0.00 0 0.00 0.00 0.00 0 0.00 0.00
md0 0.00 0 0.00 0.00 0.00 0 0.00 0.00
cd0 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd0 10.69 339 0.85 3.53 12.25 479 0.85 5.73
sd1 12.37 259 0.75 3.13 13.78 422 0.75 5.68
sd2 10.80 340 0.88 3.58 12.27 476 0.88 5.71
sd3 13.20 249 0.71 3.20 14.31 407 0.71 5.69
sd4 10.69 374 0.88 3.91 11.94 491 0.88 5.72
sd5 13.20 241 0.63 3.10 14.35 408 0.63 5.72
sd6 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd7 16.00 593 0.61 9.27 0.00 0 0.61 0.00
sd8 16.00 593 0.58 9.27 0.00 0 0.58 0.00
sd9 16.00 593 0.57 9.27 0.00 0 0.57 0.00
sd10 16.00 593 0.61 9.27 0.00 0 0.61 0.00
sd11 16.00 591 0.61 9.24 0.00 0 0.61 0.00
sd12 16.00 592 0.59 9.25 0.00 0 0.59 0.00
sd13 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd14 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd15 0.00 0 0.00 0.00 0.00 0 0.00 0.00
raid0 0.00 0 1.01 0.00 50.19 489 1.01 23.97
raid1 64.00 4 0.01 0.25 0.00 0 0.01 0.00
raid2 0.00 0 0.00 0.00 0.00 0 0.00 0.00
now, after 20 GB done, system is very unresponsive, but nothing in messages or
dmesg:
(iostat -w 1, rsync and systat vmstat: no output, ssh to box time-out, but
serial console is 'only' very very slow)
device read KB/t r/s time MB/s write KB/t w/s time MB/s
fd0 0.00 0 0.00 0.00 0.00 0 0.00 0.00
ld0 0.00 0 0.00 0.00 0.00 0 0.00 0.00
md0 0.00 0 0.00 0.00 0.00 0 0.00 0.00
cd0 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd0 11.37 277 0.74 3.08 12.57 374 0.74 4.59
sd1 11.88 257 0.65 2.99 13.07 362 0.65 4.63
sd2 11.80 258 0.67 2.98 13.03 365 0.67 4.65
sd3 11.24 271 0.81 2.98 12.58 377 0.81 4.63
sd4 11.82 247 0.69 2.85 13.14 362 0.69 4.65
sd5 11.74 256 0.70 2.94 12.98 362 0.70 4.59
sd6 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd7 16.00 530 0.50 8.28 0.00 0 0.50 0.00
sd8 16.00 530 0.57 8.28 0.00 0 0.57 0.00
sd9 16.00 530 0.57 8.28 0.00 0 0.57 0.00
sd10 16.00 530 0.56 8.28 0.00 0 0.56 0.00
sd11 16.00 530 0.51 8.28 0.00 0 0.51 0.00
sd12 16.00 530 0.51 8.28 0.00 0 0.51 0.00
sd13 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd14 0.00 0 0.00 0.00 0.00 0 0.00 0.00
sd15 0.00 0 0.00 0.00 0.00 0 0.00 0.00
raid0 0.00 0 1.00 0.00 36.31 535 1.00 18.96
raid1 0.00 0 0.00 0.00 0.00 0 0.00 0.00
raid2 0.00 0 0.00 0.00 0.00 0 0.00 0.00
systat vmstat at this moment:
4 users Load 9.34 9.10 8.04 Mon Sep 4 15:39:01
Proc:r d s w Csw Trp Sys Int Sof Flt PAGING SWAPPING
2 10 6435 23 117 13250 119 11 in out in out
ops
20.5% Sy 0.0% Us 0.0% Ni 3.3% In 76.2% Id pages
| | | | | | | | | | |
==========%% forks
fkppw
memory totals (in kB) 23237 Interrupts fksvm
real virtual free 100 cpu0 softclock pwait
Active 1227868 1227868 1620 13 cpu0 softnet relck
All 2052948 2052948 134672 cpu0 softserial rlkok
100 cpu0 timer noram
Namei Sys-cache Proc-cache FPU flush IPI ndcpy
Calls hits % hits % FPU synch IPI fltcp
8 8 100 2146 TLB shootdown I zfod
cpu1 softnet cow
Disks: seeks xfers bytes %busy 100 cpu1 timer 64 fmin
fd0 FPU flush IPI 85 ftarg
ld0 FPU synch IPI 150021 itarg
md0 2512 TLB shootdown I 1038 wired
cd0 cpu2 softnet pdfre
sd0 685 8390K 76.9 100 cpu2 timer pdscn
sd1 669 8330K 72.6 FPU flush IPI
sd2 671 8349K 73.4 FPU synch IPI
sd3 679 8339K 73.8 2403 TLB shootdown I
sd4 669 8296K 71.6 cpu3 softnet
sd5 669 8301K 72.2 100 cpu3 timer
sd6 FPU flush IPI
sd7 556 8896K 57.6 FPU synch IPI
sd8 556 8893K 58.0 2412 TLB shootdown I
sd9 556 8893K 58.2 ioapic0 pin 4
sd10 556 8890K 58.6 ioapic0 pin 6
sd11 556 8893K 59.8 13238 ioapic2 pin 0
sd12 556 8893K 61.4 13 ioapic2 pin 6
sd13 ioapic1 pin 6
sd14 ioapic0 pin 15
sd15
raid0 575 20M 100.0
raid1
raid2
root@raidtest2:~# date
Mon Sep 4 15:58:02 CEST 2006
root@raidtest2:~#
took 1 minute, and see that vmstat is frozen for 20 minutes already.
root@raidtest2:~# netstat -anfinet|grep EST
tcp 0 0 194.109.0.22.22 194.109.0.97.63281 ESTABLISHED
tcp 0 0 194.109.0.22.22 194.109.0.97.63327 ESTABLISHED
tcp 0 80 194.109.0.22.22 194.109.0.97.63328 ESTABLISHED
on the console took 30 seconds, just to see the connections are still there and now
(16:00) suddenly the system starts going again; rsync, iostat and systat just continue
>
> >
> > dmesg available on http://dia.zepam.nl/pr-20060904-1-dmesg.txt
> >
> > Not production system, so I can test stuff.
> >> How-To-Repeat:
> >
> >> Fix:
> >
> >
> >> Unformatted:
> >
> >
>
>
>
> -- Rui Paulo
>
>
>
--
Bert Kiers
XS4All UNIX systeembeheerder, lockpicker & techno anarchist