Subject: port-i386/25670: Occasional "unable to allocate scsipi_xfer" error messages
To: None <gnats-bugs@gnats.netbsd.org>
From: None <he@netbsd.org>
List: netbsd-bugs
Date: 05/22/2004 12:29:51
>Number: 25670
>Category: port-i386
>Synopsis: Occasional "unable to allocate scsipi_xfer" error messages
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: port-i386-maintainer
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Sat May 22 13:27:00 UTC 2004
>Closed-Date:
>Last-Modified:
>Originator: Havard Eidnes
>Release: NetBSD 2.0_BETA
>Organization:
>Environment:
System: NetBSD dolly.uninett.no 2.0_BETA NetBSD 2.0_BETA (DOLLY) #1: Fri May 21 18:31:09 CEST 2004 he@dolly.uninett.no:/usr/obj/sys/arch/i386/compile.i386/DOLLY i386
Architecture: i386
Machine: i386
>Description:
An IBM 1U server configured as shown below with a single
raidframe raid0 which concatenates sd0 and sd1 occasionally
emits error messages of this type:
May 22 00:43:52 dolly /netbsd: sd3(mpt0:0:1:0): unable to allocate scsipi_xfer
May 22 00:43:54 dolly /netbsd: sd3: not queued, error 12
May 22 00:43:54 dolly /netbsd: sd3(mpt0:0:1:0): unable to allocate scsipi_xfer
May 22 00:43:54 dolly /netbsd: sd3: not queued, error 12
May 22 01:03:09 dolly /netbsd: sd3(mpt0:0:1:0): unable to allocate scsipi_xfer
May 22 01:03:11 dolly /netbsd: sd3: not queued, error 12
This machine runs NetNews using innd with "cnfs" style article
and overview storage (in /newsdata):
dolly# df
Filesystem 1K-blocks Used Avail Capacity Mounted on
/dev/sd2a 79827 25168 50667 33% /
/dev/sd2f 2979262 114705 2715593 4% /var
/dev/sd2e 29913005 1678275 26739079 5% /usr
mfs:344 297955 4 283053 0% /tmp
kernfs 1 1 0 100% /kern
/dev/raid0a 351828176 40005772 294230992 11% /newsdata
/dev/sd3a 12386530 1473232 10293970 12% /news
/dev/sd3e 22609434 2 21478960 0% /spare
dolly#
I've also had this happen to either sd0 or sd1, and raidframe
then became quite upset, repeatedly logging
Mraid0: failed to create a dag. Too many component failures.
Eventually the machine paniced with a kernel page fault trap
(null pointer dereference?):
.
raid0: failed to create a dag. Too many component failures.
raid0: failed to create a dag. Too many component failures.
raid0: failed to create a dag. Too many component failures.
raid0: failed to create a dag. Too many component failures.
raid0: failed to create a dag. Too many component failures.
uvm_fault(0xd19ae8a4, 0, 0, 1) -> 0xe
kernel: page fault trap, code=0
Stopped in pid 19351.1 (tcsh) at netbsd:bgetvp+0x19: movl 0x3c(%ecx),%eax
db>
db> trace
bgetvp(0,c201665c,0,0,ce834984) at netbsd:bgetvp+0x19
getblk(0,c0,0,8000,0) at netbsd:getblk+0x20b
bread(0,c0,0,8000,ffffffff) at netbsd:bread+0x53
ffs_update(ce834a64,2,0,ce834a78,c064dd00) at netbsd:ffs_update+0x2db
VOP_UPDATE(ce189024,0,0,0,0) at netbsd:VOP_UPDATE+0x36
.
However, what I'm after is getting these apparent pool memory
allocation errors to stop happning, since they seem to be the
root cause of these problems. "vmstat -m" output and reading
of the source code indicates that these errors coincide with a
failure by the pool allocator to get a scsipi_xfer structure:
Memory resource pool statistics
Name Size Requests Fail Releases Pgreq Pgrel Npage Hiwat Minpg Maxpg Idle
...
scxspl 148 11813057 3 11813054 1755 1752 3 10 0 inf 2
...
and the code:
xs = pool_get(&scsipi_xfer_pool,
((flags & XS_CTL_NOSLEEP) != 0 ? PR_NOWAIT : PR_WAITOK));
if (xs == NULL) {
if (flags & XS_CTL_URGENT) {
if ((flags & XS_CTL_REQSENSE) == 0)
periph->periph_flags &= ~PERIPH_RECOVERY_ACTIVE;
} else
periph->periph_active--;
scsipi_printaddr(periph);
printf("unable to allocate %sscsipi_xfer\n",
(flags & XS_CTL_URGENT) ? "URGENT " : "");
}
splx(s);
I have "of course" no idea whether XS_CTL_NOSLEEP is set, but
it is probable; I've bumped NKMEMPAGES from 32K to 48K, and
the problem still occurs. Shouldn't the pool be extended if
there is no room in the currently allocated pages under all
circumstances?!?
Now, the big question is, what should be done to avoid having
this problem reoccur?
Boot messages from the machine:
NetBSD 2.0_BETA (DOLLY) #1: Fri May 21 18:31:09 CEST 2004
he@dolly.uninett.no:/usr/obj/sys/arch/i386/compile.i386/DOLLY
total memory = 1023 MB
avail memory = 998 MB
BIOS32 rev. 0 found at 0xfd7d1
mainbus0 (root)
cpu0 at mainbus0: (uniprocessor)
cpu0: Intel Xeon (686-class), 3056.68 MHz, id 0xf29
cpu0: features bfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu0: features bfebfbff<PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX>
cpu0: features bfebfbff<FXSR,SSE,SSE2,SS,HTT,TM,SBF>
cpu0: I-cache 12K uOp cache 8-way, D-cache 8 KB 64b/line 4-way
cpu0: L2 cache 512 KB 64b/line 8-way
cpu0: ITLB 4K/4M: 64 entries
cpu0: DTLB 4K/4M: 64 entries
cpu0: 16 page colors
pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
pchb0 at pci0 dev 0 function 0
pchb0: ServerWorks product 0x0014 (rev. 0x32)
pchb1 at pci0 dev 0 function 1
pchb1: ServerWorks product 0x0014 (rev. 0x00)
pci1 at pchb1 bus 255
pci1: no spaces enabled!
pchb2 at pci0 dev 0 function 2
pchb2: ServerWorks product 0x0014 (rev. 0x00)
pci2 at pchb2 bus 1
pci2: no spaces enabled!
vga1 at pci0 dev 1 function 0: ATI Technologies Rage XL (rev. 0x27)
wsdisplay0 at vga1 kbdmux 1
wsmux1: connecting to wsdisplay0
ahc0 at pci0 dev 2 function 0: Adaptec 29160 Ultra160 SCSI adapter
ahc0: interrupting at irq 9
ahc0: aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs
scsibus0 at ahc0: 16 targets, 8 luns per target
pchb3 at pci0 dev 15 function 0
pchb3: ServerWorks CSB5 SouthBridge (rev. 0x93)
pci3 at pchb3 bus 1
pci3: i/o space, memory space enabled
mpt0 at pci3 dev 1 function 0: LSI Logic 53c1030 Ultra320 SCSI
mpt0: interrupting at irq 9
scsibus1 at mpt0: 16 targets, 8 luns per target
rccide0 at pci0 dev 15 function 1
rccide0: ServerWorks CSB5 IDE Controller (rev. 0x93)
rccide0: bus-master DMA support present
rccide0: primary channel configured to compatibility mode
rccide0: primary channel interrupting at irq 14
atabus0 at rccide0 channel 0
rccide0: secondary channel configured to compatibility mode
rccide0: secondary channel interrupting at irq 15
atabus1 at rccide0 channel 1
ohci0 at pci0 dev 15 function 2: ServerWorks OSB4/CSB5 USB Host Controller (rev. 0x05)
ohci0: interrupting at irq 11
ohci0: OHCI version 1.0, legacy support
usb0 at ohci0: USB revision 1.0
uhub0 at usb0
uhub0: ServerWorks OHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 4 ports with 4 removable, self powered
pcib0 at pci0 dev 15 function 3
pcib0: ServerWorks product 0x0225 (rev. 0x00)
pchb4 at pci0 dev 17 function 0
pchb4: ServerWorks CIOB-X2 (rev. 0x05)
pci4 at pchb4 bus 1
pci4: i/o space, memory space enabled
mpt1 at pci4 dev 1 function 0: LSI Logic 53c1030 Ultra320 SCSI
mpt1: unable to map device registers
pchb5 at pci0 dev 17 function 2
pchb5: ServerWorks CIOB-X2 (rev. 0x05)
pci5 at pchb5 bus 2
pci5: i/o space, memory space enabled
bge0 at pci5 dev 1 function 0: Broadcom BCM5703X Gigabit Ethernet
bge0: interrupting at irq 3
bge0: ASIC BCM5703 A2 (0x1002), Ethernet address 00:0d:60:1c:76:12
brgphy0 at bge0 phy 1: BCM5703 1000BASE-T media interface, rev. 2
brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
bge1 at pci5 dev 2 function 0: Broadcom BCM5703X Gigabit Ethernet
bge1: interrupting at irq 5
bge1: ASIC BCM5703 A2 (0x1002), Ethernet address 00:0d:60:1c:76:13
brgphy1 at bge1 phy 1: BCM5703 1000BASE-T media interface, rev. 2
brgphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
isa0 at pcib0
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
com0: console
pcppi0 at isa0 port 0x61
sysbeep0 at pcppi0
npx0 at isa0 port 0xf0-0xff: using exception 16
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
fd0 at fdc0 drive 0: 1.44MB, 80 cyl, 2 head, 18 sec
Kernelized RAIDframe activated
scsibus0: waiting 2 seconds for devices to settle...
scsibus1: waiting 2 seconds for devices to settle...
atapibus0 at atabus1: 2 targets
cd0 at atapibus0 drive 0: <LG CD-ROM CRN-8245B, , 1.16> cdrom removable
cd0: 32-bit data port
cd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 2 (Ultra/33)
cd0(rccide0:1:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 2 (Ultra/33) (using DMA data transfers)
sd0 at scsibus0 target 0 lun 0: <SEAGATE, ST1181677LWV, 0002> disk fixed
sd0: 169 GB, 24247 cyl, 24 head, 609 sec, 512 bytes/sect x 354600001 sectors
sd0: sync (12.50ns offset 63), 16-bit (160.000MB/s) transfers, tagged queueing
sd1 at scsibus0 target 1 lun 0: <SEAGATE, ST1181677LWV, 0002> disk fixed
sd1: 169 GB, 24247 cyl, 24 head, 609 sec, 512 bytes/sect x 354600001 sectors
sd1: sync (12.50ns offset 63), 16-bit (160.000MB/s) transfers, tagged queueing
sd2 at scsibus1 target 0 lun 0: <IBM-ESXS, MAS3367NC FN, B907> disk fixed
sd2: 34715 MB, 27150 cyl, 4 head, 654 sec, 512 bytes/sect x 71096640 sectors
sd2: sync (6.25ns offset 127), 16-bit (320.000MB/s) transfers, tagged queueing
sd3 at scsibus1 target 1 lun 0: <IBM-ESXS, MAS3367NC FN, B907> disk fixed
sd3: 34715 MB, 27150 cyl, 4 head, 654 sec, 512 bytes/sect x 71096640 sectors
sd3: sync (6.25ns offset 127), 16-bit (320.000MB/s) transfers, tagged queueing
ses0 at scsibus1 target 8 lun 0: <IBM, 25P3495a S320 1, 1> processor fixed
ses0: SAF-TE Compliant Device
ses0: async, 8-bit transfers
boot device: sd2
root on sd2a dumps on sd2b
root file system type: ffs
raid0: Component /dev/sd0a being configured at col: 0
Column: 0 Num Columns: 2
Version: 2 Serial Number: 1142 Mod Counter: -1193820954
Clean: Yes Status: 0
raid0: Component /dev/sd1a being configured at col: 1
Column: 1 Num Columns: 2
Version: 2 Serial Number: 1142 Mod Counter: -1193820954
Clean: Yes Status: 0
raid0: RAID Level 0
raid0: Components: /dev/sd0a /dev/sd1a
raid0: Total Sectors: 709197552 (346287 MB)
wsdisplay0: screen 1 added (80x25, vt100 emulation)
wsdisplay0: screen 2 added (80x25, vt100 emulation)
wsdisplay0: screen 3 added (80x25, vt100 emulation)
wsdisplay0: screen 4 added (80x25, vt100 emulation)
sd3(mpt0:0:1:0): unable to allocate scsipi_xfer
sd3: not queued, error 12
sd3(mpt0:0:1:0): unable to allocate scsipi_xfer
sd3: not queued, error 12
sd3(mpt0:0:1:0): unable to allocate scsipi_xfer
sd3: not queued, error 12
The kernel config is GENERIC with irrelevant parts trimmed
out, and with the following tweaks:
maxusers 64 # estimated number of users
options NKMEMPAGES=49152 # elevated from the default 32k
"vmstat -m" output:
dolly# vmstat -m
vmstat: Kmem statistics are not being gathered by the kernel.
Memory resource pool statistics
Name Size Requests Fail Releases Pgreq Pgrel Npage Hiwat Minpg Maxpg Idle
phpool 40 3470 0 2793 11 0 11 11 0 inf 0
pcgpool 140 5099 0 5095 1 0 1 1 0 inf 0
pmappl 76 16672 0 16614 13 11 2 2 0 inf 0
pdppl 4096 908 0 849 522 463 59 62 0 inf 0
vmsppl 220 16672 0 16614 23 19 4 4 0 inf 0
vmmpepl 88 13075545 0 13074194 2480 2444 36 52 0 inf 4
vmmpekpl 88 190757 0 190536 16 10 6 11 0 inf 0
uaoeltpl 84 338 0 320 1 0 1 1 0 inf 0
aobjpl 52 1 0 0 1 0 1 1 0 inf 0
amappl 40 179692 0 179109 6 0 6 6 0 inf 0
bufpl 120 14809868 0 14802335 1006 440 566 807 0 inf 0
buf1k 1024 57212 0 51134 627 493 134 359 1 inf 0
buf2k 2048 1441 0 1291 22 15 7 11 1 inf 0
buf4k 4096 468 0 441 14 9 5 7 1 inf 0
buf8k 8192 13826 0 12718 705 469 236 398 1 inf 0
buf16k 16384 374 0 306 99 81 18 18 1 inf 1
buf32k 32768 309 0 207 118 66 52 58 1 inf 1
mbpl 256 16584 0 16300 59 38 21 26 1 inf 0
mclpl 2048 6661 0 6400 1278 1143 135 169 4 512 2
kqueuepl 260 2 0 1 1 0 1 1 0 inf 0
knotepl 72 6 0 1 1 0 1 1 0 inf 0
sockpl 200 6530 0 6389 11 3 8 9 0 inf 0
ttypl 264 67 0 0 5 0 5 5 0 inf 0
rndsample 528 188408 0 188402 256 255 1 2 0 inf 0
procpl 408 16691 0 16620 14 6 8 8 0 inf 0
lwppl 132 16691 0 16620 3 0 3 3 0 inf 0
pgrppl 16 512 0 477 1 0 1 1 0 inf 0
pcredpl 24 16691 0 16620 1 0 1 1 0 inf 0
plimitpl 156 66 0 56 1 0 1 1 0 inf 0
pstatspl 224 16691 0 16620 8 4 4 5 0 inf 0
rusgepl 72 16620 0 16620 264 263 1 1 0 inf 1
ptimerpl 124 76 0 54 1 0 1 1 0 inf 0
filepl 56 526571 0 526258 47 42 5 6 0 inf 0
cwdipl 12 16678 0 16620 1 0 1 1 0 inf 0
fdescpl 280 16678 0 16620 17 12 5 5 0 inf 0
pnbufpl 1024 1533 0 1530 11 10 1 2 0 inf 0
vnodepl 164 87193 0 21759 2839 112 2727 2727 0 inf 0
ncachepl 84 599740 0 534299 1387 23 1364 1364 0 inf 0
sdpcpool 120 65205 0 65205 232 200 32 40 0 inf 32
pagedeppl 68 10918 0 10918 170 169 1 89 0 inf 1
inodedeppl 88 43361 0 43361 522 521 1 375 0 inf 1
newblkpl 36 70302 0 70302 207 206 1 1 0 inf 1
bmsafemappl 36 2495 0 2495 184 183 1 1 0 inf 1
allocdirectpl 80 33533 0 33533 164 163 1 33 0 inf 1
indirdeppl 32 438 0 438 136 135 1 1 0 inf 1
allocindirpl 64 36769 0 36769 343 163 180 180 0 inf 180
freefragpl 40 42 0 42 30 30 0 1 0 inf 0
freeblkspl 172 33102 0 33102 681 680 1 671 0 inf 1
freefilepl 36 21465 0 21465 205 204 1 118 0 inf 1
diraddpl 36 24636 0 24636 99 98 1 19 0 inf 1
mkdirpl 32 9910 0 9910 7 7 0 6 0 inf 0
dirrempl 36 24624 0 24624 233 232 1 148 0 inf 1
newdirblkpl 20 1 0 1 1 1 0 1 0 inf 0
ffsinopl 176 806188 0 740770 2975 130 2845 2845 0 inf 0
dino1pl 128 806188 0 740770 2184 73 2111 2111 0 inf 0
scxspl 148 12058071 3 12058071 1824 1815 9 10 0 inf 9
wdcspl 52 8 0 8 1 1 0 1 0 inf 0
extent 20 8180472 0 8179235 7 0 7 7 0 inf 0
raidpl 128 3102981 0 3102981 2 0 2 2 2 4 2
rtentpl 132 27 0 0 1 0 1 1 0 inf 0
in6pcbpl 156 40 0 26 1 0 1 1 0 inf 0
inpcbpl 108 4513 0 4497 1 0 1 1 0 inf 0
inmltpl 32 2 0 0 1 0 1 1 0 inf 0
ipqepl 44 207 0 207 100 100 0 1 0 inf 0
tcpcbpl 316 1463 0 1440 83 81 2 3 0 inf 0
synpl 184 721 0 721 170 170 0 1 0 inf 0
sigapl 2052 16678 0 16620 786 727 59 62 0 inf 1
pipepl 140 24098 0 24063 61 59 2 2 0 inf 0
rf_alloclist_pl 168 2909800 0 2909799 785 781 4 4 3 11 3
rf_asmhdr_pl 16 2723924 0 2723924 1 0 1 1 1 2 1
rf_asm_pl 288 2909799 0 2909799 948 942 6 6 5 14 6
rf_pda_pl 40 3102981 0 3102981 1 0 1 1 1 2 1
rf_vple_pl 8 30 0 0 1 0 1 1 1 2 0
rf_rad_pl 324 2723924 0 2723924 916 912 4 4 3 11 4
rf_dagnode_pl 444 11832378 0 11832378 2861 2843 18 20 15 57 18
rf_dagh_pl 76 2909799 0 2909799 1 0 1 1 1 3 1
rf_daglist_pl 236 2909799 0 2909799 649 646 3 3 2 8 3
rf_funclist_pl 8 2909799 0 2909799 1 0 1 1 1 2 1
rf_dqd_pl 116 3102981 0 3102981 872 869 3 3 2 8 3
In use 57190K, total allocated 70104K; utilization 81.6%
dolly#
>How-To-Repeat:
Run NetNews on a box running 2.0_BETA; watch the SCSI
subsystem complain when put under pressure.
>Fix:
Sorry, don't know.
>Release-Note:
>Audit-Trail:
>Unformatted: