kern/46896: iSCSI initiator ccb_pool gets corrupted

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: kern/46896: iSCSI initiator ccb_pool gets corrupted
From: mhitch%lightning.msu.montana.edu@localhost
Date: Mon, 3 Sep 2012 20:40:00 +0000 (UTC)

>Number:         46896
>Category:       kern
>Synopsis:       iSCSI initiator ccb_pool gets corrupted
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Sep 03 20:40:00 +0000 2012
>Originator:     Michael L. Hitch
>Release:        NetBSD 6.0_RC1 as of 19-Aug-2012
>Organization:
        Montana State University
>Environment:
System: NetBSD net5.msu.montana.edu 6.0_RC1 NetBSD 6.0_RC1 (XEN3_DOM0) #43: Sun 
Sep 2 20:19:33 MDT 2012 
mhitch%net8.msu.montana.edu@localhost:/home/mhitch/NetBSD-6/OBJ/amd64/home/mhitch/NetBSD-6/src/sys/arch/amd64/compile/XEN3_DOM0
 amd64
Architecture: x86_64
Machine: amd64
>Description:
        After updating to 6.0_RC1, I started a XEN DOMU kernel using an iSCSI
        disk.  I'm fairly certain that I had been able to run this for some time
        previously (netbsd-6 tree as of 24-May).  Shortly after starting the 
DOMU
        kernel, the iSCSI initiator started reporting no ccbs:

        Aug 30 00:20:11 net5 /netbsd: S2C1: No CCB in run_xfer
        Aug 30 00:20:11 net5 /netbsd: sd1(iscsi0:0:0:0): adapter resource 
shortage
        Aug 30 00:20:12 net5 /netbsd: S2C1: No CCB in run_xfer
        Aug 30 00:20:12 net5 /netbsd: sd1(iscsi0:0:0:0): adapter resource 
shortage

        I'm running a  6.0_RC1 XEN3_DOM0 kernel (with the iscsi initiator added
        to the kernel config, since xen kernels won't load modules), and an i386
        XEN3 DOMU running cacti (lots and lots of disk updates).

        After writing a quick kernel groveler to extract information from the
        various iSCSI initiator tables, I found that indeed, the ccb_pool
        head for the session showed it was empty.  Dumping out the contents of
        all the ccbs seemed to indicate they were all free, just no longer on 
the
        free list.

        Session 0xffffa00002945000: id=2
        ccb_pool 0x0000000000000000:0xffffa0000294c588 ccb_throttled 
0x0000000000000000
        ccb[ 0]  0xffffa00002945208 next 0xffffa0000294d3f8 status 0 disp 0 ITT 
80000200
        ...
        ccb[55]  0xffffa0000294c378 next 0xffffa0000294c168 status 0 disp 0 ITT 
49000237
        ccb[56]  0xffffa0000294c588 next 0x0000000000000000 status 0 disp 0 ITT 
89000238
        ccb[57]  0xffffa0000294c798 next 0xffffa0000294c588 status 0 disp 0 ITT 
87000239

        I was not able to see anything obvious in changes to sys/dev/iscsi 
source
        that might have caused this.  I then added the ccbs_waiting queue 
header,
        and noted that when this condition occurs, the tail entry of the header
        pointed to the ccb_pool - certainly not correct.

        This leads me to suspect that removing ccbs from ccbs_waiting and
        adding them to the free pool has some trouble.  From looking at the
        code, it looks to me like a ccb on the ccb_waiting queue is passed to
        wake_ccb(), which removes it from the ccb_waiting queue.  However, there
        appears to be no protection of something else from getting the same ccb
        on the ccbs_waiting queue and calling wake_ccb().  The first caller 
wins,
        removing the ccb from ccbs_waiting and adding it to ccb_pool.  The 
second
        caller now tries to remove the same ccb from ccbs_waiting and adding it
        to ccb_pool with nasty results.  I'm now working on seeing if this is
        indeed the case (adding some debug code to check and print information
        if it detects this occuring).

>How-To-Repeat:
        I suspect this problem is relatively rare, and needs something similar
        to my above described setup to get enough random activity with the iSCSI
        target to duplicate.
>Fix:
        If the problem is multiple processing of a ccb on the ccbs_waiting 
queue,
        try to prevent that from happening, or at least prevent it from 
clobbering
        the ccb_pool and ccbs_waiting queues.

Prev by Date: Re: port-xen/46634 (X does not work anymore with Xen (DOM0, amd64))
Next by Date: kern/46897: procfs uses too small a buffer for /proc/cpuinfo (or, I have too many cores)
Previous by Thread: Re: port-xen/46634 (X does not work anymore with Xen (DOM0, amd64))
Next by Thread: Re: kern/46896: iSCSI initiator ccb_pool gets corrupted
Indexes:

Home | Main Index | Thread Index | Old Index