kern/58755: panic: ahci_cmd_kill_xfer: not supposed to be requeued

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: kern/58755: panic: ahci_cmd_kill_xfer: not supposed to be requeued
From: campbell+netbsd%mumble.net@localhost
Date: Wed, 16 Oct 2024 16:55:00 +0000 (UTC)

>Number:         58755
>Category:       kern
>Synopsis:       panic: ahci_cmd_kill_xfer: not supposed to be requeued
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Oct 16 16:55:00 +0000 2024
>Originator:     Taylor R Campbell
>Release:        10.0_BETA
>Organization:
>Environment:
NetBSD manticore.local 10.0_BETA NetBSD 10.0_BETA (GENERIC) #27: Mon Aug 28 11:34:52 UTC 2023  root@singbulli.local:/home/riastradh/netbsd/10/obj.amd64/sys/arch/amd64/compile/GENERIC amd64
>Description:
[ 2166162.642057] wd3: (uncorrectable data error)
[ 2166165.442122] wd3d: requeue reading fsbn 6947777000 of 6947777000-6947777127 (wd3 bn 6947777000; cn 6892635 tn 14 sn 38)
[ 2166165.452120] wd3d: error reading fsbn 6947777000 of 6947777000-6947777127 (wd3 bn 6947777000; cn 6892635 tn 14 sn 38)
[ 2166165.462121] cgd6d: error reading fsbn 6947774952 of 6947774952-6947775079 (cgd6 bn 6947774952; cn 3392468 tn 0 sn 488)
[ 2166169.302206] wd3: soft error (corrected) xfer 60
[ 2166175.792352] wd3d: requeue reading fsbn 6947776744 of 6947776744-6947776871 (wd3 bn 6947776744; cn 6892635 tn 10 sn 34), xfer 420, retry 4
[ 2166182.332499] wd3d: requeue reading fsbn 6947776744 of 6947776744-6947776871 (wd3 bn 6947776744; cn 6892635 tn 10 sn 34)
[ 2166182.348966] wd3d: error reading fsbn 6947776744 of 6947776744-6947776871 (wd3 bn 6947776744; cn 6892635 tn 10 sn 34)
[ 2166182.352776] cgd6d: error reading fsbn 6947774696 of 6947774696-6947774823 (cgd6 bn 6947774696; cn 3392468 tn 0 sn 232)
[ 2166215.763252] wd3d: device timeout reading fsbn 6946603768 of 6946603768-6946603895 (wd3 bn 6946603768; cn 6891471 tn 15 sn 55), xfer 4c0, retry 0
[ 2166215.776451] wd3d: device timeout reading fsbn 6940815200 of 6940815200-6940815327 (wd3 bn 6940815200; cn 6885729 tn 5 sn 53), xfer 240, retry 0
[ 2166215.789623] wd3d: device timeout reading fsbn 6940815328 of 6940815328-6940815455 (wd3 bn 6940815328; cn 6885729 tn 7 sn 55), xfer 2e0, retry 0
[ 2166216.813275] panic: ahci_cmd_kill_xfer: not supposed to be requeued
[ 2166216.824387] cpu0: Begin traceback...
[ 2166216.824387] vpanic() at netbsd:vpanic+0x183
[ 2166216.833503] panic() at netbsd:panic+0x3c
[ 2166216.833503] ahci_cmd_kill_xfer() at netbsd:ahci_cmd_kill_xfer+0xbb
[ 2166216.844434] ata_recovery_resume() at netbsd:ata_recovery_resume+0x11c
[ 2166216.854788] ata_thread_run() at netbsd:ata_thread_run+0x17f
[ 2166216.854788] atabus_thread() at netbsd:atabus_thread+0x236
[ 2166216.865778] cpu0: End traceback...
>How-To-Repeat:
have a flaky disk


>Fix:
Yes, please!

(Obviously this disk needs to be replaced, but NetBSD's recovery path shouldn't panic like this.)

Here's a possible explanation of the stack trace:

atabus_thread(...)
   (assume chp->ch_flags & ATACH_TH_RECOVERY)
-> ata_thread_run(chp, AT_WAIT, ATACH_TH_RECOVERY, chp->recovery_tfd)
   https://nxr.netbsd.org/xref/src/sys/dev/ata/ata.c?r=1.169#504
-> (*atac->atac_bustype_ata->ata_recovery)(chp, flags, tfd)
   https://nxr.netbsd.org/xref/src/sys/dev/ata/ata.c?r=1.169#1657
 = ahci_channel_recover(chp, flags, tfd)
-> ata_recovery_resume(chp, drive, tfd, flags)
   https://nxr.netbsd.org/xref/src/sys/dev/ic/ahcisata_core.c?r=1.107#1752
   (assume ata_read_log_ext_ncq returns 0)
-> xfer->ops->c_kill_xfer(chp, xfer, (error == 0) ? KILL_REQUEUE : KILL_RESET)
   https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_recovery.c?r=1.4#244
 = ahci_cmd_kill_xfer(chp, xfer, KILL_REQUEUE)
->	case KILL_REQUEUE:
		panic("%s: not supposed to be requeued\n", __func__);
   https://nxr.netbsd.org/xref/src/sys/dev/ic/ahcisata_core.c?r=1.107#1298

(Side note: callout_stop in ata_recovery_resume looks suspicious, should
probably be callout_halt instead with some appropriate locking.)

However, I'm not familiar enough with the ata(4) or ahci(4) data flow to know what went wrong with this path.

Prev by Date: Re: kern/58730: NFS client locks up waiting on nfscn2
Next by Date: bin/58756: can not remove in-filesystem WAPBL log
Previous by Thread: port-evbarm/58754: No pull-up|down in BeagleBone GPIO pin
Next by Thread: bin/58756: can not remove in-filesystem WAPBL log
Indexes:

Home | Main Index | Thread Index | Old Index