NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/58755: panic: ahci_cmd_kill_xfer: not supposed to be requeued
>Number: 58755
>Category: kern
>Synopsis: panic: ahci_cmd_kill_xfer: not supposed to be requeued
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed Oct 16 16:55:00 +0000 2024
>Originator: Taylor R Campbell
>Release: 10.0_BETA
>Organization:
>Environment:
NetBSD manticore.local 10.0_BETA NetBSD 10.0_BETA (GENERIC) #27: Mon Aug 28 11:34:52 UTC 2023 root@singbulli.local:/home/riastradh/netbsd/10/obj.amd64/sys/arch/amd64/compile/GENERIC amd64
>Description:
[ 2166162.642057] wd3: (uncorrectable data error)
[ 2166165.442122] wd3d: requeue reading fsbn 6947777000 of 6947777000-6947777127 (wd3 bn 6947777000; cn 6892635 tn 14 sn 38)
[ 2166165.452120] wd3d: error reading fsbn 6947777000 of 6947777000-6947777127 (wd3 bn 6947777000; cn 6892635 tn 14 sn 38)
[ 2166165.462121] cgd6d: error reading fsbn 6947774952 of 6947774952-6947775079 (cgd6 bn 6947774952; cn 3392468 tn 0 sn 488)
[ 2166169.302206] wd3: soft error (corrected) xfer 60
[ 2166175.792352] wd3d: requeue reading fsbn 6947776744 of 6947776744-6947776871 (wd3 bn 6947776744; cn 6892635 tn 10 sn 34), xfer 420, retry 4
[ 2166182.332499] wd3d: requeue reading fsbn 6947776744 of 6947776744-6947776871 (wd3 bn 6947776744; cn 6892635 tn 10 sn 34)
[ 2166182.348966] wd3d: error reading fsbn 6947776744 of 6947776744-6947776871 (wd3 bn 6947776744; cn 6892635 tn 10 sn 34)
[ 2166182.352776] cgd6d: error reading fsbn 6947774696 of 6947774696-6947774823 (cgd6 bn 6947774696; cn 3392468 tn 0 sn 232)
[ 2166215.763252] wd3d: device timeout reading fsbn 6946603768 of 6946603768-6946603895 (wd3 bn 6946603768; cn 6891471 tn 15 sn 55), xfer 4c0, retry 0
[ 2166215.776451] wd3d: device timeout reading fsbn 6940815200 of 6940815200-6940815327 (wd3 bn 6940815200; cn 6885729 tn 5 sn 53), xfer 240, retry 0
[ 2166215.789623] wd3d: device timeout reading fsbn 6940815328 of 6940815328-6940815455 (wd3 bn 6940815328; cn 6885729 tn 7 sn 55), xfer 2e0, retry 0
[ 2166216.813275] panic: ahci_cmd_kill_xfer: not supposed to be requeued
[ 2166216.824387] cpu0: Begin traceback...
[ 2166216.824387] vpanic() at netbsd:vpanic+0x183
[ 2166216.833503] panic() at netbsd:panic+0x3c
[ 2166216.833503] ahci_cmd_kill_xfer() at netbsd:ahci_cmd_kill_xfer+0xbb
[ 2166216.844434] ata_recovery_resume() at netbsd:ata_recovery_resume+0x11c
[ 2166216.854788] ata_thread_run() at netbsd:ata_thread_run+0x17f
[ 2166216.854788] atabus_thread() at netbsd:atabus_thread+0x236
[ 2166216.865778] cpu0: End traceback...
>How-To-Repeat:
have a flaky disk
>Fix:
Yes, please!
(Obviously this disk needs to be replaced, but NetBSD's recovery path shouldn't panic like this.)
Here's a possible explanation of the stack trace:
atabus_thread(...)
(assume chp->ch_flags & ATACH_TH_RECOVERY)
-> ata_thread_run(chp, AT_WAIT, ATACH_TH_RECOVERY, chp->recovery_tfd)
https://nxr.netbsd.org/xref/src/sys/dev/ata/ata.c?r=1.169#504
-> (*atac->atac_bustype_ata->ata_recovery)(chp, flags, tfd)
https://nxr.netbsd.org/xref/src/sys/dev/ata/ata.c?r=1.169#1657
= ahci_channel_recover(chp, flags, tfd)
-> ata_recovery_resume(chp, drive, tfd, flags)
https://nxr.netbsd.org/xref/src/sys/dev/ic/ahcisata_core.c?r=1.107#1752
(assume ata_read_log_ext_ncq returns 0)
-> xfer->ops->c_kill_xfer(chp, xfer, (error == 0) ? KILL_REQUEUE : KILL_RESET)
https://nxr.netbsd.org/xref/src/sys/dev/ata/ata_recovery.c?r=1.4#244
= ahci_cmd_kill_xfer(chp, xfer, KILL_REQUEUE)
-> case KILL_REQUEUE:
panic("%s: not supposed to be requeued\n", __func__);
https://nxr.netbsd.org/xref/src/sys/dev/ic/ahcisata_core.c?r=1.107#1298
(Side note: callout_stop in ata_recovery_resume looks suspicious, should
probably be callout_halt instead with some appropriate locking.)
However, I'm not familiar enough with the ata(4) or ahci(4) data flow to know what went wrong with this path.
Home |
Main Index |
Thread Index |
Old Index