NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/53441: nouveau panic in 8.0_RC2 amd64
The following reply was made to PR kern/53441; it has been noted by GNATS.
From: Greg Oster <oster%netbsd.org@localhost>
To: gnats-bugs%NetBSD.org@localhost
Cc:
Subject: Re: kern/53441: nouveau panic in 8.0_RC2 amd64
Date: Fri, 3 Aug 2018 17:40:41 -0600
On Tue, 10 Jul 2018 16:15:00 +0000 (UTC)
oster%netbsd.org@localhost wrote:
> >Number: 53441
> >Category: kern
> >Synopsis: nouveau panic in 8.0_RC2 amd64
> >Confidential: no
> >Severity: critical
> >Priority: high
> >Responsible: kern-bug-people
> >State: open
> >Class: sw-bug
> >Submitter-Id: net
> >Arrival-Date: Tue Jul 10 16:15:00 +0000 2018
> >Originator: Greg Oster
> >Release: NetBSD 8.0_RC2
> >Organization:
> >Environment:
> System: NetBSD thog 8.0_RC2 NetBSD 8.0_RC2 (THOG.gdb) #0: Fri Jun 29
> 15:10:23 CST 2018
> oster@thog:/u1/builds/build183/src/obj/amd64/u1/builds/build183/src/sys/arch/amd64/compile/THOG.gdb
> amd64 Architecture: x86_64 Machine: amd64
> >Description:
>
> The nouveau driver occasionally panics for no good reason. It can
> panic when X11 is being used, and it can panic when no-one is on the
> console.
>
> Panic looks like:
>
> uvm_fault(0xffffffff819b7d80, 0x0, 1) -> e
> fatal page fault in supervisor mode
> trap type 6 code 0 rip 0xffffffff8114d302 cs 0x8 rflags 0x10282 cr2
> 0x70 ilevel 0x8 rsp 0xffff80013ce5bdd0 curlwp 0xfffffe843b5a0080 pid
> 0.16 lowest kstack 0xffff80013ce592c0 panic: trap
> cpu2: Begin traceback...
> vpanic() at netbsd:vpanic+0x219
> vpanic() at netbsd:vpanic
> trap() at netbsd:trap+0x2b9
> --- trap (number 6) ---
> nouveau_fence_update() at netbsd:nouveau_fence_update+0x10
> nouveau_fence_done() at netbsd:nouveau_fence_done+0x29
> nouveau_bo_fence_signalled() at netbsd:nouveau_bo_fence_signalled+0x18
> ttm_bo_wait() at netbsd:ttm_bo_wait+0x90
> ttm_bo_cleanup_refs_and_unlock() at
> netbsd:ttm_bo_cleanup_refs_and_unlock+0x66 ttm_bo_delayed_delete() at
> netbsd:ttm_bo_delayed_delete+0x175 ttm_bo_delayed_workqueue() at
> netbsd:ttm_bo_delayed_workqueue+0x2b linux_worker() at
> netbsd:linux_worker+0xf9 workqueue_runlist() at
> netbsd:workqueue_runlist+0x59 workqueue_worker() at
> netbsd:workqueue_worker+0xb1 cpu2: End traceback...
> uvm_fault(0xfffffe842f5fd5c0, 0x0, 2) -> e
>
> fatal page fault in supervisor mode
> dumping to dev 0,1 (offset=8425399, size=4189705):
> trap type 6 code 0x2 rip 0xffffffff80cb5d7b cs 0x8 rflags 0x10296 cr2
> 0x84 ilevel 0x8 rsp 0xffff800d1u4m2p4 b2b90 curlwp 0xfffffe8403f36120
> pid 885.2 lowest kstack 0xffff8001424b02c0 coretemp0: workqueue busy:
> updates stopped coretemp1: workqueue busy: updates stopped
> coretemp2: workqueue busy: updates stopped
> coretemp3: workqueue busy: updates stopped
>
>
>
> >How-To-Repeat:
>
> Run the nouveau driver on NetBSD-8.0_RC2/amd64 using a NVIDIA GeForce
> GT 420: ...
> pci1 at ppb0 bus 1
> pci1: i/o space, memory space enabled, rd/line, wr/inv ok
> nouveau0 at pci1 dev 0 function 0: vendor 10de product 0de2 (rev.
> 0xa1) drm kern info: nouveau [ DEVICE][nouveau0] BOOT0 : 0x0c1100a1
> drm kern info: nouveau [ DEVICE][nouveau0] Chipset: GF108 (NVC1)
> drm kern info: nouveau [ DEVICE][nouveau0] Family : NVC0
> drm kern info: nouveau [ VBIOS][nouveau0] checking PRAMIN for
> image... drm kern info: nouveau [ VBIOS][nouveau0] ... appears to
> be valid drm kern info: nouveau [ VBIOS][nouveau0] using image
> from PRAMIN drm kern info: nouveau [ VBIOS][nouveau0] BIT
> signature found drm kern info: nouveau [ VBIOS][nouveau0] version
> 70.08.1f.00.0c nouveau0: interrupting at ioapic0 pin 16 (nouveau)
> drm kern warning: nouveau
> W[ PFB][nouveau0][0x00000000][0xfffffe811d51b808] reclocking of
> this ram type unsupported drm kern info: nouveau
> [ PFB][nouveau0] RAM type: DDR3 drm kern info: nouveau
> [ PFB][nouveau0] RAM size: 512 MiB drm kern info: nouveau
> [ PFB][nouveau0] ZCOMP: 0 tags drm kern info: nouveau
> [ VOLT][nouveau0] GPU voltage: 900000uv drm kern info: nouveau
> [ PTHERM][nouveau0] FAN control: PWM drm kern info: nouveau
> [ PTHERM][nouveau0] fan management: automatic drm kern info:
> nouveau [ PTHERM][nouveau0] internal sensor: yes drm kern info:
> nouveau [ CLK][nouveau0] 03: core 50 MHz memory 135 MHz drm kern
> info: nouveau [ CLK][nouveau0] 07: core 405 MHz memory 324 MHz
> drm kern info: nouveau [ CLK][nouveau0] 0f: core 700 MHz memory
> 800 MHz drm kern info: nouveau [ CLK][nouveau0] --: core 405 MHz
> memory 324 MHz Zone kernel: Available graphics memory: 5504634 kiB
> Zone dma32: Available graphics memory: 2097152 kiB drm kern info:
> nouveau [ DRM] VRAM: 512 MiB drm kern info: nouveau [ DRM]
> GART: 1048576 MiB drm kern info: nouveau [ DRM] TMDS table
> version 2.0 drm kern info: nouveau [ DRM] DCB version 4.0 drm
> kern info: nouveau [ DRM] DCB outp 00: 01800302 00020030 drm
> kern info: nouveau [ DRM] DCB outp 01: 02000300 00000000 drm
> kern info: nouveau [ DRM] DCB outp 02: 08811392 00020020 drm
> kern info: nouveau [ DRM] DCB outp 03: 04822310 00000000 drm
> kern info: nouveau [ DRM] DCB conn 00: 00001030 drm kern info:
> nouveau [ DRM] DCB conn 01: 00002161 drm kern info: nouveau
> [ DRM] DCB conn 02: 00000200 drm: Supports vblank timestamp
> caching Rev 2 (21.10.2013). drm: Driver supports precise vblank
> timestamp query. drm kern info: nouveau [ DRM] MM: using COPY0
> for buffer copies nouveaufb0 at nouveau0
> nouveau0: info: registered panic notifier
> nouveaufb0: framebuffer at 0xffff8001400b4000, size 1920x1200, depth
> 32, stride 7680 ...
>
>
> and then wait for the boom. The panic may happen in hours or days.
>
>
> >Fix:
> Please. I have a kernel with full debug symbols and a couple of
> crash dumps related to this if someone wants additional information
> from them.
Traceback from gdb kernel:
(gdb) bt
#0 cpu_reboot (howto=260, bootstr=0x0)
at /u1/builds/build185/src/sys/arch/amd64/amd64/machdep.c:710
#1 0xffffffff80ceece2 in vpanic (fmt=0xffffffff81207070 "trap",
ap=0xffff80013ce5bbb8)
at /u1/builds/build185/src/sys/kern/subr_prf.c:342 #2
0xffffffff80ceeaba in panic (fmt=0xffffffff81207070 "trap")
at /u1/builds/build185/src/sys/kern/subr_prf.c:258 #3
0xffffffff80228bfd in trap (frame=0xffff80013ce5bce0)
at /u1/builds/build185/src/sys/arch/amd64/amd64/trap.c:336 #4
0xffffffff8021f61f in alltraps () #5 0xffffffff8114d577 in
nouveau_fence_update (chan=0x0)
at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/nouveau/nouveau_fence.c:132
#6 0xffffffff8114d72d in nouveau_fence_done (fence=0xfffffe834add5c48)
at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/nouveau/nouveau_fence.c:171
#7 0xffffffff811419f5 in nouveau_bo_fence_signalled
( sync_obj=0xfffffe834add5c48)
at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/nouveau/nouveau_bo.c:1566
#8 0xffffffff8119841a in ttm_bo_wait (bo=0xfffffe82f9fc0408,
lazy=false, interruptible=false, no_wait=true)
at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c:1671
#9 0xffffffff81195d15 in ttm_bo_cleanup_refs_and_unlock
( bo=0xfffffe82f9fc0408, interruptible=false, no_wait_gpu=true)
at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c:516
#10 0xffffffff81196108 in ttm_bo_delayed_delete
(bdev=0xfffffe811d500160, remove_all=false)
at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c:621
#11 0xffffffff811961da in ttm_bo_delayed_workqueue
(work=0xfffffe811d500520)
at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c:650
#12 0xffffffff80abf6a9 in linux_worker (wk=0xfffffe811d500520,
arg=0xfffffe843e620f80)
at /u1/builds/build185/src/sys/external/bsd/common/linux/linux_work.c:505
#13 0xffffffff80cf85ef in workqueue_runlist (wq=0xfffffe843b5b7d00,
list=0xfffffe843b5b7d70)
at /u1/builds/build185/src/sys/kern/subr_workqueue.c:106 #14
0xffffffff80cf86b2 in workqueue_worker (cookie=0xfffffe843b5b7d00)
at /u1/builds/build185/src/sys/kern/subr_workqueue.c:133 #15
0xffffffff80208747 in lwp_trampoline () #16 0x0000000000000000 in ?? ()
(gdb)
...
(gdb) list
166
167 bool
168 nouveau_fence_done(struct nouveau_fence *fence)
169 {
170 if (fence->channel)
171 nouveau_fence_update(fence->channel);
172 return !fence->channel;
173 }
174
175 static int
(gdb) down
#5 0xffffffff8114d577 in nouveau_fence_update (chan=0x0)
at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/nouveau/nouveau_fence.c:132
132 struct nouveau_fence_chan *fctx = chan->fence;
(gdb) list
127 }
128
129 static void
130 nouveau_fence_update(struct nouveau_channel *chan)
131 {
132 struct nouveau_fence_chan *fctx = chan->fence;
133 struct nouveau_fence *fence, *fnext;
134
135 spin_lock(&fctx->lock);
136 list_for_each_entry_safe(fence, fnext, &fctx->pending,
head) {
(gdb) print chan
$11 = (struct nouveau_channel *) 0x0
(gdb)
"huh?"
We just checked fence->channel for non-zero before the call to
nouveau_fence_update(), and now it's suddenly zero? Methinks there
are some locking issues happening here if the rug is getting pulled
out that fast! Also: are there other uses of fence->channel where it
could suddenly change from something to 0 and cause issues?
(the machine worked fine for 8 days before this panic...)
Later...
Greg Oster
Home |
Main Index |
Thread Index |
Old Index