NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: kern/53441: nouveau panic in 8.0_RC2 amd64
The following reply was made to PR kern/53441; it has been noted by GNATS.
From: Greg Oster <oster%netbsd.org@localhost>
To: kern-bug-people%netbsd.org@localhost, gnats-admin%netbsd.org@localhost,
netbsd-bugs%netbsd.org@localhost, oster%netbsd.org@localhost
Cc: gnats-bugs%NetBSD.org@localhost
Subject: Re: kern/53441: nouveau panic in 8.0_RC2 amd64
Date: Fri, 3 Aug 2018 20:53:16 -0600
On Fri, 3 Aug 2018 23:45:01 +0000 (UTC)
Greg Oster <oster%netbsd.org@localhost> wrote:
> The following reply was made to PR kern/53441; it has been noted by
> GNATS.
>
> From: Greg Oster <oster%netbsd.org@localhost>
> To: gnats-bugs%NetBSD.org@localhost
> Cc:
> Subject: Re: kern/53441: nouveau panic in 8.0_RC2 amd64
> Date: Fri, 3 Aug 2018 17:40:41 -0600
>
> On Tue, 10 Jul 2018 16:15:00 +0000 (UTC)
> oster%netbsd.org@localhost wrote:
>
> > >Number: 53441
> > >Category: kern
> > >Synopsis: nouveau panic in 8.0_RC2 amd64
> > >Confidential: no
> > >Severity: critical
> > >Priority: high
> > >Responsible: kern-bug-people
> > >State: open
> > >Class: sw-bug
> > >Submitter-Id: net
> > >Arrival-Date: Tue Jul 10 16:15:00 +0000 2018
> > >Originator: Greg Oster
> > >Release: NetBSD 8.0_RC2
> > >Organization:
> > >Environment:
> > System: NetBSD thog 8.0_RC2 NetBSD 8.0_RC2 (THOG.gdb) #0: Fri Jun
> > 29 15:10:23 CST 2018
> > oster@thog:/u1/builds/build183/src/obj/amd64/u1/builds/build183/src/sys/arch/amd64/compile/THOG.gdb
> > amd64 Architecture: x86_64 Machine: amd64
> > >Description:
> >
> > The nouveau driver occasionally panics for no good reason. It can
> > panic when X11 is being used, and it can panic when no-one is on
> > the console.
> >
> > Panic looks like:
> >
> > uvm_fault(0xffffffff819b7d80, 0x0, 1) -> e
> > fatal page fault in supervisor mode
> > trap type 6 code 0 rip 0xffffffff8114d302 cs 0x8 rflags 0x10282 cr2
> > 0x70 ilevel 0x8 rsp 0xffff80013ce5bdd0 curlwp 0xfffffe843b5a0080
> > pid 0.16 lowest kstack 0xffff80013ce592c0 panic: trap
> > cpu2: Begin traceback...
> > vpanic() at netbsd:vpanic+0x219
> > vpanic() at netbsd:vpanic
> > trap() at netbsd:trap+0x2b9
> > --- trap (number 6) ---
> > nouveau_fence_update() at netbsd:nouveau_fence_update+0x10
> > nouveau_fence_done() at netbsd:nouveau_fence_done+0x29
> > nouveau_bo_fence_signalled() at
> > netbsd:nouveau_bo_fence_signalled+0x18 ttm_bo_wait() at
> > netbsd:ttm_bo_wait+0x90 ttm_bo_cleanup_refs_and_unlock() at
> > netbsd:ttm_bo_cleanup_refs_and_unlock+0x66 ttm_bo_delayed_delete()
> > at netbsd:ttm_bo_delayed_delete+0x175 ttm_bo_delayed_workqueue() at
> > netbsd:ttm_bo_delayed_workqueue+0x2b linux_worker() at
> > netbsd:linux_worker+0xf9 workqueue_runlist() at
> > netbsd:workqueue_runlist+0x59 workqueue_worker() at
> > netbsd:workqueue_worker+0xb1 cpu2: End traceback...
> > uvm_fault(0xfffffe842f5fd5c0, 0x0, 2) -> e
> >
> > fatal page fault in supervisor mode
> > dumping to dev 0,1 (offset=8425399, size=4189705):
> > trap type 6 code 0x2 rip 0xffffffff80cb5d7b cs 0x8 rflags 0x10296
> > cr2 0x84 ilevel 0x8 rsp 0xffff800d1u4m2p4 b2b90 curlwp
> > 0xfffffe8403f36120 pid 885.2 lowest kstack 0xffff8001424b02c0
> > coretemp0: workqueue busy: updates stopped coretemp1: workqueue
> > busy: updates stopped coretemp2: workqueue busy: updates stopped
> > coretemp3: workqueue busy: updates stopped
> >
> >
> >
> > >How-To-Repeat:
> >
> > Run the nouveau driver on NetBSD-8.0_RC2/amd64 using a NVIDIA
> > GeForce GT 420: ...
> > pci1 at ppb0 bus 1
> > pci1: i/o space, memory space enabled, rd/line, wr/inv ok
> > nouveau0 at pci1 dev 0 function 0: vendor 10de product 0de2 (rev.
> > 0xa1) drm kern info: nouveau [ DEVICE][nouveau0] BOOT0 :
> > 0x0c1100a1 drm kern info: nouveau [ DEVICE][nouveau0] Chipset:
> > GF108 (NVC1) drm kern info: nouveau [ DEVICE][nouveau0] Family :
> > NVC0 drm kern info: nouveau [ VBIOS][nouveau0] checking PRAMIN
> > for image... drm kern info: nouveau [ VBIOS][nouveau0] ...
> > appears to be valid drm kern info: nouveau [ VBIOS][nouveau0]
> > using image from PRAMIN drm kern info: nouveau
> > [ VBIOS][nouveau0] BIT signature found drm kern info: nouveau
> > [ VBIOS][nouveau0] version 70.08.1f.00.0c nouveau0: interrupting
> > at ioapic0 pin 16 (nouveau) drm kern warning: nouveau
> > W[ PFB][nouveau0][0x00000000][0xfffffe811d51b808] reclocking of
> > this ram type unsupported drm kern info: nouveau
> > [ PFB][nouveau0] RAM type: DDR3 drm kern info: nouveau
> > [ PFB][nouveau0] RAM size: 512 MiB drm kern info: nouveau
> > [ PFB][nouveau0] ZCOMP: 0 tags drm kern info: nouveau
> > [ VOLT][nouveau0] GPU voltage: 900000uv drm kern info: nouveau
> > [ PTHERM][nouveau0] FAN control: PWM drm kern info: nouveau
> > [ PTHERM][nouveau0] fan management: automatic drm kern info:
> > nouveau [ PTHERM][nouveau0] internal sensor: yes drm kern info:
> > nouveau [ CLK][nouveau0] 03: core 50 MHz memory 135 MHz drm
> > kern info: nouveau [ CLK][nouveau0] 07: core 405 MHz memory
> > 324 MHz drm kern info: nouveau [ CLK][nouveau0] 0f: core 700
> > MHz memory 800 MHz drm kern info: nouveau [ CLK][nouveau0]
> > --: core 405 MHz memory 324 MHz Zone kernel: Available graphics
> > memory: 5504634 kiB Zone dma32: Available graphics memory:
> > 2097152 kiB drm kern info: nouveau [ DRM] VRAM: 512 MiB drm
> > kern info: nouveau [ DRM] GART: 1048576 MiB drm kern info:
> > nouveau [ DRM] TMDS table version 2.0 drm kern info: nouveau
> > [ DRM] DCB version 4.0 drm kern info: nouveau [ DRM] DCB
> > outp 00: 01800302 00020030 drm kern info: nouveau [ DRM] DCB
> > outp 01: 02000300 00000000 drm kern info: nouveau [ DRM] DCB
> > outp 02: 08811392 00020020 drm kern info: nouveau [ DRM] DCB
> > outp 03: 04822310 00000000 drm kern info: nouveau [ DRM] DCB
> > conn 00: 00001030 drm kern info: nouveau [ DRM] DCB conn 01:
> > 00002161 drm kern info: nouveau [ DRM] DCB conn 02: 00000200
> > drm: Supports vblank timestamp caching Rev 2 (21.10.2013). drm:
> > Driver supports precise vblank timestamp query. drm kern info:
> > nouveau [ DRM] MM: using COPY0 for buffer copies nouveaufb0
> > at nouveau0 nouveau0: info: registered panic notifier
> > nouveaufb0: framebuffer at 0xffff8001400b4000, size 1920x1200,
> > depth 32, stride 7680 ...
> >
> >
> > and then wait for the boom. The panic may happen in hours or days.
> >
> >
> > >Fix:
> > Please. I have a kernel with full debug symbols and a couple of
> > crash dumps related to this if someone wants additional information
> > from them.
>
> Traceback from gdb kernel:
>
> (gdb) bt
> #0 cpu_reboot (howto=260, bootstr=0x0)
> at /u1/builds/build185/src/sys/arch/amd64/amd64/machdep.c:710
> #1 0xffffffff80ceece2 in vpanic (fmt=0xffffffff81207070 "trap",
> ap=0xffff80013ce5bbb8)
> at /u1/builds/build185/src/sys/kern/subr_prf.c:342 #2
> 0xffffffff80ceeaba in panic (fmt=0xffffffff81207070 "trap")
> at /u1/builds/build185/src/sys/kern/subr_prf.c:258 #3
> 0xffffffff80228bfd in trap (frame=0xffff80013ce5bce0)
> at /u1/builds/build185/src/sys/arch/amd64/amd64/trap.c:336 #4
> 0xffffffff8021f61f in alltraps () #5 0xffffffff8114d577 in
> nouveau_fence_update (chan=0x0)
> at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/nouveau/nouveau_fence.c:132
> #6 0xffffffff8114d72d in nouveau_fence_done
> (fence=0xfffffe834add5c48)
> at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/nouveau/nouveau_fence.c:171
> #7 0xffffffff811419f5 in nouveau_bo_fence_signalled
> ( sync_obj=0xfffffe834add5c48)
> at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/nouveau/nouveau_bo.c:1566
> #8 0xffffffff8119841a in ttm_bo_wait (bo=0xfffffe82f9fc0408,
> lazy=false, interruptible=false, no_wait=true)
> at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c:1671
> #9 0xffffffff81195d15 in ttm_bo_cleanup_refs_and_unlock
> ( bo=0xfffffe82f9fc0408, interruptible=false, no_wait_gpu=true)
> at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c:516
> #10 0xffffffff81196108 in ttm_bo_delayed_delete
> (bdev=0xfffffe811d500160, remove_all=false)
> at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c:621
> #11 0xffffffff811961da in ttm_bo_delayed_workqueue
> (work=0xfffffe811d500520)
> at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/ttm/ttm_bo.c:650
> #12 0xffffffff80abf6a9 in linux_worker (wk=0xfffffe811d500520,
> arg=0xfffffe843e620f80)
> at /u1/builds/build185/src/sys/external/bsd/common/linux/linux_work.c:505
> #13 0xffffffff80cf85ef in workqueue_runlist (wq=0xfffffe843b5b7d00,
> list=0xfffffe843b5b7d70)
> at /u1/builds/build185/src/sys/kern/subr_workqueue.c:106 #14
> 0xffffffff80cf86b2 in workqueue_worker (cookie=0xfffffe843b5b7d00)
> at /u1/builds/build185/src/sys/kern/subr_workqueue.c:133 #15
> 0xffffffff80208747 in lwp_trampoline () #16 0x0000000000000000 in ??
> () (gdb) ...
> (gdb) list
> 166
> 167 bool
> 168 nouveau_fence_done(struct nouveau_fence *fence)
> 169 {
> 170 if (fence->channel)
> 171 nouveau_fence_update(fence->channel);
> 172 return !fence->channel;
> 173 }
> 174
> 175 static int
> (gdb) down
> #5 0xffffffff8114d577 in nouveau_fence_update (chan=0x0)
> at /u1/builds/build185/src/sys/external/bsd/drm2/dist/drm/nouveau/nouveau_fence.c:132
> 132 struct nouveau_fence_chan *fctx = chan->fence;
> (gdb) list
> 127 }
> 128
> 129 static void
> 130 nouveau_fence_update(struct nouveau_channel *chan)
> 131 {
> 132 struct nouveau_fence_chan *fctx = chan->fence;
> 133 struct nouveau_fence *fence, *fnext;
> 134
> 135 spin_lock(&fctx->lock);
> 136 list_for_each_entry_safe(fence, fnext,
> &fctx->pending, head) {
> (gdb) print chan
> $11 = (struct nouveau_channel *) 0x0
> (gdb)
>
> "huh?"
>
> We just checked fence->channel for non-zero before the call to
> nouveau_fence_update(), and now it's suddenly zero? Methinks there
> are some locking issues happening here if the rug is getting pulled
> out that fast! Also: are there other uses of fence->channel where it
> could suddenly change from something to 0 and cause issues?
>
> (the machine worked fine for 8 days before this panic...)
>
> Later...
>
> Greg Oster
>
Just fell over again.. so twice now today. Seems there are (at least)
two different failure modes - one where I can get a kernel trace, and
one where it's a fast trip to reboot....
uvm_fault(0xffffffff819b7d80, 0x0, 1) -> e
fatal page fault in supervisor mode
trap type 6 code 0 rip 0xffffffff8114d577 cs 0x8 rflags 0x10282 cr2
0x70 ilevel 0x8 rsp 0xffff80013ce5bdd0 curlwp 0xfffffe843b5a0080 pid
0.16 lowest kstack 0xffff80013ce592c0 panic: trap
cpu1: Begin traceback...
vpanic() at netbsd:vpanic+0x219
vpanic() at netbsd:vpanic
trap() at netbsd:trap+0x2b9
--- trap (number 6) ---
nouveau_fence_update() at netbsd:nouveau_fence_update+0x10
nouveau_fence_done() at netbsd:nouveau_fence_done+0x29
nouveau_bo_fence_signalled() at netbsd:nouveau_bo_fence_signalled+0x18
ttm_bo_wait() at netbsd:ttm_bo_wait+0x90
ttm_bo_cleanup_refs_and_unlock() at
netbsd:ttm_bo_cleanup_refs_and_unlock+0x66 ttm_bo_delayed_delete() at
netbsd:ttm_bo_delayed_delete+0x175 ttm_bo_delayed_workqueue() at
netbsd:ttm_bo_delayed_workqueue+0x2b linux_worker() at
netbsd:linux_worker+0xf9 workqueue_runlist() at
netbsd:workqueue_runlist+0x59 workqueue_worker() at
netbsd:workqueue_worker+0xb1 cpu1: End traceback...
Later...
Greg Oster
--
Later...
Greg Oster
Home |
Main Index |
Thread Index |
Old Index