tech-kern archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: AMDGPU: Floating Point traps in Display Core code
> Date: Fri, 24 Feb 2023 23:21:35 -0800
> From: Jeff Frasca <thatguy%jeff-frasca.name@localhost>
>
> Ok, first off, the FP code I've run into is in the Display
> Core code, specifically in:
> sys/external/bsd/drm2/dist/drm/amd/display/dc/calcs/amdgpu_dcn_calcs.c
> It's all SIMD code operating on xmmN registers. To get to
> this codepath, I needed to have CONFIG_DRM_AMD_DC set during
> compilation. I've attached a diff that adds this to files.amdgpu.
>
> A typical backtrace printed out by ddb is:
> breakpoint()
> vpanic()
> panic()
> fputrap()
> Xtrap16()
> dcn10_create_resource_pool()
> [...]
>
> (I had to type it manually from a picture snapped on my
> phone, so, no offsets, if any of those are of interest,
> let me know.)
Probably not, except perhaps the one in dcn10_create_resource_pool to
confirm that it is where you think it is, in dcn_bw_update_from_pplib.
> There's a missing call from the backtrace that (I think)
> gets eaten by the trap jump: dcn_bw_update_from_pplib().
> (It's in amdgpu_dcn_calcs.c)
That's via dcn10_resource_construct, I assume? (which is a single-use
static that presumably gets compiled away)
> The actual trap number that's getting generated is 19
> rather than the 16 implied by the call to Xtrap16 (but
> I suspect y'all understand that quirk better than I do.)
The logic looks like this:
IDTVEC(trap16)
ZTRAP_NJ(T_ARITHTRAP)
.Ldo_fputrap:
...
call _C_LABEL(fputrap)
jmp .Lalltraps_checkusr
IDTVEC_END(trap16)
IDTVEC(trap19)
ZTRAP_NJ(T_XMM)
jmp .Ldo_fputrap
IDTVEC_END(trap19)
So the return address of fputrap will always live in the Xtrap16
symbol, not the Xtrap19 one, even if it gets there by trap 19.
> dcn_bw_update_from_pplib() dutifully calls the macro
> DC_FP_START(), which I believe Taylor wired up to call
> fpu_kern_enter(), which seems like it should do the right
> thing. However, the x86 fpu_kern_enter() only appears to
> save registers and mask off the x87 FP trap flag in CR0.
>
> The instruction that's causing the trap in this case is
> the very first FP instruction in the function, and it's
> tripping the precision exception (MXCSR is set to 0x20
> when printed out in fputrap() by a debug printf I added
> in my local build; this is also where I'm getting the
> trap number 19 rather than 16).
This looks like a mistake on my part. It's possible that we never
noticed with the crypto code because it largely doesn't deal in
floating-point exceptions, and that's all that we've use the FP/SIMD
unit for in the kernel so far.
But we should have set the MXCSR (and FPSW/FPCW, if that matters) to a
reliable state. And we need to do that anyway for crypto on CPUs with
the MCDT bug (https://gnats.netbsd.org/57230).
Since we're definitely not in a position to handle floating-point
exception traps in the kernel, I just committed a change to set MXCSR
to 0x1fbf (all exception status bits set, denormals-are-zero disabled,
all exception trap mask bits set, round-to-nearest/ties-to-even,
flush-to-zero disabled).
Does that change anything?
> Unless I add another function call to DC_FP_START() that
> masks all the non-fatal FP traps in MXCSR. I tried
> setting it to 0x00001d00 and 0x00009d40. The former just
> masks the non-fatal traps and the latter tries to set "do
> sane things with edge cases" flags. (If I try to mask
> MXCSR in fpu_kern_enter(), then some of the crypto code
> breaks.)
Can you be more specific about the crypto code breaking? Does it
still break with the change I just committed? (I verified all the
self-tests at boot run under qemu before committing, of course, but
it's possible I broke something on real hardware.)
https://mail-index.netbsd.org/source-changes/2023/02/25/msg143547.html
Home |
Main Index |
Thread Index |
Old Index