Subject: Re: port-dreamcast/34243
To: None <port-dreamcast-maintainer@netbsd.org, gnats-admin@netbsd.org,>
From: Valeriy E. Ushakov <uwe@ptc.spbu.ru>
List: netbsd-bugs
Date: 08/22/2006 06:05:04
The following reply was made to PR port-dreamcast/34243; it has been noted by GNATS.
From: "Valeriy E. Ushakov" <uwe@ptc.spbu.ru>
To: gnats-bugs@netbsd.org
Cc: Yasushi Oshima <oshima-ya@yagoto-urayama.jp>
Subject: Re: port-dreamcast/34243
Date: Tue, 22 Aug 2006 06:36:09 +0400
I can reproduce this problem on my usl-5p (running uncomitted landisk
port, with Nonaka-san patches integrated into current -current).
The machine would boot all the way to the login prompt, but login
attempt just never succeeds. If I boot into single user mode and run
passwd(1), it just "loops" consuming all the CPU.
Running passwd under gdb (with a slightly tweaked ld.elf_so that does
debugger handshake before calling .init sections):
# gdb -q passwd
(no debugging symbols found)...(gdb) run
Starting program: /usr/bin/passwd
(no debugging symbols found)...(no debugging symbols found)...
(no debugging symbols found)...(no debugging symbols found)...
(no debugging symbols found)...(no debugging symbols found)...
(no debugging symbols found)...(no debugging symbols found)...
^C(no debugging symbols found)...(no debugging symbols found)...
Program received signal SIGINT, Interrupt.
0x20690a9c in _init () from /usr/lib/libcom_err.so.4
(gdb) bt
#0 0x20690a9c in _init () from /usr/lib/libcom_err.so.4
#1 0x206907e0 in _init () from /usr/lib/libcom_err.so.4
#2 0x204234de in _rtld_call_init_functions (first=0x7fffdd50)
at /usr/src/libexec/ld.elf_so/rtld.c:147
#3 0x20423466 in _rtld_call_init_functions (first=0x20416e00)
at /usr/src/libexec/ld.elf_so/rtld.c:141
#4 0x20423466 in _rtld_call_init_functions (first=0x20416c00)
at /usr/src/libexec/ld.elf_so/rtld.c:141
#5 0x20423466 in _rtld_call_init_functions (first=0x20416a00)
at /usr/src/libexec/ld.elf_so/rtld.c:141
#6 0x20423466 in _rtld_call_init_functions (first=0x20416800)
at /usr/src/libexec/ld.elf_so/rtld.c:141
#7 0x20423466 in _rtld_call_init_functions (first=0x20416600)
at /usr/src/libexec/ld.elf_so/rtld.c:141
#8 0x20423466 in _rtld_call_init_functions (first=0x20416400)
at /usr/src/libexec/ld.elf_so/rtld.c:141
#9 0x2042427a in _rtld (sp=0x7fffddac, relocbase=541209686)
at /usr/src/libexec/ld.elf_so/rtld.c:477
(gdb) x/i $pc
0x20690a9c <_init+720>: mov.l @r4,r1
This is inside frame_dummy() called rfom libcom_err.so .init
Continuing passwd and interrupting it again later (it's still stuck)
ends up in exactly the same location.
Doing stepi over this instruction and continuing makes passwd unstuck
and it prompts for a new password.
Setting a gdb break point *anywhere* on that page make the passwd work
even if the breakpoint is never hit.
DDB confirms the picture:
# passwd
Stopped in pid 15.1 (passwd) at netbsd:cpu_Debugger+0x6: mov r14, r15
db> bt
cpu_Debugger() at netbsd:scifintr+0x64
scifintr() at netbsd:intc_intr+0x4a
intc_intr() at 0x8c000680
<EXPEVT 000; SSR=00000001> at 0x20690a9c
db> c
^Z[1] + Stopped passwd
# jobs -l
[1] + 15 Stopped passwd
# pmap 15
...
20690000 4K read/exec /usr/lib/libcom_err.so.4.1
20691000 60K /usr/lib/libcom_err.so.4.1
206A0000 8K read/write /usr/lib/libcom_err.so.4.1
...
Running a kernel with caches disabled doesn't change this failure
scenario (besides, there are no relocs in vicinity of that
instruction).
I've tried replacing the kernel and ld.elf_so with the old ones from
the usl-5p cf image provided by Nonaka-san in /misc on ftp.n.o but
that doesn't change anything either.
But changing libcom_err.so to the old one (all the rest being
-current) makes passwd work.
I've tracked the important difference between two instances of
libcom_err.so to the following:
* OLD (works):
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
LOAD 0x000000 0x00000000 0x00000000 0x010f8 0x010f8 R E 0x10000
LOAD 0x0010f8 0x000110f8 0x000110f8 0x00130 0x001d8 RW 0x10000
* NEW (doesn't work):
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
LOAD 0x000000 0x00000000 0x00000000 0x00fbc 0x00fbc R E 0x10000
LOAD 0x000fbc 0x00010fbc 0x00010fbc 0x0012c 0x001d4 RW 0x10000
Note that in the new one the second segment starts on the first page
of the file (0x0fbc < 0x1000).
If I tweak current libcom_err.so to include some dummy read only data
to artificially inflate the size of the first segment to be larger
than 1 page the resulting libcom_err.so does work.
I guess that new binutils trigger this bug because they produce a
shorter .dynsym section (omiting some SECTION entries) and make the
first loadable segment to be shorter than one page.
An interesting test would be to use a "working" system from before
binutils upgrade and to replace libcom_err.so.4.1 with the one from
the current, and see if the bug is triggered. That should confirm
that the bug is an old bug in the kernel only made apparent by the
layout change triggered by new binutils.
-uwe