Port-xen archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Xen hangs (NetBSD 7.0.1, amd64, Xen 4.5.3)



On 09/11/16 21:16, Alaric Snell-Pym wrote:
> On 17/10/16 18:43, Manuel Bouyer wrote:
>> this means that they're all stuck on a lock.
>> The first thing would be to use ddb (or crash(8), if this works on Xen)
>> to find where in the kernel these processes are stuck.
>>
>> The next step would be to which which process is actaully holding the lock
>> (this one is probably not in tstile state)
>>
> 
> 
> Thanks, Manuel... looks like crash(8) works, I'll check that out next
> time it happens! Does the lock show as the "wait channel" as shown by
> ps/w? How do I find what process currently holds it? And then I guess I
> do a "trace/t" on the pids involved to see what's what? I'm a little
> green with ddb.
> 

It's happened again, during an xl create! I'm not sure how to interpret
these results, but here's what I was doing:

Pid 4406 is my hanging "xl create".

-+= 00001 root init
 |--= 00526 root xl create /domU/domUs/vm2/xl.cfg
 |--= 00892 root /usr/sbin/syslogd -s
 |-+= 01180 root xl create /domU/domUs/vm4/xl.cfg
 | \--- 02832 root xl create /domU/domUs/vm4/xl.cfg
 |--- 01380 root /usr/pkg/sbin/xenstored --pid-file /var/run/xenstored.pid
 |--- 01397 root /usr/pkg/sbin/xenconsoled
 |--= 01492 root /usr/sbin/powerd
 |--= 01557 root xl create /domU/domUs/vm3/xl.cfg
 |--= 01637 root /usr/sbin/ntpd -p /var/run/ntpd.pid
 |-+= 01673 root /usr/sbin/sshd
 | |-+= 00379 root sshd: alaric [priv]
 | | \-+- 00301 alaric sshd: alaric@pts/1 (sshd)
 | |   \-+= 00250 alaric -sh
 | |     \-+= 02415 root su -
 | |       \-+= 02493 root -sh
 | |         \--= 05496 root /usr/pkg/libexec/xen/bin/xenconsole --num 0
--type pv 3
 | |-+= 02815 root sshd: alaric [priv]
 | | \-+- 00423 alaric sshd: alaric@pts/0 (sshd)
 | |   \-+= 02096 alaric -sh
 | |     \-+= 00938 root su -
 | |       \-+= 00210 root -sh
 | |         \--= 04406 root xl create -c /domU/domUs/vm4/xl.cfg
 | \-+= 08541 root sshd: alaric [priv]
 |   \-+- 05855 alaric sshd: alaric@pts/10 (sshd)
 |     \-+= 06627 alaric -sh
 |       \-+= 05133 root su -
 |         \-+= 07688 root -sh
 |           \-+= 16294 root /usr/pkg/bin/proctree
 |             \-+- 09586 root sh -c ps -axwwo user,pid,ppid,pgid,command
 |               \--- 17305 root ps -axwwo user,pid,ppid,pgid,command
 |--= 01931 root /usr/sbin/cron
 |-+= 02060 root /usr/libexec/postfix/master -w
 | |--- 01779 postfix qmgr -l -t unix -u
 | \--- 07732 postfix pickup -l -t unix -u
 |--= 02089 root xl create /domU/domUs/vm5/xl.cfg
 |--= 02206 root /usr/sbin/inetd -l
 |--= 02475 root xl create /domU/domUs/vm8/xl.cfg
 |--= 02511 root xl create /domU/domUs/vm1/xl.cfg
 |--= 02630 root xl create /domU/domUs/vm6/xl.cfg
 |--- 03217 root xenstore-read /local/domain/0/backend/vbd/4/3/params
 |--- 04535 root xenstore-read /local/domain/0/backend/vbd/7/1/params
 |--- 05442 root xenstore-rm /local/domain/0/backend/vif/4/0
 |--= 05978 root xl create /domU/domUs/vm7/xl.cfg
 |--- 08280 root xenstore-rm /local/domain/0/backend/vif/7/1
 |--= 00366 root /usr/libexec/getty Pc console
 |--= 02673 root /usr/libexec/getty Pc ttyE1
 |--= 02617 root /usr/libexec/getty Pc ttyE2
 \--= 00259 root /usr/libexec/getty Pc ttyE3

Here's what crash has to say:

crash> ps/a
PID     COMMAND      STRUCT PROC *            UAREA *     VMSPACE/VM_MAP
16638     crash   ffffa00001feea70   ffffa0000e508000   ffffa00001ef88b8
7732     pickup   ffffa00001fee030   ffffa0000e516000   ffffa00001ef8748
7688         sh   ffffa00001560a58   ffffa0000e429000   ffffa00001c9d460
5133         su   ffffa000020bea78   ffffa0000e377000   ffffa00001ef8188
6627         sh   ffffa000020bed08   ffffa0000de60000   ffffa00002ae3d10
5855       sshd   ffffa00001e5acf8   ffffa0000e519000   ffffa00001ef82f8
8541       sshd   ffffa00001feed00   ffffa0000e4ad000   ffffa00001c9d5d0
4406         xl   ffffa00001fee550   ffffa0000e505000   ffffa00001c9d8b0
2832         xl   ffffa00001cff2b0   ffffa0000e41b000   ffffa00001ef85d8
5442 xenstore-r   ffffa00001fee2c0   ffffa0000dd44000   ffffa00001ef8468
3217 xenstore-r   ffffa000020be7e8   ffffa0000fff0000   ffffa00002ae35e0
5496 xenconsole   ffffa00001560018   ffffa0000e314000   ffffa00001c9dd00
5978         xl   ffffa00001cff7d0   ffffa0000e4be000   ffffa00001ef8018
8280 xenstore-r   ffffa000020be558   ffffa0000e4eb000   ffffa00000fd4cf8
4535 xenstore-r   ffffa000020be2c8   ffffa0000dfea000   ffffa00002ae3750
2475         xl   ffffa00001fee7e0   ffffa0000e4e0000   ffffa00001c9d180
2630         xl   ffffa00001e5a028   ffffa0000e487000   ffffa00001ef8a28
2089         xl   ffffa00001e5aa68   ffffa0000e45a000   ffffa00001ef8b98
1180         xl   ffffa00001e5a2b8   ffffa0000e422000   ffffa00001ef8d08
1557         xl   ffffa00001e5a7d8   ffffa0000e3ca000   ffffa00001ef8e78
526          xl   ffffa00001e5a548   ffffa0000e399000   ffffa00001c9d010
2511         xl   ffffa00001cff020   ffffa0000e317000   ffffa00001c9d2f0
2493         sh   ffffa00000f7e7b8   ffffa0000ddb0000   ffffa00000e455c0
2415         su   ffffa00000e40520   ffffa0000dd1d000   ffffa00000e45a10
250          sh   ffffa00000e407b0   ffffa0000dcd1000   ffffa00000e45b80
301        sshd   ffffa00000f7ea48   ffffa0000dd9d000   ffffa00000e45450
379        sshd   ffffa00001cff540   ffffa0000e30d000   ffffa00000e458a0
210          sh   ffffa00000e40290   ffffa0000dd94000   ffffa00001c9d740
938          su   ffffa000015602a8   ffffa0000e283000   ffffa00001c9db90
2096         sh   ffffa00001cffcf0   ffffa0000e300000   ffffa00000fd4e68
423        sshd   ffffa00001560538   ffffa0000e280000   ffffa00000fd4008
2815       sshd   ffffa00000f7e528   ffffa0000ddb7000   ffffa00000e45170
366       getty   ffffa00000e40a40   ffffa0000dcce000   ffffa00000e45cf0
259       getty   ffffa0000130cce0   ffffa0000ddd6000   ffffa00000fd45c8
2617      getty   ffffa00000f7ecd8   ffffa0000dd9a000   ffffa00000e452e0
2673      getty   ffffa00000e40000   ffffa0000dd97000   ffffa00000e45730
1931       cron   ffffa00001cffa60   ffffa0000e303000   ffffa00001c9da20
2206      inetd   ffffa00000f7e008   ffffa0000ddd9000   ffffa00000fd4178
1779       qmgr   ffffa0000130c7c0   ffffa0000dfcc000   ffffa00000fd42e8
2060     master   ffffa000015607c8   ffffa0000dfed000   ffffa00001c9de70
1673       sshd   ffffa00001560ce8   ffffa0000dfe7000   ffffa00000fd4a18
1492     powerd   ffffa0000130c010   ffffa0000dfe4000   ffffa00000fd4458
1637       ntpd   ffffa0000130c530   ffffa0000dfc9000   ffffa00000fd48a8
1397 xenconsole   ffffa0000130c2a0   ffffa0000dd1a000   ffffa00000fd4738
1380  xenstored   ffffa00000f7e298   ffffa0000ddba000   ffffa00000e45000
892     syslogd   ffffa0000130ca50   ffffa0000ddd3000   ffffa00000fd4b88
1          init   ffffa00000e40cd0   ffffa0000dcdd000   ffffa00000e45e60
0        system   ffffffff80c60ea0   ffffa0000de6d000   ffffffff80cacf60

I took a look at the stacks of the ones that looked xen-related:

4406     1               xl   netbsd   43 tstile       ffffa00001a59c30

trace: pid 4406 lid 1 at 0xffffa0000e507960
sleepq_block() at sleepq_block+0x8e
turnstile_block() at turnstile_block+0x2c3
rw_vector_enter() at rw_vector_enter+0x11f
genfs_lock() at genfs_lock+0x7b
VOP_LOCK() at VOP_LOCK+0x37
vn_lock() at vn_lock+0x11
kernfs_root() at kernfs_root+0x41
lookup_once() at lookup_once+0x26e
namei_tryemulroot() at namei_tryemulroot+0x612
namei() at namei+0x43
vn_open() at vn_open+0x94
do_open() at do_open+0x111
do_sys_openat() at do_sys_openat+0x68
sys_open() at sys_open+0x24
syscall() at syscall+0x9a
--- syscall (number 5) ---
7f7ff603c40a:

2832     1               xl   netbsd   43 tstile       ffffa00001a59c30

trace: pid 2832 lid 1 at 0xffffa0000e41d960
sleepq_block() at sleepq_block+0x8e
turnstile_block() at turnstile_block+0x2c3
rw_vector_enter() at rw_vector_enter+0x11f
genfs_lock() at genfs_lock+0x7b
VOP_LOCK() at VOP_LOCK+0x37
vn_lock() at vn_lock+0x11
kernfs_root() at kernfs_root+0x41
lookup_once() at lookup_once+0x26e
namei_tryemulroot() at namei_tryemulroot+0x612
namei() at namei+0x43
vn_open() at vn_open+0x94
do_open() at do_open+0x111
do_sys_openat() at do_sys_openat+0x68
sys_open() at sys_open+0x24
syscall() at syscall+0x9a
--- syscall (number 5) ---
7f7ff603c40a:

5442     1      xenstore-rm   netbsd   43 tstile       ffffa00001e263e8

trace: pid 5442 lid 1 at 0xffffa0000dd46a38
sleepq_block() at sleepq_block+0x8e
turnstile_block() at turnstile_block+0x2c3
rw_vector_enter() at rw_vector_enter+0x11f
genfs_lock() at genfs_lock+0x7b
VOP_LOCK() at VOP_LOCK+0x37
vn_lock() at vn_lock+0x11
lookup_once() at lookup_once+0x1f4
namei_tryemulroot() at namei_tryemulroot+0x612
namei() at namei+0x43
fd_nameiat.isra.0() at fd_nameiat.isra.0+0x54
do_sys_statat() at do_sys_statat+0x72
sys___stat50() at sys___stat50+0x28
syscall() at syscall+0x9a
--- syscall (number 439) ---
7f7ff710809a:

3217     1    xenstore-read   netbsd   43 tstile       ffffa00001868dd8

trace: pid 3217 lid 1 at 0xffffa0000fff2a38
sleepq_block() at sleepq_block+0x8e
turnstile_block() at turnstile_block+0x2c3
rw_vector_enter() at rw_vector_enter+0x11f
genfs_lock() at genfs_lock+0x7b
VOP_LOCK() at VOP_LOCK+0x37
vn_lock() at vn_lock+0x11
lookup_once() at lookup_once+0x1f4
namei_tryemulroot() at namei_tryemulroot+0x612
namei() at namei+0x43
fd_nameiat.isra.0() at fd_nameiat.isra.0+0x54
do_sys_statat() at do_sys_statat+0x72
sys___stat50() at sys___stat50+0x28
syscall() at syscall+0x9a
--- syscall (number 439) ---
7f7ff710809a:

5496     2       xenconsole   netbsd   43 netio        ffffa00001e77f40
5496     1       xenconsole   netbsd   43 select       ffffa000006ed050

trace: pid 5496 lid 2 at 0xffffa0000e316cd0
sleepq_block() at sleepq_block+0x8e
cv_timedwait_sig() at cv_timedwait_sig+0x9c
sbwait() at sbwait+0x57
soreceive() at soreceive+0xd4d
dofileread() at dofileread+0x90
sys_read() at sys_read+0x5f
syscall() at syscall+0x9a
--- syscall (number 3) ---
7f7ff6c3c38a:

5978     2               xl   netbsd   43 netio        ffffa000020a1628
5978     1               xl   netbsd   43 select       ffffa000006ed050

trace: pid 5978 lid 2 at 0xffffa0000e4c0cd0
sleepq_block() at sleepq_block+0x8e
cv_timedwait_sig() at cv_timedwait_sig+0x9c
sbwait() at sbwait+0x57
soreceive() at soreceive+0xd4d
dofileread() at dofileread+0x90
sys_read() at sys_read+0x5f
syscall() at syscall+0x9a
--- syscall (number 3) ---
7f7ff603c38a:

8280     1      xenstore-rm   netbsd   43 xbrd         ffffa0000079e000

trace: pid 8280 lid 1 at 0xffffa0000e4edd08
sleepq_block() at sleepq_block+0x8e
xenbus_dev_read() at xenbus_dev_read+0x47
kernfs_try_fileop() at kernfs_try_fileop+0x56
VOP_READ() at VOP_READ+0x3c
vn_read() at vn_read+0x94
dofileread() at dofileread+0x90
sys_read() at sys_read+0x5f
syscall() at syscall+0x9a
--- syscall (number 3) ---
7f7ff703c38a:

4535     1    xenstore-read   netbsd   43 tstile       ffffa00001868dd8

trace: pid 4535 lid 1 at 0xffffa0000dfecc90
sleepq_block() at sleepq_block+0x8e
turnstile_block() at turnstile_block+0x2c3
rw_vector_enter() at rw_vector_enter+0x11f
genfs_lock() at genfs_lock+0x7b
VOP_LOCK() at VOP_LOCK+0x37
vn_lock() at vn_lock+0x11
vn_write() at vn_write+0x9b
dofilewrite() at dofilewrite+0x97
sys_write() at sys_write+0x5f
syscall() at syscall+0x9a
--- syscall (number 4) ---
7f7ff703c1ca:

2475     2               xl   netbsd   43 netio        ffffa0000203b870
2475     1               xl   netbsd   43 select       ffffa000006ed050

trace: pid 2475 lid 2 at 0xffffa0000e4e2cd0
sleepq_block() at sleepq_block+0x8e
cv_timedwait_sig() at cv_timedwait_sig+0x9c
sbwait() at sbwait+0x57
soreceive() at soreceive+0xd4d
dofileread() at dofileread+0x90
sys_read() at sys_read+0x5f
syscall() at syscall+0x9a
--- syscall (number 3) ---
7f7ff603c38a:

2630     2               xl   netbsd   43 netio        ffffa00001fe2628
2630     1               xl   netbsd   43 select       ffffa000006ed050

trace: pid 2630 lid 2 at 0xffffa0000e489cd0
sleepq_block() at sleepq_block+0x8e
cv_timedwait_sig() at cv_timedwait_sig+0x9c
sbwait() at sbwait+0x57
soreceive() at soreceive+0xd4d
dofileread() at dofileread+0x90
sys_read() at sys_read+0x5f
syscall() at syscall+0x9a
--- syscall (number 3) ---
7f7ff603c38a:

2089     2               xl   netbsd   43 netio        ffffa00001fe2f48
2089     1               xl   netbsd   43 select       ffffa000006ed050

trace: pid 2089 lid 2 at 0xffffa0000e45ccd0
sleepq_block() at sleepq_block+0x8e
cv_timedwait_sig() at cv_timedwait_sig+0x9c
sbwait() at sbwait+0x57
soreceive() at soreceive+0xd4d
dofileread() at dofileread+0x90
sys_read() at sys_read+0x5f
syscall() at syscall+0x9a
--- syscall (number 3) ---
7f7ff603c38a:

1180     2               xl   netbsd   43 netio        ffffa00001f0acf8
1180     1               xl   netbsd   43 select       ffffa000006ed050

trace: pid 1180 lid 2 at 0xffffa0000e424cd0
sleepq_block() at sleepq_block+0x8e
cv_timedwait_sig() at cv_timedwait_sig+0x9c
sbwait() at sbwait+0x57
soreceive() at soreceive+0xd4d
dofileread() at dofileread+0x90
sys_read() at sys_read+0x5f
syscall() at syscall+0x9a
--- syscall (number 3) ---
7f7ff603c38a:

1557     2               xl   netbsd   43 netio        ffffa00001ed5198
1557     1               xl   netbsd   43 select       ffffa000006ed050

trace: pid 1557 lid 2 at 0xffffa0000e3cccd0
sleepq_block() at sleepq_block+0x8e
cv_timedwait_sig() at cv_timedwait_sig+0x9c
sbwait() at sbwait+0x57
soreceive() at soreceive+0xd4d
dofileread() at dofileread+0x90
sys_read() at sys_read+0x5f
syscall() at syscall+0x9a
--- syscall (number 3) ---
7f7ff603c38a:

526      2               xl   netbsd   43 netio        ffffa00001e77190
526      1               xl   netbsd   43 select       ffffa000006ed050

trace: pid 526 lid 2 at 0xffffa0000e39bcd0
sleepq_block() at sleepq_block+0x8e
cv_timedwait_sig() at cv_timedwait_sig+0x9c
sbwait() at sbwait+0x57
soreceive() at soreceive+0xd4d
dofileread() at dofileread+0x90
sys_read() at sys_read+0x5f
syscall() at syscall+0x9a
--- syscall (number 3) ---
7f7ff603c38a:

2511     2               xl   netbsd   43 netio        ffffa00001cd9190
2511     1               xl   netbsd   43 select       ffffa000006ed050

trace: pid 2511 lid 2 at 0xffffa0000e319cd0
sleepq_block() at sleepq_block+0x8e
cv_timedwait_sig() at cv_timedwait_sig+0x9c
sbwait() at sbwait+0x57
soreceive() at soreceive+0xd4d
dofileread() at dofileread+0x90
sys_read() at sys_read+0x5f
syscall() at syscall+0x9a
--- syscall (number 3) ---
7f7ff603c38a:

1397     2      xenconsoled   netbsd   43 netio        ffffa000013fc3d8
1397     1      xenconsoled   netbsd   43 select       ffffa000006ed050

trace: pid 1397 lid 2 at 0xffffa0000dd1ccd0
sleepq_block() at sleepq_block+0x8e
cv_timedwait_sig() at cv_timedwait_sig+0x9c
sbwait() at sbwait+0x57
soreceive() at soreceive+0xd4d
dofileread() at dofileread+0x90
sys_read() at sys_read+0x5f
syscall() at syscall+0x9a
--- syscall (number 3) ---
7f7ff643c38a:

1380     1        xenstored   netbsd   43 select       ffffa000006ed050

trace: pid 1380 lid 1 at 0xffffa0000ddbcc58
sleepq_block() at sleepq_block+0x8e
sel_do_scan() at sel_do_scan+0x4c4
pollcommon() at pollcommon+0x12c
sys_poll() at sys_poll+0x82
syscall() at syscall+0x9a
--- syscall (number 209) ---
7f7ff703c3ea:


Does any of that point to what the problem might be? How can I tell what
process holds the lock we're queuing up on? To my untrained eye, it
looks like some kind of filesystem lock, but I'm not familiar with
NetBSD's locks...

I'm afraid I'm going to need to reboot this machine ("vm4" is my
database server, so my blog's down until it's back), so I won't be able
to try anything else until it happens again :-)

Thanks,

-- 
Alaric Snell-Pym
http://www.snell-pym.org.uk/alaric/

Attachment: signature.asc
Description: OpenPGP digital signature



Home | Main Index | Thread Index | Old Index