On 09/11/16 21:16, Alaric Snell-Pym wrote: > On 17/10/16 18:43, Manuel Bouyer wrote: >> this means that they're all stuck on a lock. >> The first thing would be to use ddb (or crash(8), if this works on Xen) >> to find where in the kernel these processes are stuck. >> >> The next step would be to which which process is actaully holding the lock >> (this one is probably not in tstile state) >> > > > Thanks, Manuel... looks like crash(8) works, I'll check that out next > time it happens! Does the lock show as the "wait channel" as shown by > ps/w? How do I find what process currently holds it? And then I guess I > do a "trace/t" on the pids involved to see what's what? I'm a little > green with ddb. > It's happened again, during an xl create! I'm not sure how to interpret these results, but here's what I was doing: Pid 4406 is my hanging "xl create". -+= 00001 root init |--= 00526 root xl create /domU/domUs/vm2/xl.cfg |--= 00892 root /usr/sbin/syslogd -s |-+= 01180 root xl create /domU/domUs/vm4/xl.cfg | \--- 02832 root xl create /domU/domUs/vm4/xl.cfg |--- 01380 root /usr/pkg/sbin/xenstored --pid-file /var/run/xenstored.pid |--- 01397 root /usr/pkg/sbin/xenconsoled |--= 01492 root /usr/sbin/powerd |--= 01557 root xl create /domU/domUs/vm3/xl.cfg |--= 01637 root /usr/sbin/ntpd -p /var/run/ntpd.pid |-+= 01673 root /usr/sbin/sshd | |-+= 00379 root sshd: alaric [priv] | | \-+- 00301 alaric sshd: alaric@pts/1 (sshd) | | \-+= 00250 alaric -sh | | \-+= 02415 root su - | | \-+= 02493 root -sh | | \--= 05496 root /usr/pkg/libexec/xen/bin/xenconsole --num 0 --type pv 3 | |-+= 02815 root sshd: alaric [priv] | | \-+- 00423 alaric sshd: alaric@pts/0 (sshd) | | \-+= 02096 alaric -sh | | \-+= 00938 root su - | | \-+= 00210 root -sh | | \--= 04406 root xl create -c /domU/domUs/vm4/xl.cfg | \-+= 08541 root sshd: alaric [priv] | \-+- 05855 alaric sshd: alaric@pts/10 (sshd) | \-+= 06627 alaric -sh | \-+= 05133 root su - | \-+= 07688 root -sh | \-+= 16294 root /usr/pkg/bin/proctree | \-+- 09586 root sh -c ps -axwwo user,pid,ppid,pgid,command | \--- 17305 root ps -axwwo user,pid,ppid,pgid,command |--= 01931 root /usr/sbin/cron |-+= 02060 root /usr/libexec/postfix/master -w | |--- 01779 postfix qmgr -l -t unix -u | \--- 07732 postfix pickup -l -t unix -u |--= 02089 root xl create /domU/domUs/vm5/xl.cfg |--= 02206 root /usr/sbin/inetd -l |--= 02475 root xl create /domU/domUs/vm8/xl.cfg |--= 02511 root xl create /domU/domUs/vm1/xl.cfg |--= 02630 root xl create /domU/domUs/vm6/xl.cfg |--- 03217 root xenstore-read /local/domain/0/backend/vbd/4/3/params |--- 04535 root xenstore-read /local/domain/0/backend/vbd/7/1/params |--- 05442 root xenstore-rm /local/domain/0/backend/vif/4/0 |--= 05978 root xl create /domU/domUs/vm7/xl.cfg |--- 08280 root xenstore-rm /local/domain/0/backend/vif/7/1 |--= 00366 root /usr/libexec/getty Pc console |--= 02673 root /usr/libexec/getty Pc ttyE1 |--= 02617 root /usr/libexec/getty Pc ttyE2 \--= 00259 root /usr/libexec/getty Pc ttyE3 Here's what crash has to say: crash> ps/a PID COMMAND STRUCT PROC * UAREA * VMSPACE/VM_MAP 16638 crash ffffa00001feea70 ffffa0000e508000 ffffa00001ef88b8 7732 pickup ffffa00001fee030 ffffa0000e516000 ffffa00001ef8748 7688 sh ffffa00001560a58 ffffa0000e429000 ffffa00001c9d460 5133 su ffffa000020bea78 ffffa0000e377000 ffffa00001ef8188 6627 sh ffffa000020bed08 ffffa0000de60000 ffffa00002ae3d10 5855 sshd ffffa00001e5acf8 ffffa0000e519000 ffffa00001ef82f8 8541 sshd ffffa00001feed00 ffffa0000e4ad000 ffffa00001c9d5d0 4406 xl ffffa00001fee550 ffffa0000e505000 ffffa00001c9d8b0 2832 xl ffffa00001cff2b0 ffffa0000e41b000 ffffa00001ef85d8 5442 xenstore-r ffffa00001fee2c0 ffffa0000dd44000 ffffa00001ef8468 3217 xenstore-r ffffa000020be7e8 ffffa0000fff0000 ffffa00002ae35e0 5496 xenconsole ffffa00001560018 ffffa0000e314000 ffffa00001c9dd00 5978 xl ffffa00001cff7d0 ffffa0000e4be000 ffffa00001ef8018 8280 xenstore-r ffffa000020be558 ffffa0000e4eb000 ffffa00000fd4cf8 4535 xenstore-r ffffa000020be2c8 ffffa0000dfea000 ffffa00002ae3750 2475 xl ffffa00001fee7e0 ffffa0000e4e0000 ffffa00001c9d180 2630 xl ffffa00001e5a028 ffffa0000e487000 ffffa00001ef8a28 2089 xl ffffa00001e5aa68 ffffa0000e45a000 ffffa00001ef8b98 1180 xl ffffa00001e5a2b8 ffffa0000e422000 ffffa00001ef8d08 1557 xl ffffa00001e5a7d8 ffffa0000e3ca000 ffffa00001ef8e78 526 xl ffffa00001e5a548 ffffa0000e399000 ffffa00001c9d010 2511 xl ffffa00001cff020 ffffa0000e317000 ffffa00001c9d2f0 2493 sh ffffa00000f7e7b8 ffffa0000ddb0000 ffffa00000e455c0 2415 su ffffa00000e40520 ffffa0000dd1d000 ffffa00000e45a10 250 sh ffffa00000e407b0 ffffa0000dcd1000 ffffa00000e45b80 301 sshd ffffa00000f7ea48 ffffa0000dd9d000 ffffa00000e45450 379 sshd ffffa00001cff540 ffffa0000e30d000 ffffa00000e458a0 210 sh ffffa00000e40290 ffffa0000dd94000 ffffa00001c9d740 938 su ffffa000015602a8 ffffa0000e283000 ffffa00001c9db90 2096 sh ffffa00001cffcf0 ffffa0000e300000 ffffa00000fd4e68 423 sshd ffffa00001560538 ffffa0000e280000 ffffa00000fd4008 2815 sshd ffffa00000f7e528 ffffa0000ddb7000 ffffa00000e45170 366 getty ffffa00000e40a40 ffffa0000dcce000 ffffa00000e45cf0 259 getty ffffa0000130cce0 ffffa0000ddd6000 ffffa00000fd45c8 2617 getty ffffa00000f7ecd8 ffffa0000dd9a000 ffffa00000e452e0 2673 getty ffffa00000e40000 ffffa0000dd97000 ffffa00000e45730 1931 cron ffffa00001cffa60 ffffa0000e303000 ffffa00001c9da20 2206 inetd ffffa00000f7e008 ffffa0000ddd9000 ffffa00000fd4178 1779 qmgr ffffa0000130c7c0 ffffa0000dfcc000 ffffa00000fd42e8 2060 master ffffa000015607c8 ffffa0000dfed000 ffffa00001c9de70 1673 sshd ffffa00001560ce8 ffffa0000dfe7000 ffffa00000fd4a18 1492 powerd ffffa0000130c010 ffffa0000dfe4000 ffffa00000fd4458 1637 ntpd ffffa0000130c530 ffffa0000dfc9000 ffffa00000fd48a8 1397 xenconsole ffffa0000130c2a0 ffffa0000dd1a000 ffffa00000fd4738 1380 xenstored ffffa00000f7e298 ffffa0000ddba000 ffffa00000e45000 892 syslogd ffffa0000130ca50 ffffa0000ddd3000 ffffa00000fd4b88 1 init ffffa00000e40cd0 ffffa0000dcdd000 ffffa00000e45e60 0 system ffffffff80c60ea0 ffffa0000de6d000 ffffffff80cacf60 I took a look at the stacks of the ones that looked xen-related: 4406 1 xl netbsd 43 tstile ffffa00001a59c30 trace: pid 4406 lid 1 at 0xffffa0000e507960 sleepq_block() at sleepq_block+0x8e turnstile_block() at turnstile_block+0x2c3 rw_vector_enter() at rw_vector_enter+0x11f genfs_lock() at genfs_lock+0x7b VOP_LOCK() at VOP_LOCK+0x37 vn_lock() at vn_lock+0x11 kernfs_root() at kernfs_root+0x41 lookup_once() at lookup_once+0x26e namei_tryemulroot() at namei_tryemulroot+0x612 namei() at namei+0x43 vn_open() at vn_open+0x94 do_open() at do_open+0x111 do_sys_openat() at do_sys_openat+0x68 sys_open() at sys_open+0x24 syscall() at syscall+0x9a --- syscall (number 5) --- 7f7ff603c40a: 2832 1 xl netbsd 43 tstile ffffa00001a59c30 trace: pid 2832 lid 1 at 0xffffa0000e41d960 sleepq_block() at sleepq_block+0x8e turnstile_block() at turnstile_block+0x2c3 rw_vector_enter() at rw_vector_enter+0x11f genfs_lock() at genfs_lock+0x7b VOP_LOCK() at VOP_LOCK+0x37 vn_lock() at vn_lock+0x11 kernfs_root() at kernfs_root+0x41 lookup_once() at lookup_once+0x26e namei_tryemulroot() at namei_tryemulroot+0x612 namei() at namei+0x43 vn_open() at vn_open+0x94 do_open() at do_open+0x111 do_sys_openat() at do_sys_openat+0x68 sys_open() at sys_open+0x24 syscall() at syscall+0x9a --- syscall (number 5) --- 7f7ff603c40a: 5442 1 xenstore-rm netbsd 43 tstile ffffa00001e263e8 trace: pid 5442 lid 1 at 0xffffa0000dd46a38 sleepq_block() at sleepq_block+0x8e turnstile_block() at turnstile_block+0x2c3 rw_vector_enter() at rw_vector_enter+0x11f genfs_lock() at genfs_lock+0x7b VOP_LOCK() at VOP_LOCK+0x37 vn_lock() at vn_lock+0x11 lookup_once() at lookup_once+0x1f4 namei_tryemulroot() at namei_tryemulroot+0x612 namei() at namei+0x43 fd_nameiat.isra.0() at fd_nameiat.isra.0+0x54 do_sys_statat() at do_sys_statat+0x72 sys___stat50() at sys___stat50+0x28 syscall() at syscall+0x9a --- syscall (number 439) --- 7f7ff710809a: 3217 1 xenstore-read netbsd 43 tstile ffffa00001868dd8 trace: pid 3217 lid 1 at 0xffffa0000fff2a38 sleepq_block() at sleepq_block+0x8e turnstile_block() at turnstile_block+0x2c3 rw_vector_enter() at rw_vector_enter+0x11f genfs_lock() at genfs_lock+0x7b VOP_LOCK() at VOP_LOCK+0x37 vn_lock() at vn_lock+0x11 lookup_once() at lookup_once+0x1f4 namei_tryemulroot() at namei_tryemulroot+0x612 namei() at namei+0x43 fd_nameiat.isra.0() at fd_nameiat.isra.0+0x54 do_sys_statat() at do_sys_statat+0x72 sys___stat50() at sys___stat50+0x28 syscall() at syscall+0x9a --- syscall (number 439) --- 7f7ff710809a: 5496 2 xenconsole netbsd 43 netio ffffa00001e77f40 5496 1 xenconsole netbsd 43 select ffffa000006ed050 trace: pid 5496 lid 2 at 0xffffa0000e316cd0 sleepq_block() at sleepq_block+0x8e cv_timedwait_sig() at cv_timedwait_sig+0x9c sbwait() at sbwait+0x57 soreceive() at soreceive+0xd4d dofileread() at dofileread+0x90 sys_read() at sys_read+0x5f syscall() at syscall+0x9a --- syscall (number 3) --- 7f7ff6c3c38a: 5978 2 xl netbsd 43 netio ffffa000020a1628 5978 1 xl netbsd 43 select ffffa000006ed050 trace: pid 5978 lid 2 at 0xffffa0000e4c0cd0 sleepq_block() at sleepq_block+0x8e cv_timedwait_sig() at cv_timedwait_sig+0x9c sbwait() at sbwait+0x57 soreceive() at soreceive+0xd4d dofileread() at dofileread+0x90 sys_read() at sys_read+0x5f syscall() at syscall+0x9a --- syscall (number 3) --- 7f7ff603c38a: 8280 1 xenstore-rm netbsd 43 xbrd ffffa0000079e000 trace: pid 8280 lid 1 at 0xffffa0000e4edd08 sleepq_block() at sleepq_block+0x8e xenbus_dev_read() at xenbus_dev_read+0x47 kernfs_try_fileop() at kernfs_try_fileop+0x56 VOP_READ() at VOP_READ+0x3c vn_read() at vn_read+0x94 dofileread() at dofileread+0x90 sys_read() at sys_read+0x5f syscall() at syscall+0x9a --- syscall (number 3) --- 7f7ff703c38a: 4535 1 xenstore-read netbsd 43 tstile ffffa00001868dd8 trace: pid 4535 lid 1 at 0xffffa0000dfecc90 sleepq_block() at sleepq_block+0x8e turnstile_block() at turnstile_block+0x2c3 rw_vector_enter() at rw_vector_enter+0x11f genfs_lock() at genfs_lock+0x7b VOP_LOCK() at VOP_LOCK+0x37 vn_lock() at vn_lock+0x11 vn_write() at vn_write+0x9b dofilewrite() at dofilewrite+0x97 sys_write() at sys_write+0x5f syscall() at syscall+0x9a --- syscall (number 4) --- 7f7ff703c1ca: 2475 2 xl netbsd 43 netio ffffa0000203b870 2475 1 xl netbsd 43 select ffffa000006ed050 trace: pid 2475 lid 2 at 0xffffa0000e4e2cd0 sleepq_block() at sleepq_block+0x8e cv_timedwait_sig() at cv_timedwait_sig+0x9c sbwait() at sbwait+0x57 soreceive() at soreceive+0xd4d dofileread() at dofileread+0x90 sys_read() at sys_read+0x5f syscall() at syscall+0x9a --- syscall (number 3) --- 7f7ff603c38a: 2630 2 xl netbsd 43 netio ffffa00001fe2628 2630 1 xl netbsd 43 select ffffa000006ed050 trace: pid 2630 lid 2 at 0xffffa0000e489cd0 sleepq_block() at sleepq_block+0x8e cv_timedwait_sig() at cv_timedwait_sig+0x9c sbwait() at sbwait+0x57 soreceive() at soreceive+0xd4d dofileread() at dofileread+0x90 sys_read() at sys_read+0x5f syscall() at syscall+0x9a --- syscall (number 3) --- 7f7ff603c38a: 2089 2 xl netbsd 43 netio ffffa00001fe2f48 2089 1 xl netbsd 43 select ffffa000006ed050 trace: pid 2089 lid 2 at 0xffffa0000e45ccd0 sleepq_block() at sleepq_block+0x8e cv_timedwait_sig() at cv_timedwait_sig+0x9c sbwait() at sbwait+0x57 soreceive() at soreceive+0xd4d dofileread() at dofileread+0x90 sys_read() at sys_read+0x5f syscall() at syscall+0x9a --- syscall (number 3) --- 7f7ff603c38a: 1180 2 xl netbsd 43 netio ffffa00001f0acf8 1180 1 xl netbsd 43 select ffffa000006ed050 trace: pid 1180 lid 2 at 0xffffa0000e424cd0 sleepq_block() at sleepq_block+0x8e cv_timedwait_sig() at cv_timedwait_sig+0x9c sbwait() at sbwait+0x57 soreceive() at soreceive+0xd4d dofileread() at dofileread+0x90 sys_read() at sys_read+0x5f syscall() at syscall+0x9a --- syscall (number 3) --- 7f7ff603c38a: 1557 2 xl netbsd 43 netio ffffa00001ed5198 1557 1 xl netbsd 43 select ffffa000006ed050 trace: pid 1557 lid 2 at 0xffffa0000e3cccd0 sleepq_block() at sleepq_block+0x8e cv_timedwait_sig() at cv_timedwait_sig+0x9c sbwait() at sbwait+0x57 soreceive() at soreceive+0xd4d dofileread() at dofileread+0x90 sys_read() at sys_read+0x5f syscall() at syscall+0x9a --- syscall (number 3) --- 7f7ff603c38a: 526 2 xl netbsd 43 netio ffffa00001e77190 526 1 xl netbsd 43 select ffffa000006ed050 trace: pid 526 lid 2 at 0xffffa0000e39bcd0 sleepq_block() at sleepq_block+0x8e cv_timedwait_sig() at cv_timedwait_sig+0x9c sbwait() at sbwait+0x57 soreceive() at soreceive+0xd4d dofileread() at dofileread+0x90 sys_read() at sys_read+0x5f syscall() at syscall+0x9a --- syscall (number 3) --- 7f7ff603c38a: 2511 2 xl netbsd 43 netio ffffa00001cd9190 2511 1 xl netbsd 43 select ffffa000006ed050 trace: pid 2511 lid 2 at 0xffffa0000e319cd0 sleepq_block() at sleepq_block+0x8e cv_timedwait_sig() at cv_timedwait_sig+0x9c sbwait() at sbwait+0x57 soreceive() at soreceive+0xd4d dofileread() at dofileread+0x90 sys_read() at sys_read+0x5f syscall() at syscall+0x9a --- syscall (number 3) --- 7f7ff603c38a: 1397 2 xenconsoled netbsd 43 netio ffffa000013fc3d8 1397 1 xenconsoled netbsd 43 select ffffa000006ed050 trace: pid 1397 lid 2 at 0xffffa0000dd1ccd0 sleepq_block() at sleepq_block+0x8e cv_timedwait_sig() at cv_timedwait_sig+0x9c sbwait() at sbwait+0x57 soreceive() at soreceive+0xd4d dofileread() at dofileread+0x90 sys_read() at sys_read+0x5f syscall() at syscall+0x9a --- syscall (number 3) --- 7f7ff643c38a: 1380 1 xenstored netbsd 43 select ffffa000006ed050 trace: pid 1380 lid 1 at 0xffffa0000ddbcc58 sleepq_block() at sleepq_block+0x8e sel_do_scan() at sel_do_scan+0x4c4 pollcommon() at pollcommon+0x12c sys_poll() at sys_poll+0x82 syscall() at syscall+0x9a --- syscall (number 209) --- 7f7ff703c3ea: Does any of that point to what the problem might be? How can I tell what process holds the lock we're queuing up on? To my untrained eye, it looks like some kind of filesystem lock, but I'm not familiar with NetBSD's locks... I'm afraid I'm going to need to reboot this machine ("vm4" is my database server, so my blog's down until it's back), so I won't be able to try anything else until it happens again :-) Thanks, -- Alaric Snell-Pym http://www.snell-pym.org.uk/alaric/
Attachment:
signature.asc
Description: OpenPGP digital signature