tech-kern archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
WABPL tstile/performance issues...[LONG]
[This email is a bit dense, long, and perhaps less than optimally
organized. If you like a challenge, and digging around in kernel
source code, this might be the puzzle for you. If you don't like
kernel source code, then I highly recommend you do something you like
instead of reading this email :) ]
Notes on the WAPBL tstile/performance issues...
Background
----------
Back when WAPBL was added to a branch in NetBSD, I made some
observations about performance noted here:
http://mail-index.netbsd.org/source-changes/2008/07/29/msg208256.html
In particular, that src/sys/miscfs/syncfs/sync_subr.c:sched_sync() was
effectively getting too far behind in processing items to sync, and
that's what was requring us to use WAPBL_DEBUG_SERIALIZE.
Other users posted (e.g.:
http://mail-index.netbsd.org/current-users/2008/10/28/msg005466.html
) that they saw very poor performance, but I was never able to
replicate the issue on my test box -- until very recently.
It turns out that (I believe) WAPBL_DEBUG_SERIALIZE is somewhat of a
red herring, and the problem is much, much deeper... I also believe
the underlying cause to be the root of PR#39564 (WAPBL performance
issues with disk cache flushing), and is at least partially
responsible for the 'tstile' problems also seen with WAPBL.
Current Observations
--------------------
1) A single tar extraction is started: tar -xzplf src.tar.gz
on a filesystem mounted with '-o log'. Note that in the early days of
WAPBL, I needed to run multiple, simultaneous tar extractions to see
this problem. Not any more.
2) Some 30 seconds in, 600-700MB of data has been written to the
disk, at least as reported by df.
3) By adding some accounting to src/sys/miscfs/syncfs/sync_subr.c
it is observed that the total number of items in the
syncer_workitem_pending[] array very quickly climbs to 60000+.
4) Some times the number of work items drops back down to ~2000.
Other times it does not. If sched_sync() is modified to be more
agressive (sleeping for 1 second only if all the queues have been
dealt with in less than one second [code thanks to mlelstv]), the
queue total can be held to about 4000, but it does not solve the
problem.
5) At some point, the "silly behaviour" begins... exactly where this
is, I'm not sure yet, but it's usually some 50 seconds after the tar
extraction has started. The "silly behaviour" can be characterised
as: real IO to the disk sits at maybe 1-2MB/sec, but the amount of new
data on the disk is in the neighbourhood of 20-100K/sec. Heads on the
disk go madly back and forth, likely between the little bit of new
data being written and the journal. System progress grinds to a halt,
and the tar process goes into 'tstile' wait... (more on that later..)
6) Observable during this time are incessant calls to wapbl_flush()
from here:
(gdb) list *0xc0352316
0xc0352316 is in ffs_full_fsync
(/u1/devel/current2/src/sys/ufs/ffs/ffs_vnops.c:477).
472 * Don't flush the log if the vnode being flushed
473 * contains no dirty buffers that could be in the log.
474 */
475 if (!((flags & FSYNC_RECLAIM) &&
476 LIST_EMPTY(&vp->v_dirtyblkhd))) {
477 error = wapbl_flush(mp->mnt_wapbl, 0);
478 if (error)
479 return error;
480 }
481
7) When things are working "normally", the following (custom)
debugging output is observed:
wapbl_flush: 0 0xc095bab3
wapbl_truncate: 4922368 4898304 0 0xc095c4b3
That is:
a) wabpl_flush() is called with a 'waitfor' value of 0, and called with
this return address: 0xc095c4b3
(gdb) list *0xc095c4b3
0xc095c4b3 is in wapbl_flush (/u1/devel/current2/src/sys/kern/vfs_wapbl.c:1387).
1382 * remaining to flush without the protection of the
journal.
1383 */
1384 panic("wapbl_flush: current transaction too big to
flush\n");
1385 }
1386
1387 error = wapbl_truncate(wl, flushsize, 0);
1388 if (error)
1389 goto out2;
1390
1391 off = wl->wl_head;
(gdb)
b) wapbl_truncate() is called with a 'minfree' value of 4922368,
there is 4898304 free space in the log, the 'waitonly' flag is 0, and
we're called from here:
(gdb) list *0xc095c4b3
0xc095c4b3 is in wapbl_flush (/u1/devel/current2/src/sys/kern/vfs_wapbl.c:1387).
1382 * remaining to flush without the protection of the
journal.
1383 */
1384 panic("wapbl_flush: current transaction too big to
flush\n");
1385 }
1386
1387 error = wapbl_truncate(wl, flushsize, 0);
1388 if (error)
1389 goto out2;
1390
1391 off = wl->wl_head;
(gdb)
Since the 'minfree' is larger than the 'avail', a full flush of the
log is executed, and the log is completely emptied.
8) When things are *not* working normally, and we get into the "silly
syndrome" case, what is observed is:
...
wapbl_flush: 0 0xc0352316
wapbl_truncate: 21504 9791488 0 0xc095c4b3
wapbl_write_commit head=0x9362944 tail=0x9341440
wapbl_flush() done..
back on wl: 0xce9a7968
overdrive..66555 0xce9a78b0
wapbl_flush: 0 0xc0352316
wapbl_truncate: 21504 9807872 0 0xc095c4b3
wapbl_write_commit head=0x9384448 tail=0x9362944
wapbl_flush() done..
back on wl: 0xce9a78b0
overdrive..66555 0xce9a77f8
wapbl_flush: 0 0xc0352316
wapbl_truncate: 21504 9807872 0 0xc095c4b3
wapbl_write_commit head=0x9405952 tail=0x9384448
wapbl_flush() done..
back on wl: 0xce9a77f8
overdrive..66555 0xce9a7740
wapbl_flush: 0 0xc0352316
wapbl_truncate: 21504 9807872 0 0xc095c4b3
wapbl_write_commit head=0x9427456 tail=0x9405952
wapbl_flush() done..
...
Explaining my verbose debugging output, what we have above is:
a) wapbl_flush() is being called with a 'waitfor' value of 0, and
with the return address of 0xc0352316:
(gdb) list *0xc0352316
0xc0352316 is in ffs_full_fsync
(/u1/devel/current2/src/sys/ufs/ffs/ffs_vnops.c:477).
472 * Don't flush the log if the vnode being flushed
473 * contains no dirty buffers that could be in the log.
474 */
475 if (!((flags & FSYNC_RECLAIM) &&
476 LIST_EMPTY(&vp->v_dirtyblkhd))) {
477 error = wapbl_flush(mp->mnt_wapbl, 0);
478 if (error)
479 return error;
480 }
481
(that is, indirectly being called from the sched_sync() via
ffs_full_fsync() )
b) wabpl_truncate() is now only being asked for a minfree of 21504,
and since this is smaller than the space available (i.e. the journal
is big enough), wapbl_truncate() does not do anything.
c) wapbl_write_commit() has the given head/tail for the location of
the log.
d) 'back on on wl' - the vnode pointer's address is printed here if
the following condition is true in sched_sync() and the item is added
back onto a worklist:
if (TAILQ_FIRST(slp) == vp) {
e) This line:
overdrive..66555 0xce9a78b0
tells me that at the time of printing, there were 66555 items on all
the work queues, and vnode pointer 0xce9a78b0 was the one we pulled
off for processing. You can see that we processed it, but then we
stuffed it right back onto a worklist. That seems to be a common
theme here, and one that I don't fully understand.
*** PERHAPS THIS IS WHERE THE BUG IS??? ***
9) The 'tstile' issue in WAPBL shows up looking something like this:
root 598 2.7 0.1 752 1944 ttyp0 D+ 2:38PM 0:16.23 tar -xz 0 598
431 0 117 0 752 1944 c095bad9 D+ ttyp0 0:16.23 tar -xz
(where 'c095bad9' is the address of the function that called either
rw_vector_enter() or mutex_vector_enter(), instead of the usual
'tstile'. No, your kernel won't magically print those values without
you hacking the code too...
).
In this particular case, this leads us to:
(gdb) list *0xc095bad9
0xc095bad9 is in wapbl_begin (/u1/devel/current2/src/sys/kern/vfs_wapbl.c:847).
842 if (error)
843 return error;
844 }
845
846 rw_enter(&wl->wl_rwlock, op);
847 mutex_enter(&wl->wl_mtx);
848 wl->wl_lock_count++;
849 mutex_exit(&wl->wl_mtx);
850
851 #if defined(WAPBL_DEBUG_PRINT) && defined(WAPBL_DEBUG_SERIALIZE)
(gdb)
which is in wabpl_begin().
10) when the system is in it's "silly state" (as describe in 8) ) it
is trivial to observe 'tar' being stuck in tstile all the time.
11) Where the DIOCCACHESYNC bug shows up (PR#39564) is really in the
incessant calling of wapbl_write_commit() when no real work appears to
be getting done. I think that, if anything, all the cache flushing
does is change the timing of the problem... without the cache flush
perhaps the disks can stay ahead, and the underlying problem is not
brought to the forefront?
12) A loop similar to this:
while (1)
sync
sleep 1
end
is sufficient to keep data flowing such that the "silly behaviour"
never occurs. The difference here, of course, is that a 'sync' call
will call both wapbl_flush() with the 'waitfor' flag set, which means
the function does not return until all the IO has completed after the
sync.
13) The test system is a 2.4GHz Q6600 quad-core w/ 2GB RAM and SATA
drives. The problem has been observed on a multitude of different
hardware and drive configurations.
If anyone has any ideas on where to go from here, I'm all ears.
Later...
Greg Oster
Home |
Main Index |
Thread Index |
Old Index