Subject: Re: FFS journal
To: Kirill Kuvaldin <kirill.kuvaldin@gmail.com>
From: Bill Studenmund <wrstuden@netbsd.org>
List: tech-kern
Date: 07/03/2006 10:16:50
--LZvS9be/3tNcYl/X
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
On Sun, Jul 02, 2006 at 07:59:50PM +0400, Kirill Kuvaldin wrote:
> Hi guys,
>=20
> I'm about to add journaling mechanism to FFS as a part of my summer of co=
de
> project (http://netbsd-soc.sourceforge.net/projects/jffs/).
>=20
> I'd like you to look at the preliminary version of a design document
> and hear your feedback. ;)
>=20
> Thanks in advance.
>=20
> -------------------------------------------------------------------------=
----
>=20
> SUPPORT FOR JOURNALING FOR FFS: PROJECT PLAN
> Author: Kirill Kuvaldin (kirill.kuvaldin@gmail.com)
>=20
> I. PROJECT DESIGN
>=20
> * One of the main sources of confusion about journaling is what exactly a
> journal contains. In the vast majority of journaling filesystems
> journal only contains modifications made to filesystem metadata
> (i.e. changes to directories, inodes, inode and block bitmaps). The
> journal doesn't contain any user data stored in a file.
You missed the difference between logical and physical journals. You=20
really want to do a physical journal as recovery is MUCH simpler.
> * The key point of journaling mechanism is a notion of transaction.
> A transaction is a complete set of modifications made to the on-disk
> metadata during one operation (e.g. creating a file or directory). The
> main role of a transaction from the high-level point of view is to be
> atomic with respect to system failures -- i.e. a transaction happens
> completely, or it doesn't happen at all.
>=20
> * The biggest benefit from journaling can be achieved when using it with
> the asynchronous semantics. It means that data blocks forming one
> transaction are not flushed to the disk as soon as the transaction is
> completely written to the journal area. When the data is later flushed
> from the cache, the cache manager can sort the list of blocks by their
> disk addresses, which minimize the seek time when writing the blocks.
> Asynchronous journaling filesystem can be several times faster than a
> traditional synchronous FFS.
>=20
> * The technique of batching multiple transactions into a single
> transaction is often known as *group commit*. Group commit can offer
> significant speed advantage to a journaling filesystem because it
> amortized the cost of writing to disk over many transactions.
>=20
> * The operations considered to be transactions are listed below:
> - create a file;
> - delete a file;
> - rename a file (including deletion of the existing name);
> - change the size of a file (growing or shrinking);
> - ... (anything else?)
As Matt pointed out, writing the inode out.
Also, allocating and deallocating blocks.
> * When a journaled filesystem reboots, if there are any journal entries
> that were not marked as completed, the system must *replay* the entries
> to bring the metadata structures up-to-date. Replaying the journal
> prevents partial updates because each journal entry is a complete,
> self-contained transaction.
>=20
> * The biggest bottleneck of journaling mechanism is that all transactions
> write to a single journal. A single journal effectively forces the
> filesystem into a single-threaded model for updates. This is a serious
> disadvantage when the concurrent semantics is required. Due to the time
> constraints to a summer of code projects the support for multiple
> journals will not be implemented under the scope of this project.
Huh? With group commit, you get a lot of changes out in one flush, so you=
=20
aren't single-threaded. Yes, the journal is still a semi-synchronous block=
=20
point (all transactions in a group don't return until the journal's=20
written), but "single-threaded" doesn't sound right.
Also, I doubt you'll ever want more than one journal, for two reasons.=20
First, if your journal isn't processing writes fast enough, the=20
easiest/best thing to do is to put it on a beefier disk. Put it on a RAID=
=20
10 with a small stripe depth. Or buy real transaction disks.
Second, if you have multiple journals, you add an extra synchronization=20
point and/or other complexity. The big problem is what happens if we=20
reboot and two or more journals for the same file system have uncompleted=
=20
writes for the same blocks. i.e. we wrote "transaction complete" to one of=
=20
them then wrote a new transaction for the same blocks to the other=20
journal, and the "transaction complete" markings didn't make it to disk.
The best I can think up is that you have a time stamp on the journal, and=
=20
you play the "older" write first. But even that has problems. You now have=
=20
to have the journal replayer sort operations.
> II. DELIVERABLES
>=20
> Mandatory:
>=20
> * Finishing design document (expected to deliver by the 7th July):
> - resolve all open issues;
> - probably make several modifications to desing according to mentors
> (or community) feedback.
>=20
> * Implementing journaling core features (expected to deliver by the 31st
> July):
> - new source files (maybe "jffs.{c,h}"?) in sys/ufs/ffs/ directory
> containing the implementations of journaling functions (described
> later in this document);
> - the set of patches providing modifications to existing FFS and UFS
> layers to add hooks for journaling code;
Since you're making transactions, you will probably need code in the exact=
=20
same places that the snapshotting code has transactions. Right now, that's=
=20
in the layers above the VFS/VOP calls.
> - the set of patches providing modification to the NetBSD
> configuration to make it possible to use the journal as an option;
> - maybe to provide single cummulative patch containing all the
> modifications to make it easier to apply it to the NetBSD source
> tree.
>=20
> * The set of unit tests (expected to deliver by 13th August):
> - to ensure that code works correctly;
> - to reveal the possible bugs;
> - to investigate code behaviour under some sort of boundary cases;
> - to analyze the filesystem performance.
>=20
> Optional:
>=20
> * Support for batching transactions:
> - it may be a significant performance win.
Yes.
> * API Documentation:
> - it may be helpful for the developers to understand what the
> journaling code does and how to use it.
Yes.
> * Benchmarks:
> - to compare the filesystem performance with different mount options:
> with journal, with soft updates, without them.
Sounds good. One thing to do in testing vs. softdeps is to make sure you=20
test enough so that softdeps actually starts flushing transactions to=20
disk. It will cache quite a lot, so you need to do enough work to fill up=
=20
its cache.
> - to compare the FFS performance against the performance of other
> journaling filesystems (NB: I don't know whether it makes sense
> because these benchmarks can't be performed only within NetBSD due
> to absence of journaling filesystems).
I don't think this makes much sense.
> * Profiling:
> - to explore the code bottlenecks, probably to optimize some parts of
> code;
> - to investigate code coverage.
>=20
> III. TECHNICAL DETAILS
>=20
> * Journal internals:
> - The journal area (or log area) used to write journal entries is a
> fixed data allocated at filesystem initialization. The filesystem
> superblock must maintain a reference to the journal area which also
Check with Darrin Jewell (dbj@) about how exactly to do this in a
backwards-compatible manner.
Also, it would be nice to be able to retrofit journals into existing file=
=20
systems.
> contains its own superblock where some sort of necessary information
> is stored. Two indices (start_index and end_index) that point to the
> start and end of the active area of the journal that is used in
> circular fashion, simply mark the bounds of the journal that contain
> active transactions.
>=20
> +-------------+------------------------+------------------------+------+
> | Journal | | | |
> | superblock | t r a n s a c t i o n | t r a n s a c t i o n | |
> |+-----------+|+-------+ +-------+ |+-------+ +-------+ | |
> ||start_index||| | | | || | | | | |
> ||end_index ||| | | | ... || | | | ... | .... |
> || ... ||| | | | || | | | | |
> |+-----------+|+-------+ +-------+ |+-------+ +-------+ | |
> +-------------+------------------------+------------------------+------+
> Figure 1: Journal area on-disk representation
>=20
> * Journaling API:
> The following ideas inspired from the BeFS textbook (see [2]). Although,
> there are only 3 functions for journal management, it may be enough for
> the rest part of filesystem to interact with journaling code.
>=20
> - jffs_start_transaction():
> o acquire the journal semaphore, holding it under the transactions
> completes;
> o ensure that there is enough space available in the journal to hold
> this transaction and in case there is - make some preparation
> actions and allocate the necessary transaction structures;
> otherwise, to force flushing blocks out of the cache, preferably
> those that were part of previous transactions;
> o set the state of transaction to *running* allowing the filesystem
> code to add new blocks to form the transaction structure.
>=20
> - jffs_write_blocks():
> o during a transaction any code that modifies a block of the
> filesystem metadata must call this function on the modified data;
> o for the sake of performance it may be possible to modify only the
> in-memory journal structures and later flush them to the log.
>=20
> - jffs_end_transaction():
> o at first this function turns a transaction into the *locked* state,
> meaning that no more block can be added to the transaction;
> o write all in-memory transaction blocks to their appropriate places
> into the journal area. When the last block is written to the
> journal, the transaction is considered to be *finished*;
> o set the callback function that will change the transaction state to
> *completed* as soon as the journal entry will be completely flushed
> to disk;
> o release the journal semaphore.
>=20
> * Journaling constraints on the cache subsystem:
> - journaling code must be able to lock disk blocks in the cache to
> prevent them from being flushed.
> - journaling code must know when a disk block is flushed to disk. It
> may be achived with callback functions if cache subsystem supports
> them. When the last block forming the transaction is flushed to
> disk, the transaction considered to be completed.
>=20
> IV. OPEN ISSUES
>=20
> * Journal location: The journal can reside on the same device as the
> rest part of file system does, but putting the journal on the
> different device may be a performance win. E.g., Seltzer paper [3]
> describes the approach when the journal is managed by write-ahead
> filesystem (wafs). It may require some additional time to implement
> the similar functionality.
Leave the journal location as a changeable thing. At the end of a file=20
system is convenient, but some configurations may need it on a separate=20
device.
> * The maximum size of a transaction is to be determined.
> The jffs_start_transaction() function uses this information to ensure
> that there is enough space available to hold the new transaction.
You will never have enough space. So pick something sensible, like 64M or=
=20
256M, and adapt commands that would overflow the transaction to operate in=
=20
stages.
At present the only operation I know of that will have this issue is=20
truncating a large (say multi-TB) file.=20
> * The functionality for recovering after the crashes (journal replaying)
> need to be clearly defined and added to this document.
fsck should note the info in the super block and replay the journal.
> * Probably the list of transactions must be extended.
It will. :-)
Take care,
Bill
--LZvS9be/3tNcYl/X
Content-Type: application/pgp-signature
Content-Disposition: inline
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (NetBSD)
iD8DBQFEqVECWz+3JHUci9cRAk3JAJ92wDFiRsTVycWCAzLoaOBAQts/VACdGeXk
EOLsNL035d5ManC68iCZAsY=
=BnCX
-----END PGP SIGNATURE-----
--LZvS9be/3tNcYl/X--