Subject: FFS journal
To: None <tech-kern@netbsd.org>
From: Kirill Kuvaldin <kirill.kuvaldin@gmail.com>
List: tech-kern
Date: 07/02/2006 19:59:50
Hi guys,
I'm about to add journaling mechanism to FFS as a part of my summer of code
project (http://netbsd-soc.sourceforge.net/projects/jffs/).
I'd like you to look at the preliminary version of a design document
and hear your feedback. ;)
Thanks in advance.
-----------------------------------------------------------------------------
SUPPORT FOR JOURNALING FOR FFS: PROJECT PLAN
Author: Kirill Kuvaldin (kirill.kuvaldin@gmail.com)
I. PROJECT DESIGN
* One of the main sources of confusion about journaling is what exactly a
journal contains. In the vast majority of journaling filesystems
journal only contains modifications made to filesystem metadata
(i.e. changes to directories, inodes, inode and block bitmaps). The
journal doesn't contain any user data stored in a file.
* The key point of journaling mechanism is a notion of transaction.
A transaction is a complete set of modifications made to the on-disk
metadata during one operation (e.g. creating a file or directory). The
main role of a transaction from the high-level point of view is to be
atomic with respect to system failures -- i.e. a transaction happens
completely, or it doesn't happen at all.
* The biggest benefit from journaling can be achieved when using it with
the asynchronous semantics. It means that data blocks forming one
transaction are not flushed to the disk as soon as the transaction is
completely written to the journal area. When the data is later flushed
from the cache, the cache manager can sort the list of blocks by their
disk addresses, which minimize the seek time when writing the blocks.
Asynchronous journaling filesystem can be several times faster than a
traditional synchronous FFS.
* The technique of batching multiple transactions into a single
transaction is often known as *group commit*. Group commit can offer
significant speed advantage to a journaling filesystem because it
amortized the cost of writing to disk over many transactions.
* The operations considered to be transactions are listed below:
- create a file;
- delete a file;
- rename a file (including deletion of the existing name);
- change the size of a file (growing or shrinking);
- ... (anything else?)
* When a journaled filesystem reboots, if there are any journal entries
that were not marked as completed, the system must *replay* the entries
to bring the metadata structures up-to-date. Replaying the journal
prevents partial updates because each journal entry is a complete,
self-contained transaction.
* The biggest bottleneck of journaling mechanism is that all transactions
write to a single journal. A single journal effectively forces the
filesystem into a single-threaded model for updates. This is a serious
disadvantage when the concurrent semantics is required. Due to the time
constraints to a summer of code projects the support for multiple
journals will not be implemented under the scope of this project.
II. DELIVERABLES
Mandatory:
* Finishing design document (expected to deliver by the 7th July):
- resolve all open issues;
- probably make several modifications to desing according to mentors
(or community) feedback.
* Implementing journaling core features (expected to deliver by the 31st
July):
- new source files (maybe "jffs.{c,h}"?) in sys/ufs/ffs/ directory
containing the implementations of journaling functions (described
later in this document);
- the set of patches providing modifications to existing FFS and UFS
layers to add hooks for journaling code;
- the set of patches providing modification to the NetBSD
configuration to make it possible to use the journal as an option;
- maybe to provide single cummulative patch containing all the
modifications to make it easier to apply it to the NetBSD source
tree.
* The set of unit tests (expected to deliver by 13th August):
- to ensure that code works correctly;
- to reveal the possible bugs;
- to investigate code behaviour under some sort of boundary cases;
- to analyze the filesystem performance.
Optional:
* Support for batching transactions:
- it may be a significant performance win.
* API Documentation:
- it may be helpful for the developers to understand what the
journaling code does and how to use it.
* Benchmarks:
- to compare the filesystem performance with different mount options:
with journal, with soft updates, without them.
- to compare the FFS performance against the performance of other
journaling filesystems (NB: I don't know whether it makes sense
because these benchmarks can't be performed only within NetBSD due
to absence of journaling filesystems).
* Profiling:
- to explore the code bottlenecks, probably to optimize some parts of
code;
- to investigate code coverage.
III. TECHNICAL DETAILS
* Journal internals:
- The journal area (or log area) used to write journal entries is a
fixed data allocated at filesystem initialization. The filesystem
superblock must maintain a reference to the journal area which also
contains its own superblock where some sort of necessary information
is stored. Two indices (start_index and end_index) that point to the
start and end of the active area of the journal that is used in
circular fashion, simply mark the bounds of the journal that contain
active transactions.
+-------------+------------------------+------------------------+------+
| Journal | | | |
| superblock | t r a n s a c t i o n | t r a n s a c t i o n | |
|+-----------+|+-------+ +-------+ |+-------+ +-------+ | |
||start_index||| | | | || | | | | |
||end_index ||| | | | ... || | | | ... | .... |
|| ... ||| | | | || | | | | |
|+-----------+|+-------+ +-------+ |+-------+ +-------+ | |
+-------------+------------------------+------------------------+------+
Figure 1: Journal area on-disk representation
* Journaling API:
The following ideas inspired from the BeFS textbook (see [2]). Although,
there are only 3 functions for journal management, it may be enough for
the rest part of filesystem to interact with journaling code.
- jffs_start_transaction():
o acquire the journal semaphore, holding it under the transactions
completes;
o ensure that there is enough space available in the journal to hold
this transaction and in case there is - make some preparation
actions and allocate the necessary transaction structures;
otherwise, to force flushing blocks out of the cache, preferably
those that were part of previous transactions;
o set the state of transaction to *running* allowing the filesystem
code to add new blocks to form the transaction structure.
- jffs_write_blocks():
o during a transaction any code that modifies a block of the
filesystem metadata must call this function on the modified data;
o for the sake of performance it may be possible to modify only the
in-memory journal structures and later flush them to the log.
- jffs_end_transaction():
o at first this function turns a transaction into the *locked* state,
meaning that no more block can be added to the transaction;
o write all in-memory transaction blocks to their appropriate places
into the journal area. When the last block is written to the
journal, the transaction is considered to be *finished*;
o set the callback function that will change the transaction state to
*completed* as soon as the journal entry will be completely flushed
to disk;
o release the journal semaphore.
* Journaling constraints on the cache subsystem:
- journaling code must be able to lock disk blocks in the cache to
prevent them from being flushed.
- journaling code must know when a disk block is flushed to disk. It
may be achived with callback functions if cache subsystem supports
them. When the last block forming the transaction is flushed to
disk, the transaction considered to be completed.
IV. OPEN ISSUES
* Journal location: The journal can reside on the same device as the
rest part of file system does, but putting the journal on the
different device may be a performance win. E.g., Seltzer paper [3]
describes the approach when the journal is managed by write-ahead
filesystem (wafs). It may require some additional time to implement
the similar functionality.
* The maximum size of a transaction is to be determined.
The jffs_start_transaction() function uses this information to ensure
that there is enough space available to hold the new transaction.
* The functionality for recovering after the crashes (journal replaying)
need to be clearly defined and added to this document.
* Probably the list of transactions must be extended.
V. ADDITIONAL INFORMATION AND REFERENCES
[1] Marshall Kirk McKusick, Keith Bostic, Michael J. Karels,
John S. Quatermann
"The Design and Implementation of the 4.4 BSD Operating System".
Addison-Wesley Professional, ISBN 0201549794. 1996.
[2] Dominic Giampaolo.
"Practical File System Design with the Be File System".
Morgan Kaufmann Publishers, ISBN 1558604979. 1999.
[3] Seltzer et al.
"Journaling vs. Soft Updates: Asynchronous Meta-data protection in
File Systems".
USENIX Technical Conference. 2000.
[4] The NetBSD Developers. "NetBSD Internals".
Copyright (c) 2006 The NetBSD Foundation.
[5] NetBSD Source Code
http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/