Subject: Re: Journaling for FFS
To: None <tech-kern@NetBSD.org>
From: Cary G. Gray <Cary.G.Gray@wheaton.edu>
List: tech-kern
Date: 10/05/2006 13:05:08
The "log near the inodes" idea misses one of the advantages of
journalling, which is to improve the locality of writes by concentrating
them in the log. Journalling just the metadata wins, because that's where
lots of small, carefully ordered, writes can be required.
Working out how to journal isn't trivial. Log analysis has to find the
last complete log record, and modern disk interfaces are likely to write
the blocks of even a large contiguous write out of order. So there is
considerable care required...
Fifteen years ago I worked for DEC, and I was in the middle of a project
that included adding metadata journalling to the FFS. We used Bob
Hagmann's paper on the Cedar File System as our roadmap. I left DEC
before the work was completed; Uresh Vahalia finished implementation of
the filesystem code and wrote a nice paper for USENIX.
Here are the references:
Robert B. Hagmann, Reimplementing the Cedar File System Using Logging and
Group Commit, Proc. 11th ACM Symposium on Operating Systems Principles,
pp. 155-162, Austin, TX, 1987, ACM.
(You can find it in the ACM Digital Library or through CiteSeer at
http://citeseer.ist.psu.edu/hagmann87reimplementing.html)
Uresh Vahalia, Cary G. Gray, and Dennis Ting. Metadata logging in an NFS
server. In Proceedings of the Winter 1995 USENIX Technical Conference,
pp. 265-76, New Orleans, LA, January 1995. USENIX.
(See
http://www.usenix.org/publications/library/proceedings/neworl/vahalia.html
for a link to it in PostScript.)
Hagmann's paper is especially good on log management and analysis of
possible failures, including what happens if you fail in the middle of
writing a log record. The way that bits get to a disk has changed a
lot in twenty years; so Hagmann's analysis would not apply, but he does
provide a good example of careful analysis of assumptions.
The approach we took was mostly-physical redo logging, as in Hagmann.
The biggest complication from the FFS data structures is that some blocks
change between being metadata and regular file contents--those containing
directory data and indirect blocks. Those transitions must somehow be
dealt with; what I figured out back then was a way to do it via multiple
passes during replay, one way for each level of indirection.
It would probably be better to log the transitions of blocks to/from
metadata status, so that decision about whether to replay modificiations
can be based on log analysis alone. If you do something like Hagmann's
transaction header/trailer record scheme, you can leave room in those
records to record the changes--you probably need to pad to keep blocks
aligned, anyway.
Doing so would also allow you to log and lazily reclaim the blocks pointed
to by indirect blocks: you can log to indicate that a particular
nth-indirect block should be freed. The actual freeing of the blocks
pointed to by it can be recorded later along with the actual freeing of
the block. That spares you from having huge transaction records for
delete/truncate--you end up with some number of later transactions to
handle the freeing. Recovery would need to enqueue any intend-to-free
indirect blocks it finds that haven't completed so that the freeing could
be completed.
Now that I think about it, such an "intent to free" record could be one
way to handle the removed-but-still-open inode problem. You can record
its removal from the directory and an "intend-to-free" for the inode. When
it is finally closed, you can then either synchronously log it to be freed
(including logging freeing the blocks it points to), or you could enqueue
it for lazy reclamation (possibly piggybacked on another transaction).
If a crash (or forced remount readonly, etc.) happens before it is
reclaimed (i.e., before the actual free is written to the log),
recovery--whenever it runs--would handle the actual freeing of the inode
and any associated blocks.
The mostly-physical logging scheme you get in this way includes logical
(operation) logging of just a few important state transitions for blocks
and inodes: allocation (as data), allocation (as metadata) [both of which
might include zeroing], intent-to-free (for inodes and blocks of indirect
pointers), freeing. I'd have to spend some time working on the details to
tell whether that list is complete...
As you can probably tell, I've wanted to get back to putting these ideas
into an implementation, but (as right now) I always seem to need to be
taking care of one of my classes.
Cary Gray