On Tue, Dec 09, 2008 at 12:12:32PM -0800, Jason Thorpe wrote: > > On Dec 9, 2008, at 11:00 AM, Manuel Bouyer wrote: > >> But I don't get why we want to have the journal on stable storage >> *now*. >>> From what I understand, what we want is to have the journal entry for >> this transaction to stable storage *before* the metadata changes start >> to hit >> stable storage, and the journal cleanup hits stable storage *after* >> all metadata changes are on stable storage. I can't see what FUA >> brings >> us here, as long as writes to stable storage are properly ordered. > > You want your journal to remain self-consistent, otherwise you can't > trust it to replay it. The performance part of the argument is separate from that, and is the reason that using just ordering constraints is not "enough", in the sense Manuel was after.. Ordering constraints would produce a self-consistent journal. However, using ordered journal writes would also force other (unrelated) pending unordered writes out as well, adding latency to the journalling. The journalling is already an overhead penalty (more total writes than would otherwise be required if dependencies could be fully ordered) and needs to jump the queue if it is to offer a latency reduction (reduced number of writes needed for *this* transaction to be acknowledged to upper layers). > Also, without explicit cache management, you can't be sure when the data > gets written out of the drive's cache. Again, command completion has > nothing to do with how the writes are ordered to the oxide. And getting to oxide, specifically versus other suitable stable storage like nvram, has nothing to do with journalling, until you're at the point of caring about visibility differences for the cache vs the oxide -- as clearly you are in the cluster case. Even in the cluster case, you're less concerned about oxide (if there even is any) than you are about shared stable storage. With a volatile cache, you know you need to get past it. With a fast non-volatile cache in the right place, you can get great performance for fully ordered writes and forgo journalling entirely. Trying to design something generic for the mix of conditions in between gets us into discussions like this :-) The trouble is really knowing enough about the IO topology (of which there may be several layers, each with their own cache - host controller, SAN raid controller, external JBOD disk, etc) to be sure what you're getting on completion: are write caches volatile? Are there differing visibility or fault domains involved at each layer? How far do you need to get through these layers before you have "enough" commitment for your specific needs, and at what cost? ZFS had lots of performance troubles with some SAN and external RAID systems at one stage, because they were treating them like local disks where they had full control of cache contents, and issuing syncs to complete and close one previous transaction in a way that could force cache flushes for many other unrelated writes for future transactions, or committed-to-nvram writes from previous transactions, or even totally unrelated activity in other filesystems. That's been fixed through a mix of clearing up interpretation differences about what is needed for stable storage and what cache flushing does, and most importantly by allowing the admin and system designer to be explicit about io topology using dedicated read and write cache devices that zfs can manage directly. -- Dan.
Attachment:
pgpoyfczak5vr.pgp
Description: PGP signature