Swift Griggs <swiftgriggs%gmail.com@localhost> writes: > I'm curious about something, probably due to ignorance of the full > dynamics of the vfs(9) layer. Why is it that folks don't choose file > system block sizes and partition offsets that are least-common-factors > that they share with the hardware layer. Ie.. Let's say the hard disk > uses 4K pages, the file system uses 8K blocks, and the vendor > recommends that you stay aligned with a 1GB value. Wouldn't operating > on 8K blocks still satisfy the underlying device (since 8K operations > would always be divisible by a factor of 4K) and the 1GB alignment may > not always be perfect, but the 8K ops below it would eventually stack > to 1GB exactly, too. Good questions, and it boils down to a few things: - many devices don't have a way to report their underlying block sizes. For example, if you buy a 2T spinning disk, it will very likely be one that has sectors that are actually 4K but an interface of 512B sectors. So if you read, it's fine because it gets a 4K sector into the cache, and then hands you the piece you want. And when you write, if you write a 512-byte sector, it has to read-modify-write. Worse, if you write 4K or 8K but not lined up (which you will if your fs has 8K blocks but aligned to 63), it has to read-modify-write 2 sectors per write. - SSDs are even harder to figure out, as Andreas's helpful references in response to my question show. - filesystems sometimes get moved around, and higher up it's even more disconnected from the actual hardware So there are two issues: alignment and filesystem block/frag size, and both have to be ok. For larger disks, UFS uses larger block sizes by default (man newfs). So that's ok, but alignment is messier. We're seeing smaller disks with 4K sectors or larger flash erase blocks and 512B interfaces now. And, there are also disks with native 4K sectors, where the interface to the computer transfers 4K chunks. That avoids the alignment issue, but requires filesystem/kernel support. I am pretty sure netbsd-7 is ok with that but I am not sure about earlier. It would probably be possible to add a call into drivers to return this info and propagate it up and have newfs/fdisk query it. I am not sure that all disks return the info enough, and there are probably a lot of details. But it's more work and doesn't necessarily do better than "just start at 2048 and use big blocks". Certainly you are welcome to read the code and think about it if this interests you - just explaining why I think no one has done the work so far. > Is it all about waste at the file system layer due to some block > operations being optimized for large devices and buffers but not being > as applicable (or being downright wasteful) on smaller block devices? I geuss you can put it that way, saying otherwise we would always start at 2048 and use 32K or even 64K blocks. But I think part of it is inertia. And the the 63 start dates back to floppies that had 63-sector tracks - so it was actually aligned.
Attachment:
signature.asc
Description: PGP signature