Subject: Re: The demise of DEV_BSIZE
To: Bill Studenmund <wrstuden@nas.nasa.gov>
From: Chuck Silvers <chuq@chuq.com>
List: tech-kern
Date: 10/05/1999 15:14:03
cool. I had a start on this in the UBC branch, but I'm glad someone else
is doing the rest of it. I have a few comments:
1. do we really want to pretend to support non-power-of-two devices at all
in the device interface? we kinda went thru this when this whole subject
came up before, and I thought the consensus was that it wasn't worthwhile.
2. do we really need a (*d_bsize)() ? I recall that there was already a
device ioctl that returns this info... or maybe I was just thinking
that that's how I'd do it eventually.
3. I'm not sure what "swap blocks" are, but I would guess they should
be in pagesize units, since that's how swap space is managed.
4. I'd think the goal should be to get rid of DEF_BSIZE eventually too.
purely software devices like md can define their own constants.
5. perhaps the sector size info in the on-disk disklabel should be
ignored and replaced with the info from the device itself?
there are probably further disklabel implications to all this.
-Chuck
On Tue, Oct 05, 1999 at 02:17:29PM -0700, Bill Studenmund wrote:
> As part of a project here at NAS to test how *BSD systems deal with lots
> of disk, I've had to get NetBSD working with non-512-byte sector disks.
>
> To do this, I've worked up patches based on Koji Imada's third proposal
> (PR 3972) and with comments from this list the last time this topic was
> brought up. I want to thank Leo for giving me near-current patches for
> Koji's 3rd proposal.
>
> I'll repeat what I understand of his proposal here (so y'all can
> understand me even if I misunderstood the proposal :-) :
>
> Koji's third proposal: the block numbers in struct buf will be in units of
> the natural block size of the media. So on a 0.5 K sector device, they are
> in 512-byte blocks. On a 1 K device, they are in 1 K blocks. All routines
> which need to worry about block size will just deal with whatever size the
> media posesses. Also (and this is the difference from his 1st proposal),
> filesystems should be able to deal with the filesystem being on a
> different block size device than the one on which it was made. So say I
> have a filesystem made on a 512 byte device, I can dd it to a 1 K sector
> device, and it will just work.
>
> I also wanted to support media with a sector size which isn't a power of
> two. The i/o system should support it, but filesystems don't necessarily
> have to support non-power-of-2 sectors.
>
> What I've done: block numbers in struct buf are now in blocks on the media
> - the "natural" media size. ffs has been adjusted so that it will work as
> long as there's only one filesystem block (fragment, actually) per disk
> block. So I can take an 8k/1k ffs from a 512-byte disk to a 1 K byte disk,
> but not to a 2 K byte disk. Supporting more than one data block (ffs frag)
> per disk block would be hard. I've not touched msdosfs or cd9660fs with
> respect to this, so the diffs are whatever Koji & Leo have done. :-)
>
> I've also changed DEV_BSIZE & DEV_BSHIFT to DEF_BSIZE & DEF_BSHIFT.
> Unfortunatly I can't just delete them yet... :-(
>
> The btodb and dbtob macros have changed. They now take a shift and size
> parameter. They are:
>
> #define dbtob(x, sh, bks) ((sh) ? ((x) << (sh)) : ((x) * (bks)))
> #define btodb(x, sh, bks) ((sh) ? ((x) >> (sh)) : ((x) / (bks)))
>
> x is the value to be shifted, sh is the device's shift value, and bks
> is the block size in bytes. For a power of 2 block size, sh is the log
> base 2 of the block size. So for 512-byte blocks, sh is 9. For 1 K
> sectors, it's 10, etc. So if the device's block size is a power of 2 (most
> of them), these macros keep shifting. We only multiply and divide if the
> block size isn't a power of 2. This feature is important as dividing is
> always slow, and a number of our architectures have to use a math
> subroutine for division, which is even slower.
>
> Both character and block devices have gained a new function call, d_bsize:
>
> void (*d_bsize) __P((dev_t dev, int * bshift, int * bsize));
>
> which fills in the bshift and bsize values for a device. bshift == -1
> indicates that the device isn't configured.
>
> struct specinfo has gained two new fields, si_bshift and si_bsize. They
> cache the block size info for the relevant device. They are initialized in
> checkalias when the new struct specinfo is being generated for the device
> node.
>
> struct mount also gained shift & size fields too (mnt_bshift & mnt_bsize),
> which reflect the values for the underlying device. The mount routines
> will now do a validity check on the device to make sure the filesystem is
> happy with the block size.
>
> physio has grown two additional parameters, for the block shift and block
> size values. The readdisklabel and writedisklabel routines have also
> gained shift and size values.
>
> I have modified the sd, cd, wd, and fd drivers to support these changes.
> For the moment, wd is using WD_DEF_BSIZE as I wasn't sure what to do with
> it at the time I made the change. The md driver uses DEF_BSIZE. The fd
> driver's support of the partition encoding the density has been extended
> so that it (on i386) can also encode the sector size. With changes to the
> format table, we should be able to support 256 byte or 1024 byte floppies
> (do they exist?).
>
> Open issues:
>
> We can't totally get rid of DEF_BSIZE. In addition to a few cases where we
> really need a DEF_BSIZE (md and memory disks come to mind - there's no
> underlying block size from which to determine values), there are a number
> of other uses layered on top of it. For instance, UFS keeps track of
> "blocks" allocated to a file in units of DEV_BSIZE. I've changed this to
> UFS_BSIZE & UFS_BSHIFT. ufs quotas are in the same unit.
>
> lfs is sprinkled with DEV_BSIZE. I changed them to DEF_BSIZE for now, but
> this needs fixing. Does struct lfs reflect the on-disk "superblock"? The
> problem I ran into is that it doesn't have fields for disk size (that I
> saw), and since it lacks a pointer to struct mount (which has disk block
> size info), it's hard for all the routines which are passed a struct lfs *
> to get the disk block size right.
>
> Swap "blocks" are in DEF_BSIZE units. Does that need to change?
>
> vnd, raidframe, and ccd haven't been updated to reflect these changes. I
> think that both raidframe and ccd should only agregate like-sized devices.
> vnd obviously needs to be able to change block sizes.
>
> So far only i386 has been fully changed. I've changed the disklabel entry
> points for other ports, but I'm not sure if I got all the calls to
> auxiliary disklabel routines.
>
> Other disk drivers need work, like rd, rz, xy, & xd. Are there others?
>
> Should tape drives do anything with block size? I've done nothing as I'm
> not exactly sure what we should do, nor how to do it (say in the face of
> variable block size tapes).
>
> disklabel writing needs work in that we shouldn't accept a disklabel which
> we know is not the device's block size. i.e. for sd & cd drives, we can
> querry the device to see what it's block size is. We shouldn't let you set
> a disklabel with a different block size. But on devices where we can't
> querry the block size (I think xy, xd, rd, and non-ata wd), we need to be
> able to set the block size in the disklabel as it is the authority on the
> block size. :-) Also, if the block size of a drive changes (either we
> write a new disk label or we note a probable device reports different
> sector sizes), we need to update existing devices nodes. Should we vgone
> them, or just update the size fields in their struct specinfo. I think
> vgone..
>
> My current thought is to make these diffs (which I'm still assembling)
> into a branch. We should be able to merge them in fairly soon. :-)
>
> I have a system with both 512 and 2048 byte sector disks in it, and I've
> simultaneously used filesystem on both sized devices. :-)
>
> Thoughts? I think I covered everything I've done.
>
> Take care,
>
> Bill