Subject: Disklabel ioctls (was Re: nore on disk ...)
To: Gordon Ross <gwr@mc.com>
From: Jason Thorpe <thorpej@nas.nasa.gov>
List: tech-kern
Date: 11/14/1995 10:15:48
[ Time to change the subject, since the old one only shows off my
  poor typing :-) ]

On Tue, 14 Nov 95 11:15:19 EST 
 "Gordon W. Ross" <gwr@mc.com> wrote:

 > Or do it like SunOS: <sun/dkio.h>
 > 
 > /*
 >  * Disk io control commands
 >  */
 > #define	DKIOCGGEOM _IOR(d, 2, struct dk_geom)	/* Get geometry */
 > #define	DKIOCSGEOM _IOW(d, 3, struct dk_geom)	/* Set geometry */
 > #define	DKIOCGPART _IOR(d, 4, struct dk_map)	/* Get partition info */
 > #define	DKIOCSPART _IOW(d, 5, struct dk_map)	/* Set partition info */
 > #define	DKIOCINFO  _IOR(d, 8, struct dk_info)	/* Get info */

Yah, it would be nice to have these implemented, if nothing else than for 
the sake of binary compatibility.

However, I suggested the semantics I did given that we have:

	DIOCGDINFO	get disklabel
	DIOCSDINFO	set in-core disklabel
	DIOCWDINFO	set in-core disklabel, update disk
	DIOCGPART	get partition info; only really useful in kernel

Since our disklabel contains geometry (albeit potentially incorrect if 
misconfigured) and partition info, it seems like something to just ask 
the disk for geometry was more in order.  Sort of like a "generic" 
version of a SCSI rigid-geometry mode-sense and indentify.  Maybe this 
could be generalized a bit more and provide vendor, model, and revision 
strings, if applicable.

Actually, I've been thinking about disklabels lately; that 30-minute 
commute gives me that opportunity.  I've been getting pretty frustrated 
at some of the stupidity in our current disklabel semantics.  I think 
what bugs me the most is the fact that on some devices, notably SCSI, the 
total size of the disk is often a few to several blocks larger than:

	sectors/track * tracks/cylinder * cylinders

Indeed, I have a couple of disk manuals at work that flat out say that 
the disk's reported geometry is _estimated_ based on observed averages.

Disklabel(8)'s inability to deal with this is a problem.  For example, 
you have a new SCSI disk, you install it, and proceed to install the 
disklabel.  disklabel(8) chokes because partition `c' (or, whichever 
RAW_PART happens to be on your system... *ahem*) runs past end of unit.  
I.e. disklabel(8) did the math, and determined that your absolutely 
correct disklabel is in error.  (Remember: MI SCSI and other drivers set 
the size of RAW_PART and some geometry info in the disklabel if no label 
is read off disk).  So, wanting to label the disk, you shrink `c'.  BZZT!
Can't do that either; "open partition would move or shrink".

So, the way I bypassed this in the hp300 install.sh is, well, sickening:

	* Increase the number of cylinders by one.  Add a dummy partition
	  (a in this example) of offset 0, size 1 (block).  Label disk, using 
	  `disklabel -r -R /dev/rsd?c labelfile'.
	* Run `disklabel -e /dev/rsd?a'.  Adjust cylinders back to proper
	  value, and adjust the size of `c' so that the math works out.
	* Run `disklabel -e /dev/rsd?c' again and actually edit the
	  partition map to your liking.

I would propose something along these lines:

	* DIOCWDLABEL should _not_ return ESRCH in the event there's no
	  label on the disk.  Instead, it should update the disk, just
	  like you asked it to.
	* New ioctl: DIOCRDLABEL - attempt to do a raw read of the
	  label area _without_ actually updating the in-core copy.  This
	  would provide a mechanism for accurately testing for the presence
	  of a disklabel in a machine-independent manner.  Could return
	  ESRCH if there's not a suitable label.  Should be passed the
	  following as an argument:

		struct dioc_rdargs {
			char	drd_errbuf[128];	/* error message */
			struct	disklabel drd_label;	/* fill me in */
		};

	  The error message might be "magic number incorrect",
	  "checksum failure", or "disklabel not present".  The current
	  mechanism for determining the cause of an error leaves a bit
	  to be desired.

	* All knowledge of the on-disk format of the disklabel should
	  be pulled out of disklabel(8) and cursed at.  The on-disk
	  format should be dealt with exclusively in the kernel, with
	  the exception of those ports that require access to the label
	  area to install the boot block.  A program to handle that
	  should live in /usr/mdec.
	* disklabel(8) should be more lenient when slapping one's
	  hands for incorrect RAW_PART size.  In particular, it should
	  probably print a message to stdout if the extra blocks are less
	  than one cylinder, and maybe to stderr if it looks really bad.
	* The kernel _should_ allow an open partition to move or shrink
	  iff the following are true:

		* The only partition open is RAW_PART.
		* Only one flavor (char or block) of that partition is
		  open.
		* The partition is open for writing.
		* The "label is writable" flag for that disk is set.

	  Given the nature of RAW_PART, it's pretty difficult to
	  "move" it, but I think you get the idea :-)

	* Ports (like the i386, where it has to deal with the dos
	  partition foo) should implement the machine-dependent foo
	  in the kernel, with a generic interface:

	  DIOCMACHDEP - machine-dependent disk ioctls, read/write,
	  takes the following argument:

		struct dioc_machdep {
			int	dmd_op;		/* machdep operation */
			void	*dmd_data;	/* pointer to data/buffer */
			size_t	dmd_datalen;	/* size of data/buffer */
		};

	  This is somewhat analagous to how sys_sysarch() works.

	  Any program taking advantage of this interface is inherently
	  machine-dependent, and should be condemned spend it's days in
	  /usr/mdec.

Any comments/discussion?  :-)

--------------------------------------------------------------------------
Jason R. Thorpe                                       thorpej@nas.nasa.gov
NASA Ames Research Center                               Home: 408.866.1912
NAS: M/S 258-6                                          Work: 415.604.0935
Moffett Field, CA 94035                                Pager: 415.428.6939