Subject: Re: Testing of sgivol etc.
To: Havard Eidnes <he@netbsd.org>
From: Scott G. Akmentins-Taylor <staylor@mrynet.com>
List: port-sgimips
Date: 11/10/2001 17:15:27
Hi, Håvard
I discovered this as well. What I've discovered is that using
"sgivol -i" is a Very Bad thing. With the disklabel mods make
to the kernel and libutil, it is completely unnecessary.
I just do the following for initialisation:
# disklabel -i sd1
(create the disklabel. it will write both the the BSD and
the SGI disklabels)
# ./sgivol -w boot boot sd1
After this, the drive is completely writable and bootable after
populated.
I have ALSO discovered that it is near impossible to rectify the
drive's problem after ONLY the SGI disklabel has been written --
i.e. no BSD disklabel exists. I had to hack together a kernel
that would not try to use the SGI disklabel if no BSD label was
found (that or using a pre-disklabel-mod kernel to use dd(1)
to overwrite the disk blocks with /dev/zero).
Let me know if this isn't clear or I can help you get around
the problem.
Please let me know if you have further insight or better means
for dealing with this.
-scott
> Hi,
>
> I just took some time to install a new disk on my SGI Indigo2,
> together with some more memory. I upgraded to today's kernel, and
> decided to test the new "sgivol" utility, and the two first steps
> worked as advertized:
>
> viola# ./sgivol sd1
> No SGI volumn header found, magic=6c6c6c6c
> viola# ./sgivol -i sd1
> disklabel shows 35843670 sectors
> checksum: 00000000
> root part: 0
> swap part: 1
> bootfile:
>
> Volume header files:
>
> SGI partitions:
> 0:a blocks 35840535 first 3135 type 7 (EFS)
> 8:i blocks 3135 first 0 type 0 (Volume Header)
> 10:k blocks 35843670 first 0 type 6 (Volume)
>
> Do you want to update volume (y/n)? y
> viola#
>
> However, this one didn't:
>
> viola# ./sgivol sd1
>
> On the console appeared:
>
> sd1: no disk label
> sd1: no disk label
> Stopped in pid 1245 (sgivol) at 0x8811bd4c: mfhi v0
> db>
>
> At this point the machine is still running diskless, of course. Some
> digging resulted in identification of where it crashed; here's the DDB
> trace with the subroutines identified by running gdb on the image
> afterwards:
>
> db> trace
> 8811bc68+e4 (200,5072df,a12,8ba75fd8) ra 880211a4 sz 32
> sdstrategy
> 0x8811bd4c <sdstrategy+228>: mfhi $v0
> 88020f40+264 (8811bc68,5072df,a12,100000) ra 8811c408 sz 80
> physio
> 8811c3d8+30 (8811bc68,d2d27e78,a12,100000) ra 88069ed4 sz 32
> sdread
> 88069dac+128 (8811bc68,d2d27e78,a12,100000) ra 880bcb20 sz 96
> spec_read
> 880bcaa4+7c (8811bc68,d2d27e78,a12,100000) ra 88060910 sz 24
> nfsspec_read
> 88060804+10c (8811bc68,d2d27e78,d2d27e78,100000) ra 88036f44 sz 64
> vn_read
> 88036e80+c4 (8811bc68,d2d27e78,d2d27e78,100003e0) ra 88036e64 sz 96
> dofileread
> 88036dc0+a4 (8811bc68,d2d27e78,d2d27e78,100003e0) ra 880f9b74 sz 56
> sys_read
> 880f9964+210 (8811bc68,d2d27e78,d2d27e78,100003e0) ra 8800305c sz 80
> syscall_plain
> mips3_SystemCall+b0 (8811bc68,d2d27e78,d2d27e78,100003e0) ra 3010c080 sz 0
> PC 0x3010c080: not in kernel space
> 0+3010c080 (8811bc68,d2d27e78,d2d27e78,100003e0) ra 0 sz 0
> User-level: pid 1245
> db>
>
> The offending line of code appears to be
>
> if (lp->d_secsize == DEV_BSIZE) {
> sector_aligned = (bp->b_bcount & (DEV_BSIZE - 1)) == 0;
> } else {
> >>> sector_aligned = (bp->b_bcount % lp->d_secsize) == 0; <<<
> }
>
> I *think* lp->d_secsize is either initialized to 0 or read from the
> disk.
>
> The section of code for the marked line above appears to be:
>
> 0x8811bd34 <sdstrategy+204>: lw $a0,48($s0)
> 0x8811bd38 <sdstrategy+208>: nop
> 0x8811bd3c <sdstrategy+212>: divu $zero,$a0,$v1
> 0x8811bd40 <sdstrategy+216>: bnez $v1,0x8811bd4c <sdstrategy+228>
> 0x8811bd44 <sdstrategy+220>: nop
> 0x8811bd48 <sdstrategy+224>: break 0x7
> 0x8811bd4c <sdstrategy+228>: mfhi $v0
>
> and sure enough, v1 is zero:
>
> db> show reg
> ...
> v1 0
> a0 0x200
> ...
>
>
> I decided that the problem was the missing or uninitialized disk
> label, and after some failed attempts I managed to wedge one in place.
> This could not be done through an operation which would try to read
> the missing disklabel, as that would hit the above problem as well, so
> I ended up modifying a proto-file from one of my other systems and
> doing
>
> # disklabel -R -r sd1 new-label.sd1
>
> whereafter the label became sufficiently initialized that I could
> proceed with tuning the contents of the disk label.
>
> The root cause for the problem may be insufficient provision of
> default values for the in-core disklabel when the label on the disk is
> missing.
>
>
> Regards,
>
> - Håvard