Re: fses with multiple entry points (was: Re: altroot)

To: David Holland <dholland-tech%netbsd.org@localhost>
Subject: Re: fses with multiple entry points (was: Re: altroot)
From: Reinoud Zandijk <reinoud%NetBSD.org@localhost>
Date: Thu, 21 Nov 2024 23:11:38 +0100

On Sat, Nov 16, 2024 at 09:38:17PM +0000, David Holland wrote:
> On Wed, Nov 13, 2024 at 10:46:32PM +0100, Reinoud Zandijk wrote:
>  > Speaking of which, does NetBSD VFS code grok multiple vnodes on a
>  > device marked with VV_ROOT that are not connected to eachother by
>  > directories? I'd like to use that feature in my SFS code to
>  > represent multiple mount points on a single device. I haven't tried
>  > it out but at first glance it does seem to be legal.
>  > 
>  > Has anyone tried this before?
> 
> We don't support more than one mount point per filesystem volume (in
> the sense that struct mount is both the mount point and the anchor for
> the global volume state like the in-memory superblock and such)... so
> if you want to have multiple mount points they end up being separate
> volumes as far as the VFS logic is concerned. They'll need to share
> any joint state under the covers. This means each one will have its
> own vnodes, and therefore its own distinct root vnode. Each of those
> root vnodes belongs to its own struct mount, and nothing unusual
> happens. So it'll all work in that sense. Ultimately this is what
> nullfs does.

That is what I have currently implemented; on 1st mount it creates an extra
anonymous mountpoint to keep the system files and the overall state in.
Multiple RO and RW mounts work as expected with vnodes on their own mount
points.

Though writing works fine the issues come up when trying to sync the FS as
each mount point is synced one-by-one but are still open to further
modifications. Two writable mountpoints that are concurrently receiving logs
or other operations will thus never sync properly since on cleaning the second
mountpoint the first or both start writing again thus tainting the global
state again.

I'm now experimenting with suspending the dependent mountpoints on a sync
request on the anonymous mountpoint. Turns out that is impossible for now due
to the vfs_suspend_lock mutex that demands that only 1 mountpoint can be
suspended. Not clear at all why that should be enforced though.

> In principle we could improve the VFS code to distinguish mounts from
> volumes (there are various reasons to do this, in fact) but nobody's
> ever been all that enthusiastic about it and it's not likely to happen
> soon. And that doesn't by itself let you have fses with multiple entry
> points, unless the subvolumes (or at least the subvolume namespaces)
> are strictly disjoint.

In my FS the subvolumes/mountpoints are strictly disjoint by design; its
logically impossible to link a file or directory from another mountpoint into
one's tree.

> (If you just tag multiple vnodes VV_ROOT, what happens if one is a
> subdir of another? E.g. if you have a /usr volume with two entry
> points, one for all of it (/usr) and one for the "pkg" subtree
> (/usr/pkg) so you can also mount your packages somewhere else... and
> if you mark both the /usr and /usr/pkg vnodes with VV_ROOT, it's not
> obvious exactly what'll happen, but one likely possibility is that cd
> /usr/pkg && cd .. will put you in /.)

The alternative implementation with multiple VV_ROOTs I was referring to in my
mail is that all would behave as basically one FS but that the mountpoints
created by the mount system call are only used to get the relevant VFS_ROOT()
vnodes on the shared anonymous system-files mountpoint and ignores these
mountpoints otherwise.

All the user mounted mountpoints would live on one partition/wedge but have
distinct and disjunct heads to mount (selected with a mount option). So if
there is a `usr' and a `pkg' head it could be that you mount `pkg' on
'usr/pkg' and both from the point of view of VFS would have the same v_mount
in their vnodes; blocks on an vnode can be recorded on the vnode itself just
like LFS does and would on disc be intertwined or even shared on the same
device; name lookups would look at their directory vnode keysi or v_data
structures to see in what head they ought to find the inode referenced by
their numbers in directories.

My reasoning is that lookups from any directory vnode inside the `pkg' or
'usr' head would return vnodes as normal (as it knows from its vnode key what
head it is on so it knows how and what directory to look into) but that
lookups to say '/usr/pkg/../bar' in the `usr/pkg' example would result in the
VFS lookup code to handle 'pkg's VV_ROOT flag as a token that requesting '..'
from it would give the same vnode and thus would restart the lookup from the
directory it was mounted on. Renames would also look if the referenced heads
are equal or otherwise reject the rename with returning EXDEV.

Is this mechanism not already happening on every FFS mountpoint too as they
also have both the '..' refering to itself and the VV_ROOT flag for each
mounted FS? Isn't the code already assuming that if it gets the same vnode
when looking up '..' or in a VV_ROOT marked vnode it is crossing the FS
boundary and resolves further lookups from the parent of the directory it was
mounted on?

Having all vnodes on one shared anonymous system-files mountpoint would solve
lots of issues and not only the VFS_SYNC() issue!

Any thoughts or caviats on this idea? Would someting need to be
enhanced/changed?

With regards and thanks in advance for any feedback :)

Reinoud

References:
- altroot
  - From: Edgar Fuß
- Re: altroot
  - From: Reinoud Zandijk
- fses with multiple entry points (was: Re: altroot)
  - From: David Holland

Prev by Date: resizing the root file system
Next by Date: Re: resizing the root file system
Previous by Thread: fses with multiple entry points (was: Re: altroot)
Next by Thread: i915drmkms heartbeat failure, reset failure and hard lockup
Indexes:

Home | Main Index | Thread Index | Old Index