Subject: Re: Question: various bugs in sync()?
To: None <tech-kern@netbsd.org>
From: Thor Lancelot Simon <tls@rek.tjls.com>
List: tech-kern
Date: 01/15/1999 16:10:56
On Fri, Jan 15, 1999 at 03:35:39AM -0500, Thor Lancelot Simon wrote:
I am really *much* more concerned about the following bug, in which data may
never be scheduled to be written, period.
I'm hoping someone can fill in some of the answers to the questions I couldn't
figure out for myself last night, particularly the no sync/double sync
question about block devices' vnodes.
> Bug #2: data for block devices without mounted filesystems is not
> flushed by sync(2).
>
> Because sys_sync walks the list of mounted filesystems,
> data for block devices is not sync()ed.
>
> There are two sub-cases here.
>
> Case 1: block devices accessed with write()
>
> In this case, I don't know if data is flushed or not. We
> walk the list of mounted filesystems, flushing data
> for all their vnodes. This percolates down as given above.
> But does the VFS_SYNC->VOP_FSYNC->vflushbuf() chain actually
> catch the vnode for the block device? I guess it depends
> whether that device vnode is on the mount-point's list of
> vnodes. If it is, it should, I *think*, get written --
> but then I don't understand why filesystems beneath the
> root don't get written multiple times. Someone, please
> help me understand this!
>
> In any event, if the filesystem the block device is on
> is itself mounted read-only, the block device definitely
> doesn't get flushed because of the check for MNT_RDONLY
> (vfs_syscalls.c line 520). So we can definitely lose
> this way.
>
> Case 2: block devices accessed with mmap()
>
> mmap()ed data is flushed by vnode_pager_sync(mp) or by
> uvm_vnp_sync(mp). These walk the list of uvn's (UVM)
> or vm_objects (Mach VM), checking the mount point of
> each corresponding vnode and flushing all dirty pages
> for those which match the given mp. (uvm_vnode.c line
> 1984). If I'm correct that vp->v_mount for a device
> node is in fact the filesystem the device node lives
> in (and not NULL or something) then *usually* data
> gets synced this way. However, we still lose for
> device nodes that live on read-only filesystems.
>
> I'm pretty sure I know how to fix this. I can just
> change the semantics of uvm_vnp_sync()/vnode_pager_sync()
> to remove the "mp" argument (and comparison), and move the
> call outside the per-mount-point loop. I don't *think* this
> needs to be protected with vfs_busy.
>
> If it does, I propose to vfs_busy all filesystems,
> call uvm_vnp_sync/vnode_pager_sync with the new
> interface, then either vfs_unbusy all filesystems
> and iterate over them vfs_busy-ing, sync-ing, and
> un-busying as before, or leave them all vfs_busied,
> VFS_SYNC them all, then vfs_unbusy them all; I'd
> like suggestions on which approach is better as
> well as whether or not I need to bother to protect
> the uvm_vnp_sync/vnode_pager_sync with vfs_busy anyway.
> (I think I do, since it protects from unmounting
> while the sync is running)
>
> I'm quite curious about the write() case and the question about device
> nodes' buffers being flushed/not flushed when the filesystems they
> live on are flushed; if they are, I can't see why filesystems below
> the root aren't flushed multiple times, once from the mount list and
> once from their device's vnode being flushed from the filesystem it
> lives in.
>
> I'm hoping someone more familiar with the VFS code and buffer cache
> can remove some of my Deep Confusion here.
>