Subject: Re: Raising NFS parameters for higher bandwith or "long fat pipe"
To: Manuel Bouyer <bouyer@antioche.lip6.fr>
From: Jonathan Stone <jonathan@DSG.Stanford.EDU>
List: tech-net
Date: 12/13/2003 14:58:58
Manuel confirms that NetBSD (-current and 1.6.2_RC2) has asymmmetric
NFS client performance: with a NetBSD NFS server, Solaris NFS clients
sustain 100Mbyte/sec.
For a NetBSD-current client, NFS reads throughput is about half of
Solaris, and with writes throughput about 80% of solaris. (.1.6.2_RC2
shows a similar spread, but is consistently slower than -current).
I found a FreeBSD 4.7 NFS client is somewhat better, and has roughly
symmetric read/write performance. Some digging in the code suggests
FreeBSD does better on reads becasue it issues explicit NFS read-ahead,
in addition to the `read one ahead' we do (which FreeBSD does too).
Below is an untested, almost certinaly broken, patch showing how
FreeBSD 4.x does NFS readahead. It' basically two chunks:
1. Adding a loop to sys/nfs/nfs_bio.c:nfs_bioread() to compute
offsets and issue additional read-aheads.
2. Changing the api to nfs_asyncio() so that the extra read-aheads
can pass appropriate credentials (and struct proc *?) to
nfs_async().
I dont know FreeBSD kernel internals well enough to rework these to
work correctly in NetBSD right off the bat. I'd appreciate some help,
as this should make a big difference to NFS client read performance.
* There must be a better way to compute the read-ahead block number!
* Should the vfs_busy_pages/vfs_unbusy_pages map directly to
uvm_page_busy/uvm_page_unbusy()? Or can we just ignore the FreeBSD
vfs_{,un}busy_page() since our nfs client code doesn't do anything
similar in in its other path?
* How about the change to pass a struct proc * and creds to nfs_async()?
I'm actively looking at NFS with RPSEC_GSS righ tnow, so passing
creds looks right to me. But does it break anything?
* What's up with the cr_hold() with no apparent matching decrement of
the credential refcount?
NB: if & when if I commit this I will credit FreeBSD. Do the FreeBSD
people here know of anyone specific that should get credit for it?
(Matt Dillon, maybe?)
Index: nfs_bio.c
===================================================================
RCS file: /cvsroot/src/sys/nfs/nfs_bio.c,v
retrieving revision 1.113
diff -u -r1.113 nfs_bio.c
--- nfs_bio.c 2003/11/17 02:02:31 1.113
+++ nfs_bio.c 2003/12/13 22:06:23
@@ -94,6 +94,7 @@
int enough = 0;
struct dirent *dp, *pdp;
off_t curoff = 0;
+ int nra;
#ifdef DIAGNOSTIC
if (uio->uio_rw != UIO_READ)
@@ -226,9 +227,45 @@
}
baddr = (caddr_t)0;
switch (vp->v_type) {
- case VREG:
+ case VREG: {
+ off_t lbn, rabn;
+ u_int biosize, seqcount;
+
nfsstats.biocache_reads++;
+ /* XXX jrs, 2003-12-08: issue more readaheads here */
+ lbn = uio->uio_offset / biosize;
+ on = uio->uio_offset & (biosize - 1);
+ biosize = vp->v_mount->mnt_stat.f_iosize;
+ seqcount = (u_int)((off_t)(ioflag >> 16) * biosize / 16384);
+ /*
+ * Start the read ahead(s), as required.
+ */
+ if (nfs_numasync > 0 && nmp->nm_readahead > 0) {
+ for (nra = 0; nra < nmp->nm_readahead && nra < seqcount &&
+ (off_t)(lbn + 1 + nra) * biosize < np->n_size;
+ nra++) {
+ rabn = lbn + 1 + nra;
+ if (!incore(vp, rabn)) {
+ rabp = nfs_getcacheblk(vp, rabn, biosize, p);
+ if (!rabp)
+ return (EINTR);
+ if ((rabp->b_flags & (B_CACHE|B_DELWRI)) == 0) {
+ rabp->b_flags |= (B_READ | B_ASYNC);
+ /* vfs_busy_pages(rabp, 0); */
+ if (nfs_asyncio(rabp, p->p_ucred, p)) {
+ rabp->b_flags |= B_INVAL|B_ERROR;
+ /* vfs_unbusy_pages(rabp); */
+ brelse(rabp);
+ break;
+ }
+ } else {
+ brelse(rabp);
+ }
+ }
+ }
+ }
+
error = 0;
if (uio->uio_offset >= np->n_size) {
break;
@@ -250,6 +287,7 @@
}
n = 0;
break;
+ }
case VLNK:
nfsstats.biocache_readlinks++;
@@ -454,7 +492,7 @@
if ((rabp->b_flags & (B_DONE | B_DELWRI)) == 0) {
rabp->b_dcookie = nndp->dc_cookie;
rabp->b_flags |= (B_READ | B_ASYNC);
- if (nfs_asyncio(rabp)) {
+ if (nfs_asyncio(rabp, curproc->p_ucred, curproc)) {
rabp->b_flags |= B_INVAL;
brelse(rabp);
}
@@ -780,22 +818,41 @@
*/
int
-nfs_asyncio(bp)
+nfs_asyncio(bp, cred, proc)
struct buf *bp;
+ struct ucred *cred;
+ struct proc *proc;
{
int i;
struct nfsmount *nmp;
int gotiod, slpflag = 0, slptimeo = 0, error;
+ /*
+ * If no async daemons then return EIO to force caller to run the rpc
+ * synchronously.
+ */
if (nfs_numasync == 0)
return (EIO);
nmp = VFSTONFS(bp->b_vp->v_mount);
+
+#ifdef notyet
+ /*
+ * Commits are usually short and sweet so lets save some cpu and
+ * leave the async daemons for more important rpc's (such as reads
+ * and writes).
+ */
+ if ((bp->b_flags & (B_READ|B_NEEDCOMMIT)) == B_NEEDCOMMIT &&
+ (nmp->nm_bufqiods > nfs_numasync / 2)) {
+ return(EIO);
+ }
+#endif
+
again:
if (nmp->nm_flag & NFSMNT_INT)
slpflag = PCATCH;
gotiod = FALSE;
-
+
/*
* Find a free iod to process this request.
*/