Subject: RE: namei caching of newly created files?
To: Luke Mewburn <tech-perform@NetBSD.org>
From: Gordon Waidhofer <gww@traakan.com>
List: tech-kern
Date: 01/19/2005 21:06:40
> -----Original Message-----
> From: tech-kern-owner@NetBSD.org [mailto:tech-kern-owner@NetBSD.org]On
> Behalf Of Luke Mewburn
> Sent: Wednesday, January 19, 2005 7:35 PM
> To: tech-perform@NetBSD.org; tech-kern@NetBSD.org
> Subject: Re: namei caching of newly created files?
>
>
> On Wed, Jan 19, 2005 at 05:47:31PM -0800, Bill Studenmund wrote:
> | Becasue we don't build in-kernel lists of files in the directory, we have
> | to do a full search (O(n)) of the directory to see if the name is or isn't
> | there already.
>
> FreeBSD implemented UFS_DIRHASH a while ago:
>
> http://docs.freebsd.org/cgi/getmsg.cgi?fetch=685987+0+archive/2001/cvs-all/2001071
> 5.cvs-all
>
> NetApp did the same thing for their WAFL file system over a decade ago:
> http://www.netapp.com/tech_library/3006.html
>
>
> We should just lift the working UFS_DIRHASH implementation from FreeBSD.
> Three years ago I did a work-in-progress port of UFS_DIRHASH from FreeBSD
> but I didn't finish it. I have the code around somewhere.
>
With all due respect to whom it may concern......YECH!!
We (Traakan) also did dir hashing circa 1994, but no paper.
I believe our software release predates NetApp's
(coincidence?). I'm reasonably sure it was a common customer's
requirement that drove both designs.
I think I'll go ahead and disclose this since it is
now 10 year old stuff.
Here's a better way to do it than the incore side table.
It is not backward compatible with existing large directories
but is for the normal, one block directories. A case can easily
be made for why it need not be backward compatible.
The code impact is negligible.
95+% of directories are a single block. The rest range
from a little big to REALLY REALLY REALLY big. This algorithm
is ellegance itself for the 95+% case because it results
in no real costs and is also backward compatible. It is a
HUGE win on big directories without a lot of incore side
table weirdness. It is an expanding hash so small stays
small, a little big stays a little big, and bite the bullet
for really really big.
int buckets[] = { 1, 8, 64, 256, 256, 256, 256, .... };
lookup (dir d, char name[], result *ret)
{
int key = hash(name);
int base = 0;
int i;
int blockno;
int class;
for (i = 0; buckets[i]; i++) {
class = key % buckets[i];
blockno = base + class;
rc = lookup_in_block (d, blockno, name, ret);
if (rc > 0) {
return 0;
}
base += buckets[i];
}
return ENOENT;
}
lookup_in_block (dir d, daddr_t blockno, char name[], result *ret)
{
// if blockno of d is empty (a hole), return 0
// read logical block blockno in directory d
// sequential search for name[], note any empty slot
// if found, fill in ret
}
A superblock option bit or distinct magic number
should control whether or not directories work this
way.
Regards,
-gww