Subject: Fast ffs (was Re: Request for testers.)
To: None <cg110@york.ac.uk>
From: None <olly@muscat.co.uk>
List: port-arm32
Date: 04/22/1997 09:29:32
There's an obvious tweak -- use ANDS and NE to avoid the LDRB for ffs(0).
Here's a rewritten version with APCS register names:
ffs
; Standard trick to isolate bottom bit in a0 or 0 if a0 = 0 on entry
rsb a1, a0, #0
ands a0, a0, a1
; now a0 has at most one set bit, call this X
; if X = 0, all further instructions are skipped
orrne a0, a0, a0, lsl #4 ; a0 = X * 0x11
orrne a0, a0, a0, lsl #6 ; a0 = X * 0x451
rsbne a0, a0, a0, lsl #16 ; a0 = X * 0x0450fbaf
; now lookup in table indexed on top 6 bits of a0
adrne a2, ffs_table
ldrneb a0, [ a2, a0, lsr #26 ]
mov pc, lr
ffs_table
<64 byte table>
Now we're only using 32 entries in the table, which makes me wonder if we
can use a smaller table (mind you, this isn't going to make much difference
to the speed, perhaps a small improvement from better caching).
We can pack 0 ... 31 into the 32 bit value 0x07c564dd (there are probably
others possibilities -- I found this one easily enough).
In binary 00000111110001010110010011011101 then 0000 from shifting, and we
get each of 0 ... 31 exactly once by considering sets of 5 consecutive bits.
However, I can't see how to multiply by this value in <= 3 instructions.
Looking at factors of the form 2^n +/- 1 we have 1025, 63, 15, 9, 7, 5, 3
(not all at once), but there's also a prime factor of 673 (1010100001 in
binary).
Cheers,
Olly
--
Usenet is not for answering questions but for questioning answers.