Subject: Re: ARM bswap optimizations
To: Jason R Thorpe <thorpej@wasabisystems.com>
From: Richard Earnshaw <rearnsha@buzzard.freeserve.co.uk>
List: port-arm
Date: 08/14/2002 00:06:01
> Hi folks...
>
> I'm wanting to shave some cycles out of the TCP/IP code on ARM. hton*()
> and ntoh*() is low-hanging fruit. The issues:
>
> * Constants are not byte-swapped at compile-time.
>
> * A function must be called to do the byte-swap. This costs
> 3 cycles to call the function (one to branch, 2 for the
> pipeline flush), and 3 cycles to return. This is significant
> overhead if you consider that it's 4 insns to byte-swap an int,
> and 3 insns to byte-swap a short.
>
> The following patch addresses these issues. I'd appreciate it if people
> would read it over to make sure that I didn't screw up the asm (mostly
> the constraints :-) I've booted it multi-user on an XScale.
Writing your inline as
inline u_int32_t
__byte_swap_long_var(u_int32_t v)
{
u_int32_t t1, t2, t3;
t1 = v ^ ((v << 16) | v >> 16);
t2 = t1 & 0xff00ffff;
t3 = (v >> 8) | (v << 24);
return t3 ^ (t2 >> 8);
}
enables gcc to generate a sequence that is only one instruction longer (5
rather than 4 instructions -- and a pattern to eliminate the fifth could
be fairly easily added to gcc). It has the added advantage that the
compiler will do any constant reduction for you. Eg:
u_int32_t foo()
{
return (__byte_swap_long_var(0x01234567));
}
compiles as:
foo:
mov ip, sp
stmfd sp!, {fp, ip, lr, pc}
ldr r0, .L5
sub fp, ip, #4
ldmea fp, {fp, sp, pc}
.L6:
.align 0
.L5:
.word 1732584193 @ = 0x67452301
The main advantage of leaving it as C code is that the compiler can
schedule the instructions individually.
Similar, simpler code can be done for half-words.
R.