Subject: Re: 25%+ improvement in in_cksum speed!
To: None <port-i386@netbsd.org>
From: David Laight <david@l8s.co.uk>
List: port-i386
Date: 09/22/2002 21:37:09
On Wed, Sep 18, 2002 at 12:28:58AM +0100, David Laight wrote:
> > It would be interesting to know what PIII, P4 and athlon XP get.
>
> The P4 figures are interesting!
> The best is a rolled up C loop! - even then it is still only a
> quarter of the speed of a similar athlon.
I've managed to write an SSE2 version for the P4, this gives:
32 bit C sum f807 took 4185 usecs 0.498891 nsec/byte
32 bit C pair sum f807 took 8373 usecs 0.998139 nsec/byte
sse2 test sum f807 took 2205 usecs 0.262856 nsec/byte
(I think this is a 1.8GHz P4 - thanks to Greg Oster for testing
this for me.)
I suspect that minor instruction re-ordering will give additional
benefit (I'd start with an empty loop and add the instructions
1 by 1 to different places to see which order is best!)
Unrolling (to 64byte blocks) is also a probable winner.
If I get bored tomorrow I might include the routines in the actual
checksum code...
sse2_mask:
.word 0xffff,0xffff,0xffff,0
.word 0xffff,0xffff,0xffff,0
ENTRY(sum_sse2)
movl 4(%esp),%edx
movl 8(%esp),%ecx
pushl %ebx
pushl %esi
pushl %edi
pxor %xmm0,%xmm0
pxor %xmm2,%xmm2
movdqu sse2_mask,%xmm7
xorl %eax,%eax
xorl %ebx,%ebx
1:
movdqa (%edx),%xmm1
movdqa 16(%edx),%xmm3
pextrw $3,%xmm1,%esi
pextrw $7,%xmm1,%edi
pand %xmm7,%xmm1
addl %esi,%eax
pextrw $3,%xmm3,%esi
addl %edi,%ebx
pextrw $7,%xmm3,%edi
paddq %xmm1,%xmm0
pand %xmm7,%xmm3
addl %esi,%eax
addl %edi,%ebx
paddq %xmm3,%xmm2
addl $32,%edx
subl $32,%ecx
jnz 1b
paddq %xmm2,%xmm0
addl %ebx,%eax
pshufd $0xee,%xmm0,%xmm1 # abcd -> abab
paddq %xmm1,%xmm0 # xx(ab+cd)
movd %xmm0,%ebx
pextrw $2,%xmm0,%esi
pextrw $3,%xmm0,%edi
addl %esi,%edi
addl %ebx,%eax
adcl %edi,%eax
adcl $0,%eax
popl %edi
popl %esi
popl %ebx
ret
David
--
David Laight: david@l8s.co.uk