On Mon, 2007-04-09 at 16:55 +0200, Tim Blechmann wrote:
The first batch of SSE was written in pure assembly for two reasons: a)
i had fun learning it, and b) i failed to get xmmintrin.h working but
had asm working.
The second batch (find_peaks, for displaying waveforms) is done with
xmmintrin.h as i finally figured out how to use it :)
> beside that, if ardour is using a fixed block size, using compile-time
Ardour is not using a fixed block size, it uses the block size from
jackd. Second, Ardour does sample accurate even handling, which means
that buffers might be divided up in any possible way so we must have
code which will work for non-aligned buffers and numbers of frames which
are not dividable by 4. (alignment here means 16-byte alignment which is
required by x86 SIMD, 4 bytes per sample, 16 bytes = 4 samples).
The find_peaks algo works like this:
1) run one sample at a time until we reach alignment
2) run buffer in quads of quads (64 bytes or 16 samples in one loop)
while there are >= 16 samples left
3) run buffer in quads (16 bytes or 16 samples in one loop)
while there are >= 4 samples left
4) run one sample at a time until we run out of samples
So we have "conservative dynamic unrolling" :)
But, the benefits here are quite small and very architecture dependent.
The AMD 64 bit processors benefit a lot more from unrolling and memory
prefetching than what my Core 2 Duo (in 32 bit mode) benefits.
I don't have the numbers here, and any numbers i would give you would be
from a testbench which can only measure raw performance, not real world