How to use SSE instruction set on C6678 DSP?-CodePudding

SSE can only be used on x86 x64 CPUs. I have a problem using the SPEEXDSP library on a TI C6678. I've never used the SSE instruction, I've tried many ways and can't get it to work on the DSP.

Is it possible to modify SSE instructions to normal C instructions? How to modify it? Looking forward to your reply. Example:

static inline double interpolate_product_double(const float* a, const float* b, unsigned int len, const spx_uint32_t oversample, float* frac) {
int i;
double ret;
__m128d sum;
__m128d sum1 = _mm_setzero_pd();
__m128d sum2 = _mm_setzero_pd();
__m128 f = _mm_loadu_ps(frac);
__m128d f1 = _mm_cvtps_pd(f);
__m128d f2 = _mm_cvtps_pd(_mm_movehl_ps(f, f));
__m128 t;
for (i = 0; i < len; i  = 2)
{
    t = _mm_mul_ps(_mm_load1_ps(a   i), _mm_loadu_ps(b   i * oversample));
    sum1 = _mm_add_pd(sum1, _mm_cvtps_pd(t));
    sum2 = _mm_add_pd(sum2, _mm_cvtps_pd(_mm_movehl_ps(t, t)));

    t = _mm_mul_ps(_mm_load1_ps(a   i   1), _mm_loadu_ps(b   (i   1) * oversample));
    sum1 = _mm_add_pd(sum1, _mm_cvtps_pd(t));
    sum2 = _mm_add_pd(sum2, _mm_cvtps_pd(_mm_movehl_ps(t, t)));
}
sum1 = _mm_mul_pd(f1, sum1);
sum2 = _mm_mul_pd(f2, sum2);
sum = _mm_add_pd(sum1, sum2);
sum = _mm_add_sd(sum, _mm_unpackhi_pd(sum, sum));
_mm_store_sd(&ret, sum);
return ret;

}

CodePudding user response：

Yes, you can use SIMD Everywhere (SIMDe). It provides portable implementations of many intrinsics, including all of the ones in your code. Full disclosure: I am the lead developer.

Edit: replying to phuclv here since it's a bit long for a comment.

SIMDe doesn't currently use the c6x instrinsics to implement functions like we often do for NEON, AltiVec/VSX, WASM SIMD, etc. There is nothing preventing it, and patches are very much welcome, but they're not there yet.

However, every function in SiMDe has fallback implementations all the way back to standard C. Usually things don't get that far, though; even discounting the architecture-specific implementations mentioned above, if the compiler supports it the operations are also implemented using GNU C vector extensions, and even the portable fallbacks are actually annotated with OpenMP SIMD directives. Conversion functions use compiler built-ins like __builtin_convertvector, and functions which require shuffling data around will use __builtin_shuffle / __builtin_shufflevector.

Basically, SIMDe goes to great lengths to get the compiler to vectorize the whenever possible, even if SIMDe doesn't actually know how to do it. The functions above are all pretty straightforward; I don't know enough about c6x SIMD to know what kind of operations are supported in hardware, but GCC and clang (which the TI compilers are based on) generally do a very good job with all the information SIMDe gives them. Honestly, the thing I'm most worried about here is whether the c6x supports double-precision floating point in SIMD (which the code above uses)... there is a pretty good chance it only supports single-precision floats.

CodePudding user response：

Is it possible to modify SSE instructions to normal C instructions?

There's no such thing as "C instructions" because C is a high level language with only statements and no instructions. But yes it's possible to convert SSE intrinsics to C expressions because they're simply multiple operations in parallel

SSE is one of the SIMD instruction sets so just convert it to the corresponding SIMD in the target architecture. In your case TI C6678 does have SIMD support:

The C64x and C674x DSPs support 2-way SIMD operations for 16-bit data and 4-way SIMD operations for 8-bit data. On the C66x DSP, the vector processing capability is improved by extending the width of the SIMD instructions. C66x DSPs can execute instructions that operate on 128-bit vectors.

CodePudding user response：

The C66x architecture indeed supports a number of SIMD instructions, somewhat comparable to those of Intel's SSE.

You need to be aware of the processor's register set in both architectures and compare the available instructions.

For example, _mm_add_ps performs four simultaneous additions of single-precision floats, contained four by four in the SSE registers. The DSP has a similar DADDSP instruction that only performs two such additions. Hence you will need to translate one _mm_add_ps by two DADDSP.

Read the manuals (these instruction sets are online), understand what the instructions are doing, and find the equivalences. In case of a dead-end, you still have the recourse to good old scalar operations, like C[0]= A[0] B[0]; C[1]= A[1] B[1];