Cogs and Levers A blog full of technical stuff

SIMD

Introduction

SIMD (Single Instruction, Multiple Data) is a computing technique used in modern CPUs and GPUs to perform the same operation on multiple pieces of data simultaneously. SIMD instructions are critical for optimizing tasks in data-parallel applications, such as multimedia processing, scientific computing, and machine learning.

What is SIMD?

SIMD allows a single instruction to operate on multiple data elements in parallel. It is a subset of parallel computing focused on data-level parallelism. Traditional instructions operate on a single data element (Single Instruction, Single Data).

Most modern CPUs have SIMD instruction sets built into their architecture. These include:

  • Intel/AMD x86:
    • MMX (legacy)
    • SSE (Streaming SIMD Extensions)
    • AVX (Advanced Vector Extensions)
    • AVX-512 (latest in Intel’s Xeon and some desktop processors)
  • ARM:
    • NEON
  • PowerPC:
    • AltiVec (also known as VMX)
  • RISC-V:
    • Vector extensions.

When to Use SIMD

SIMD is ideal for applications with:

  1. Data Parallelism: Repeated operations on arrays or vectors (e.g., adding two arrays).
  2. Heavy Computation:
    • Multimedia processing (e.g., video encoding/decoding, image manipulation).
    • Scientific simulations (e.g., matrix operations).
    • Machine learning (e.g., tensor computations).
  3. Regular Data Access Patterns: Data laid out in contiguous memory blocks.

SIMD support in your CPU provides vector registers to store multiple data elements (i.e. 4 floats in a 128-bit register). From there, vectorized instructions are performed simultaneously. SIMD requires aligned memory for optimal performance. Misaligned data incurs penalties or falls back to scalar processing.

How to use it

Intel Intrinsics for AVX

The following example simply adds two vectors together, and prints the results out to the terminal.

#include <immintrin.h>
#include <stdio.h>
#include <stdlib.h>

// add two arrays of floats using AVX
void add_arrays(float* a, float* b, float* result) {
    __m256 vec_a = _mm256_load_ps(a);   // Load 8 floats into a vector register
    __m256 vec_b = _mm256_load_ps(b);   // Load 8 floats into another register
    __m256 vec_res = _mm256_add_ps(vec_a, vec_b); // SIMD addition
    _mm256_store_ps(result, vec_res);  // Store the result back to memory
}

int main() {
    // ensure array sizes match AVX requirements (8 floats)
    float a[8] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f};
    float b[8] = {8.0f, 7.0f, 6.0f, 5.0f, 4.0f, 3.0f, 2.0f, 1.0f};
    float c[8];

    add_arrays(a, b, c);

    // print the result
    printf("Answer: ");
    for (int i = 0; i < 8; i++) {
        printf("%f ", c[i]);
    }
    printf("\n");

    return 0;
}

In order to compile this you need to use:

gcc -mavx test.c -o test

When the disassemble this program, we can see evidence that the extended instruction set is being used:

    __m256 vec_a = _mm256_load_ps(a);   // Load 8 floats into a vector register
    118a:       c5 fc 29 44 24 c8       vmovaps %ymm0,-0x38(%rsp)
    1190:       48 8b 44 24 98          mov    -0x68(%rsp),%rax
    1195:       48 89 44 24 b8          mov    %rax,-0x48(%rsp)
    119a:       48 8b 44 24 b8          mov    -0x48(%rsp),%rax
    119f:       c5 fc 28 00             vmovaps (%rax),%ymm0
    __m256 vec_b = _mm256_load_ps(b);   // Load 8 floats into another register
    11a3:       c5 fc 29 44 24 e8       vmovaps %ymm0,-0x18(%rsp)
    11a9:       c5 fc 28 44 24 c8       vmovaps -0x38(%rsp),%ymm0
    11af:       c5 fc 29 44 24 48       vmovaps %ymm0,0x48(%rsp)
    11b5:       c5 fc 28 44 24 e8       vmovaps -0x18(%rsp),%ymm0
    11bb:       c5 fc 29 44 24 68       vmovaps %ymm0,0x68(%rsp)

Compiler Auto-Vectorisation

SIMD is so common these days, that if you wrote the code above just in plain-old c:

void add_arrays(float* a, float* b, float* result, int n) {
    for (int i = 0; i < n; i++) {
        result[i] = a[i] + b[i];
    }
}

If you were to compile this code with either -O2 or -O3, you’ll find that vectorisation gets enabled.

Without any optimisation, we get the following:

void add_arrays(float* a, float* b, float* result, int n) {
    1159:       55                      push   %rbp
    115a:       48 89 e5                mov    %rsp,%rbp
    115d:       48 89 7d e8             mov    %rdi,-0x18(%rbp)
    1161:       48 89 75 e0             mov    %rsi,-0x20(%rbp)
    1165:       48 89 55 d8             mov    %rdx,-0x28(%rbp)
    1169:       89 4d d4                mov    %ecx,-0x2c(%rbp)
  for (int i = 0; i < n; i ++) {
    116c:       c7 45 fc 00 00 00 00    movl   $0x0,-0x4(%rbp)
    1173:       eb 50                   jmp    11c5 <add_arrays+0x6c>
    result[i] = a[i] + b[i];
    1175:       8b 45 fc                mov    -0x4(%rbp),%eax
    1178:       48 98                   cltq
    117a:       48 8d 14 85 00 00 00    lea    0x0(,%rax,4),%rdx
    1181:       00 
    1182:       48 8b 45 e8             mov    -0x18(%rbp),%rax
    1186:       48 01 d0                add    %rdx,%rax
    1189:       f3 0f 10 08             movss  (%rax),%xmm1
    118d:       8b 45 fc                mov    -0x4(%rbp),%eax
    1190:       48 98                   cltq
    1192:       48 8d 14 85 00 00 00    lea    0x0(,%rax,4),%rdx
    1199:       00 
    119a:       48 8b 45 e0             mov    -0x20(%rbp),%rax
    119e:       48 01 d0                add    %rdx,%rax
    11a1:       f3 0f 10 00             movss  (%rax),%xmm0
    11a5:       8b 45 fc                mov    -0x4(%rbp),%eax
    11a8:       48 98                   cltq
    11aa:       48 8d 14 85 00 00 00    lea    0x0(,%rax,4),%rdx
    11b1:       00 
    11b2:       48 8b 45 d8             mov    -0x28(%rbp),%rax
    11b6:       48 01 d0                add    %rdx,%rax
    11b9:       f3 0f 58 c1             addss  %xmm1,%xmm0
    11bd:       f3 0f 11 00             movss  %xmm0,(%rax)
  for (int i = 0; i < n; i ++) {
    11c1:       83 45 fc 01             addl   $0x1,-0x4(%rbp)
    11c5:       8b 45 fc                mov    -0x4(%rbp),%eax
    11c8:       3b 45 d4                cmp    -0x2c(%rbp),%eax
    11cb:       7c a8                   jl     1175 <add_arrays+0x1c>
  }
}

The use of movss and addss are indeed SIMD instructions; but they are only operating on scalar values at a time.

Now, if we turn the optimisation up you’ll notice that we start to use some of those SIMD primitives start working on packed numbers.

    result[i] = a[i] + b[i];
    1260:       0f 10 04 07             movups (%rdi,%rax,1),%xmm0
    1264:       0f 10 14 06             movups (%rsi,%rax,1),%xmm2
    1268:       0f 58 c2                addps  %xmm2,%xmm0
    126b:       0f 11 04 02             movups %xmm0,(%rdx,%rax,1)

These instructions (like addps) can add 4, 8, or 16 numbers at once.

Assembly

If you really feel the need to get that extra bit of power, you can crack out the assembly language yourself and have a go.

movaps xmm0, [a]     ; Load 4 floats from a
movaps xmm1, [b]     ; Load 4 floats from b
addps xmm0, xmm1     ; Add packed floats
movaps [result], xmm0; Store the result

For the work that it’s doing, this is very tidy code.

High Level Libraries

Finally, there are a number of high level libraries that industralise the usage of SIMD instructions really well. Using these makes these operations much easier to write!

  • Eigen (C++): Matrix and vector math.
  • NumPy (Python): Uses SIMD internally via BLAS.
  • OpenCV (C++): SIMD-optimized image processing.

Challenges with SIMD

Branching can be an issue with SIMD struggling to diverge execution paths (e.g., if statements).

The alignment requirements are quite strict for the maximum optimum capability. SIMD often requires data to be aligned to specific byte boundaries (e.g., 16 bytes for SSE, 32 bytes for AVX).

SIMD scales to a fixed number of elements per operation, determined by the vector register width. Scalability can be an issue here with higher dimension vectors.

Code written with specific intrinsics or assembly may not run on CPUs with different SIMD instruction sets. So, if you’re not using one of those higher level libraries - portability can be an issue.

Conclusion

SIMD is a powerful tool for optimizing performance in data-parallel applications, allowing modern CPUs and GPUs to handle repetitive tasks more efficiently. By leveraging intrinsics, compiler optimizations, or high-level libraries, developers can unlock significant performance gains with relatively little effort.

However, like any optimization, SIMD has its challenges, such as branching, memory alignment, and portability. Understanding these limitations and balancing them with the benefits is key to effectively integrating SIMD into your projects.

Whether you’re working on scientific simulations, multimedia processing, or machine learning, SIMD offers a compelling way to accelerate your computations. Start small, experiment with intrinsics or auto-vectorization, and explore the high-level libraries to see how SIMD can transform your application’s performance.