A high-level C library providing SIMD-accelerated functions and safe data structures for the ESP32-S3 microcontroller.
The xTensa LX7 core on the ESP32-S3 includes custom SIMD instructions, but they are not emitted by the compiler and can only be accessed through inline assembly. This makes them hard to use safely, since they come with strict alignment requirements, unusual saturation semantics, and the usual type and memory safety pitfalls of handwritten assembly. esp_simd wraps these instructions in a safe, high-level API that abstracts away those concerns.
- High-level vector API with SIMD acceleration
- Hand-written branchless ASM functions using zero-overhead loops
- Up to 30× faster performance on certain tasks
- Type-safe handling of aligned data structures
- Currently supports signed integers (
int8,int16,int32) and 32-bit float types, with unsigned types planned - Future support: matrix and tensor data structures
We benchmarked a set of vector math operations (add, sub, mul, dot product, sum, and bitwise ops) across int8, int16, int32, and float32 data types. Runtimes were aggregated from 32 runs on random vectors of length 1–256.
SIMD (vectorized) execution consistently outperforms scalar execution, with speedups ranging from ~2× to nearly 10× depending on the operation and data type.
- Integer arithmetic (especially 32-bit add/sub) shows the largest gains (≈9–10×).
- Floating-point operations achieve more modest improvements (≈1.5–4×).
- Memory-bound ops (copy/fill/zero/ones) also benefit, though with smaller absolute times.
| Operation | DType | SIMD (μs) | Scalar (μs) | Speedup (Scalar ÷ SIMD) |
|---|---|---|---|---|
| Saturated Add | INT8 | 199 | 1227 | 6.2× |
| INT16 | 185 | 1214 | 6.6× | |
| INT32 | 193 | 1864 | 9.7× | |
| FLOAT32 | 267 | 957 | 3.6× | |
| Saturated Sub | INT8 | 192 | 1134 | 5.9× |
| INT32 | 184 | 1666 | 9.1× | |
| FLOAT32 | 262 | 890 | 3.4× | |
| Mul + shift | INT8 | 193 | 1167 | 6.0× |
| INT32 | 588 | 2275 | 3.9× | |
| FLOAT32 | 279 | 1057 | 3.8× | |
| Dot Product | INT8 | 186 | 923 | 5.0× |
| INT32 | 404 | 815 | 2.0× | |
| FLOAT32 | 367 | 583 | 1.6× | |
| Sum | INT8 | 147 | 662 | 4.5× |
| INT32 | 159 | 1163 | 7.3× | |
| FLOAT32 | 267 | 595 | 2.2× | |
| Bitwise AND | INT8 | 186 | 820 | 4.4× |
| INT32 | 207 | 864 | 4.2× | |
| FLOAT32 | 203 | 861 | 4.2× |
Using esp-idf:
Clone the repository into your project’s components directory.
If your project does not have components directory, create the folder on the top level directory, (i.e same level as 'main' and 'build').
e.g. project-name/components
git clone https://github.com/zliu43/esp_simd.git components/esp_simdAdd esp_simd to the REQUIRES field in the CMakeLists.txt in 'main': e.g. project-name/main/CMakeLists.txt
idf_component_register(SRCS "hello_world_main.c"
PRIV_REQUIRES spi_flash esp_simd
INCLUDE_DIRS "")Using Arduino: Go to https://github.com/zliu43/esp_simd and download the library as a zip folder.
Go to Sketch > Include Library > Add .ZIP Library...
#include <stdint.h>
#include <math.h>
#include "vector.h"
#define VEC_LENGTH 512
#define VEC_LSHIFT 9
/**
* Calculates the mean value, mean power, and standard deviation of a 512 length vector of int16_ts,
* representing the output of an accelerometer, for use for downstream ML classification
*/
// Initialize vectors
vector_t* accelerometer_x_data = vector_create(VEC_LENGTH, DTYPE_INT16);
vector_t* accelerometer_y_data = vector_create(VEC_LENGTH, DTYPE_INT16);
vector_t* accelerometer_z_data = vector_create(VEC_LENGTH, DTYPE_INT16);
// Check vector validity (memory allocation, alignment)
assert(vector_ok(accelerometer_x_data) == VECTOR_SUCCESS);
assert(vector_ok(accelerometer_y_data) == VECTOR_SUCCESS);
assert(vector_ok(accelerometer_z_data) == VECTOR_SUCCESS);
/*
data acquisition
*/
// Computes averages by calculating the sum and performing division using right shift
int32_t average_x;
int32_t average_y;
int32_t average_z;
vec_sum(accelerometer_x_data, &average_x);
vec_sum(accelerometer_y_data, &average_y);
vec_sum(accelerometer_z_data, &average_z);
average_x = average_x >> VEC_LSHIFT;
average_y = average_y >> VEC_LSHIFT;
average_z = average_z >> VEC_LSHIFT;
// Calculates energy by squaring the readings
vector_t* x_energy = vector_create(VEC_LENGTH, DTYPE_INT32); // DTYPE_INT32 otherwise overflow
assert(vector_ok(x_energy) == VECTOR_SUCCESS);
vec_mul_widen(accelerometer_x_data, accelerometer_x_data, x_energy); // Mul-widen for int16_t * int16_t -> int32_t
int32_t average_x_energy;
vec_sum(x_energy, &average_x_energy);
average_x_energy = average_x_energy >> VEC_LSHIFT;
// Calculates standard deviation by calculating variance, then using dotp to calculating sum of squares
int32_t sd_x;
vector_t* standard_dev_x = vector_create(VEC_LENGTH, DTYPE_INT16);
assert(vector_ok(standard_dev_x) == VECTOR_SUCCESS);
vec_copy(accelerometer_x_data, standard_dev_x);
vec_add_scalar(standard_dev_x, -average_x);
vec_dotp(standard_dev_x, standard_dev_x, &sd_x);
sd_x = sd_x >> VEC_LSHIFT;
sd_x = (int32_t)(sqrt((float)(sd_x)));
/*
computation for y and z data ...
*/
vector_destroy(accelerometer_x_data);
vector_destroy(standard_dev_x);
vector_destroy(x_energy);
int32_t features[] = {average_x, average_y, average_z, average_x_energy, average_y_energy, average_z_energy, sd_x, sd_y, sd_z};
/*
downstream inference
*/ - ESP32-S3 microcontroller
- Vector struct with SIMD acceleration
- Matrix struct
- Tensor struct
- Support for unsigned integer data types
Contributions are welcome! Feel free to submit issues or pull requests to improve functionality, documentation, or performance.
This project is licensed under the MIT License.