A low-level C11 machine learning framework built around a deterministic Memory Arena, explicit Tensor primitives, a compact Autograd engine, optimized GEMM backends, and an end-to-end quantization stack for edge-oriented workloads.
This repository focuses on correctness, reproducibility, and performance in a compact systems-style implementation.
- Arena-based memory management with checkpoint/restore semantics.
- Dense Tensor core with shape/stride metadata and guarded allocation paths.
- Reverse-topological Autograd traversal via parent links and operator callbacks.
- Runtime-dispatched matrix multiplication with SCALAR, AVX2/FMA, NEON, and optional CBLAS paths.
- Blocked/Packed GEMM with tunable tile configuration and optional OpenMP parallel execution.
- INT8 quantization features:
- Per-tensor quantization
- Per-channel quantization
- Grouped activation calibration
- Fully integer INT8 x INT8 -> INT32 -> INT8 requantization
- FP16 conversion utilities for compact representation.
- Versioned checksummed binary I/O with strict metadata and integrity validation.
- Broad validation, stability, fuzz/property, and benchmark tests.
include/
ml_core.h # Arena and Tensor definitions
ml_autograd.h # Backward pass API
ml_math.h # GEMM, backend dispatch, quantization, FP16 utilities
ml_ops.h # Differentiable ops (add/mul/matmul)
ml_nn.h # Activations and losses
ml_layer.h # Linear and quantized linear layers
ml_optim.h # SGD and gradient reset
ml_io.h # Weight serialization/deserialization
src/
ml_core.c
ml_autograd.c
ml_math.c
ml_ops.c
ml_nn.c
ml_layer.c
ml_optim.c
ml_io.c
test/
test_step1.c
test_step2.c
test_step3_4.c
test_step5_to_7.c
test_step8_to_10.c
test_validation_hardening.c
test_io_hardening.c
test_nn_stability.c
test_quantization_backend.c
test_int8_linear_inference.c
test_fuzz_quant_io.c
test_benchmark_limits.c
The runtime is built on a single-owner Memory Arena:
arena_initinitializes a fixed backing buffer.arena_allocperforms aligned bump allocation.arena_checkpointandarena_restoreallow scoped temporary allocations.arena_resetresets all transient state in O(1).
This model avoids frequent heap operations and enables deterministic allocation behavior.
A Tensor stores:
float* datafloat* gradshape[],strides[],ndim,size- Autograd fields (
requires_grad,parents,backward,visited)
Tensor creation validates dimensionality and guards overflow-sensitive size computations.
tensor_backward executes reverse traversal over a topologically sorted node list and invokes registered backward callbacks.
Implemented operator-level gradient propagation includes:
op_addwith broadcast-aware gradient accumulationop_mulop_matmulop_reluop_sigmoidop_softmaxloss_mseloss_crossentropy
tensor_matmul_simd selects backend implementations at runtime.
Supported backend identifiers:
ML_MATMUL_BACKEND_SCALARML_MATMUL_BACKEND_AVX2_FMAML_MATMUL_BACKEND_NEONML_MATMUL_BACKEND_CBLAS(when enabled)
Additional compute features:
- Cache-blocked GEMM (
tensor_matmul_blocked) - Packed panel strategy
- Microkernel dispatch (including 8x8 paths)
- Configurable tiling through
MlGemmConfig ml_gemm_autotuneand thread controls
The project provides a full quantization surface:
- Per-tensor INT8 parameter generation (
tensor_calc_qparams_i8) - Per-channel INT8 parameter generation (
tensor_calc_qparams_i8_per_channel) - Activation calibration (global and grouped percentile)
- De/quantization APIs for INT8 tensors
- Integer kernel path:
tensor_matmul_i8i8_i8pc
- FP16 conversion and tensor helpers:
ml_float_to_fp16ml_fp16_to_floattensor_quantize_fp16tensor_dequantize_fp16
QuantizedLinearLayer provides native INT8 linear inference.
io_save_weights and io_load_weights implement a versioned binary format with:
- Fixed header metadata
- Tensor descriptor validation
- Header checksum validation
- Data checksum validation
- Corruption rejection and strict mismatch handling
- C11-compatible compiler (GCC or Clang)
- Math library (
-lm) - Optional:
- OpenMP (
-fopenmp) - CBLAS (
-DML_USE_CBLAS+ BLAS link flags)
- OpenMP (
Compile one test target:
gcc -O3 -std=c11 -Wall -Wextra -Iinclude src/*.c test/test_step2.c -lm -o test_step2Compile with OpenMP (optional):
gcc -O3 -std=c11 -Wall -Wextra -fopenmp -Iinclude src/*.c test/test_benchmark_limits.c -lm -o test_benchmark_limitsCompile with CBLAS (optional, platform-specific link flags may differ):
gcc -O3 -std=c11 -Wall -Wextra -DML_USE_CBLAS -Iinclude src/*.c test/test_quantization_backend.c -lblas -lm -o test_quantization_backendRepresentative execution flow:
./test_step1
./test_step2
./test_step3_4
./test_step5_7
./test_step8_10
./test_validation_hardening
./test_io_hardening
./test_nn_stability
./test_quantization_backend
./test_int8_linear_inference
./test_fuzz_quant_ioOn Windows, generated executables typically use .exe suffix.
Benchmark executable:
./test_benchmark_limitsKey runtime variables:
ML_MATMUL_BACKEND=scalar|avx2|avx2_fma|neon|cblasML_GEMM_AUTOTUNE=1ML_GEMM_BMML_GEMM_BNML_GEMM_BKML_GEMM_APACK_THRESHOLDML_GEMM_NUM_THREADS
Backend introspection APIs:
ml_matmul_last_backend()ml_matmul_backend_name()
- Step tests validate baseline arena, tensor, math, and training flow.
- Validation hardening tests verify invalid input rejection and non-finite update guards.
- I/O hardening tests verify corruption detection and shape consistency checks.
- NN stability tests stress Softmax and Cross-Entropy numerical behavior.
- Quantization/backend tests verify parity and error thresholds across quantized paths.
- Fuzz/property tests exercise randomized quantization and serialization invariants.
- Benchmark tests provide throughput/latency observations and CSV-style metrics flow.
tensor_createtensor_set_requires_gradtensor_backwardoptim_sgd_stepoptim_zero_grad
tensor_addtensor_matmultensor_matmul_blockedtensor_matmul_simdop_addop_mulop_matmul
op_reluop_sigmoidop_softmaxloss_mseloss_crossentropy
layer_linear_createlayer_linear_forwardlayer_linear_quantizelayer_linear_forward_int8
tensor_calc_qparams_i8tensor_quantize_i8tensor_dequantize_i8tensor_calc_qparams_i8_per_channeltensor_quantize_i8_per_channeltensor_calibrate_activation_qparams_i8tensor_calibrate_activation_qparams_i8_groupedtensor_matmul_int8tensor_matmul_int8_per_channeltensor_matmul_i8i8_i8pctensor_quantize_fp16tensor_dequantize_fp16
io_save_weightsio_load_weights
This repository currently targets dense tensor operations and related training/inference primitives.
Not in scope (at present):
- Convolution and pooling operator families
- Graph compilation/fusion passes
- Distributed training
- Multi-format model import/export ecosystem
- Full package/distribution pipeline
- The project is intentionally designed for explicit control over memory, numerics, and compute paths.
- Safety checks are integrated into allocation, validation, serialization, and optimizer flows.
- The implementation favors predictable behavior and direct inspectability over abstraction-heavy runtime layers.
Recommended contribution areas:
- Additional operator coverage
- Extended backend kernels
- Cross-platform CI and benchmark automation
- Expanded calibration algorithms
- Documentation and usage examples
The following references were used to guide implementation details and technical decisions:
- C language standard library references: https://en.cppreference.com/w/c
- GCC compiler options and target tuning: https://gcc.gnu.org/onlinedocs/
- Clang compiler documentation: https://clang.llvm.org/docs/
- Intel x86 Intrinsics Guide (AVX2/FMA): https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
- Arm NEON intrinsics reference: https://developer.arm.com/architectures/instruction-sets/intrinsics/
- OpenMP specification: https://www.openmp.org/specifications/
- CBLAS reference (Netlib BLAS): https://www.netlib.org/blas/
- Valgrind user manual (Memcheck): https://valgrind.org/docs/manual/manual.html
- IEEE 754 floating-point standard overview: https://ieeexplore.ieee.org/document/8766229
- CRC background and polynomial references: https://reveng.sourceforge.io/crc-catalogue/all.htm