Skip to content

inventixcity/ML-Framework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ML Frame

A low-level C11 machine learning framework built around a deterministic Memory Arena, explicit Tensor primitives, a compact Autograd engine, optimized GEMM backends, and an end-to-end quantization stack for edge-oriented workloads.

This repository focuses on correctness, reproducibility, and performance in a compact systems-style implementation.

Technical Highlights

  • Arena-based memory management with checkpoint/restore semantics.
  • Dense Tensor core with shape/stride metadata and guarded allocation paths.
  • Reverse-topological Autograd traversal via parent links and operator callbacks.
  • Runtime-dispatched matrix multiplication with SCALAR, AVX2/FMA, NEON, and optional CBLAS paths.
  • Blocked/Packed GEMM with tunable tile configuration and optional OpenMP parallel execution.
  • INT8 quantization features:
    • Per-tensor quantization
    • Per-channel quantization
    • Grouped activation calibration
    • Fully integer INT8 x INT8 -> INT32 -> INT8 requantization
  • FP16 conversion utilities for compact representation.
  • Versioned checksummed binary I/O with strict metadata and integrity validation.
  • Broad validation, stability, fuzz/property, and benchmark tests.

Repository Layout

include/
  ml_core.h        # Arena and Tensor definitions
  ml_autograd.h    # Backward pass API
  ml_math.h        # GEMM, backend dispatch, quantization, FP16 utilities
  ml_ops.h         # Differentiable ops (add/mul/matmul)
  ml_nn.h          # Activations and losses
  ml_layer.h       # Linear and quantized linear layers
  ml_optim.h       # SGD and gradient reset
  ml_io.h          # Weight serialization/deserialization

src/
  ml_core.c
  ml_autograd.c
  ml_math.c
  ml_ops.c
  ml_nn.c
  ml_layer.c
  ml_optim.c
  ml_io.c

test/
  test_step1.c
  test_step2.c
  test_step3_4.c
  test_step5_to_7.c
  test_step8_to_10.c
  test_validation_hardening.c
  test_io_hardening.c
  test_nn_stability.c
  test_quantization_backend.c
  test_int8_linear_inference.c
  test_fuzz_quant_io.c
  test_benchmark_limits.c

Core Architecture

1. Memory Model

The runtime is built on a single-owner Memory Arena:

  • arena_init initializes a fixed backing buffer.
  • arena_alloc performs aligned bump allocation.
  • arena_checkpoint and arena_restore allow scoped temporary allocations.
  • arena_reset resets all transient state in O(1).

This model avoids frequent heap operations and enables deterministic allocation behavior.

2. Tensor Model

A Tensor stores:

  • float* data
  • float* grad
  • shape[], strides[], ndim, size
  • Autograd fields (requires_grad, parents, backward, visited)

Tensor creation validates dimensionality and guards overflow-sensitive size computations.

3. Autograd Model

tensor_backward executes reverse traversal over a topologically sorted node list and invokes registered backward callbacks.

Implemented operator-level gradient propagation includes:

  • op_add with broadcast-aware gradient accumulation
  • op_mul
  • op_matmul
  • op_relu
  • op_sigmoid
  • op_softmax
  • loss_mse
  • loss_crossentropy

4. Compute Backends

tensor_matmul_simd selects backend implementations at runtime.

Supported backend identifiers:

  • ML_MATMUL_BACKEND_SCALAR
  • ML_MATMUL_BACKEND_AVX2_FMA
  • ML_MATMUL_BACKEND_NEON
  • ML_MATMUL_BACKEND_CBLAS (when enabled)

Additional compute features:

  • Cache-blocked GEMM (tensor_matmul_blocked)
  • Packed panel strategy
  • Microkernel dispatch (including 8x8 paths)
  • Configurable tiling through MlGemmConfig
  • ml_gemm_autotune and thread controls

5. Quantization and Reduced Precision

The project provides a full quantization surface:

  • Per-tensor INT8 parameter generation (tensor_calc_qparams_i8)
  • Per-channel INT8 parameter generation (tensor_calc_qparams_i8_per_channel)
  • Activation calibration (global and grouped percentile)
  • De/quantization APIs for INT8 tensors
  • Integer kernel path:
    • tensor_matmul_i8i8_i8pc
  • FP16 conversion and tensor helpers:
    • ml_float_to_fp16
    • ml_fp16_to_float
    • tensor_quantize_fp16
    • tensor_dequantize_fp16

QuantizedLinearLayer provides native INT8 linear inference.

6. Serialization and Integrity

io_save_weights and io_load_weights implement a versioned binary format with:

  • Fixed header metadata
  • Tensor descriptor validation
  • Header checksum validation
  • Data checksum validation
  • Corruption rejection and strict mismatch handling

Build and Toolchain

Prerequisites

  • C11-compatible compiler (GCC or Clang)
  • Math library (-lm)
  • Optional:
    • OpenMP (-fopenmp)
    • CBLAS (-DML_USE_CBLAS + BLAS link flags)

Example Build Commands

Compile one test target:

gcc -O3 -std=c11 -Wall -Wextra -Iinclude src/*.c test/test_step2.c -lm -o test_step2

Compile with OpenMP (optional):

gcc -O3 -std=c11 -Wall -Wextra -fopenmp -Iinclude src/*.c test/test_benchmark_limits.c -lm -o test_benchmark_limits

Compile with CBLAS (optional, platform-specific link flags may differ):

gcc -O3 -std=c11 -Wall -Wextra -DML_USE_CBLAS -Iinclude src/*.c test/test_quantization_backend.c -lblas -lm -o test_quantization_backend

Running Tests

Representative execution flow:

./test_step1
./test_step2
./test_step3_4
./test_step5_7
./test_step8_10
./test_validation_hardening
./test_io_hardening
./test_nn_stability
./test_quantization_backend
./test_int8_linear_inference
./test_fuzz_quant_io

On Windows, generated executables typically use .exe suffix.

Benchmarking and Runtime Tuning

Benchmark executable:

./test_benchmark_limits

Key runtime variables:

  • ML_MATMUL_BACKEND=scalar|avx2|avx2_fma|neon|cblas
  • ML_GEMM_AUTOTUNE=1
  • ML_GEMM_BM
  • ML_GEMM_BN
  • ML_GEMM_BK
  • ML_GEMM_APACK_THRESHOLD
  • ML_GEMM_NUM_THREADS

Backend introspection APIs:

  • ml_matmul_last_backend()
  • ml_matmul_backend_name()

Test Coverage Summary

  • Step tests validate baseline arena, tensor, math, and training flow.
  • Validation hardening tests verify invalid input rejection and non-finite update guards.
  • I/O hardening tests verify corruption detection and shape consistency checks.
  • NN stability tests stress Softmax and Cross-Entropy numerical behavior.
  • Quantization/backend tests verify parity and error thresholds across quantized paths.
  • Fuzz/property tests exercise randomized quantization and serialization invariants.
  • Benchmark tests provide throughput/latency observations and CSV-style metrics flow.

API Surface (High-Level)

Core

  • tensor_create
  • tensor_set_requires_grad
  • tensor_backward
  • optim_sgd_step
  • optim_zero_grad

Math/Ops

  • tensor_add
  • tensor_matmul
  • tensor_matmul_blocked
  • tensor_matmul_simd
  • op_add
  • op_mul
  • op_matmul

NN

  • op_relu
  • op_sigmoid
  • op_softmax
  • loss_mse
  • loss_crossentropy

Layer

  • layer_linear_create
  • layer_linear_forward
  • layer_linear_quantize
  • layer_linear_forward_int8

Quantization

  • tensor_calc_qparams_i8
  • tensor_quantize_i8
  • tensor_dequantize_i8
  • tensor_calc_qparams_i8_per_channel
  • tensor_quantize_i8_per_channel
  • tensor_calibrate_activation_qparams_i8
  • tensor_calibrate_activation_qparams_i8_grouped
  • tensor_matmul_int8
  • tensor_matmul_int8_per_channel
  • tensor_matmul_i8i8_i8pc
  • tensor_quantize_fp16
  • tensor_dequantize_fp16

I/O

  • io_save_weights
  • io_load_weights

Current Scope and Limitations

This repository currently targets dense tensor operations and related training/inference primitives.

Not in scope (at present):

  • Convolution and pooling operator families
  • Graph compilation/fusion passes
  • Distributed training
  • Multi-format model import/export ecosystem
  • Full package/distribution pipeline

Engineering Notes

  • The project is intentionally designed for explicit control over memory, numerics, and compute paths.
  • Safety checks are integrated into allocation, validation, serialization, and optimizer flows.
  • The implementation favors predictable behavior and direct inspectability over abstraction-heavy runtime layers.

Contributing

Recommended contribution areas:

  • Additional operator coverage
  • Extended backend kernels
  • Cross-platform CI and benchmark automation
  • Expanded calibration algorithms
  • Documentation and usage examples

Documentation References

The following references were used to guide implementation details and technical decisions:

About

A bare-metal, dependency-free Machine Learning framework written in C and Assembly. Features static memory arenas, reverse-mode Autograd, SIMD (AVX2/NEON) optimizations, and INT8 quantization for edge and TinyML environments.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages