Skip to content
View Siddhesh2377's full-sized avatar
🪨
Eating Stones
🪨
Eating Stones

Block or report Siddhesh2377

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Siddhesh2377/README.md

Siddhesh Sonar

On Device AI + SDK Engineer. I make models run fast on real hardware.

I work across the full inference stack. Forked llama.cpp, stripped it to CPU only, rewrote the inference paths for Android with big.LITTLE thread scheduling and ARM micro kernels. Built custom Hexagon DSP kernels in HMX/HVX assembly. Currently building cross platform C++ SDK infrastructure at RunAnywhere (YC W26) with thin native bridges to Kotlin, Swift, Flutter, and React Native.

Not the "import tensorflow and call predict" kind of AI work. The kind where you're writing HMX assembly for a Crouton tile layout and then discovering the matrix unit is fused off on your test silicon.

What I Built

ToolNeuron Offline AI ecosystem for Android. 400 stars, 47 forks, 5K+ Play Store installs. Three repos: ToolNeuron (app), Ai-Systems-New (C++ SDK modules), llama.cpp-android (custom engine fork). Forked llama.cpp, stripped all GPU backends, built 5 engine layers on top: GGMLEngine, ThreadEngine (big.LITTLE thermal aware scheduling), VLM Engine (20+ vision/audio architectures), RAG Engine (late chunking, binary quantized retrieval), and a callback logger. Four SDK modules: LLM inference, Stable Diffusion (QNN Hexagon DSP + MNN), TTS with 10 voices (ONNX Runtime), emotional TTS with voice cloning (ONNX Runtime). ARM micro kernels: NEON, i8mm, dotprod, fp16, bf16, KleidiAI. ~50 t/s on Cortex X3. No cloud, no telemetry.

Android NPU (private). Custom GGUF inference engine that bypasses QNN SDK and dispatches matmul directly to the Hexagon cDSP via FastRPC. Wrote DSP kernel library from scratch in Hexagon assembly: FP16 and INT8 GEMM (HMX), FlashAttention, RMSNorm (HVX), elementwise ops. Weights pre permuted into Crouton tile layout for HMX DMA. Zero copy CPU DSP buffer sharing through rpcmem (ION/dmabuf). Found and fixed 7 FastRPC/DMA/cache pipeline bugs. Diagnosed HMX fuse off on V73 SM7635 through known input verification where the benchmark reported 8 TFLOPS from no op instructions writing zeros. Blog post.

Edge AI Studio (private). DSL driven inference engine with its own compiler. Write .edge scripts, compile to .egraph binary, run on CPU/CUDA/OpenCL/Qualcomm QNN NPU with per op backend dispatch. Hardware manifests per SoC so the compiler assigns each op to the right backend based on actual device capabilities. Built CLion and VS Code extensions with full LSP (diagnostics, completion, hover, go to definition). Sub 2MB runtime binary. Runs Qwen3 0.6B on Snapdragon. Blog post.

ForgeAI Desktop app for loading, inspecting, compressing, merging, and testing LLM model files offline. Rust + Tauri v2 + SvelteKit 5. Cross platform.

What I Actually Understand

Hardware level inference. GGML internals and compute graph construction, ML op scheduling across CPU/GPU/NPU, quantization behavior on real silicon (Q4_K_M through Q8_0, IQ variants), how different quant schemes perform on different ARM cores, big.LITTLE thread pinning, thermal zone monitoring.

Qualcomm Hexagon DSP internals. HMX matrix extensions (systolic array, Crouton tile layout, accumulator patterns), HVX 1024 bit vector ops (vrmpy, vmpy, vgather), VTCM tightly coupled memory, FastRPC IPC over /dev/cdsprpc, rpcmem zero copy buffer sharing (ION/dmabuf), fastrpc_mmap flags, FARF logging, skel disassembly. Spent weeks debugging at the instruction level on V73.

Cross platform SDK architecture. C++ shared core with thin Kotlin/JNI, Swift, Flutter, React Native bridges. V table dispatch for modality routing. Built and maintain this at RunAnywhere.

Production Android at the metal. NDK/JNI memory lifecycle, Jetpack Compose, plugin SDK architecture, secure IPC via AIDL, ARM NEON intrinsics (i8mm, dotprod, KleidiAI micro kernels).

Currently

Building cross platform inference SDK at RunAnywhere (YC W26). Maintain the thin platform bridges over a shared C++ core with V table dispatch. Rewrote the SDK telemetry subsystem from scratch (V table dispatch, per modality event queues, background flush worker). Built a CLI tool for finetuning ML models + LoRAs in C++. Shipped LoRA adapter support across the full stack. 11 PRs across 3 repos, ~16K lines in 6 weeks.

Contact

siddheshsonar2377@gmail.com · LinkedIn · Blog

Open to roles in edge AI, on device inference, mobile SDK infrastructure, and systems programming.

Pinned Loading

  1. ToolNeuron ToolNeuron Public

    Encrypted & Privacy First, On Android Device AI App

    Kotlin 407 49

  2. Ai-Systems-New Ai-Systems-New Public

    On-device AI SDK powering ToolNeuron — LLM chat & tool calling (llama.cpp), Stable Diffusion image generation (QNN/MNN), image processing (upscale, segment, inpaint, depth, style), and TTS. Native …

    C++ 23 3

  3. ForgeAi ForgeAi Public

    ForgeAI : Your local model workshop, Load. Inspect. Merge. Ship.

    Rust 13 1

  4. llama.cpp-android llama.cpp-android Public

    Custom llama.cpp fork with character intelligence engine: control vectors, attention bias, head rescaling, attention temperature, fast weight memory

    C++ 8 4