μQ MicroQuant Book Evaluation →
TinyML Compiler Engineering · Consulting

Your model.
Our compiler.
Their microcontroller.

We design and integrate custom quantization frameworks directly into proprietary RTOS and bare-metal firmware. The playground below is our actual SPQ4 compiler running in your browser — a live sample of the engineering you're hiring.

87%flash reduction vs FP32
0.00 KBdynamic heap used
<1%fixed-point L1 error
bit-exactRTOS async parity
Live Showcase

The SPQ4 Compiler Playground

Upload a weights file — or synthesize one — and watch the real pipeline run: scale-protected INT4 quantization, SRAM tile solving, dead-block sparsity analysis, and production C++ codegen. Every number below is computed live from your actual data.

1 · Model Ingestion

Upload raw float32 weights, or configure a synthetic layer.

OR SYNTHESIZE
One fixed-point scale (m0, nb) per block
Drives the 2D weight-tile solver
30%
Zero blocks compile to m0 = 0 and are skipped free

2 · Quantization Map

Live block Dead block (skipped) Clipped outlier

Hover any weight to inspect its INT4 code, block scale, and packed byte.

Nibble Packing (2 × INT4 → 1 byte)

w0
····
+
w1
····
Packed Byte
········
Compiled flash
vs FP32
Dead blocks skipped
SRAM tile
MACs / inference
Reconstruction RMSE

3 · Generated C++ Assets

model_assets.h — flash-aligned static arrays, drops straight into an arm-none-eabi-gcc build.


// Run the SPQ4 compiler to generate C++ assets...
                        

This is a browser demo of the real pipeline. In an engagement we run the full compiler against your actual model, calibrate against your dataset, and hand-tune the kernels for your silicon.

Get this done for your product →
What You're Hiring

Compiler Engineering, Not a Boxed Tool

SPQ4 — Scale-Protected Quantization, 4-bit — is one method engineered end-to-end for microcontrollers. We adapt it to your hardware.

🎯

Scale-Protected INT4

Per-block MSE-optimal clip search. Outliers saturate at the INT4 rail instead of destroying scale resolution for every neighboring weight. Two weights per byte, ~87% smaller than FP32.

🧮

FPU-less, Division-free

Fixed-point requantization (acc·m0) >> nb with round-to-nearest. The hot loop is integer MACs, bitwise masks, and shifts — nothing else. Runs on Cortex-M0.

🧊

Zero-Allocation Runtime

Header-only C++11, standard library only. Every buffer statically allocated: 0.00 KB heap, no arena, no fragmentation, deterministic latency. Weights execute in place from flash.

⏱️

RTOS Cooperative Executor

Tile-by-tile stepping with ping-pong SRAM buffers and DMA prefetch. Never blocks a realtime thread, never trips a watchdog — and stays bit-identical to blocking execution.

🕳️

Free Structural Sparsity

All-zero weight blocks compile to m0 == 0 and are skipped with zero mask storage. The skip pattern is a compile-time constant — timing stays input-invariant for WCET analysis.

📐

SRAM-Perfect Tiling

The compiler solves 2D tile geometry against your SRAM/TCM budget and serializes weights in streaming order, so every tile DMA-transfers whole into a tightly-coupled buffer.

🏎️

Hardware-Specific Kernels

SWAR nibble unpacking everywhere; __SMUAD dual-MAC paths on Cortex-M4/M7; Helium, RISC-V RVV, and Tensilica DSP ports built to order in engagements.

🧾

Verification Evidence

Every delivery ships with a parity harness: bit-exact async-vs-blocking checks, fixed-vs-float error bounds, footprint and alignment audits you can cite in a safety file.

Measured, Not Marketed

Verified Harness Results

Numbers from the repository's own benchmark harness on the reference 2-layer demo model (64×128 → 32×64). Reproduce them yourself — the evaluation sandbox license covers it.

Benchmark Harness Output

MetricFP32 ReferenceMicroQuant SPQ4Delta
Weight flash footprint 40.00 KB 7.81 KB −80.5%
Dynamic heap allocation allocator-dependent 0.00 KB static memory model
Inference latency (host) 3.07 µs 1.74 µs 1.77× faster
Fixed-point accuracy baseline 0.28% relative L1 round-to-nearest requant
Async vs blocking parity bit-identical PASS
Host measurements (x86/Apple Silicon, -O3). On Cortex-M4/M7 the __SMUAD dual-MAC path and dead-block skipping widen the gap further — we provide cycle-accurate target numbers during an engagement.

Reproduce It Yourself

Three commands, no dependencies beyond Python 3 and a C++ compiler:

terminal
$ python3 compiler/test_compiler.py Ran 12 tests ... OK $ python3 compiler/main.py --out benchmark/model_assets.h SUCCESS: SPQ4 bare-metal assets written $ g++ -std=c++11 -O3 -Iruntime/include -Ibenchmark \ benchmark/benchmark.cpp -o benchmark_run && ./benchmark_run Async vs. Blocking exact bit parity : PASS (bit-identical) Relative L1 error : 0.2785 % Flash size reduction : 80.47% vs FP32 Dynamic heap allocation : 0.00 KB
Legal Clarity

One License. No Copyleft. No Surprises.

MicroQuant is commercial-only proprietary software under a single clean EULA — nothing in this stack can contaminate your codebase.

💎 Commercial Production License

For Shipping Products

Read the Commercial EULA
🧪 Evaluation & Recruitment Sandbox

For Reviewers & Hiring Teams

Read the Sandbox Waiver (§4)
Start Here

Request a Technical Architecture Evaluation

Tell us about your model and your silicon. A compiler engineer — not a salesperson — reviews every request.

What the First Call Covers

🔍

Feasibility, Honestly

We look at your model class, accuracy budget, and memory map and tell you what INT4 will and won't do for it — before any contract.

📉

Bottleneck Diagnosis

Flash pressure, SRAM contention, cycle budgets, watchdog constraints — we map where your current inference path actually hurts.

🗺️

A Concrete Integration Plan

You leave with a scoped proposal: target kernels, expected footprint and latency envelopes, verification deliverables, and timeline.

Schedule Evaluation

✔️

Evaluation Request Received

A TinyML compiler engineer will reach out to name@company.com within 12 hours with a booking link.