Stream is a design space exploration (DSE) and constraint-optimization framework for heterogeneous dataflow accelerators: accelerator systems built by combining cores that each have their own dataflow and performance model (AIE and TPU-like are two example core types among others). Scheduling is layer-fused, and the TETRA constraint optimization uses MILP (Mixed-Integer Linear Programming) to decide tensor placement and transfer paths across the cores of such a system. Stream builds on top of ZigZag for per-core cost estimation.
β Heterogeneous dataflow cores: compose an accelerator from cores that each carry their own dataflow and cost model (AIE, TPU-like, pooling, SIMD, and more).
β Layer-fused scheduling across the whole system of cores.
β TETRA constraint optimization: a MILP (TransferAndTensorAllocator) decides tensor placement and transfer-path routing.
β Pluggable solver backends: OR-Tools GSCIP (default, license-free), OR-Tools HiGHS, and Gurobi behind one unified SolverModel API.
β ONNX workloads with auto-generated or hand-written mappings.
β AMD AIE code generation: emit aie / aiex MLIR for the Ryzen AI NPU, ready for the mlir-aie / IRON toolchain.
β Built for AI agents: an MCP server and typed IR models expose the pipeline programmatically.
The pipeline runs as a chain of stages: parse β tile β cost β MILP allocation β memory estimation.
Python >=3.12 is required.
Full install with MCP server support (from the repo root):
pip install -e ".[mcp]"Base install (no MCP server):
pip install -e .The authoritative dependency source is pyproject.toml (package stream-dse). The base install pulls in zigzag-dse, ortools>=9.15 (the default, license-free MILP backend), pydantic, pydot, and xdsl. Optional extras: [mcp] adds fastmcp (required for the MCP server); [gurobi] adds gurobipy (commercial solver, opt-in).
AIE-target MLIR codegen and tracing additionally need the AMD AIE toolchain (mlir_aie, llvm-aie, xdsl-aie, snax-mlir, aie-python-extras). These are git/URL installs that PyPI does not allow in package metadata, so a console script installs them after the base install rather than via an extra:
pip install -e . # or, once published: pip install stream-dse
stream-setup-aie # installs the AIE toolchain into the current environmentstream-setup-aie --dry-run prints exactly what it will install without making changes.
β οΈ Platform caveat: the AIE toolchain is Linux x86_64 only (manylinux wheels), CPython 3.12 or 3.13.
π‘ Solver license note: OR-Tools (
ortools_gscip, the default backend) is open-source and needs no license. Gurobi requires the[gurobi]extra (pip install -e ".[gurobi]") plus a separate commercial license;backend="gurobi"errors at solve time without a valid license.
Optional pre-commit setup:
pre-commit installRun the CO pipeline on a small two-Conv workload (a committed test fixture) with an auto-generated mapping (approximately 11 seconds):
python scripts/main_stream_co.py \
--hardware stream/inputs/examples/hardware/tpu_like_quad_core.yaml \
--workload stream/inputs/testing/workload/2conv_1_8_32_32_16_32_3.onnxOr simply just co-2conv (this repo uses just as a task runner; it defaults to tpu_like_quad_core, see the matrix below). --mapping is omitted, so the mapping is auto-generated by the pipeline; the hardware is a TPU-like quad-core system.
Expected output:
Total latency: 14344.0
Group 0: 14344 (100.0%, wall=9.4s)
A YAML summary is written to outputs/.../summary.yaml with total_latency: 14344.0, plus workload/tiling/schedule PNG visualizations.
An accelerator in Stream is described as a system of heterogeneous dataflow cores. Core roles include compute, memory, shim, and offchip; example dataflow core types include AIE, TPU-like, and pooling.
Hardware and mapping files are organized as follows:
stream/inputs/examples/hardware/- system-level hardware YAMLs (e.g.tpu_like_quad_core.yaml,eyeriss_like_*.yaml,simba*.yaml,fusemax.yaml).stream/inputs/examples/hardware/cores/- per-core-type YAMLs (e.g.tpu_like.yaml,pooling.yaml,simd.yaml,offchip.yaml,eyeriss_like.yaml).stream/inputs/aie/hardware/andstream/inputs/aie/hardware/cores/- AMD AIE example core types (e.g.aie_tile.yaml,mem_tile_256KB.yaml,shim_dma.yaml).stream/inputs/examples/mapping/,stream/inputs/aie/mapping/, andstream/inputs/testing/mapping/- mapping descriptions.
A mapping can be auto-generated (as in Quick Start above) or hand-written and passed via --mapping.
The generic CO pipeline runs any ONNX workload on any of the example hardware systems. The repo ships two small workloads and exercises them across all eight non-AIE example architectures, both from the scripts/main_stream_co.py entry point and from the pytest suite (tests/test_hardware_combinations.py).
Workloads - committed test fixtures under stream/inputs/testing/workload/ (weight values are cleared, only tensor shapes matter for cost estimation, so the ONNX stay tiny; just gen-workloads regenerates them via the builders):
- 2-conv - two chained Conv layers (
make_2_conv.py). - swiglu - a 5-node SwiGLU block: two Gemms, SiLU, an elementwise Mul, and a down-projection Gemm (
make_swiglu.py).
Hardware (stream/inputs/examples/hardware/) |
Description | 2-conv | swiglu |
|---|---|---|---|
eyeriss_like_single_core |
one Eyeriss-like compute core (+ pooling, SIMD, DRAM) | β | β |
eyeriss_like_dual_core |
two Eyeriss-like compute cores | β | β |
eyeriss_like_quad_core |
four Eyeriss-like compute cores | β | β |
tpu_like_quad_core |
four TPU-like compute cores | β | β |
simba_small |
small Simba chiplet mesh | β | β |
simba |
36-core Simba chiplet mesh | β | β |
fusemax |
FuseMax array + vector + DRAM | β | β |
meta_prototype_dual_core_simd_offchip |
two Meta-prototype compute cores (+ pooling, SIMD, DRAM) | β | β |
β = completes through the generic CO pipeline. All combinations run in the default fast suite; on these small single-fusion-group workloads even the 36-core simba mesh finishes in seconds.
Run one combination - the justfile wraps scripts/main_stream_co.py; hw is any hardware stem from the table (default tpu_like_quad_core):
just co-2conv fusemax # 2-conv on an architecture
just co-swiglu simba_small # swiglu on an architectureEquivalently, the raw entry-point call:
python scripts/main_stream_co.py \
--hardware stream/inputs/examples/hardware/fusemax.yaml \
--workload stream/inputs/testing/workload/2conv_1_8_32_32_16_32_3.onnxRun the whole matrix - the justfile wraps pytest tests/test_hardware_combinations.py, which runs 2-conv + swiglu over all eight architectures plus a parse-only check confirming every hardware definition loads:
just matrix # parse + 2-conv + swiglu over all 8 architectures (incl. simba)All entry-point scripts live in scripts/ and are run from the repo root (so relative input paths resolve and stream imports as the installed package).
| Script | Purpose |
|---|---|
scripts/main_stream_co.py |
Generic CO pipeline for any workload + hardware pair; manual or auto-generated mapping; YAML summary output. General-purpose (non-AIE). |
scripts/main_gemm.py |
CO allocation + optional AIE MLIR codegen for GEMM workloads (AMD Strix AIE). |
scripts/main_swiglu.py |
CO allocation + optional AIE MLIR codegen for SwiGLU workloads (AMD Strix AIE). |
scripts/main_swiglu_dse_single.py |
Single-mapping SwiGLU DSE evaluation (AIE). |
scripts/main_swiglu_dse.py |
Multi-mapping SwiGLU DSE sweep over tile sizes (AIE). |
scripts/main_aie_co.py |
CO allocation for a hard-coded single AIE tile workload (no args; run as python scripts/main_aie_co.py). |
scripts/main_gemm_codegen.py |
Direct GEMM β AIE MLIR codegen via xDSL transforms (no CO pipeline); --M/--N/--K. |
scripts/main_stream_co.py is the general-purpose entry point. The others are AIE-specific: they hardwire AMD Strix or single-tile AIE hardware, and codegen requires NPU hardware. Note that scripts/main_aie_co.py takes no arguments (all paths are hard-coded). Plotting and trace post-processing utilities live in scripts/analysis/.
Full scripts/main_stream_co.py CLI syntax:
python scripts/main_stream_co.py \
--hardware PATH_TO_HW_YAML \
--workload PATH_TO_ONNX \
[--mapping PATH_TO_MAPPING_YAML] # omit for auto-generated mapping
[--output OUTPUT_DIR] # default: "outputs"
[--experiment-id ID]
[--skip-if-exists]The public API lives in stream/api.py.
The primary entry point is optimize_allocation_co_generic, which auto-generates the mapping from the workload and hardware (no hand-written mapping YAML needed). This snippet is confirmed to run and print total_latency: 14344.0 (the 2-conv ONNX it references is produced by just gen-workloads):
import tempfile
from stream.api import configure_logging, optimize_allocation_co_generic
configure_logging()
with tempfile.TemporaryDirectory() as tmp:
ctx = optimize_allocation_co_generic(
hardware="stream/inputs/examples/hardware/tpu_like_quad_core.yaml",
workload="stream/inputs/testing/workload/2conv_1_8_32_32_16_32_3.onnx",
experiment_id="my-first-run",
output_path=tmp,
)
print("total_latency:", ctx.get("total_latency"))
print("group_latencies:", ctx.get("group_latencies"))Expected output: total_latency: 14344.0.
The other two public functions:
optimize_allocation_co_with_mapping(hardware, workload, mapping, experiment_id, output_path, ...)- runs CO with a hand-written mapping YAML.optimize_allocation_cois a backward-compatible alias for it (both names importable).optimize_mapping(hardware, workload, experiment_id, output_path, max_nb_mappings=20, ...)- DSE pipeline: enumerates mapping variants and runs CO for each.
All three return a StageContext. Useful keys: ctx.get("total_latency"), ctx.get("group_latencies"), ctx.get("scheduler"), ctx.get("workload"), ctx.get("accelerator").
Stream ships an MCP server (stream/mcp/server.py, server name stream) that lets an AI agent submit and inspect TETRA CO jobs. Requires the [mcp] extra (pip install -e ".[mcp]").
β οΈ Install caveat:[mcp]does not currently resolve against the pinned PyPIxdsl 0.29.1- fastmcp's dependency tree needs newertyping-extensions/pydanticthan xdsl 0.29.1 permits. For now it installs only in the dev environment that uses the git build of xdsl; a clean fix awaits the xdsl upgrade.
Launch command (from the repo root):
python3 -c "from stream.mcp.server import mcp; mcp.run(transport='stdio')"The server runs on STDIO (JSON-RPC) transport and blocks until the client disconnects.
The 6 tools:
| Tool | Purpose |
|---|---|
run_optimization(hardware, workload, mapping, output_path, backend, ...) |
Submit a TETRA CO job; returns a job_id immediately; solve runs in the background. |
poll_optimization(job_id) |
Check job status (pending / running / complete / failed / not_found). |
get_workload_ir(workload=None, experiment_id=None) |
Return the workload DAG as WorkloadIR JSON. |
get_accelerator_ir(hardware=None, experiment_id=None) |
Return the hardware model as AcceleratorIR JSON. |
get_allocation_ir(job_id) |
Return the TETRA allocation result as AllocationIR JSON (3 persona views). |
get_solve_stats(job_id) |
Return MILP solve statistics (objective, time, gap, node count, backend). |
Run / poll / inspect flow:
run_optimization(...)returns{"job_id": "...", "status": "pending"}.- Poll
poll_optimization(job_id)until{"status": "complete"}. - Inspect with
get_allocation_ir(job_id)for theAllocationIR(algorithmic / hardware / compiler views) andget_solve_stats(job_id)for solve statistics.
Programmatic / IR API for structured JSON output:
from stream.ir import WorkloadIR, AcceleratorIR, AllocationIR
# After running optimize_allocation_co_generic(...)
workload_ir = WorkloadIR.from_internal(ctx.get("workload"))
accelerator_ir = AcceleratorIR.from_internal(ctx.get("accelerator"))
allocation_ir = AllocationIR.from_internal(ctx.get("scheduler"))
workload_data = workload_ir.model_dump() # JSON-compatible dict
hardware_data = accelerator_ir.model_dump()
allocation_data = allocation_ir.model_dump()AllocationIR offers .algorithmic_view(), .hardware_view(), and .compiler_view() persona views.
- Hosted documentation site: kuleuven-micas.github.io/stream, the human-facing docs (installation, getting started, the workload/hardware/mapping input formats, and driving Stream from an AI agent via the MCP server and IR models), rebuilt from
docs/on every push tomain. - Stream paper (IEEE): A. Symons, L. Mei, S. Colleman, P. Houshmand, S. Karl and M. Verhelst, "Stream: Design Space Exploration of Layer-Fused DNNs on Heterogeneous Dataflow Accelerators".
- ZigZag: zigzag-project.github.io/zigzag, the per-core cost-estimation framework Stream builds on.