🌊 Stream

Stream is a design space exploration (DSE) and constraint-optimization framework for heterogeneous dataflow accelerators: accelerator systems built by combining cores that each have their own dataflow and performance model (AIE and TPU-like are two example core types among others). Scheduling is layer-fused, and the TETRA constraint optimization uses MILP (Mixed-Integer Linear Programming) to decide tensor placement and transfer paths across the cores of such a system. Stream builds on top of ZigZag for per-core cost estimation.

📖 Explore the Documentation

🚀 Getting Started Guide

✨ Key Features

✔ Heterogeneous dataflow cores: compose an accelerator from cores that each carry their own dataflow and cost model (AIE, TPU-like, pooling, SIMD, and more).

✔ Layer-fused scheduling across the whole system of cores.

✔ TETRA constraint optimization: a MILP (TransferAndTensorAllocator) decides tensor placement and transfer-path routing.

✔ Pluggable solver backends: OR-Tools GSCIP (default, license-free), OR-Tools HiGHS, and Gurobi behind one unified SolverModel API.

✔ ONNX workloads with auto-generated or hand-written mappings.

✔ AMD AIE code generation: emit aie / aiex MLIR for the Ryzen AI NPU, ready for the mlir-aie / IRON toolchain.

✔ Built for AI agents: an MCP server and typed IR models expose the pipeline programmatically.

The pipeline runs as a chain of stages: parse → tile → cost → MILP allocation → memory estimation.

🚀 Installation

Python >=3.12 is required.

Full install with MCP server support (from the repo root):

pip install -e ".[mcp]"

Base install (no MCP server):

pip install -e .

The authoritative dependency source is pyproject.toml (package stream-dse). The base install pulls in zigzag-dse, ortools>=9.15 (the default, license-free MILP backend), pydantic, pydot, and xdsl. Optional extras: [mcp] adds fastmcp (required for the MCP server); [gurobi] adds gurobipy (commercial solver, opt-in).

AIE code generation

AIE-target MLIR codegen and tracing additionally need the AMD AIE toolchain (mlir_aie, llvm-aie, xdsl-aie, snax-mlir, aie-python-extras). These are git/URL installs that PyPI does not allow in package metadata, so a console script installs them after the base install rather than via an extra:

pip install -e .       # or, once published: pip install stream-dse
stream-setup-aie       # installs the AIE toolchain into the current environment

stream-setup-aie --dry-run prints exactly what it will install without making changes.

⚠️ Platform caveat: the AIE toolchain is Linux x86_64 only (manylinux wheels), CPython 3.12 or 3.13.

💡 Solver license note: OR-Tools (ortools_gscip, the default backend) is open-source and needs no license. Gurobi requires the [gurobi] extra (pip install -e ".[gurobi]") plus a separate commercial license; backend="gurobi" errors at solve time without a valid license.

Optional pre-commit setup:

pre-commit install

⚡ Quick Start

Run the CO pipeline on a small two-Conv workload (a committed test fixture) with an auto-generated mapping (approximately 11 seconds):

python scripts/main_stream_co.py \
  --hardware stream/inputs/examples/hardware/tpu_like_quad_core.yaml \
  --workload stream/inputs/testing/workload/2conv_1_8_32_32_16_32_3.onnx

Or simply just co-2conv (this repo uses just as a task runner; it defaults to tpu_like_quad_core, see the matrix below). --mapping is omitted, so the mapping is auto-generated by the pipeline; the hardware is a TPU-like quad-core system.

Expected output:

Total latency: 14344.0
  Group 0: 14344 (100.0%, wall=9.4s)

A YAML summary is written to outputs/.../summary.yaml with total_latency: 14344.0, plus workload/tiling/schedule PNG visualizations.

🧩 Hardware and Core Types

An accelerator in Stream is described as a system of heterogeneous dataflow cores. Core roles include compute, memory, shim, and offchip; example dataflow core types include AIE, TPU-like, and pooling.

Hardware and mapping files are organized as follows:

stream/inputs/examples/hardware/ - system-level hardware YAMLs (e.g. tpu_like_quad_core.yaml, eyeriss_like_*.yaml, simba*.yaml, fusemax.yaml).
stream/inputs/examples/hardware/cores/ - per-core-type YAMLs (e.g. tpu_like.yaml, pooling.yaml, simd.yaml, offchip.yaml, eyeriss_like.yaml).
stream/inputs/aie/hardware/ and stream/inputs/aie/hardware/cores/ - AMD AIE example core types (e.g. aie_tile.yaml, mem_tile_256KB.yaml, shim_dma.yaml).
stream/inputs/examples/mapping/, stream/inputs/aie/mapping/, and stream/inputs/testing/mapping/ - mapping descriptions.

A mapping can be auto-generated (as in Quick Start above) or hand-written and passed via --mapping.

📊 Workload × Hardware Matrix

The generic CO pipeline runs any ONNX workload on any of the example hardware systems. The repo ships two small workloads and exercises them across all eight non-AIE example architectures, both from the scripts/main_stream_co.py entry point and from the pytest suite (tests/test_hardware_combinations.py).

Workloads - committed test fixtures under stream/inputs/testing/workload/ (weight values are cleared, only tensor shapes matter for cost estimation, so the ONNX stay tiny; just gen-workloads regenerates them via the builders):

2-conv - two chained Conv layers (make_2_conv.py).
swiglu - a 5-node SwiGLU block: two Gemms, SiLU, an elementwise Mul, and a down-projection Gemm (make_swiglu.py).

Hardware (`stream/inputs/examples/hardware/`)	Description	2-conv	swiglu
`eyeriss_like_single_core`	one Eyeriss-like compute core (+ pooling, SIMD, DRAM)	✓	✓
`eyeriss_like_dual_core`	two Eyeriss-like compute cores	✓	✓
`eyeriss_like_quad_core`	four Eyeriss-like compute cores	✓	✓
`tpu_like_quad_core`	four TPU-like compute cores	✓	✓
`simba_small`	small Simba chiplet mesh	✓	✓
`simba`	36-core Simba chiplet mesh	✓	✓
`fusemax`	FuseMax array + vector + DRAM	✓	✓
`meta_prototype_dual_core_simd_offchip`	two Meta-prototype compute cores (+ pooling, SIMD, DRAM)	✓	✓

✓ = completes through the generic CO pipeline. All combinations run in the default fast suite; on these small single-fusion-group workloads even the 36-core simba mesh finishes in seconds.

Run one combination - the justfile wraps scripts/main_stream_co.py; hw is any hardware stem from the table (default tpu_like_quad_core):

just co-2conv fusemax           # 2-conv on an architecture
just co-swiglu simba_small      # swiglu on an architecture

Equivalently, the raw entry-point call:

python scripts/main_stream_co.py \
  --hardware stream/inputs/examples/hardware/fusemax.yaml \
  --workload stream/inputs/testing/workload/2conv_1_8_32_32_16_32_3.onnx

Run the whole matrix - the justfile wraps pytest tests/test_hardware_combinations.py, which runs 2-conv + swiglu over all eight architectures plus a parse-only check confirming every hardware definition loads:

just matrix          # parse + 2-conv + swiglu over all 8 architectures (incl. simba)

🖥️ Command-Line Entry Points

All entry-point scripts live in scripts/ and are run from the repo root (so relative input paths resolve and stream imports as the installed package).

Script	Purpose
`scripts/main_stream_co.py`	Generic CO pipeline for any workload + hardware pair; manual or auto-generated mapping; YAML summary output. General-purpose (non-AIE).
`scripts/main_gemm.py`	CO allocation + optional AIE MLIR codegen for GEMM workloads (AMD Strix AIE).
`scripts/main_swiglu.py`	CO allocation + optional AIE MLIR codegen for SwiGLU workloads (AMD Strix AIE).
`scripts/main_swiglu_dse_single.py`	Single-mapping SwiGLU DSE evaluation (AIE).
`scripts/main_swiglu_dse.py`	Multi-mapping SwiGLU DSE sweep over tile sizes (AIE).
`scripts/main_aie_co.py`	CO allocation for a hard-coded single AIE tile workload (no args; run as `python scripts/main_aie_co.py`).
`scripts/main_gemm_codegen.py`	Direct GEMM → AIE MLIR codegen via xDSL transforms (no CO pipeline); `--M/--N/--K`.

scripts/main_stream_co.py is the general-purpose entry point. The others are AIE-specific: they hardwire AMD Strix or single-tile AIE hardware, and codegen requires NPU hardware. Note that scripts/main_aie_co.py takes no arguments (all paths are hard-coded). Plotting and trace post-processing utilities live in scripts/analysis/.

Full scripts/main_stream_co.py CLI syntax:

python scripts/main_stream_co.py \
  --hardware PATH_TO_HW_YAML \
  --workload PATH_TO_ONNX \
  [--mapping PATH_TO_MAPPING_YAML]  # omit for auto-generated mapping
  [--output OUTPUT_DIR]             # default: "outputs"
  [--experiment-id ID]
  [--skip-if-exists]

🐍 Public API

The public API lives in stream/api.py.

The primary entry point is optimize_allocation_co_generic, which auto-generates the mapping from the workload and hardware (no hand-written mapping YAML needed). This snippet is confirmed to run and print total_latency: 14344.0 (the 2-conv ONNX it references is produced by just gen-workloads):

import tempfile
from stream.api import configure_logging, optimize_allocation_co_generic

configure_logging()

with tempfile.TemporaryDirectory() as tmp:
    ctx = optimize_allocation_co_generic(
        hardware="stream/inputs/examples/hardware/tpu_like_quad_core.yaml",
        workload="stream/inputs/testing/workload/2conv_1_8_32_32_16_32_3.onnx",
        experiment_id="my-first-run",
        output_path=tmp,
    )
    print("total_latency:", ctx.get("total_latency"))
    print("group_latencies:", ctx.get("group_latencies"))

Expected output: total_latency: 14344.0.

The other two public functions:

optimize_allocation_co_with_mapping(hardware, workload, mapping, experiment_id, output_path, ...) - runs CO with a hand-written mapping YAML. optimize_allocation_co is a backward-compatible alias for it (both names importable).
optimize_mapping(hardware, workload, experiment_id, output_path, max_nb_mappings=20, ...) - DSE pipeline: enumerates mapping variants and runs CO for each.

All three return a StageContext. Useful keys: ctx.get("total_latency"), ctx.get("group_latencies"), ctx.get("scheduler"), ctx.get("workload"), ctx.get("accelerator").

🤖 MCP Server (for AI agents)

Stream ships an MCP server (stream/mcp/server.py, server name stream) that lets an AI agent submit and inspect TETRA CO jobs. Requires the [mcp] extra (pip install -e ".[mcp]").

⚠️ Install caveat: [mcp] does not currently resolve against the pinned PyPI xdsl 0.29.1 - fastmcp's dependency tree needs newer typing-extensions/pydantic than xdsl 0.29.1 permits. For now it installs only in the dev environment that uses the git build of xdsl; a clean fix awaits the xdsl upgrade.

Launch command (from the repo root):

python3 -c "from stream.mcp.server import mcp; mcp.run(transport='stdio')"

The server runs on STDIO (JSON-RPC) transport and blocks until the client disconnects.

The 6 tools:

Tool	Purpose
`run_optimization(hardware, workload, mapping, output_path, backend, ...)`	Submit a TETRA CO job; returns a `job_id` immediately; solve runs in the background.
`poll_optimization(job_id)`	Check job status (`pending` / `running` / `complete` / `failed` / `not_found`).
`get_workload_ir(workload=None, experiment_id=None)`	Return the workload DAG as `WorkloadIR` JSON.
`get_accelerator_ir(hardware=None, experiment_id=None)`	Return the hardware model as `AcceleratorIR` JSON.
`get_allocation_ir(job_id)`	Return the TETRA allocation result as `AllocationIR` JSON (3 persona views).
`get_solve_stats(job_id)`	Return MILP solve statistics (objective, time, gap, node count, backend).

Run / poll / inspect flow:

run_optimization(...) returns {"job_id": "...", "status": "pending"}.
Poll poll_optimization(job_id) until {"status": "complete"}.
Inspect with get_allocation_ir(job_id) for the AllocationIR (algorithmic / hardware / compiler views) and get_solve_stats(job_id) for solve statistics.

🧠 Working in This Repo (AI agents)

Programmatic / IR API for structured JSON output:

from stream.ir import WorkloadIR, AcceleratorIR, AllocationIR

# After running optimize_allocation_co_generic(...)
workload_ir = WorkloadIR.from_internal(ctx.get("workload"))
accelerator_ir = AcceleratorIR.from_internal(ctx.get("accelerator"))
allocation_ir = AllocationIR.from_internal(ctx.get("scheduler"))

workload_data = workload_ir.model_dump()      # JSON-compatible dict
hardware_data = accelerator_ir.model_dump()
allocation_data = allocation_ir.model_dump()

AllocationIR offers .algorithmic_view(), .hardware_view(), and .compiler_view() persona views.

📚 Further Documentation

Hosted documentation site: kuleuven-micas.github.io/stream, the human-facing docs (installation, getting started, the workload/hardware/mapping input formats, and driving Stream from an AI agent via the MCP server and IR models), rebuilt from docs/ on every push to main.
Stream paper (IEEE): A. Symons, L. Mei, S. Colleman, P. Houshmand, S. Karl and M. Verhelst, "Stream: Design Space Exploration of Layer-Fused DNNs on Heterogeneous Dataflow Accelerators".
ZigZag: zigzag-project.github.io/zigzag, the per-core cost-estimation framework Stream builds on.

Name		Name	Last commit message	Last commit date
Latest commit History 1,192 Commits
.github/workflows		.github/workflows
docs		docs
scripts		scripts
stream		stream
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
justfile		justfile
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌊 Stream

📖 Explore the Documentation

🚀 Getting Started Guide

✨ Key Features

🚀 Installation

AIE code generation

⚡ Quick Start

🧩 Hardware and Core Types

📊 Workload × Hardware Matrix

🖥️ Command-Line Entry Points

🐍 Public API

🤖 MCP Server (for AI agents)

🧠 Working in This Repo (AI agents)

📚 Further Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🌊 Stream

📖 Explore the Documentation

🚀 Getting Started Guide

✨ Key Features

🚀 Installation

AIE code generation

⚡ Quick Start

🧩 Hardware and Core Types

📊 Workload × Hardware Matrix

🖥️ Command-Line Entry Points

🐍 Public API

🤖 MCP Server (for AI agents)

🧠 Working in This Repo (AI agents)

📚 Further Documentation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages