This repository contains an implementation of scheduling algorithms for Processing-in-Memory (PIM) systems.
If you use our code, please cite our paper:
@inproceedings{kang2026nonclairvoyant,
author = {Kang, Hongbo and Zhao, Yiwei and Agrawal, Kunal and Wu, Yongwei and Gibbons, Phillip B.},
title = {Non-Clairvoyant Scheduling for Processing-in-Memory},
year = {2026},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
booktitle = {Proceedings of the 38th ACM Symposium on Parallelism in Algorithms and Architectures},
location = {London, UK},
series = {SPAA '26},
doi = {10.1145/3816782.3819200},
url = {https://doi.org/10.1145/3816782.3819200}
}The code consists of two parts:
simulator/-- a closed-form makespan simulator that runs on any Linux/Windows host. Implements clairvoyant (Sec. 4) and non-clairvoyant (Sec. 5) schedulers alongside four baselines, plus a parameter sweep harness.upmem/-- a real-hardware port targeting UPMEM DPUs. Same algorithm set, driven throughlibdpuand a shared DPU kernel.
third-party/ holds git submodules (parlaylib, UPMEM SDK). Initialise with
git submodule update --init --recursive before building.
C++17 + pthreads. parlaylib is fetched as a submodule; nothing else is required.
cmake -B simulator/build -S simulator -DCMAKE_BUILD_TYPE=Release
cmake --build simulator/build -j
Produces three binaries under simulator/build/:
pim_sim-- single-config run; prints per-algorithm makespan and ratios.pim_sweep-- parallelregime x P x seedgrid; emits CSV to stdout.pim_tests-- unit + invariant tests.
Two-regime demo (Strong-CPU and Weak-CPU on the same workload):
./simulator/build/pim_sim
Single custom regime via --phi:
./simulator/build/pim_sim --phi 3 --target-dist zipf --zipf-theta 1.2 \
--work-dist pareto --pareto-alpha 1.5
Parameter sweep (default 5 seeds x {8,16,32,64} PIMs x 2 regimes):
./simulator/build/pim_sweep > results.csv
./simulator/build/pim_sweep --Ps "8,16,32,64" --seeds 10 > results.csv
Tests:
./simulator/build/pim_tests
A larger sweep script lives at simulator/bench/run_param_sweep.sh and
prints a markdown ratio table over 26 settings of (P, phi, skew, work
distribution).
--P, --n, --m, --seed, --obj-size, --desc, --reply, --work, --mu, --B, --phi
--target-dist {uniform,zipf} --zipf-theta <f> (default 0.99)
--work-dist {constant,exp,pareto} --pareto-alpha <f> (default 1.5)
--Ps, --seeds sweep-only
./simulator/build/pim_sim --help lists every flag with its default. Without
--phi, both pim_sim and pim_sweep run the dual Strong-CPU + Weak-CPU
demo on the requested workload.
Host-side C++17 against UPMEM libdpu; the DPU kernel is built with
dpu-upmem-dpurte-clang. CMake locates both via pkg-config or the standard
UPMEM SDK paths.
upmem/
include/ shared host headers
pim_hw.h hardware parameters (B, E, E_CPU, ...)
model.h Object, Task, Workload, SystemParams
workload.h generator interface
schedule.h Schedule = sequence of Phases (Push/Pull/CpuExec)
runtime.h DPU dispatcher + wall-clock measurement
lower_bound.h per-instance lower bounds
algo.h AlgoFn interface shared by baselines/Clv/NCV
common/kernel_args.h layout shared with the DPU kernel
host/ host-side C++17 implementation
main.cpp CLI entry point
workload.cpp deterministic hash-based generator
runtime.cpp libdpu wrapper, MRAM packing, phase execution
lower_bound.cpp
algos/ one .cpp per algorithm
dpu/ DPU-side C
kernel_exec.c single-binary execution kernel
bench/ smoke + calibration + sweep helpers
CMakeLists.txt host + DPU build
The host binary and the compiled DPU kernel land side by side in
upmem/build/pim_upmem and upmem/build/kernel_exec.dpu, so the host loads
the kernel by relative path at runtime.
cmake -B upmem/build -S upmem -DCMAKE_BUILD_TYPE=Release
cmake --build upmem/build -j
To pick a non-default tasklet count (1..24, default 12):
cmake -B upmem/build -S upmem -DUPMEM_NR_TASKLETS=8
cmake --build upmem/build -j
cd upmem/build
./pim_upmem --n 1024 --m 100000 --target-dist zipf --zipf-theta 0.99 \
--work-dist pareto --pareto-alpha 1.5 --work 1000
--dpus N requests a specific DPU count; omit to use every DPU returned by
dpu_alloc(DPU_ALLOCATE_ALL, ...). The CLI prints one CSV row per algorithm
with wall-clock makespan and ratios against the paper-merged and strict lower
bounds. Run ./pim_upmem --help for all flags.
./upmem/bench/run_tests.sh # smoke test + small CSV checks
./upmem/bench/run_calibration.sh # bench the hardware, suggest pim_hw.h values
./upmem/bench/run_param_sweep.sh # 19-setting sweep over n/W/theta/work-dist
./upmem/bench/run_sweep_size.sh # size sweep over m
./upmem/bench/run_revolver_test.sh # NR_TASKLETS sweep using the calibrator
upmem/include/pim_hw.h carries bandwidths B_push / B_pull, DPU compute
rate E_DPU, host CPU rate E_CPU, DPU clock frequency, and per-launch
overhead. Re-calibrate via upmem/bench/run_calibration.sh and paste the
suggested constants into pim_hw.h whenever you change UPMEM_NR_TASKLETS,
the DPU work loop, or the host machine.