This repository provides the open-source implementation of our paper “The Hidden Bloat in Machine Learning Systems.”
The project includes the components described in the paper, the kernel detector and kernel locator. The compaction component is not included in this release and will be made public upon acceptance of our companion paper.
NEGATIVA_ML analyzes machine learning (ML) workloads to identify unused GPU code segments in shared libraries.
Given an ML workload, negativa_ml outputs:
- A list of shared libraries loaded during execution.
- The unused GPU code segments within those libraries.
See the Quick Start section below for usage instructions.
Tested environment: Ubuntu 20.04
Requirements: CUDA must be properly installed and configured. Refer to the official CUDA installation guide: CUDA Installation Guide for Linux.
-
Install Rust: https://rust-lang.org/tools/install/
-
Install dependencies:
sudo apt update sudo apt install libspdlog-dev cmake
-
Set LD_LIBRARY_PATH:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64
-
Run tests:
make testAll tests should pass successfully.
-
Install and verify:
make install negativa_ml --help
Example ML workloads are available in the examples directory.
Here’s how to analyze the demo example:
-
Build the demo:
cd examples/demo && ./build.sh
The
demoexample compiles a shared librarylibdemo.soand a main executablemainthat uses it. The shared library contains two GPU kernels:matrixMulGPUandsetScalarItems. -
Verify the demo:
cd build && ./main matmul
This should output
SUCCESS. -
Run NEGATIVA_ML analysis:
negativa_ml debloat -- $PWD/mainNote: The executable path must be absolute.
-
View the results:
The analysis results are stored in the
nml_workspacedirectory:nml_workspace ├── trace.json # Traced kernels and loaded shared libraries ├── spans/ # Located unused GPU code segments │ └── libdemo.so.json # Lists unused code segments in `libdemo.so`The
spansdirectory contains the unused GPU code segments for each shared library. These can be used as input for thecompactioncomponent (not yet released).
You can verify the detected code segments are unused by modifying them and rerunning the workload.
The following command reconstructs a shared library by overwriting unused code regions with 0x1:
negativa_ml reconstruct \
--span-path nml_workspace/spans/libdemo.so.json \
--output-dir ./reconstructedThis will produce a reconstructed version of the shared library in ./reconstructed/.
You may replace the original shared library with this version to verify correctness. Remember to back up the original file first.
For convenience, a helper script debloat.sh is provided under the demo example to automate this process.
The main executable is negativa_ml, which supports four subcommands:
| Command | Description |
|---|---|
trace |
Traces kernel launches and loaded shared libraries during the workload execution. |
locate |
Identifies unused GPU code segments in shared libraries based on trace results. |
debloat |
Runs trace and locate sequentially, producing final analysis results. |
reconstruct |
Rebuilds shared libraries with unused code segments set to 0x1. |
If you use this code in your research, please cite our paper:
@inproceedings{zhanghidden,
title={The Hidden Bloat in Machine Learning Systems},
author={Zhang, Huaifeng and Ali-Eldin, Ahmed},
booktitle={Eighth Conference on Machine Learning and Systems (MLSys)},
year={2025}
}