This repository contains the implementation of our 6D object pose estimation project on the LINEMOD dataset. The system combines object detection, keypoint heatmap regression, RGB-D feature fusion, and geometric pose recovery into one modular pipeline.
Our best configuration achieves a mean ADD-based accuracy of 92.41% by combining:
- YOLOv10 for object detection
- heatmap-based keypoint regression for 2D keypoint localization
- a cross-fusion RGB-D residual architecture for multimodal learning
- PnP + RANSAC for final 6D pose recovery
- Modular pipeline covering data preparation, training, prediction, and evaluation
- RGB baseline and RGB-D cross-fusion model variants
- FPS and CPS 3D keypoint sampling strategies
- ADD / ADD-S evaluation on LINEMOD
- Both notebook-based and script-based workflows are included
The project follows a staged 6D pose estimation pipeline:
- Prepare the LINEMOD data into a unified train/test structure.
- Convert training annotations into YOLO format and train the detector.
- Generate YOLO bbox predictions and use them to crop RGB and depth object patches.
- Sample 3D object keypoints using FPS or CPS.
- Project 3D keypoints into image space and generate Gaussian heatmaps.
- Train a keypoint heatmap regressor: RGB baseline or RGB-D cross-fusion model.
- Decode predicted 2D keypoints and solve pose with PnP + RANSAC.
- Evaluate final pose quality with ADD / ADD-S.
The project compares a simple RGB baseline against an RGB-D extension that adds depth-aware feature fusion before heatmap prediction.
| Model | Inputs | Backbone | Fusion strategy | Output |
|---|---|---|---|---|
| Baseline | Cropped RGB patch | ResNet-18 | None | Keypoint heatmaps |
| Extension | Cropped RGB patch + cropped depth patch | Dual ResNet-18 streams | Residual cross-fusion between RGB and depth features | Keypoint heatmaps |
The baseline model uses a single RGB crop as input, extracts features with ResNet-18, and predicts one heatmap per keypoint through a lightweight convolutional upsampling head. This model is the clean reference point for measuring the benefit of depth.
The extension processes RGB and depth in parallel, builds separate feature maps, and then fuses them through a cross-fusion residual module before the final heatmap head. In practice, this gives the network access to both appearance cues and geometric structure, which is why it outperforms the RGB-only baseline in the project summary results.
.
├── data/ # datasets, labels, keypoints, projected labels, heatmaps
├── docs/ # report and related documents
├── models/ # trained checkpoints and YOLO artifacts
├── notebooks/ # original notebook workflow kept for reference
├── src/heatnet/ # reusable Python code
├── scripts/ # task-based entrypoints: prepare_data, train, predict, evaluate
├── configs/ # example JSON configs for the scripts
├── outputs/ # generated checkpoints, histories, predictions, evaluations
├── pyproject.toml # package metadata and heatnet CLI entrypoint
├── requirements.txt # Python dependencies
└── README.md
git clone https://github.com/emirmasood/HeatNet.git
cd HeatNetpip install -r requirements.txtpython3 -m pip install -e .This enables the package-style commands:
python3 -m heatnet --help
heatnet --helpIf editable install is blocked by your local Python setup, you can still use the package entrypoint directly from the repo root:
PYTHONPATH=src python3 -m heatnet --helpDue to GitHub's file size restrictions, download large files separately:
- Dataset: Google Drive Data Folder
- ResNet Checkpoints: ResNet Checkpoints
- YOLOv10m pretrained weights: YOLOv10m Checkpoint
Place downloaded files into their respective folders as indicated in the folder structure above.
This repository supports two workflows:
- Script-based workflow Best for a cleaner, modular project structure.
- Notebook-based workflow Best if you want to follow the original project development phase by phase.
If your assets are already prepared, the fastest way to explore the project is:
python3 -m heatnet predict --help
python3 -m heatnet evaluate --helpIf you want to run the modular script workflow, the four main entrypoints are:
scripts/prepare_data.pyscripts/train.pyscripts/predict.pyscripts/evaluate.py
The same tasks are also available through the package entrypoint:
python3 -m heatnet prepare-datapython3 -m heatnet trainpython3 -m heatnet predictpython3 -m heatnet evaluate
python3 -m heatnet prepare-data --help
python3 -m heatnet prepare-data --config configs/prepare_data.example.json bbox-predict
python3 -m heatnet prepare-data --config configs/prepare_data.example.json sample-3dpython3 -m heatnet train --help
python3 -m heatnet train --config configs/train.example.jsonpython3 -m heatnet predict --help
python3 -m heatnet predict --config configs/predict.example.jsonpython3 -m heatnet evaluate --help
python3 -m heatnet evaluate --config configs/evaluate.example.jsonThe intended chained flow is:
prepare-datato build derived assets and YOLO bbox labelstrainto fit the keypoint modelpredictto export bothkeypoints_2dand final poses underoutputs/predictions/evaluateto read that prediction JSON and compute ADD / ADD-S metrics
Generated artifacts are organized under:
outputs/checkpoints/outputs/histories/outputs/predictions/outputs/evaluations/
The notebooks are still included and useful as:
- the original project workflow
- experiment history
- visual reference for how each phase was developed
The final integrated notebook is:
jupyter lab notebooks/end_to_end/ph5_01_end_to_end.ipynbThe earlier notebooks remain important if you want to fully reproduce the project from raw data, retrain models, or inspect each research phase separately.
The project report and experiments compare:
- RGB baseline vs RGB-D cross-fusion models
- FPS vs CPS keypoint sampling
- multiple activation functions and scheduler variants
The best reported result in this repository is:
- Mean ADD-based accuracy: 92.41% on LINEMOD
The table below summarizes the mean ADD-based accuracy of the RGB-D cross-fusion variants across activation functions and learning-rate schedulers.
| Scheduler | ReLU | SiLU | Mish |
|---|---|---|---|
| ConstantLR | 88.19% | 82.86% | 88.32% |
| OneCycleLR | 90.14% | 91.08% | 91.92% |
| PolynomialLR | 87.40% | 86.91% | 86.99% |
Key takeaways:
- OneCycleLR + Mish gives the best mean result at 91.92%.
- Among the tested schedulers, OneCycleLR is consistently the strongest overall.
- Mish is the best-performing activation on average in this comparison.
- Relative to the RGB baseline, the RGB-D cross-fusion extension delivers the strongest overall performance in the project, showing that depth is beneficial when fused with RGB features through the residual cross-fusion design.
For the full analysis, ablations, and qualitative results, see the report in docs.
Ismail Aljosevic (ismail.aljosevic@studenti.polito.it)
Amir Masoud Almasi (amirmasoud.almasi@studenti.polito.it)
Ana Parovic (ana.parovic@studenti.polito.it)
Ashkan Shafiei (ashkan.shafiei@studenti.polito.it)
We thank Prof. Barbara Caputo, Dr. Raffaele Camoriano, Stephany Chanelo, and Paolo Rabino for their foundational instruction and guidance in the 3D learning course at Politecnico di Torino, which inspired the initial direction of this work.

