Argus is an academic research prototype for detecting anomalies in network
activity and user behavior. The project is being prepared as the technical base
for a diploma thesis: it compares classical machine-learning baselines with a
sequence-based LSTM Autoencoder on synthetic security logs, and it is structured
so that public datasets can be added later.
The repository is not a production IDS/UEBA platform. It does not provide real-time ingestion, SIEM integrations, incident management, access control, or production monitoring. Its current purpose is reproducible experimentation: data generation, preprocessing, feature engineering, model training, metrics, plots, and short anomaly explanations.
- Synthetic log generation for controlled anomaly scenarios.
- Preprocessing and feature preparation for tabular and sequence models.
- Baseline models:
Random ForestandIsolation Forest. - Sequence model:
LSTM-Autoencoderin PyTorch. - Evaluation with
precision,recall,f1,ROC-AUC, andPR-AUC. - Basic rule-based anomaly explanations.
The next diploma stages are expected to add public datasets such as
CERT Insider Threat Dataset, CICIDS2017, and CSE-CIC-IDS2018, but those
datasets are intentionally not stored in this repository.
argus/
├── docs/
│ ├── architecture.md
│ ├── datasets.md
│ └── roadmap.md
├── notebooks/
│ └── 01_exploration.ipynb
├── results/
│ ├── baseline_reference/
│ ├── plots/
│ ├── metrics_autoencoder.json
│ ├── metrics_baseline.json
│ └── metrics_summary.csv
├── src/
│ ├── evaluate.py
│ ├── explain_anomalies.py
│ ├── features.py
│ ├── generate_data.py
│ ├── preprocessing.py
│ ├── train_baseline.py
│ ├── train_lstm_autoencoder.py
│ └── utils.py
├── requirements.txt
├── run_all.sh
└── README.md
data/ and most files in results/ are generated locally and ignored by Git.
python3 -m pip install -r requirements.txtOptional terminal formatting for the final pipeline report:
python3 -m pip install richRun the full synthetic pipeline:
bash run_all.shUseful variants:
bash run_all.sh --best-autoencoder
bash run_all.sh --skip-generate --best-autoencoder
bash run_all.sh --with-stage10
bash run_all.sh --skip-generate --with-stage10 --stage10-epochs 50Manual stage-by-stage run:
python3 src/generate_data.py
python3 src/preprocessing.py
python3 src/train_baseline.py
python3 src/train_lstm_autoencoder.py
python3 src/evaluate.py
python3 src/explain_anomalies.pyCurrent practical LSTM Autoencoder configuration:
python3 src/train_lstm_autoencoder.py \
--sequence-length 5 \
--hidden-size 32 \
--epochs 10 \
--batch-size 128 \
--learning-rate 0.001 \
--threshold-percentiles 85 90 92 95 \
--early-stopping-patience 0 \
--selection-metric f1
python3 src/evaluate.pyThe current pipeline uses synthetic network and user activity logs generated by
src/generate_data.py. The generated dataset includes normal events and several
anomaly types:
brute_force_loginunusual_night_activitydata_exfiltrationunusual_ip_change
Generated raw and processed files are written under data/ and are not tracked
by Git.
The public repository keeps source code, dependencies, launch scripts, README, and a small reference snapshot of metrics/plots. It does not track:
- local planning notes such as
future-development.md; - practice reports, PDFs, DOCX/HTML exports, and local
docs/materials; - generated datasets under
data/; - model weights such as
*.pt; - full prediction outputs and intermediate experiment sweeps;
- temporary preview/build directories and dashboard screenshots.
If a future public dataset is required for an experiment, the repository should document how to obtain and prepare it instead of committing the dataset itself.
- Current results are based on synthetic data.
- Strong metrics on synthetic data are not evidence of production readiness.
- The LSTM Autoencoder is a research comparison point, not guaranteed to beat tabular baselines on every scenario.
- Public dataset support, richer feature engineering, explainability, and a demonstration dashboard are planned diploma-stage work.
- Python
- pandas
- numpy
- scikit-learn
- matplotlib
- PyTorch
- tqdm
This project is licensed under the MIT License. See LICENSE.