Argus

Argus is an academic research prototype for detecting anomalies in network activity and user behavior. The project is being prepared as the technical base for a diploma thesis: it compares classical machine-learning baselines with a sequence-based LSTM Autoencoder on synthetic security logs, and it is structured so that public datasets can be added later.

The repository is not a production IDS/UEBA platform. It does not provide real-time ingestion, SIEM integrations, incident management, access control, or production monitoring. Its current purpose is reproducible experimentation: data generation, preprocessing, feature engineering, model training, metrics, plots, and short anomaly explanations.

Current Scope

Synthetic log generation for controlled anomaly scenarios.
Preprocessing and feature preparation for tabular and sequence models.
Baseline models: Random Forest and Isolation Forest.
Sequence model: LSTM-Autoencoder in PyTorch.
Evaluation with precision, recall, f1, ROC-AUC, and PR-AUC.
Basic rule-based anomaly explanations.

The next diploma stages are expected to add public datasets such as CERT Insider Threat Dataset, CICIDS2017, and CSE-CIC-IDS2018, but those datasets are intentionally not stored in this repository.

Documentation

Project Structure

argus/
├── docs/
│   ├── architecture.md
│   ├── datasets.md
│   └── roadmap.md
├── notebooks/
│   └── 01_exploration.ipynb
├── results/
│   ├── baseline_reference/
│   ├── plots/
│   ├── metrics_autoencoder.json
│   ├── metrics_baseline.json
│   └── metrics_summary.csv
├── src/
│   ├── evaluate.py
│   ├── explain_anomalies.py
│   ├── features.py
│   ├── generate_data.py
│   ├── preprocessing.py
│   ├── train_baseline.py
│   ├── train_lstm_autoencoder.py
│   └── utils.py
├── requirements.txt
├── run_all.sh
└── README.md

data/ and most files in results/ are generated locally and ignored by Git.

Setup

python3 -m pip install -r requirements.txt

Optional terminal formatting for the final pipeline report:

python3 -m pip install rich

Run Pipeline

Run the full synthetic pipeline:

bash run_all.sh

Useful variants:

bash run_all.sh --best-autoencoder
bash run_all.sh --skip-generate --best-autoencoder
bash run_all.sh --with-stage10
bash run_all.sh --skip-generate --with-stage10 --stage10-epochs 50

Manual stage-by-stage run:

python3 src/generate_data.py
python3 src/preprocessing.py
python3 src/train_baseline.py
python3 src/train_lstm_autoencoder.py
python3 src/evaluate.py
python3 src/explain_anomalies.py

Current practical LSTM Autoencoder configuration:

python3 src/train_lstm_autoencoder.py \
  --sequence-length 5 \
  --hidden-size 32 \
  --epochs 10 \
  --batch-size 128 \
  --learning-rate 0.001 \
  --threshold-percentiles 85 90 92 95 \
  --early-stopping-patience 0 \
  --selection-metric f1
python3 src/evaluate.py

Synthetic Data

The current pipeline uses synthetic network and user activity logs generated by src/generate_data.py. The generated dataset includes normal events and several anomaly types:

brute_force_login
unusual_night_activity
data_exfiltration
unusual_ip_change

Generated raw and processed files are written under data/ and are not tracked by Git.

Artifacts Policy

The public repository keeps source code, dependencies, launch scripts, README, and a small reference snapshot of metrics/plots. It does not track:

local planning notes such as future-development.md;
practice reports, PDFs, DOCX/HTML exports, and local docs/ materials;
generated datasets under data/;
model weights such as *.pt;
full prediction outputs and intermediate experiment sweeps;
temporary preview/build directories and dashboard screenshots.

If a future public dataset is required for an experiment, the repository should document how to obtain and prepare it instead of committing the dataset itself.

Limitations

Current results are based on synthetic data.
Strong metrics on synthetic data are not evidence of production readiness.
The LSTM Autoencoder is a research comparison point, not guaranteed to beat tabular baselines on every scenario.
Public dataset support, richer feature engineering, explainability, and a demonstration dashboard are planned diploma-stage work.

Technologies

Python
pandas
numpy
scikit-learn
matplotlib
PyTorch
tqdm

License

This project is licensed under the MIT License. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Argus

Current Scope

Documentation

Project Structure

Setup

Run Pipeline

Synthetic Data

Artifacts Policy

Limitations

Technologies

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
docs		docs
notebooks		notebooks
results		results
src		src
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_all.sh		run_all.sh

Folders and files

Latest commit

History

Repository files navigation

Argus

Current Scope

Documentation

Project Structure

Setup

Run Pipeline

Synthetic Data

Artifacts Policy

Limitations

Technologies

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages