Skip to content

Eddie-dk1/argus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Argus

Argus is an academic research prototype for detecting anomalies in network activity and user behavior. The project is being prepared as the technical base for a diploma thesis: it compares classical machine-learning baselines with a sequence-based LSTM Autoencoder on synthetic security logs, and it is structured so that public datasets can be added later.

The repository is not a production IDS/UEBA platform. It does not provide real-time ingestion, SIEM integrations, incident management, access control, or production monitoring. Its current purpose is reproducible experimentation: data generation, preprocessing, feature engineering, model training, metrics, plots, and short anomaly explanations.

Current Scope

  • Synthetic log generation for controlled anomaly scenarios.
  • Preprocessing and feature preparation for tabular and sequence models.
  • Baseline models: Random Forest and Isolation Forest.
  • Sequence model: LSTM-Autoencoder in PyTorch.
  • Evaluation with precision, recall, f1, ROC-AUC, and PR-AUC.
  • Basic rule-based anomaly explanations.

The next diploma stages are expected to add public datasets such as CERT Insider Threat Dataset, CICIDS2017, and CSE-CIC-IDS2018, but those datasets are intentionally not stored in this repository.

Documentation

Project Structure

argus/
├── docs/
│   ├── architecture.md
│   ├── datasets.md
│   └── roadmap.md
├── notebooks/
│   └── 01_exploration.ipynb
├── results/
│   ├── baseline_reference/
│   ├── plots/
│   ├── metrics_autoencoder.json
│   ├── metrics_baseline.json
│   └── metrics_summary.csv
├── src/
│   ├── evaluate.py
│   ├── explain_anomalies.py
│   ├── features.py
│   ├── generate_data.py
│   ├── preprocessing.py
│   ├── train_baseline.py
│   ├── train_lstm_autoencoder.py
│   └── utils.py
├── requirements.txt
├── run_all.sh
└── README.md

data/ and most files in results/ are generated locally and ignored by Git.

Setup

python3 -m pip install -r requirements.txt

Optional terminal formatting for the final pipeline report:

python3 -m pip install rich

Run Pipeline

Run the full synthetic pipeline:

bash run_all.sh

Useful variants:

bash run_all.sh --best-autoencoder
bash run_all.sh --skip-generate --best-autoencoder
bash run_all.sh --with-stage10
bash run_all.sh --skip-generate --with-stage10 --stage10-epochs 50

Manual stage-by-stage run:

python3 src/generate_data.py
python3 src/preprocessing.py
python3 src/train_baseline.py
python3 src/train_lstm_autoencoder.py
python3 src/evaluate.py
python3 src/explain_anomalies.py

Current practical LSTM Autoencoder configuration:

python3 src/train_lstm_autoencoder.py \
  --sequence-length 5 \
  --hidden-size 32 \
  --epochs 10 \
  --batch-size 128 \
  --learning-rate 0.001 \
  --threshold-percentiles 85 90 92 95 \
  --early-stopping-patience 0 \
  --selection-metric f1
python3 src/evaluate.py

Synthetic Data

The current pipeline uses synthetic network and user activity logs generated by src/generate_data.py. The generated dataset includes normal events and several anomaly types:

  • brute_force_login
  • unusual_night_activity
  • data_exfiltration
  • unusual_ip_change

Generated raw and processed files are written under data/ and are not tracked by Git.

Artifacts Policy

The public repository keeps source code, dependencies, launch scripts, README, and a small reference snapshot of metrics/plots. It does not track:

  • local planning notes such as future-development.md;
  • practice reports, PDFs, DOCX/HTML exports, and local docs/ materials;
  • generated datasets under data/;
  • model weights such as *.pt;
  • full prediction outputs and intermediate experiment sweeps;
  • temporary preview/build directories and dashboard screenshots.

If a future public dataset is required for an experiment, the repository should document how to obtain and prepare it instead of committing the dataset itself.

Limitations

  • Current results are based on synthetic data.
  • Strong metrics on synthetic data are not evidence of production readiness.
  • The LSTM Autoencoder is a research comparison point, not guaranteed to beat tabular baselines on every scenario.
  • Public dataset support, richer feature engineering, explainability, and a demonstration dashboard are planned diploma-stage work.

Technologies

  • Python
  • pandas
  • numpy
  • scikit-learn
  • matplotlib
  • PyTorch
  • tqdm

License

This project is licensed under the MIT License. See LICENSE.

About

Academic research prototype for anomaly detection in network activity and user behavior, comparing ML baselines and an LSTM Autoencoder on synthetic security logs.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors