Skip to content

Montimage/MMT-RCA

Repository files navigation

MMT-RCA — Montimage Root Cause Analysis Platform

Generic, multi-method anomaly detection and root cause analysis platform. Connects to any monitoring data source via config file. No code changes required per integration.

Architecture

Data Source (MQTT/Kafka/HTTP)
        │
        ▼
   [Collector]  ── reads config/mmt-rca.yml
        │           maps raw messages → observations
        ▼
  [Analysis API]  ── FastAPI (port 8000)
        │
        ├── Statistical anomaly detector (z-score)
        ├── Isolation Forest detector (unsupervised ML)
        ├── Similarity engine (adjusted cosine vs. knowledge base)
        ├── SHAP attribution (which attributes drove the result)
        └── LLM synthesis (Ollama + llama3.1 → root cause narrative)
        │
        ▼
  [PostgreSQL + TimescaleDB + pgvector]
  [Redis]  ── real-time pub/sub

Quick Start

Prerequisites

  • Docker + Docker Compose
  • 8 GB RAM minimum (for llama3.1; use llama3.2:3b for lighter machines)

1. Configure environment

cp .env.example .env
# Edit .env if needed (DB password, model choice)

On macOS: Run Ollama natively for GPU (Metal) acceleration:

brew install ollama
ollama serve          # runs on localhost:11434
ollama pull llama3.1

Then set in .env:

OLLAMA_URL=http://host.docker.internal:11434

And remove the ollama and ollama-init services from docker-compose or override with docker-compose.dev.yml.

2. Start the stack

make up
# or: docker compose up -d

The first start downloads the llama3.1 model (~4.7 GB). Monitor with:

make logs          # all services
docker compose logs -f ollama-init   # model download progress

3. Verify health

curl http://localhost:8000/health
# {"status":"ok","db":true,"ollama":true,"ollama_model":"llama3.1"}

4. Test the analysis endpoint (no MQTT needed)

curl -X POST http://localhost:8000/analyze \
  -H "Content-Type: application/json" \
  -d '{
    "project_id": "default",
    "observation": {
      "timestamp": "2024-01-15T14:31:42Z",
      "source_id": "gateway-01",
      "attributes": {
        "cpu": 0.97,
        "ram": 0.04,
        "nb_conn": 450,
        "ms_delay": 2850,
        "recv_rate": 0.02
      }
    }
  }'

Response:

{
  "event_id": "...",
  "event_type": "UNKNOWN",
  "anomaly_score": 0.82,
  "best_match": null,
  "top_k_matches": [],
  "contributing_attributes": {"ms_delay": 0.71, "recv_rate": 0.18, ...},
  "rca_narrative": "High message delay and near-zero receive rate suggest upstream network congestion or link failure.",
  "rca_confidence": "MEDIUM",
  "rca_actions": ["Check upstream ISP status", "Inspect gateway-01 network interface", "..."],
  "detector_results": [...]
}

Building the Knowledge Base

Step 1 — Record a normal baseline

# Start a learning session (type NORMAL)
SESSION=$(curl -s -X POST http://localhost:8000/learning/sessions \
  -H "Content-Type: application/json" \
  -d '{"project_id":"default","label":"Normal operation","event_type":"NORMAL"}' \
  | jq -r .id)

# Add observations manually, or let the collector run during normal operation
# Then stop the session — this triggers feature extraction and KB entry creation
curl -X POST http://localhost:8000/learning/sessions/$SESSION/stop

Step 2 — Record known incident patterns

# Trigger/simulate the incident on the monitored system, then:
SESSION=$(curl -s -X POST http://localhost:8000/learning/sessions \
  -H "Content-Type: application/json" \
  -d '{"project_id":"default","label":"DoS attack — HTTP flood","event_type":"INCIDENT",
       "description":"Multiple requests from several sources. Root cause: potential DDoS."}' \
  | jq -r .id)

# Wait while the incident runs, then stop:
curl -X POST http://localhost:8000/learning/sessions/$SESSION/stop

Step 3 — Monitor in real time

Configure config/mmt-rca.yml with your MQTT/Kafka source, then:

make restart-collector

The collector sends every observation to the analysis service, which now matches against the knowledge base.

Integrating a Data Source

Edit config/mmt-rca.yml:

project: my-client

inputs:
  - name: iot_sensors
    adapter: mqtt
    broker: "mqtt.client.example.com:1883"
    topics:
      - "sensors/+/data"
    feature_map:
      "$.cpu_pct": "cpu"
      "$.free_mem_mb": "mem_free"
      "$.rx_bytes_s": "recv_rate"
      "$.latency_ms": "ms_delay"
    group_by: "$.device_id"
    window_seconds: 30

Then make restart-collector. No code changes needed.

API Reference

Method Path Description
GET /health Service health + Ollama status
POST /analyze Analyze one observation → RCA report
GET /events/{project_id} List detected events (paginated)
GET /events/{project_id}/{id} Get single event detail
POST /projects Create a project
POST /learning/sessions Start a learning session
POST /learning/sessions/{id}/stop Stop + build KB entry
GET /learning/kb/{project_id} List knowledge base entries

Makefile Targets

make up              # start all services
make up-dev          # start with hot reload
make down            # stop all services
make logs            # tail all logs
make db-shell        # open psql
make ollama-pull     # manually pull a model
make restart-analysis
make restart-collector
make clean           # remove containers + build cache

Choosing a Model

Model Size Speed Quality Recommended for
llama3.1 (8B) 4.7 GB Medium High Production
llama3.2:3b 2.0 GB Fast Medium Development / low RAM
phi3:mini 2.3 GB Fast Medium Edge deployment
mistral:7b 4.1 GB Medium High Alternative to llama3.1

Set model: OLLAMA_MODEL=llama3.2:3b in .env, then make ollama-pull.

About

Montimage Root-cause Analysis Platform

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors