Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 36 additions & 2 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,40 @@ LANGFUSE_ENABLED=true
LANGFUSE_PUBLIC_KEY=your-langfuse-public-key
LANGFUSE_SECRET_KEY=your-langfuse-secret-key
LANGFUSE_HOST=https://cloud.langfuse.com
# Required for Grafana dashboards: keep these trace fields enabled so dashboard
# filters and breakdowns can read ticket/project/workflow labels from Langfuse.
# Comma-separated list of trace attributes to set as Langfuse tags.
# Available values: ticket_key, ticket_type, project_id, workflow_step, repo, pr_number, ci_status, event_source, event_type, llm_model
LANGFUSE_TRACE_TAGS=ticket_key,ticket_type,project_id,workflow_step,repo,pr_number,ci_status,event_source,event_type,llm_model
# Required for Grafana dashboards: project_id, ticket_type, workflow_step, and
# ticket_key must be present in metadata because dashboard ClickHouse queries use
# metadata['project_id'], metadata['ticket_type'], and metadata['workflow_step'].
# Comma-separated list of trace attributes to set as Langfuse metadata.
# Available values: ticket_key, ticket_type, project_id, workflow_step, repo, pr_number, ci_status, event_source, event_type, retry_count, system_prompt_length, llm_model
LANGFUSE_TRACE_METADATA=ticket_key,ticket_type,project_id,workflow_step,repo,pr_number,ci_status,event_source,event_type,retry_count,system_prompt_length,llm_model

# Grafana dashboard stack. These are used by docker-compose.yml and
# devtools/grafana/compose.grafana.yml. Prometheus/Redis defaults match this
# repo's compose files.
#
# Local self-hosted Langfuse defaults:
# - Start Langfuse from its compose project.
# - Run Grafana with devtools/grafana/compose.langfuse-network.yml so Grafana
# joins the Langfuse compose network.
# - The ClickHouse service is then reachable as clickhouse:9000.
GRAFANA_PORT=3010
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=grafana
LANGFUSE_DOCKER_NETWORK=langfuse_default
CLICKHOUSE_HOST=clickhouse
CLICKHOUSE_PORT=9000
CLICKHOUSE_DATABASE=default
CLICKHOUSE_USER=clickhouse
CLICKHOUSE_PASSWORD=clickhouse
PROMETHEUS_HOST=host.containers.internal
PROMETHEUS_PORT=9092
REDIS_HOST=host.containers.internal
REDIS_PORT=6380

# OpenTelemetry distributed tracing (separate from Langfuse LLM tracing above)
# Enable/disable OTLP trace export
Expand All @@ -164,8 +198,8 @@ OTEL_SDK_DISABLED=true
# - Production: your-registry.com/forge:v1.0.0
# Note: Use localhost/ prefix for local images to avoid podman short-name resolution prompts
CONTAINER_IMAGE=localhost/forge-dev:latest
# Container execution timeout in seconds (default: 2 hours)
CONTAINER_TIMEOUT=7200
# Container execution timeout in seconds (default: 30 minutes)
CONTAINER_TIMEOUT=1800
# Container resource limits
CONTAINER_MEMORY=4g
CONTAINER_CPUS=2
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -376,6 +376,8 @@ tests/ # Unit and integration tests

- **API server**: `http://localhost:8000/metrics`
- **Worker**: `http://localhost:8001/metrics`
- **Prometheus UI**: `http://localhost:9092`
- **Grafana dashboards**: `http://localhost:3010` when the compose `grafana` service is running

Key metrics:
- `forge_workflows_started_total` - Workflows started by type
Expand Down
22 changes: 19 additions & 3 deletions devtools/README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,29 @@
# Forge Developer Tools

Local development stack for running Forge services on the host with Prometheus scraping both the API and worker.
Local development stack for running Forge services on the host with Prometheus scraping both the API and worker, plus Grafana dashboards for Forge observability.

## Usage

```bash
# Start Redis + Prometheus (scrapes host-local processes)
docker compose -f devtools/docker-compose.dev.yml up -d
# Start Redis + Prometheus + Grafana (scrapes host-local processes)
docker compose --env-file .env -f devtools/docker-compose.dev.yml up -d

# In separate terminals, start the local services:
uv run uvicorn forge.main:app --reload --port 8000 --host 0.0.0.0
uv run forge worker
```

**With self-hosted Langfuse** — add the network override so Grafana can reach ClickHouse:

```bash
docker compose --env-file .env \
-f devtools/docker-compose.dev.yml \
-f devtools/grafana/compose.langfuse-network.yml \
up -d
```

> `compose.langfuse-network.yml` requires the `langfuse_default` Docker network to already exist (i.e. Langfuse must be running). Without it the entire stack will fail to start. Omit this file if you are not using self-hosted Langfuse — the Prometheus and Redis datasources work without it.

## Endpoints

| Service | URL |
Expand All @@ -22,11 +33,16 @@ uv run forge worker
| Worker metrics | http://localhost:8001/metrics |
| Redis | redis://localhost:6380/0 |
| Prometheus | http://localhost:9092 |
| Grafana | http://localhost:3010 |

## How it works

`prometheus.dev.yml` targets `host.docker.internal` which resolves to the host machine from inside the Prometheus container. The `extra_hosts: host.docker.internal:host-gateway` entry in `docker-compose.dev.yml` enables this on Linux/Fedora.

Grafana provisions dashboards from `devtools/grafana/dashboards/` and datasources from `devtools/grafana/provisioning/`. The dashboards expect Forge to emit Langfuse trace tags and metadata for `ticket_key`, `ticket_type`, `project_id`, and `workflow_step`; `.env.example` enables those fields.

For local self-hosted Langfuse, the `compose.langfuse-network.yml` override joins Grafana to the Langfuse compose network so it can reach ClickHouse at `clickhouse:9000`. Set `LANGFUSE_DOCKER_NETWORK` if your Langfuse compose network is not `langfuse_default`.

To reload Prometheus config without restarting:
```bash
curl -X POST http://localhost:9092/-/reload
Expand Down
29 changes: 29 additions & 0 deletions devtools/docker-compose.dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ version: "3.9"
# uv run forge worker
#
# Prometheus dashboard: http://localhost:9092
# Grafana dashboards: http://localhost:3010 (admin / grafana)

services:
redis:
Expand Down Expand Up @@ -41,6 +42,34 @@ services:
extra_hosts:
- "host.docker.internal:host-gateway"

grafana:
image: grafana/grafana:latest
ports:
- "${GRAFANA_PORT:-3010}:3000"
environment:
- GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:-admin}
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:-grafana}
- GF_INSTALL_PLUGINS=grafana-clickhouse-datasource,redis-datasource
- CLICKHOUSE_HOST=${CLICKHOUSE_HOST:-clickhouse}
- CLICKHOUSE_PORT=${CLICKHOUSE_PORT:-9000}
- CLICKHOUSE_DATABASE=${CLICKHOUSE_DATABASE:-default}
- CLICKHOUSE_USER=${CLICKHOUSE_USER:-clickhouse}
- CLICKHOUSE_PASSWORD=${CLICKHOUSE_PASSWORD:-clickhouse}
- PROMETHEUS_HOST=prometheus
- PROMETHEUS_PORT=9090
- REDIS_HOST=redis
- REDIS_PORT=6379
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro,z
- ./grafana/dashboards:/var/lib/grafana/dashboards:ro,z
depends_on:
- prometheus
- redis
extra_hosts:
- "host.containers.internal:host-gateway"

volumes:
redis-data:
prometheus-data:
grafana-data:
213 changes: 213 additions & 0 deletions devtools/grafana/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,213 @@
# Forge Grafana Dashboards

Grafana provisioning for Forge observability dashboards. The stack is
preconfigured for Langfuse ClickHouse, Forge Prometheus, and Forge Redis.

## Start

When running the full Forge compose stack:

```bash
docker compose --env-file .env \
-f docker-compose.yml \
-f devtools/grafana/compose.langfuse-network.yml \
up -d grafana
```

When running Forge services locally on the host with the devtools stack:

```bash
docker compose --env-file .env \
-f devtools/docker-compose.dev.yml \
-f devtools/grafana/compose.langfuse-network.yml \
up -d
```

For dashboard-only development against already-running Prometheus/Redis/ClickHouse:

```bash
docker compose --env-file .env \
-f devtools/grafana/compose.grafana.yml \
-f devtools/grafana/compose.langfuse-network.yml \
up -d
```

UI (default port): <http://localhost:3010> - log in as **admin / grafana**.

## Required Forge Trace Fields

The dashboards use Forge's configurable Langfuse trace fields for filters and
workflow-step breakdowns. Configure these in `.env` before running Forge:

```bash
LANGFUSE_TRACE_TAGS=ticket_key,ticket_type,project_id,workflow_step,repo,pr_number,ci_status,event_source,event_type,llm_model
LANGFUSE_TRACE_METADATA=ticket_key,ticket_type,project_id,workflow_step,repo,pr_number,ci_status,event_source,event_type,retry_count,system_prompt_length,llm_model
```

## Local Self-Hosted Langfuse

The default ClickHouse settings in `.env.example` match the standard local
Langfuse compose stack:

```bash
LANGFUSE_DOCKER_NETWORK=langfuse_default
CLICKHOUSE_HOST=clickhouse
CLICKHOUSE_PORT=9000
CLICKHOUSE_DATABASE=default
CLICKHOUSE_USER=clickhouse
CLICKHOUSE_PASSWORD=clickhouse
```

Use `compose.langfuse-network.yml` in the `docker compose` command so Grafana
joins that network. If your Langfuse compose project uses a different network
name, set `LANGFUSE_DOCKER_NETWORK` accordingly.

## Tear down

```bash
docker compose --env-file .env -f devtools/grafana/compose.grafana.yml down -v
```

The `-v` flag removes the `grafana-data` volume, wiping dashboards and state.

---

## Grafana MCP Server (Claude Code)

The [grafana/mcp-grafana](https://github.com/grafana/mcp-grafana) server lets
Claude Code query datasources, read and write dashboards, manage alerts, and
navigate your Grafana instance through natural language.

### 1. Create a service account token

After starting the stack, create a token via the API:

```bash
# Create a service account
SA_ID=$(curl -sf -u admin:grafana -X POST http://localhost:3010/api/serviceaccounts \
-H 'Content-Type: application/json' \
-d '{"name":"claude-code","role":"Editor"}' | python3 -c "import sys,json; print(json.load(sys.stdin)['id'])")

# Generate a token for it
curl -sf -u admin:grafana -X POST http://localhost:3010/api/serviceaccounts/$SA_ID/tokens \
-H 'Content-Type: application/json' \
-d '{"name":"claude-code-token"}' | python3 -c "import sys,json; print(json.load(sys.stdin)['key'])"
```

Save the printed token.

### 2. Install mcp-grafana

`mcp-grafana` is distributed as a Go binary and via PyPI. The easiest path if
you already have `uv` installed (which this project requires):

```bash
uv tool install mcp-grafana
```

### 3. Add the MCP server to Claude Code

Replace `<your-token>` with the token from step 1.

**Local scope** (current project only):

```bash
claude mcp add-json "grafana" \
'{"command":"uvx","args":["mcp-grafana"],"env":{"GRAFANA_URL":"http://localhost:3010","GRAFANA_SERVICE_ACCOUNT_TOKEN":"<your-token>"}}'
```

**User scope** (available across all your projects):

```bash
claude mcp add-json "grafana" --scope user \
'{"command":"uvx","args":["mcp-grafana"],"env":{"GRAFANA_URL":"http://localhost:3010","GRAFANA_SERVICE_ACCOUNT_TOKEN":"<your-token>"}}'
```

Verify the server is registered:

```bash
claude mcp list
```

### 4. Verify the connection

Start a Claude Code session and ask:

> List my Grafana dashboards.

Claude should respond with the list of the project's dashboards.

---

### Reconfigure with a new token

If you recreate the stack (`down -v`) or rotate the token, update the MCP
server with the new value:

**Local scope** (current project only):

```bash
claude mcp remove grafana
claude mcp add-json "grafana" \
'{"command":"uvx","args":["mcp-grafana"],"env":{"GRAFANA_URL":"http://localhost:3010","GRAFANA_SERVICE_ACCOUNT_TOKEN":"<new-token>"}}'
```

**User scope** (available across all your projects):

```bash
claude mcp remove grafana
claude mcp add-json "grafana" --scope user \
'{"command":"uvx","args":["mcp-grafana"],"env":{"GRAFANA_URL":"http://localhost:3010","GRAFANA_SERVICE_ACCOUNT_TOKEN":"<new-token>"}}'
```

### Notes

- The token is tied to the `grafana-data` volume. Running `down -v` destroys
the service account - repeat steps 1, 3, and the reconfigure step above after
recreating the stack.
- The **Editor** role is sufficient for most dashboard to work. Use **Admin** if
you need to manage datasources or users through Claude.
- To restrict Claude to read-only operations, add `"--disable-write"` to the
`args` array in step 3.

---

## Dashboards

Dashboard JSON files live in `devtools/grafana/dashboards/` and are version controlled. On startup, Grafana provisions them automatically.

Sub-folders under `dashboards/` become Grafana folders.

### Dashboard Inventory

| Dashboard | Primary use | Datasources |
|-----------|-------------|-------------|
| Forge Operations Dashboard | Runtime health, webhooks, queues, retry/dead-letter depth | Prometheus, Redis |
| Forge Ticket Execution Dashboard | Single-ticket trace timeline, step cost, tokens, latency, agent calls | ClickHouse |
| Forge Cost Dashboard | Cost by project, ticket type, model, day, and expensive tickets | ClickHouse |
| Forge Agent Performance Dashboard | Agent invocation rate, duration, latency, prompt size, slow calls | Prometheus, ClickHouse |
| Forge Workflow Funnel Dashboard | Workflow starts/completions/failures, step coverage, rework loops | Prometheus, ClickHouse |
| Forge CI and Review Dashboard | CI fix attempts, approvals, revisions, PR/CI trace signals | Prometheus, ClickHouse |
| Forge Model Usage Dashboard | Calls, cost, tokens, and workflow-step usage by model | ClickHouse |
| Forge Observability Health Dashboard | Prometheus target health and trace metadata coverage | Prometheus, ClickHouse |
| Forge Business Dashboard | Business-level cost and throughput summaries | ClickHouse |
| Forge Engineering Dashboard | Mixed engineering views across runtime, alerts, cost, and latency | ClickHouse, Prometheus, Redis |
| Forge Issue Detail | Detailed per-issue trace and cost drill-down | ClickHouse, Prometheus |

### Workflow

Edited dashboards in the Grafana UI need to be synced back to the repo. Grafana holds live edits in its internal database; however, the local JSON files are the source of truth for version control. On a fresh stack (`down -v && up`), Grafana re-provisions the dashboards from the files.

**Iterate on a dashboard:**

1. Edit the dashboard in the Grafana UI - manually or through the mcp server
2. When happy with the changes, ask the AI to sync it back - e.g. _"save the dashboard back to the source file"_.
3. Claude uses the MCP server to fetch the current dashboard JSON and overwrites the local file.
4. Commit the updated file.

**Create a new dashboard:**

1. Ask Claude to create it - e.g. _"create a dashboard called Trace Volume with a time series panel"_.
2. Claude creates it via the MCP server (appears in the UI immediately).
3. Ask Claude to save it to `devtools/grafana/dashboards/<name>.json`.
4. Commit the file - it will be provisioned automatically on next stack start.
Loading
Loading