Skip to content

XfLabs/CertSheriff

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CertSheriff

A read-only cert-manager troubleshooting MCP server that diagnoses why your Kubernetes TLS certificates are not Ready.

What is CertSheriff?

When a cert-manager Certificate fails, you waste time clicking through kubectl describe on four different resources. CertSheriff does it for you — it walks the cert-manager resource chain, collects events, and returns a structured JSON diagnosis that any AI agent can explain in plain English.

You:          "Why isn't my-cert ready?"
CertSheriff:  diagnose_certificate("default", "my-cert")
              → { "summary": { "most_likely_stage": "Issuer",
                                "primary_reason": "IssuerNotFound", ... } }
Agent:        "The Certificate references issuer 'missing-issuer' which doesn't exist.
               Create the Issuer or fix the issuerRef."

Architecture

┌──────────┐     ┌──────────────┐     ┌─────────────────────┐     ┌────────────────┐
│  You     │────▶│  kagent      │────▶│  CertSheriff MCP    │────▶│  Kubernetes    │
│  (chat)  │◀────│  Agent       │◀────│  Server             │◀────│  API Server    │
└──────────┘     └──────────────┘     └─────────────────────┘     └────────────────┘
                                              │
                                              │  reads (never writes)
                                              ▼
                                      cert-manager CRDs
                                      ├── Certificate
                                      ├── CertificateRequest
                                      ├── Order  (ACME)
                                      ├── Challenge (ACME)
                                      ├── Issuer
                                      └── ClusterIssuer

Tools

Tool Input What it does
diagnose_certificate namespace, certificate_name Fetches the full resource chain (Certificate → CertificateRequest → Order → Challenge), the referenced Issuer/ClusterIssuer, and events. Returns a JSON diagnosis with a computed summary block.
list_unready_certificates namespace Lists all Certificates where the Ready condition is missing or not True, with reason, message, and issuerRef.
list_expiring_certificates namespace, within_days Lists Certificates expiring within N days, sorted by urgency (soonest first). Returns notAfter, days remaining, Ready status, dnsNames, and secretName.
inspect_certificate_secret namespace, secret_name Decodes the X.509 certificate chain from a TLS Secret. Returns subject, SANs, issuer, expiry, key algorithm, chain length. Never exposes tls.key.
check_issuer_health namespace (optional) Checks Ready status of all Issuers and ClusterIssuers. Returns issuer type, Ready condition, and errors. Unhealthy issuers listed first.

Quick Start

Prerequisites

  • Python 3.11+
  • A Kubernetes cluster with cert-manager installed (install guide)
  • uv package manager (curl -LsSf https://astral.sh/uv/install.sh | sh)

Local dev (stdio — MCP Inspector)

cd certsheriff-mcp
uv sync
uv run python -m certsheriff_mcp.server

Then open MCP Inspector and connect to the stdio server.

Local dev (HTTP)

cd certsheriff-mcp
MCP_TRANSPORT_MODE=http uv run python -m certsheriff_mcp.server

The server listens on http://0.0.0.0:8000/mcp.

Run tests

cd certsheriff-mcp
uv sync --dev
uv run pytest -v

3-Step Demo

Step 1: Create a failing Certificate

kubectl apply -f examples/02-demo-failing-cert.yaml

This creates a Certificate demo-tls that references a non-existent Issuer missing-issuer.

Step 2: Diagnose it

Call the diagnose_certificate tool (via MCP Inspector, kagent, or any MCP client):

diagnose_certificate(namespace="default", certificate_name="demo-tls")

You'll see:

{
  "summary": {
    "ready": false,
    "primary_reason": "IssuerNotFound",
    "primary_message": "Referenced \"Issuer\" not found: issuer.cert-manager.io \"missing-issuer\" not found",
    "most_likely_stage": "Issuer",
    "next_commands": [
      "kubectl describe certificate demo-tls -n default",
      "kubectl describe issuer missing-issuer -n default"
    ]
  }
}

Step 3: Apply the fix, then diagnose again

kubectl apply -f examples/03-demo-fix-issuer.yaml

Wait ~30 seconds for cert-manager to reconcile, then:

diagnose_certificate(namespace="default", certificate_name="demo-tls")

Now summary.ready should be true.

See examples/example-prompts.md for more ready-to-use prompts for each tool.

Kubernetes Deployment

Prerequisites

  • Docker
  • A Kubernetes cluster (kind, minikube, or remote)
  • kubectl configured
  • cert-manager installed (install guide)
  • kagent installed (install guide)
  • kmcp controller installed (kmcp install)

Step 1: Build the Docker image

cd certsheriff-mcp
docker build -t certsheriff-mcp:latest .

Step 2: Load image into your cluster

For kind:

kind load docker-image certsheriff-mcp:latest --name <your-kind-cluster-name>

For minikube:

minikube image load certsheriff-mcp:latest

Step 3: Deploy to Kubernetes

kubectl apply -f deploy/

This creates:

  • ClusterRole + ClusterRoleBinding — read-only access to cert-manager resources and events
  • MCPServer CR in the kagent namespace — the kmcp controller automatically creates the Deployment and Service
  • Agent CR in the kagent namespace — the CertSheriff agent with cert-manager troubleshooting system prompt

Verify the deployment:

kubectl get mcpserver -n kagent
kubectl get pods -n kagent | grep certsheriff
kubectl get agents -n kagent | grep certsheriff

Project Structure

certsheriff/
├── certsheriff-mcp/                  # MCP server code
│   ├── src/certsheriff_mcp/
│   │   ├── server.py                # FastMCP entry point + five tools
│   │   ├── k8s.py                   # Kubernetes client helpers
│   │   ├── certmanager.py           # cert-manager resource fetching + diagnosis
│   │   └── events.py                # Event fetch + regex-based linking
│   ├── tests/
│   │   └── test_events_parse.py     # Unit tests for regex extraction
│   ├── pyproject.toml               # Dependencies (fastmcp, kubernetes)
│   ├── Dockerfile                   # Multi-stage build for K8s
│   └── kmcp.yaml                    # Tool configuration
├── deploy/                           # Kubernetes manifests
│   ├── certsheriff-rbac.yaml        # ClusterRole + ClusterRoleBinding (read-only)
│   ├── certsheriff-mcp-server.yaml  # MCPServer CRD (kmcp controller manages deployment)
│   └── certsheriff-agent.yaml       # Agent CRD
├── examples/                         # Demo manifests + guides
│   ├── 01-install-cert-manager.md
│   ├── 02-demo-failing-cert.yaml
│   ├── 03-demo-fix-issuer.yaml
│   ├── example-prompts.md
│   └── agentgateway-config.yaml
├── steps/                            # Build tutorial (step-by-step)
└── README.md                         # This file

Safety Guarantees

  • READ-ONLY: CertSheriff never creates, updates, or deletes any Kubernetes resource.
  • Timeouts: All API calls have a 10-second timeout.
  • Event capping: Events are limited to the most recent 30 per object.
  • Graceful errors: Missing CRDs, RBAC failures, and 404s return clear JSON error messages.

About

A Conversational cert-manager Troubleshooting Tool with AI Agents and MCP

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors