From 730e2fda244a71d53f353d058c8e8127977e749c Mon Sep 17 00:00:00 2001 From: Luca Consalvi Date: Thu, 25 Jun 2026 13:57:55 +0200 Subject: [PATCH 1/4] feat(lvms): add run-integration-tests skill for autonomous LVMS testing on TNF Adds /lvms:run-integration-tests, a re-entrant skill that automates the full LVMS QE integration test pipeline on TNF clusters. Designed for RC/EC builds where the redhat-operators OLM catalog does not include the lvms-operator package. The skill uses a 3-phase state machine, detecting cluster state on each invocation: - Phase 1 (fresh): deploy from source, build binary, start nohup test run - Phase 2 (running): show progress, print monitoring command, exit - Phase 3 (done): parse OTE JSON results, generate Markdown report, post to JIRA Verified end-to-end on OCP 4.22.0 TNF cluster: 38/40 MNO tests passed across two runs. The 2 failures are cleanup race conditions in new upstream tests, not LVMS bugs. Co-Authored-By: Claude Sonnet 4.6 (1M context) --- plugins/lvms/.claude-plugin/plugin.json | 2 +- plugins/lvms/README.md | 8 + .../skills/run-integration-tests/SKILL.md | 360 ++++++++++++++++++ 3 files changed, 369 insertions(+), 1 deletion(-) create mode 100644 plugins/lvms/skills/run-integration-tests/SKILL.md diff --git a/plugins/lvms/.claude-plugin/plugin.json b/plugins/lvms/.claude-plugin/plugin.json index 8350db56..4b7717a8 100644 --- a/plugins/lvms/.claude-plugin/plugin.json +++ b/plugins/lvms/.claude-plugin/plugin.json @@ -1,7 +1,7 @@ { "name": "lvms", "description": "LVMS (Logical Volume Manager Storage) release, QE, operational workflows, and troubleshooting", - "version": "1.1.0", + "version": "1.2.0", "author": { "name": "sakbas" }, "homepage": "https://github.com/openshift-eng/edge-tooling", "license": "Apache-2.0" diff --git a/plugins/lvms/README.md b/plugins/lvms/README.md index b59151b8..a667d4e2 100644 --- a/plugins/lvms/README.md +++ b/plugins/lvms/README.md @@ -17,6 +17,7 @@ LVMS (Logical Volume Manager Storage) release, QE, and operational workflows. | `/lvms:check-release-readiness` | Verify branches, dependencies, and configuration for an LVMS release | | `/lvms:z-stream-report` | Generate z-stream release urgency report for all supported versions | | `/lvms:setup-prereq` | Set up prerequisites to test unreleased LVMS operator builds | +| `/lvms:run-integration-tests` | Run QE integration tests from lvm-operator repo — deploy from source, run, collect results, post to JIRA | ## Usage @@ -53,6 +54,13 @@ LVMS (Logical Volume Manager Storage) release, QE, and operational workflows. /lvms:setup-prereq disconnected ``` +### Run integration tests (RC/EC builds) + +```text +/lvms:run-integration-tests +/lvms:run-integration-tests OCPEDGE-1995 +``` + ## Requirements - `oc` CLI (authenticated with cluster-admin) diff --git a/plugins/lvms/skills/run-integration-tests/SKILL.md b/plugins/lvms/skills/run-integration-tests/SKILL.md new file mode 100644 index 00000000..536075f9 --- /dev/null +++ b/plugins/lvms/skills/run-integration-tests/SKILL.md @@ -0,0 +1,360 @@ +--- +name: lvms:run-integration-tests +argument-hint: "[JIRA-ID]" +description: Run LVMS QE integration tests on a TNF cluster via SSH — deploys operator from source, runs tests, parses results, posts to JIRA. Use for RC/EC builds where OLM catalog is unavailable. +user-invocable: true +allowed-tools: Bash, Read, AskUserQuestion +--- + +# lvms:run-integration-tests + +## Synopsis + +```bash +/lvms:run-integration-tests +/lvms:run-integration-tests OCPEDGE-1995 +``` + +## Description + +Automates the full LVMS QE integration test pipeline on a TNF cluster. Designed for RC/EC builds +where the `redhat-operators` catalog does not include the `lvms-operator` package. For released +builds, use `/lvms:setup-prereq` instead. + +The skill is **re-entrant**: invoke it once to start the run, then invoke it again when tests +complete to collect results. State is detected automatically from the hypervisor. + +## Prerequisites + +| Requirement | Details | +|-------------|---------| +| TNF cluster | Fresh cluster with extra disks (`VM_EXTRADISKS_LIST="vda vdb vdc"`) | +| SSH access | Hypervisor reachable by SSH with `oc` CLI and cluster-admin kubeconfig | +| Go 1.24+ | Installed on the hypervisor (for building the test binary) | + +## Implementation + +### Step 1: Gather Inputs + +Parse `$ARGUMENTS` for an optional JIRA ticket ID (e.g. `OCPEDGE-1995`). + +Ask for the hypervisor SSH host: + +``` +Hypervisor SSH host? (default: ec2-user@52.29.221.136) +``` + +Set: +``` +SSH_HOST = user input or default +KUBECONFIG = /home/ec2-user/openshift-metal3/dev-scripts/ocp/ostest/auth/kubeconfig +``` + +### Step 2: Detect State + +Check process and log state on the hypervisor to decide which phase to enter: + +```bash +ssh "$SSH_HOST" ' + ps aux | grep "[i]ntegration-test run-suite" | grep -q . && echo RUNNING + ls ~/lvms-mno.log ~/lvms-sno.log 2>/dev/null | head -1 +' 2>/dev/null +``` + +| Result | Phase | +|--------|-------| +| Output contains `RUNNING` | → Phase 2: Tests In Progress | +| Output contains a log path (no RUNNING) | → Phase 3: Results Ready | +| No output | → Phase 1: Fresh Run | + +--- + +## Phase 1: Fresh Run + +### Step 1a: Ask for suite + +``` +Which test suite? +- mno: Multi-Node OpenShift (36 tests) — use for TNF clusters +- sno: Single-Node OpenShift (30 tests) — use for SNO clusters +- both: Run MNO then SNO sequentially (~3-4 hours) +``` + +Default to `mno` for TNF topology. + +### Step 1b: Check existing LVMS deployment + +```bash +ssh "$SSH_HOST" "export KUBECONFIG=$KUBECONFIG; \ + oc get deployment/lvms-operator -n openshift-lvm-storage 2>/dev/null" +``` + +- **Found**: Ask: "LVMS already deployed. Redeploy from source or skip to test run?" + - Redeploy: `ssh "$SSH_HOST" "cd ~/lvm-operator && make undeploy"` then continue + - Skip: jump to Step 1d +- **Not found**: Continue + +### Step 1c: Deploy LVMS from source + +Patch image registry to Managed (some tests need it): + +```bash +ssh "$SSH_HOST" "export KUBECONFIG=$KUBECONFIG; \ + oc patch configs.imageregistry.operator.openshift.io cluster \ + --type merge --patch '{\"spec\":{\"managementState\":\"Managed\",\"storage\":{\"emptyDir\":{}}}}'" +``` + +Clone or update lvm-operator: + +```bash +ssh "$SSH_HOST" ' + if [ -d ~/lvm-operator ]; then + cd ~/lvm-operator && git fetch origin && git checkout main && git pull origin main + else + git clone https://github.com/openshift/lvm-operator.git ~/lvm-operator + fi + echo "Commit: $(cd ~/lvm-operator && git rev-parse --short HEAD)" +' +``` + +Deploy (creates namespace, CRDs, RBAC, operator via kustomize — no image build required): + +```bash +ssh "$SSH_HOST" "cd ~/lvm-operator && make deploy" +``` + +Wait for operator: + +```bash +ssh "$SSH_HOST" "export KUBECONFIG=$KUBECONFIG; \ + oc -n openshift-lvm-storage wait deployment/lvms-operator \ + --for=condition=Available --timeout=120s" +``` + +Apply LVMCluster CR (required for CSI driver registration): + +```bash +ssh "$SSH_HOST" "export KUBECONFIG=$KUBECONFIG; \ + oc apply -n openshift-lvm-storage \ + -f ~/lvm-operator/config/samples/lvm_v1alpha1_lvmcluster.yaml" +``` + +Wait for Ready (poll every 10s, timeout 3m): + +```bash +ssh "$SSH_HOST" "export KUBECONFIG=$KUBECONFIG; \ + for i in \$(seq 1 18); do + STATE=\$(oc -n openshift-lvm-storage get lvmcluster my-lvmcluster \ + -o jsonpath='{.status.state}' 2>/dev/null) + echo \"[\$i] \$STATE\" + [ \"\$STATE\" = \"Ready\" ] && break + sleep 10 + done" +``` + +Verify CSI driver — if `topolvm.io` is not listed, do not proceed (tests will silently skip everything): + +```bash +ssh "$SSH_HOST" "export KUBECONFIG=$KUBECONFIG; oc get csidrivers | grep topolvm" +``` + +### Step 1d: Build test binary and start tests + +```bash +ssh "$SSH_HOST" "cd ~/lvm-operator/test/integration && make integration-build" +``` + +Remove stale logs, then start the run with `nohup`: + +```bash +# For mno: +ssh "$SSH_HOST" ' + rm -f ~/lvms-mno.log + cd ~/lvm-operator/test/integration + nohup bash -c "./integration-test run-suite -c 1 \ + openshift/lvm-operator/test/integration/qe_tests/mno > ~/lvms-mno.log 2>&1" & + echo "PID: $!" +' + +# For sno: +ssh "$SSH_HOST" ' + rm -f ~/lvms-sno.log + cd ~/lvm-operator/test/integration + nohup bash -c "./integration-test run-suite -c 1 \ + openshift/lvm-operator/test/integration/qe_tests/sno > ~/lvms-sno.log 2>&1" & + echo "PID: $!" +' + +# For both (sequential): +ssh "$SSH_HOST" ' + rm -f ~/lvms-mno.log ~/lvms-sno.log + cd ~/lvm-operator/test/integration + nohup bash -c " + ./integration-test run-suite -c 1 \ + openshift/lvm-operator/test/integration/qe_tests/mno > ~/lvms-mno.log 2>&1 + ./integration-test run-suite -c 1 \ + openshift/lvm-operator/test/integration/qe_tests/sno > ~/lvms-sno.log 2>&1 + " & + echo "PID: $!" +' +``` + +### Step 1e: Exit with monitoring instructions + +``` +Tests started on (PID above). Suite: — ~1-2 hours per suite. + +The OTE framework buffers output per-test, so the log stays empty while tests run. + +Monitor: + ssh 'ps aux | grep integration-test' + ssh 'tail -f ~/lvms-.log' + +When complete, re-invoke: + /lvms:run-integration-tests +``` + +**Stop here.** Do not proceed to Phase 2 or 3 in the same invocation. + +--- + +## Phase 2: Tests In Progress + +Count completed tests from the partial log: + +```bash +ssh "$SSH_HOST" "python3 -c \" +import json, sys +try: + raw = open('/root/lvms-mno.log').read() + # partial JSON — count result fields + passed = raw.count('\"result\": \"passed\"') + failed = raw.count('\"result\": \"failed\"') + print(f'passed: {passed}, failed: {failed}') +except: print('log not yet written') +\" 2>/dev/null || echo 'log not yet written'" +``` + +Report to the user: +``` +Tests still running on . +Progress: X passed, Y failed so far. + +Monitor: ssh 'tail -f ~/lvms-.log' + +Re-invoke /lvms:run-integration-tests when complete. +``` + +**Stop here.** + +--- + +## Phase 3: Results Ready + +### Step 3a: Determine which logs exist + +```bash +ssh "$SSH_HOST" "ls ~/lvms-mno.log ~/lvms-sno.log 2>/dev/null" +``` + +### Step 3b: Parse each log + +For each log file, use Python to parse the OTE JSON output (the log is a JSON array with a trailing +`Error: N tests failed` line): + +```bash +ssh "$SSH_HOST" "python3 -c \" +import json, re +raw = open('/root/lvms-.log').read() +data = json.loads(raw[:raw.rfind(']')+1]) +passed = [t for t in data if t['result'] == 'passed'] +failed = [t for t in data if t['result'] == 'failed'] +print(f'TOTAL:{len(data)} PASSED:{len(passed)} FAILED:{len(failed)}') +for t in failed: + m = re.search(r'-(\d{5,})-', t['name']) + tid = f'OCP-{m.group(1)}' if m else 'unknown' + print(f'FAIL|{tid}|{t[\"name\"][:80]}|{t.get(\"output\",\"\").strip().splitlines()[-1][:120]}') +\"" +``` + +### Step 3c: Get cluster version + +```bash +ssh "$SSH_HOST" "export KUBECONFIG=$KUBECONFIG; oc version --short 2>/dev/null || oc version" +``` + +### Step 3d: Generate Markdown report + +Known flakes (flag but do not count as failures): + +| Test ID | Known Issue | +|---------|-------------| +| OCP-86156 | `pvmove` fails with `No data to move` — VG state sensitivity between tests | +| OCP-71012 | ForceWipe — virtio partition naming on libvirt VMs | +| OCP-69772 | RAID test picks `/dev/sr0` (virtual CD-ROM) | + +Report format: + +```markdown +### LVMS Integration Test Results + +**OCP Version:** +**lvm-operator:** main @ +**Suite:** +**Date:** + +#### Summary + +| Suite | Passed | Failed | Total | Pass Rate | +|-------|--------|--------|-------|-----------| +| MNO | X | Y | Z | XX.X% | + +#### Failures + +| Test ID | Test Name | Failure Reason | Known Flake? | +|---------|-----------|----------------|--------------| + +#### Conclusion +**PASS** / **BLOCKED** — +``` + +Use **PASS** if pass rate ≥ 90% and no unexpected failures. Use **BLOCKED** otherwise. + +### Step 3e: Post to JIRA + +If a JIRA ticket was provided (from `$ARGUMENTS`), ask before posting: +``` +Post results to ? (yes/no) +``` + +If yes, post using the `mcp__mcp-atlassian__jira_add_comment` MCP tool. + +### Step 3f: Offer log cleanup + +Ask: "Remove log files from the hypervisor? (yes/no)" + +If yes: +```bash +ssh "$SSH_HOST" "rm -f ~/lvms-mno.log ~/lvms-sno.log" +``` + +## Error Handling + +| Error | Action | +|-------|--------| +| SSH connection refused | Hypervisor down — start it and retry | +| Go not found | Install Go 1.24+ on the hypervisor | +| `make deploy` fails | Check `oc get events -n openshift-lvm-storage` | +| LVMCluster not Ready after 3m | Check `oc describe lvmcluster -n openshift-lvm-storage` | +| `topolvm.io` CSI driver missing | Check vg-manager logs: `oc logs -n openshift-lvm-storage -l app.kubernetes.io/name=vg-manager` | +| `make integration-build` fails | Verify Go version — must be 1.24+ | +| Log empty after process exits | SSH disconnect killed the run — restart with nohup | +| JIRA post fails | Display report for manual copy | + +## Notes + +- **`-c 1` is mandatory** — Serial/Disruptive tests modify the LVMCluster CR; higher concurrency causes interference +- **`nohup` is mandatory** — SSH disconnect kills the run otherwise +- **Run from the hypervisor only** — test binary resolves cluster-internal DNS +- **OTE log buffering** — log stays empty until each test completes; monitor via `ps aux` +- **Free disks required** — check worker nodes with `lsblk`; disks with partitions/filesystems are skipped From e1814a5d4323d96421c27aa400e0dcf4089ed14f Mon Sep 17 00:00:00 2001 From: Luca Consalvi Date: Thu, 25 Jun 2026 14:45:23 +0200 Subject: [PATCH 2/4] fix: bump lvms version to 1.2.0 in marketplace.json Co-Authored-By: Claude Sonnet 4.6 (1M context) --- .claude-plugin/marketplace.json | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json index 96790cdb..1e35db18 100644 --- a/.claude-plugin/marketplace.json +++ b/.claude-plugin/marketplace.json @@ -52,7 +52,7 @@ "name": "lvms", "source": "./plugins/lvms", "description": "LVMS (Logical Volume Manager Storage) release, QE, operational workflows, and troubleshooting", - "version": "1.1.0" + "version": "1.2.0" }, { "name": "lvms-ci", From 6352d35b511b9fabc5c28fce26f57931cacc21d7 Mon Sep 17 00:00:00 2001 From: Luca Consalvi Date: Thu, 25 Jun 2026 14:47:59 +0200 Subject: [PATCH 3/4] fix: address CodeRabbit review findings - Add mcp__mcp-atlassian__jira_add_comment to allowed-tools - Replace hardcoded IP with generic placeholder in SSH host prompt - Fix /root/ path to ~/ in Phase 2 and Phase 3 log reads - Add explicit STOP condition when topolvm.io CSI driver is missing - Add Go 1.24+ and TNF cluster requirements to README Co-Authored-By: Claude Sonnet 4.6 (1M context) --- plugins/lvms/README.md | 2 ++ .../lvms/skills/run-integration-tests/SKILL.md | 17 +++++++++++------ 2 files changed, 13 insertions(+), 6 deletions(-) diff --git a/plugins/lvms/README.md b/plugins/lvms/README.md index a667d4e2..a0daf903 100644 --- a/plugins/lvms/README.md +++ b/plugins/lvms/README.md @@ -69,6 +69,8 @@ LVMS (Logical Volume Manager Storage) release, QE, and operational workflows. - `skopeo` (for z-stream report registry queries) - Jira credentials (`JIRA_BASE_URL`, `JIRA_EMAIL`, `JIRA_API_TOKEN`) for z-stream report - Python 3 with PyYAML (for must-gather analysis) +- Go 1.24+ on the hypervisor (for `run-integration-tests`) +- TNF cluster with extra disks and SSH-accessible hypervisor (for `run-integration-tests`) - **Category:** operator ## Author diff --git a/plugins/lvms/skills/run-integration-tests/SKILL.md b/plugins/lvms/skills/run-integration-tests/SKILL.md index 536075f9..1f8cae9a 100644 --- a/plugins/lvms/skills/run-integration-tests/SKILL.md +++ b/plugins/lvms/skills/run-integration-tests/SKILL.md @@ -3,7 +3,7 @@ name: lvms:run-integration-tests argument-hint: "[JIRA-ID]" description: Run LVMS QE integration tests on a TNF cluster via SSH — deploys operator from source, runs tests, parses results, posts to JIRA. Use for RC/EC builds where OLM catalog is unavailable. user-invocable: true -allowed-tools: Bash, Read, AskUserQuestion +allowed-tools: Bash, Read, AskUserQuestion, mcp__mcp-atlassian__jira_add_comment --- # lvms:run-integration-tests @@ -41,7 +41,7 @@ Parse `$ARGUMENTS` for an optional JIRA ticket ID (e.g. `OCPEDGE-1995`). Ask for the hypervisor SSH host: ``` -Hypervisor SSH host? (default: ec2-user@52.29.221.136) +Hypervisor SSH host? (e.g. ec2-user@) ``` Set: @@ -152,12 +152,17 @@ ssh "$SSH_HOST" "export KUBECONFIG=$KUBECONFIG; \ done" ``` -Verify CSI driver — if `topolvm.io` is not listed, do not proceed (tests will silently skip everything): +Verify CSI driver — if `topolvm.io` is not listed, stop immediately: ```bash ssh "$SSH_HOST" "export KUBECONFIG=$KUBECONFIG; oc get csidrivers | grep topolvm" ``` +**CRITICAL:** If `topolvm.io` is absent, stop and report: +> CSI driver `topolvm.io` not registered. Check vg-manager logs: +> `oc logs -n openshift-lvm-storage -l app.kubernetes.io/name=vg-manager` +> Do not start tests — all cases will silently skip. + ### Step 1d: Build test binary and start tests ```bash @@ -224,9 +229,9 @@ Count completed tests from the partial log: ```bash ssh "$SSH_HOST" "python3 -c \" -import json, sys +import json, sys, os try: - raw = open('/root/lvms-mno.log').read() + raw = open(os.path.expanduser('~/lvms-mno.log')).read() # partial JSON — count result fields passed = raw.count('\"result\": \"passed\"') failed = raw.count('\"result\": \"failed\"') @@ -265,7 +270,7 @@ For each log file, use Python to parse the OTE JSON output (the log is a JSON ar ```bash ssh "$SSH_HOST" "python3 -c \" import json, re -raw = open('/root/lvms-.log').read() +raw = open(os.path.expanduser('~/lvms-.log')).read() data = json.loads(raw[:raw.rfind(']')+1]) passed = [t for t in data if t['result'] == 'passed'] failed = [t for t in data if t['result'] == 'failed'] From 7e7d714dc450909b5643127214341da925762b7a Mon Sep 17 00:00:00 2001 From: Luca Consalvi Date: Sat, 27 Jun 2026 10:43:32 +0200 Subject: [PATCH 4/4] fix: resolve markdownlint errors in SKILL.md Add language specifiers to bare fenced code blocks (MD040) and blank lines around fences (MD031). Co-Authored-By: Claude Sonnet 4.6 (1M context) --- .../lvms/skills/run-integration-tests/SKILL.md | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/plugins/lvms/skills/run-integration-tests/SKILL.md b/plugins/lvms/skills/run-integration-tests/SKILL.md index 1f8cae9a..1fbdc1fd 100644 --- a/plugins/lvms/skills/run-integration-tests/SKILL.md +++ b/plugins/lvms/skills/run-integration-tests/SKILL.md @@ -40,12 +40,13 @@ Parse `$ARGUMENTS` for an optional JIRA ticket ID (e.g. `OCPEDGE-1995`). Ask for the hypervisor SSH host: -``` +```text Hypervisor SSH host? (e.g. ec2-user@) ``` Set: -``` + +```text SSH_HOST = user input or default KUBECONFIG = /home/ec2-user/openshift-metal3/dev-scripts/ocp/ostest/auth/kubeconfig ``` @@ -73,7 +74,7 @@ ssh "$SSH_HOST" ' ### Step 1a: Ask for suite -``` +```text Which test suite? - mno: Multi-Node OpenShift (36 tests) — use for TNF clusters - sno: Single-Node OpenShift (30 tests) — use for SNO clusters @@ -206,7 +207,7 @@ ssh "$SSH_HOST" ' ### Step 1e: Exit with monitoring instructions -``` +```text Tests started on (PID above). Suite: — ~1-2 hours per suite. The OTE framework buffers output per-test, so the log stays empty while tests run. @@ -241,7 +242,8 @@ except: print('log not yet written') ``` Report to the user: -``` + +```text Tests still running on . Progress: X passed, Y failed so far. @@ -328,7 +330,8 @@ Use **PASS** if pass rate ≥ 90% and no unexpected failures. Use **BLOCKED** ot ### Step 3e: Post to JIRA If a JIRA ticket was provided (from `$ARGUMENTS`), ask before posting: -``` + +```text Post results to ? (yes/no) ``` @@ -339,6 +342,7 @@ If yes, post using the `mcp__mcp-atlassian__jira_add_comment` MCP tool. Ask: "Remove log files from the hypervisor? (yes/no)" If yes: + ```bash ssh "$SSH_HOST" "rm -f ~/lvms-mno.log ~/lvms-sno.log" ```