Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions deploy/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,9 @@ full-clean:
clean-spoke:
@./openshift-clusters/scripts/clean-spoke.sh

clean-mutable-topology:
@./openshift-clusters/scripts/clean-mutable-topology.sh

ssh:
@./aws-hypervisor/scripts/ssh.sh

Expand Down Expand Up @@ -92,6 +95,9 @@ fencing-assisted:
keep-instance:
@../helpers/keep-instance.sh '$(DAYS)'

sno-to-3node:
@./openshift-clusters/scripts/sno-to-3node.sh

patch-nodes:
@./openshift-clusters/scripts/patch-nodes.sh
get-tnf-logs:
Expand Down Expand Up @@ -128,6 +134,7 @@ help:
@echo " fencing-assisted - Deploy hub + spoke TNF cluster via assisted installer"
@echo " sno-ipi - Deploy Single Node OpenShift (SNO) IPI cluster (non-interactive)"
@echo " sno-agent - Deploy Single Node OpenShift (SNO) Agent cluster (non-interactive)"
@echo " sno-to-3node - Transition existing SNO cluster to 3-node HA (platform:none)"
@echo ""
@echo "OpenShift Cluster Management:"
@echo " redeploy-cluster - Redeploy OpenShift cluster using dev-scripts make redeploy"
Expand All @@ -136,6 +143,7 @@ help:
@echo " clean - Clean OpenShift cluster using dev-scripts clean target"
@echo " full-clean - Fully clean instance cache and OpenShift cluster using dev-scripts realclean target"
@echo " clean-spoke - Clean spoke cluster resources (VMs, network, auth) from assisted installer"
@echo " clean-mutable-topology - Remove master-1/2 VMs, disks, DHCP/DNS entries (sno-to-3node cleanup)"
@echo " patch-nodes - Build resource-agents RPM and patch cluster nodes (default version: 4.11)"
@echo ""
@echo "Cluster Utilities:"
Expand Down
43 changes: 43 additions & 0 deletions deploy/openshift-clusters/clean-mutable-topology.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
---
- hosts: metal_machine
gather_facts: yes

pre_tasks:
- name: Confirm mutable topology cleanup
ansible.builtin.pause:
prompt: >-
This will destroy the master-1 and master-2 VMs, their disks, DHCP reservations,
and DNS entries. master-0 (if still present) will be unaffected.
Press Enter to proceed or Ctrl+C to abort.
delegate_to: localhost
run_once: true
when: interactive_mode | default(true) | bool

- name: Detect cluster domain from kubeconfig (best-effort)
shell: |
KUBECONFIG={{ dev_scripts_path | default('openshift-metal3/dev-scripts') }}/ocp/{{ sno_cluster_name | default('ostest') }}/auth/kubeconfig \
oc get infrastructure cluster -o jsonpath='{.status.apiServerInternalURI}' 2>/dev/null \
| sed 's|^https://api-int\.||; s|:6443$||'
register: detected_domain
changed_when: false
failed_when: false
ignore_errors: true

- name: Set cluster domain (use detected, variable override, or default)
set_fact:
sno_cluster_domain: >-
{{ sno_cluster_domain
if (sno_cluster_domain is defined and sno_cluster_domain)
else (detected_domain.stdout | trim
if (detected_domain.stdout is defined and detected_domain.stdout | trim)
else 'ostest.test.metalkube.org') }}

tasks:
- name: Run mutable topology cleanup
import_role:
name: mutable-topology/sno-to-3node
tasks_from: clean.yml
Comment on lines +35 to +39

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Refuse cleanup once the cluster is already HA unless the caller forces it.

This imports tasks/clean.yml unconditionally, and that role deletes master-1/master-2 plus their disks and DHCP state. On a successful 3-node cluster, removing two control-plane nodes here can drop etcd from 3 members to 1 and take the cluster down. Please gate this on the current control-plane topology, or require an explicit force_cleanup=true override before running the destructive role.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/openshift-clusters/clean-mutable-topology.yml` around lines 35 - 39,
The cleanup playbook currently imports mutable-topology/sno-to-3node
tasks/clean.yml unconditionally, which can destroy a healthy HA control plane.
Add a topology check in clean-mutable-topology.yml before the import_role so it
only runs when the cluster is not already HA, or require an explicit
force_cleanup=true override to proceed. Use the existing import_role task as the
entry point and gate it with a condition or precheck that prevents destructive
cleanup unless the override is set.


- name: Cleanup complete
ansible.builtin.debug:
msg: "Mutable topology cleanup complete. master-1 and master-2 VMs have been removed."
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
# Cluster identity (auto-detected from cluster if not set)
sno_cluster_name: ostest
sno_cluster_domain: ""
sno_infra_id: ""

# Existing master-0 (auto-detected from cluster)
sno_master0_ip: ""

# New node IPs (static assignments within the dev-scripts DHCP range)
sno_master1_ip: "192.168.111.21"
sno_master2_ip: "192.168.111.22"

# VM specs
sno_vm_vcpus: 6
sno_vm_ram_mb: 16384
sno_vm_disk_gb: 50

# Libvirt network (dev-scripts baremetal network)
sno_libvirt_network: ostestbm
sno_libvirt_bridge: ostestbm

# RHCOS live ISO path on hypervisor (auto-detected from release image if empty)
sno_rhcos_live_iso: ""

# Timeouts
sno_mco_timeout_minutes: 45
sno_node_join_timeout_minutes: 20
sno_etcd_timeout_minutes: 15

# Auto-fix MCO drain deadlock during topology transition
sno_auto_fix_drain: true

# Paths (override if dev-scripts is in a non-standard location)
sno_kubeconfig: ""

# VM image directory
sno_vm_image_dir: "/var/lib/libvirt/images"
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
---
- name: "[boot] Check if RHCOS live ISO exists on hypervisor"
stat:
path: "/var/lib/libvirt/images/rhcos-live.iso"
register: iso_stat

- name: "[boot] Find RHCOS live ISO from dev-scripts cache or release payload"
shell: |
DEVSCRIPTS_ISO=$(find /var/lib/libvirt/images -name 'rhcos-*-live*.iso' 2>/dev/null | head -1)
if [ -n "$DEVSCRIPTS_ISO" ]; then
echo "Using existing ISO: $DEVSCRIPTS_ISO"
sudo ln -sf "$DEVSCRIPTS_ISO" /var/lib/libvirt/images/rhcos-live.iso
exit 0
fi

CACHE_ISO=$(find {{ sno_dev_scripts_path }}/ -name 'rhcos-*-live*.iso' 2>/dev/null | head -1)
if [ -n "$CACHE_ISO" ]; then
echo "Using dev-scripts cached ISO: $CACHE_ISO"
sudo ln -sf "$CACHE_ISO" /var/lib/libvirt/images/rhcos-live.iso
exit 0
fi

echo "No cached ISO found, pulling from release payload..."
PULL_SECRET="{{ sno_dev_scripts_path }}/pull_secret.json"
RELEASE_IMAGE=$(KUBECONFIG="{{ sno_kubeconfig_resolved }}" oc get clusterversion version \
-o jsonpath='{.status.desired.image}')
echo "Release image: $RELEASE_IMAGE"
RHCOS_IMAGE=$(oc adm release info \
--image-for=machine-os-images \
-a "$PULL_SECRET" \
"$RELEASE_IMAGE")
echo "machine-os-images: $RHCOS_IMAGE"
oc image extract "$RHCOS_IMAGE" \
-a "$PULL_SECRET" \
--path /coreos/coreos-x86_64.iso:/tmp/ \
--confirm
sudo mv /tmp/coreos-x86_64.iso /var/lib/libvirt/images/rhcos-live.iso
echo "ISO pulled from release payload: /var/lib/libvirt/images/rhcos-live.iso"
when: not iso_stat.stat.exists and sno_rhcos_live_iso == ""

- name: "[boot] Set ISO path"
set_fact:
sno_iso_path: "{{ sno_rhcos_live_iso if sno_rhcos_live_iso else '/var/lib/libvirt/images/rhcos-live.iso' }}"

- name: "[boot] Read master.ign content"
slurp:
src: /tmp/master.ign
register: master_ign_content

- name: "[boot] Set base64-encoded master.ign"
set_fact:
sno_master_ign_b64: "{{ master_ign_content.content }}"

- name: "[boot] Generate auto-install ignition"
template:
src: auto-install.ign.j2
dest: /tmp/auto-install.ign
mode: '0644'

- name: "[boot] Ensure coreos-installer is available"
shell: |
if ! command -v coreos-installer &>/dev/null; then
echo "coreos-installer not found, installing via dnf..."
sudo dnf install -y coreos-installer
else
echo "coreos-installer already installed: $(command -v coreos-installer)"
fi

- name: "[boot] Create per-node ISO with embedded ignition"
shell: |
sudo cp {{ sno_iso_path }} /var/lib/libvirt/images/rhcos-{{ item.hostname }}.iso
sudo coreos-installer iso ignition embed -i /tmp/auto-install.ign /var/lib/libvirt/images/rhcos-{{ item.hostname }}.iso -f
loop: "{{ sno_new_nodes }}"

- name: "[boot] Boot each VM with ignition-embedded ISO"
shell: |
VM_NAME="{{ item.name }}"
MAC="{{ sno_node_macs[item.hostname] }}"

sudo virsh destroy "$VM_NAME" 2>/dev/null || true
sudo virsh undefine "$VM_NAME" 2>/dev/null || true

sudo virt-install \
--name "$VM_NAME" \
--ram {{ sno_vm_ram_mb }} \
--vcpus {{ sno_vm_vcpus }} \
--disk {{ sno_vm_image_dir }}/${VM_NAME}.qcow2,bus=virtio \
--network network={{ sno_libvirt_network }},model=virtio,mac=${MAC} \
--cdrom /var/lib/libvirt/images/rhcos-{{ item.hostname }}.iso \
--os-variant rhel9.0 \
--graphics none \
--noautoconsole \
--boot loader=/usr/share/edk2/ovmf/OVMF_CODE.fd,loader_ro=yes,loader_type=pflash,nvram_template=/usr/share/edk2/ovmf/OVMF_VARS.fd,loader_secure=no \
--boot hd,cdrom \
--tpm none
loop: "{{ sno_new_nodes }}"

- name: "[boot] Verify VMs are running (initial install boot)"
shell: |
sudo virsh domstate {{ item.name }}
register: vm_state
loop: "{{ sno_new_nodes }}"
changed_when: false
failed_when: "'running' not in vm_state.stdout"

- name: "[boot] Wait for coreos-installer to complete (VM will power off)"
# coreos-installer runs ExecStartPost=systemctl reboot, but RHCOS live issues
# an ACPI poweroff rather than a reset. libvirt fires on_poweroff=destroy so
# the VM shuts off. We poll until shut off, then boot from disk below.
shell: |
for i in $(seq 50); do
STATE=$(sudo virsh domstate {{ item.name }} 2>/dev/null || echo "unknown")
if echo "$STATE" | grep -q "shut off"; then
echo "{{ item.name }} shut off after $((i * 20))s - install complete"
exit 0
fi
sleep 20
done
echo "Timeout: {{ item.name }} did not shut off within 1000s"
exit 1
loop: "{{ sno_new_nodes }}"
changed_when: false

- name: "[boot] Remove CDROM from boot order after install"
# Prevent coreos-installer loop: strip the cdrom boot entry so UEFI only
# tries the hard disk on subsequent boots.
shell: |
TMPXML=$(mktemp /tmp/vm-XXXXXX.xml)
sudo virsh dumpxml {{ item.name }} > "$TMPXML"
sudo sed -i "/<boot dev='cdrom'\/>/d" "$TMPXML"
sudo virsh define "$TMPXML"
sudo rm -f "$TMPXML"
loop: "{{ sno_new_nodes }}"
changed_when: true

- name: "[boot] Start VMs to boot from installed RHCOS"
shell: |
sudo virsh start {{ item.name }}
loop: "{{ sno_new_nodes }}"
changed_when: true

- name: "[boot] Verify VMs are running from installed disk"
shell: |
sudo virsh domstate {{ item.name }}
register: vm_state_disk
loop: "{{ sno_new_nodes }}"
changed_when: false
failed_when: "'running' not in vm_state_disk.stdout"

- name: "[boot] VMs booted with RHCOS"
debug:
msg: >-
{{ sno_new_nodes | length }} VMs installed and started from disk.
Waiting for nodes to join the cluster...
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
---
- name: "[clean] Stop and undefine new VMs"
shell: |
sudo virsh destroy {{ item.name }} 2>/dev/null || true
sudo virsh undefine {{ item.name }} --remove-all-storage 2>/dev/null || true
loop: "{{ sno_new_nodes }}"
failed_when: false
changed_when: true

- name: "[clean] Remove disk images"
file:
path: "{{ sno_vm_image_dir }}/{{ item.name }}.qcow2"
state: absent
loop: "{{ sno_new_nodes }}"
become: true
failed_when: false

- name: "[clean] Remove NVRAM files"
file:
path: "/var/lib/libvirt/qemu/nvram/{{ item.name }}_VARS.fd"
state: absent
loop: "{{ sno_new_nodes }}"
become: true
failed_when: false

- name: "[clean] Remove per-node RHCOS ISOs"
file:
path: "/var/lib/libvirt/images/rhcos-{{ item.hostname }}.iso"
state: absent
loop: "{{ sno_new_nodes }}"
Comment thread
coderabbitai[bot] marked this conversation as resolved.
become: true
failed_when: false

- name: "[clean] Remove DHCP reservations from libvirt network"
shell: |
NET_XML=$(sudo virsh net-dumpxml {{ sno_libvirt_network }} 2>/dev/null || true)
EXISTING_MAC=$(echo "$NET_XML" | python3 -c "
import xml.etree.ElementTree as ET, sys
root = ET.parse(sys.stdin).getroot()
for host in root.findall('.//dhcp/host'):
if host.get('ip') == '{{ item.ip }}':
print(host.get('mac',''))
" 2>/dev/null || true)
Comment thread
coderabbitai[bot] marked this conversation as resolved.

if [ -n "$EXISTING_MAC" ]; then
sudo virsh net-update {{ sno_libvirt_network }} delete ip-dhcp-host \
"<host mac='${EXISTING_MAC}' ip='{{ item.ip }}'/>" --live --config 2>/dev/null || true
echo "Removed DHCP reservation for {{ item.ip }} (mac=${EXISTING_MAC})"
else
echo "No DHCP reservation found for {{ item.ip }}"
fi
loop: "{{ sno_new_nodes }}"
failed_when: false
changed_when: true

- name: "[clean] Remove DNS host entries from libvirt network"
shell: |
sudo virsh net-update {{ sno_libvirt_network }} delete dns-host \
"<host ip='{{ item.ip }}'><hostname>{{ item.hostname }}.{{ sno_cluster_domain }}</hostname></host>" \
--live --config 2>/dev/null || true
sudo virsh net-update {{ sno_libvirt_network }} delete dns-host \
"<host ip='{{ item.ip }}'><hostname>{{ item.hostname }}</hostname></host>" \
--live --config 2>/dev/null || true
loop: "{{ sno_new_nodes }}"
failed_when: false
changed_when: true

- name: "[clean] Remove api-int from libvirt network DNS"
shell: |
sudo virsh net-update {{ sno_libvirt_network }} delete dns-host \
"<host ip='{{ sno_master0_ip }}'><hostname>api-int.{{ sno_cluster_domain }}</hostname></host>" \
--live --config 2>/dev/null || true
Comment thread
coderabbitai[bot] marked this conversation as resolved.
failed_when: false
changed_when: true

- name: "[clean] Remove entries from addnhosts"
lineinfile:
path: "{{ sno_addnhosts_path }}"
regexp: "^{{ item.ip }}\\s"
state: absent
loop: "{{ sno_new_nodes }}"
become: true
failed_when: false

- name: "[clean] Remove hostname entries from addnhosts"
lineinfile:
path: "{{ sno_addnhosts_path }}"
regexp: "\\s{{ item.hostname }}\\.{{ sno_cluster_domain }}(\\s|$)"
state: absent
loop: "{{ sno_new_nodes }}"
become: true
failed_when: false

- name: "[clean] Flush DHCP leases for new node IPs"
shell: |
python3 -c "
import json, os
lf = '{{ sno_lease_file }}'
if not os.path.exists(lf) or os.path.getsize(lf) == 0:
exit(0)
with open(lf) as f:
leases = json.load(f)
reserved = {{ [sno_master1_ip, sno_master2_ip] | to_json }}
before = len(leases)
leases = [l for l in leases if l.get('ip-address') not in reserved]
with open(lf, 'w') as f:
json.dump(leases, f, indent=2)
print(f'Flushed {before - len(leases)} leases')
"
become: true
failed_when: false
changed_when: true

- name: "[clean] Reload libvirt dnsmasq"
shell: |
sudo kill -HUP $(cat /run/libvirt/network/{{ sno_libvirt_bridge }}.pid) 2>/dev/null || true
changed_when: true
failed_when: false

- name: "[clean] Cleanup complete"
debug:
msg: >-
Removed VMs: {{ sno_new_nodes | map(attribute='name') | list | join(', ') }}.
Disk images, NVRAM, ISOs, DHCP reservations, and DNS entries removed.
Comment on lines +114 to +124

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Cleanup leaves the NetworkManager dnsmasq override behind.

update-dns.yml creates {{ sno_nm_dnsmasq_conf }} and restarts NetworkManager, but this cleanup only HUPs libvirt dnsmasq. After the VMs and libvirt DNS records are gone, the host can still resolve api, api-int, and *.apps from stale local overrides. Remove that file and reload/restart NetworkManager here as part of the DNS cleanup path.

Suggested cleanup addition
+- name: "[clean] Remove NM dnsmasq override"
+  file:
+    path: "{{ sno_nm_dnsmasq_conf }}"
+    state: absent
+  become: true
+  failed_when: false
+
+- name: "[clean] Restart NetworkManager to drop local DNS overrides"
+  systemd:
+    name: NetworkManager
+    state: restarted
+  become: true
+  failed_when: false
+
 - name: "[clean] Reload libvirt dnsmasq"
   shell: |
     sudo kill -HUP $(cat /run/libvirt/network/{{ sno_libvirt_bridge }}.pid) 2>/dev/null || true
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: "[clean] Reload libvirt dnsmasq"
shell: |
sudo kill -HUP $(cat /run/libvirt/network/{{ sno_libvirt_bridge }}.pid) 2>/dev/null || true
changed_when: true
failed_when: false
- name: "[clean] Cleanup complete"
debug:
msg: >-
Removed VMs: {{ sno_new_nodes | map(attribute='name') | list | join(', ') }}.
Disk images, NVRAM, ISOs, DHCP reservations, and DNS entries removed.
- name: "[clean] Remove NM dnsmasq override"
file:
path: "{{ sno_nm_dnsmasq_conf }}"
state: absent
become: true
failed_when: false
- name: "[clean] Restart NetworkManager to drop local DNS overrides"
systemd:
name: NetworkManager
state: restarted
become: true
failed_when: false
- name: "[clean] Reload libvirt dnsmasq"
shell: |
sudo kill -HUP $(cat /run/libvirt/network/{{ sno_libvirt_bridge }}.pid) 2>/dev/null || true
changed_when: true
failed_when: false
- name: "[clean] Cleanup complete"
debug:
msg: >-
Removed VMs: {{ sno_new_nodes | map(attribute='name') | list | join(', ') }}.
Disk images, NVRAM, ISOs, DHCP reservations, and DNS entries removed.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@deploy/openshift-clusters/roles/mutable-topology/sno-to-3node/tasks/clean.yml`
around lines 114 - 124, The cleanup in the `clean.yml` task only reloads libvirt
dnsmasq, but it leaves the NetworkManager dnsmasq override created by
`update-dns.yml` in place. Update the `[clean]` flow to also remove `{{
sno_nm_dnsmasq_conf }}` and then reload or restart `NetworkManager` so stale
local DNS entries for `api`, `api-int`, and `*.apps` are cleared. Use the
existing cleanup block near `[clean] Reload libvirt dnsmasq` and `[clean]
Cleanup complete` to add this DNS teardown step.

Loading