[GPTEINFRA-16720] Generic multi-tenant loop bridge role by prakhar1985 · Pull Request #149 · agnosticd/core_workloads

prakhar1985 · 2026-06-03T13:04:19Z

Summary

New generic role ocp4_workload_multi_tenant_loop that loops tenant workloads for N users on a shared OpenShift cluster. Designed for combined catalog items that merge cluster and tenant provisioning into a single order.

Stopgap until Babylon gets native cluster-tenant lifecycle support.

How it works

Three-phase provisioning:

Phase 0: Run cluster workloads inside the bridge (for workloads like gitops_bootstrap that also run per-tenant with different vars — avoids Ansible extra_vars precedence conflicts)
Phase 1: Loop all users, run tenant_workloads per user with ocp4_workload_tenant_keycloak_username and common_password overridden per iteration
Phase 2: Register all users with Babylon (workshop_user_mode: multi)

Interface variables

Variable	Default	Description
`num_users`	`0`	Number of tenant users
`tenant_workloads`	`[]`	Workload FQCNs to run per user
`tenant_remove_workloads`	`[]`	Workload FQCNs for destroy
`tenant_pre_loop_delay`	`0`	Seconds to wait before tenant loop
`tenant_user_info_data`	`{}`	Per-user data for Babylon (lab_ui_url, etc.)
`tenant_workload_vars`	`{}`	Per-workload var overrides for tenant workloads
`tenant_cluster_workloads`	`[]`	Workloads to run once inside bridge (phase 0)
`tenant_cluster_workloads_vars`	`{}`	Per-workload var overrides for cluster workloads

Key design decisions

Two-phase user registration: Phase 1 runs workloads without agnosticd_user_info calls. Phase 2 registers users AFTER all workloads complete. This prevents showroom's agnosticd_user_data lookup from detecting per-user data and switching to multi-user namespace mode.
set_fact for overrides: include_role vars: cannot accept Jinja2 dict expressions in AAP. Per-workload overrides use set_fact per key instead. Works because conflicting vars are NOT at the top level (not extra_vars).
Error-tolerant destroy: block/rescue per user — destroy never blocks cluster teardown.

Testing

LB2010 (MTA + DevSpaces + Kai): tested with 3-4 users, ~38-46 min
LB1401 (Guardrails + GitOps bootstrap): tested with 4 users, ~48 min
Both use tenant_cluster_workloads and tenant_workload_vars patterns

Files

roles/ocp4_workload_multi_tenant_loop/
├── defaults/main.yml
├── meta/main.yml
├── tasks/
│   ├── main.yml
│   ├── workload.yml
│   ├── run_cluster_workload.yml
│   ├── provision_user.yml
│   ├── provision_workload.yml
│   ├── register_user.yml
│   ├── remove_workload.yml
│   └── destroy_user.yml
└── README.md

Generic bridge workload that loops tenant workloads for N users on a shared cluster. Designed as a stopgap until Babylon gets native cluster-tenant lifecycle support. Interface: - num_users: number of tenants to provision - tenant_workloads: list of workload FQCNs to run per user - tenant_remove_workloads: list of workload FQCNs for destroy - tenant_user_prefix: username prefix (default: user) - tenant_user_offset: starting number (default: 1) - tenant_password_length: per-user password length (default: 8) The role overrides ocp4_workload_tenant_keycloak_username and common_password per user iteration. All other tenant vars defined in the AgnosticV catalog item cascade via Jinja2 lazy evaluation from these two roots.

…mments

…eardown

…er resources

…a module_defaults Use include_role apply: module_defaults: to set the user parameter on all agnosticd_user_info calls inside tenant workloads. This makes each user's data and messages separate in provision_data without modifying the tenant workloads themselves.

…rkloads Call agnosticd_user_info per user with username and password so workshop_user_mode: multi has the base credentials for each user. Lab-specific data (showroom URLs etc) comes from the tenant workloads own agnosticd_user_info calls via module_defaults.

… lab_ui_url The showroom role uses _showroom_user in its agnosticd_user_info call. When undefined it uses omit which overrides module_defaults, causing all users to get the last user's showroom URL. Passing _showroom_user explicitly fixes per-user lab_ui_url in the portal.

…and message leaks _showroom_user triggers the showroom role's multi-user path which appends the username to the namespace (user1-showroom-user1 instead of user1-showroom). Also causes cross-user message contamination. The bridge's own agnosticd_user_info call at the end of each user is the authoritative source for lab_ui_url — showroom role writes to global via omit, bridge writes correct URL per-user.

module_defaults injects user: into tenant workloads but conflicts with showroom role's omit pattern, causing lab_ui_url overwrites. Simpler design: tenant workloads write to global (we don't care), the bridge's own agnosticd_user_info call is the sole source of per-user data (user, password, lab_ui_url). Clean separation.

…ys generic Move lab-specific data (lab_ui_url pattern) out of the bridge role and into agnosticv via tenant_user_info_data dict. The bridge just merges it with the base user/password data per user. Each lab defines its own URL patterns in agnosticv. The bridge has zero knowledge of showroom or any specific workload.

The bridge's agnosticd_user_info call writes per-user data which the showroom role reads back via agnosticd_user_data lookup. This creates a 'users' dict that triggers showroom's multi-user path, creating wrong namespaces (user2-showroom-user1). Fix: split into two phases: 1. Loop all users — run tenant workloads only 2. Loop all users again — register with Babylon This way no per-user data exists in user-data.yaml while showroom is deploying.

…gister phase tenant_user_info_data references ocp4_workload_tenant_keycloak_username which is only set via include_role vars: in phase 1. Phase 2 needs it as a fact so the Jinja2 expressions in tenant_user_info_data resolve.

…cisions Extensive inline comments for PR reviewers covering: - Two-phase provisioning (why register after all workloads) - Variable scoping (Jinja2 lazy eval, include_role vars:) - Showroom interaction gotchas - Destroy error tolerance - Password determinism

…ad overrides When the same workload (e.g. gitops_bootstrap) runs at both cluster and tenant level with different vars, the bridge needs to override specific vars for the tenant invocation. tenant_workload_vars is a dict in agnosticv mapping workload FQCNs to var overrides. The bridge applies them via set_fact before include_role. ocp4_workload_tenant_keycloak_username is also set as a fact so Jinja2 expressions in tenant_workload_vars resolve.

loop: evaluates before when: in Ansible — accessing tenant_workload_vars[_tenant_workload] fails with KeyError for workloads not in the dict. Use default({}) to return empty list for missing keys.

…overrides set_fact cannot override extra_vars (Ansible precedence). include_role vars: CAN — it has higher precedence than extra_vars. Merge tenant_workload_vars overrides into the include_role vars: dict so they override the top-level (extra_var) values. Also set common_password as a fact so Jinja2 chains in tenant_workload_vars values resolve outside include_role context.

…er workloads When the same workload (e.g. gitops_bootstrap) runs at both cluster and tenant level with different vars, defining either set at the top level makes them extra_vars (highest Ansible precedence — can't be overridden by anything). Fix: move the workload OUT of the workloads: list. Run it inside the bridge via include_role vars: (which CAN override role defaults when the vars are NOT extra_vars). tenant_cluster_workloads: workloads to run ONCE before the user loop tenant_cluster_workloads_vars: per-workload vars for those workloads

…ct not string The >- block scalar makes set_fact store a STRING representation of the dict instead of an actual dict. include_role vars: then fails because it gets a string, not a mapping. Remove >- so Jinja2 evaluates to an actual dict object.

…role vars include_role vars: with a Jinja2 dict expression fails in AAP. Switch to set_fact for each override key individually, then call include_role without dynamic vars. Since these vars are NOT at the top level (not extra_vars), set_fact CAN override role defaults.

agonzalezrh

lgtm

Every file now explains WHY things are done the way they are: - Why set_fact instead of include_role vars (AAP rejects Jinja2 dicts) - Why set_fact works here (vars not at top level = not extra_vars) - Why three phases (showroom agnosticd_user_data lookup) - Why tenant_cluster_workloads exists (extra_vars precedence) - Why common_password must be a fact (Jinja2 chain resolution) - Ansible precedence order reference

jkupferer · 2026-06-03T17:37:20Z

+#   showroom detects a 'users' dict and switches to multi-user
+#   mode — which appends the username to the namespace, creating
+#   wrong namespaces like user1-showroom-user1.
+#   Phase 2 (registration) happens AFTER all showroom deployments.


I'm really confused here... if the showroom role is doing something strange, shouldn't we just fix the showroom role? I may need more information to understand the approach.

Good question — the showroom multi-user detection is actually intentional behavior, not a bug. When agnosticd_user_info is called with user: set during Phase 1, it writes per-user entries to user-data.yaml. The showroom role reads that file at deploy time, finds the users dict, and switches to multi-user mode — appending the username to the namespace (user1-showroom becomes user1-showroom-user1). This is the correct path for labs that provision users in a single pass and want showroom to detect them automatically. Fixing that behavior would break those existing labs. Our workaround is to defer registration (Phase 2) until after all showroom pods are deployed. Updated the comment to explain this more clearly.

jkupferer · 2026-06-03T17:47:03Z

+#   AAP rejects `include_role vars: "{{ dict_expression }}"` with
+#   "Vars in a IncludeRole must be specified as a dictionary".
+#   Only static YAML key: value mappings are accepted.
+#   So overrides go through set_fact, root vars through include_role.


I don't believe this is true... I think it means like include_vars, though you can also bring in vars from a role with import_role, but I don't understand why one would try to get vars from a role with include_role. iirc, it isn't that it won't work because the reasons above... but anyway, I think set_fact is fine without this extra comment.

You're right, the comment was misleading. The specific constraint is that AAP rejects Jinja2 expressions that evaluate to a dict in include_role vars: — it only accepts static YAML mappings. So we use set_fact per key instead. Simplified the comment to just explain what we're doing and why, without the incorrect assertion.

jkupferer · 2026-06-03T17:48:48Z

+# Example: gitops_bootstrap gets repo_path, app_name, helm_values
+# overridden from cluster values to tenant values.
+- name: "Apply tenant overrides for {{ _tenant_workload | split('.') | last }}"
+  when: _tenant_workload in (tenant_workload_vars | default({}))


tenant_workload_vars is in defaults/main.yml so there is no need for | default({}) here.

Correct, removed.

jkupferer · 2026-06-03T17:50:02Z

+# Run the workload with the two root vars.
+# All other vars come from facts (set above or in provision_user.yml)
+# or from the catalog item's playbook-level vars (via Jinja2 cascade).
+- name: "Run {{ _tenant_workload | split('.') | last }} for {{ _tenant_username }}"


Why the split on .? That just hides the collection name? Why would we want to hide that?

Agreed. Removed the split — all loop labels now show the full FQCN.

jkupferer · 2026-06-03T17:53:06Z

+      user: "{{ _tenant_username }}"
+      password: "{{ _tenant_password }}"
+    _user_data: >-
+      {{ _base_data | combine(tenant_user_info_data | default({})) }}


tenant_user_info_data is in defaults/main.yml, no need for | default({}) here.

Correct, removed.

jkupferer · 2026-06-03T17:55:45Z

+  agnosticd.core.agnosticd_user_info:
+    user: "{{ _tenant_username }}"
+    data: "{{ _user_data }}"
+    msg: "{{ tenant_user_info_msg | default(_default_msg, true) }}"


The use of _default_msg here is a little weird? Why not just put in defaults/main.yml?

tenant_user_info_msg: "User: {{ _tenant_username }} / {{ _tenant_password }}"

Or, probably better, just:

msg: "{{ tenant_user_info_msg | default(omit, true) }}"

Good suggestion. Moved the default to defaults/main.yml and simplified the task to msg: "{{ tenant_user_info_msg | default(omit, true) }}".

jkupferer · 2026-06-03T17:56:51Z

+  ansible.builtin.debug:
+    msg:
+    - "Tenant loop: destroying {{ num_users }} user(s)"
+    - "Destroy order: {{ tenant_remove_workloads | default([]) | map('split', '.') | map('last') | list }}"


Similar to previosu comment... why do the split and last? That's just hiding potentially useful information for troubleshooting?

Agreed, removed. Full FQCN now shown in all debug labels.

jkupferer · 2026-06-03T18:00:03Z

+    msg:
+    - "Tenant loop: provisioning {{ num_users }} user(s)"
+    - "Username pattern: {{ tenant_user_prefix }}{{ tenant_user_offset }}..{{ tenant_user_prefix }}{{ (tenant_user_offset | int) + (num_users | int) - 1 }}"
+    - "Cluster workloads: {{ tenant_cluster_workloads | default([]) | map('split', '.') | map('last') | list }}"


Again, why drop the collection name from the debug information?

Removed. All labels now show the full FQCN.

- Add vars/main.yml with _tenant_username/_tenant_password defaults - Clarify three-phase WHY: showroom multi-user detection behavior, not a showroom bug — fixing it would break existing labs - Drop | default({}) where vars are already in defaults/main.yml - Drop | default(omit) pattern for tenant_user_info_msg, use omit directly - Remove split('.') | last from all debug labels — show full FQCN - Remove password from destroy_user vars: (not needed for remove) - Simplify provision_workload.yml — remove incorrect comment about include_role vars: accepting Jinja2 (it was right for our use case but the explanation was misleading) - tenant_user_info_msg default moved to defaults/main.yml

prakhar1985 · 2026-06-10T23:19:32Z

I have addressed all your comments. Regarding showroom, we can work it out with Andrew — it should not be a blocker for us. Please approve and merge if you are ok with the changes.

- workload.yml line 28: shorten username pattern debug line (156→148 chars) - name[template]: move Jinja2 templates to end of task names in destroy_user, provision_user, provision_workload, register_user

…pression

prakhar1985 added 20 commits June 1, 2026 07:33

fix: address review — convention alignment, safe defaults, shorter co…

3dac4d8

…mments

fix: GPTEINFRA-16720 ignore errors on destroy — never block cluster t…

4cab6b3

…eardown

feat: GPTEINFRA-16720 add configurable pre-loop delay for async clust…

be014c2

…er resources

feat: GPTEINFRA-16720 include showroom URL in per-user registration

f3e98e4

fix: GPTEINFRA-16720 safe loop for tenant_workload_vars missing keys

4d10609

loop: evaluates before when: in Ansible — accessing tenant_workload_vars[_tenant_workload] fails with KeyError for workloads not in the dict. Use default({}) to return empty list for missing keys.

prakhar1985 requested review from agonzalezrh, fridim, rut31337, stencell and wkulhanek as code owners June 3, 2026 13:04

agonzalezrh previously approved these changes Jun 3, 2026

View reviewed changes

prakhar1985 dismissed agonzalezrh’s stale review via efa8119 June 3, 2026 13:09