Skip to content

[GPTEINFRA-16720] Generic multi-tenant loop bridge role#149

Merged
prakhar1985 merged 24 commits into
mainfrom
GPTEINFRA-16720-multi-tenant-loop
Jun 11, 2026
Merged

[GPTEINFRA-16720] Generic multi-tenant loop bridge role#149
prakhar1985 merged 24 commits into
mainfrom
GPTEINFRA-16720-multi-tenant-loop

Conversation

@prakhar1985

Copy link
Copy Markdown
Contributor

Summary

New generic role ocp4_workload_multi_tenant_loop that loops tenant workloads for N users on a shared OpenShift cluster. Designed for combined catalog items that merge cluster and tenant provisioning into a single order.

Stopgap until Babylon gets native cluster-tenant lifecycle support.

How it works

Three-phase provisioning:

  • Phase 0: Run cluster workloads inside the bridge (for workloads like gitops_bootstrap that also run per-tenant with different vars — avoids Ansible extra_vars precedence conflicts)
  • Phase 1: Loop all users, run tenant_workloads per user with ocp4_workload_tenant_keycloak_username and common_password overridden per iteration
  • Phase 2: Register all users with Babylon (workshop_user_mode: multi)

Interface variables

Variable Default Description
num_users 0 Number of tenant users
tenant_workloads [] Workload FQCNs to run per user
tenant_remove_workloads [] Workload FQCNs for destroy
tenant_pre_loop_delay 0 Seconds to wait before tenant loop
tenant_user_info_data {} Per-user data for Babylon (lab_ui_url, etc.)
tenant_workload_vars {} Per-workload var overrides for tenant workloads
tenant_cluster_workloads [] Workloads to run once inside bridge (phase 0)
tenant_cluster_workloads_vars {} Per-workload var overrides for cluster workloads

Key design decisions

  • Two-phase user registration: Phase 1 runs workloads without agnosticd_user_info calls. Phase 2 registers users AFTER all workloads complete. This prevents showroom's agnosticd_user_data lookup from detecting per-user data and switching to multi-user namespace mode.
  • set_fact for overrides: include_role vars: cannot accept Jinja2 dict expressions in AAP. Per-workload overrides use set_fact per key instead. Works because conflicting vars are NOT at the top level (not extra_vars).
  • Error-tolerant destroy: block/rescue per user — destroy never blocks cluster teardown.

Testing

  • LB2010 (MTA + DevSpaces + Kai): tested with 3-4 users, ~38-46 min
  • LB1401 (Guardrails + GitOps bootstrap): tested with 4 users, ~48 min
  • Both use tenant_cluster_workloads and tenant_workload_vars patterns

Files

roles/ocp4_workload_multi_tenant_loop/
├── defaults/main.yml
├── meta/main.yml
├── tasks/
│   ├── main.yml
│   ├── workload.yml
│   ├── run_cluster_workload.yml
│   ├── provision_user.yml
│   ├── provision_workload.yml
│   ├── register_user.yml
│   ├── remove_workload.yml
│   └── destroy_user.yml
└── README.md

Generic bridge workload that loops tenant workloads for N users
on a shared cluster. Designed as a stopgap until Babylon gets
native cluster-tenant lifecycle support.

Interface:
- num_users: number of tenants to provision
- tenant_workloads: list of workload FQCNs to run per user
- tenant_remove_workloads: list of workload FQCNs for destroy
- tenant_user_prefix: username prefix (default: user)
- tenant_user_offset: starting number (default: 1)
- tenant_password_length: per-user password length (default: 8)

The role overrides ocp4_workload_tenant_keycloak_username and
common_password per user iteration. All other tenant vars
defined in the AgnosticV catalog item cascade via Jinja2 lazy
evaluation from these two roots.
…a module_defaults

Use include_role apply: module_defaults: to set the user parameter
on all agnosticd_user_info calls inside tenant workloads. This makes
each user's data and messages separate in provision_data without
modifying the tenant workloads themselves.
…rkloads

Call agnosticd_user_info per user with username and password so
workshop_user_mode: multi has the base credentials for each user.
Lab-specific data (showroom URLs etc) comes from the tenant
workloads own agnosticd_user_info calls via module_defaults.
… lab_ui_url

The showroom role uses _showroom_user in its agnosticd_user_info call.
When undefined it uses omit which overrides module_defaults, causing
all users to get the last user's showroom URL. Passing _showroom_user
explicitly fixes per-user lab_ui_url in the portal.
…and message leaks

_showroom_user triggers the showroom role's multi-user path which
appends the username to the namespace (user1-showroom-user1 instead
of user1-showroom). Also causes cross-user message contamination.

The bridge's own agnosticd_user_info call at the end of each user
is the authoritative source for lab_ui_url — showroom role writes
to global via omit, bridge writes correct URL per-user.
module_defaults injects user: into tenant workloads but conflicts
with showroom role's omit pattern, causing lab_ui_url overwrites.

Simpler design: tenant workloads write to global (we don't care),
the bridge's own agnosticd_user_info call is the sole source of
per-user data (user, password, lab_ui_url). Clean separation.
…ys generic

Move lab-specific data (lab_ui_url pattern) out of the bridge role
and into agnosticv via tenant_user_info_data dict. The bridge just
merges it with the base user/password data per user.

Each lab defines its own URL patterns in agnosticv.
The bridge has zero knowledge of showroom or any specific workload.
The bridge's agnosticd_user_info call writes per-user data which
the showroom role reads back via agnosticd_user_data lookup. This
creates a 'users' dict that triggers showroom's multi-user path,
creating wrong namespaces (user2-showroom-user1).

Fix: split into two phases:
1. Loop all users — run tenant workloads only
2. Loop all users again — register with Babylon

This way no per-user data exists in user-data.yaml while showroom
is deploying.
…gister phase

tenant_user_info_data references ocp4_workload_tenant_keycloak_username
which is only set via include_role vars: in phase 1. Phase 2 needs it
as a fact so the Jinja2 expressions in tenant_user_info_data resolve.
…cisions

Extensive inline comments for PR reviewers covering:
- Two-phase provisioning (why register after all workloads)
- Variable scoping (Jinja2 lazy eval, include_role vars:)
- Showroom interaction gotchas
- Destroy error tolerance
- Password determinism
…ad overrides

When the same workload (e.g. gitops_bootstrap) runs at both cluster
and tenant level with different vars, the bridge needs to override
specific vars for the tenant invocation.

tenant_workload_vars is a dict in agnosticv mapping workload FQCNs
to var overrides. The bridge applies them via set_fact before
include_role. ocp4_workload_tenant_keycloak_username is also set
as a fact so Jinja2 expressions in tenant_workload_vars resolve.
loop: evaluates before when: in Ansible — accessing
tenant_workload_vars[_tenant_workload] fails with KeyError
for workloads not in the dict. Use default({}) to return
empty list for missing keys.
…overrides

set_fact cannot override extra_vars (Ansible precedence).
include_role vars: CAN — it has higher precedence than extra_vars.

Merge tenant_workload_vars overrides into the include_role vars:
dict so they override the top-level (extra_var) values.

Also set common_password as a fact so Jinja2 chains in
tenant_workload_vars values resolve outside include_role context.
…er workloads

When the same workload (e.g. gitops_bootstrap) runs at both cluster
and tenant level with different vars, defining either set at the
top level makes them extra_vars (highest Ansible precedence — can't
be overridden by anything).

Fix: move the workload OUT of the workloads: list. Run it inside
the bridge via include_role vars: (which CAN override role defaults
when the vars are NOT extra_vars).

tenant_cluster_workloads: workloads to run ONCE before the user loop
tenant_cluster_workloads_vars: per-workload vars for those workloads
…ct not string

The >- block scalar makes set_fact store a STRING representation
of the dict instead of an actual dict. include_role vars: then
fails because it gets a string, not a mapping.

Remove >- so Jinja2 evaluates to an actual dict object.
…role vars

include_role vars: with a Jinja2 dict expression fails in AAP.
Switch to set_fact for each override key individually, then call
include_role without dynamic vars. Since these vars are NOT at
the top level (not extra_vars), set_fact CAN override role defaults.
agonzalezrh
agonzalezrh previously approved these changes Jun 3, 2026

@agonzalezrh agonzalezrh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Every file now explains WHY things are done the way they are:
- Why set_fact instead of include_role vars (AAP rejects Jinja2 dicts)
- Why set_fact works here (vars not at top level = not extra_vars)
- Why three phases (showroom agnosticd_user_data lookup)
- Why tenant_cluster_workloads exists (extra_vars precedence)
- Why common_password must be a fact (Jinja2 chain resolution)
- Ansible precedence order reference
# showroom detects a 'users' dict and switches to multi-user
# mode — which appends the username to the namespace, creating
# wrong namespaces like user1-showroom-user1.
# Phase 2 (registration) happens AFTER all showroom deployments.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really confused here... if the showroom role is doing something strange, shouldn't we just fix the showroom role? I may need more information to understand the approach.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question — the showroom multi-user detection is actually intentional behavior, not a bug. When agnosticd_user_info is called with user: set during Phase 1, it writes per-user entries to user-data.yaml. The showroom role reads that file at deploy time, finds the users dict, and switches to multi-user mode — appending the username to the namespace (user1-showroom becomes user1-showroom-user1). This is the correct path for labs that provision users in a single pass and want showroom to detect them automatically. Fixing that behavior would break those existing labs. Our workaround is to defer registration (Phase 2) until after all showroom pods are deployed. Updated the comment to explain this more clearly.

Comment thread roles/ocp4_workload_multi_tenant_loop/tasks/destroy_user.yml
Comment thread roles/ocp4_workload_multi_tenant_loop/tasks/destroy_user.yml
Comment thread roles/ocp4_workload_multi_tenant_loop/tasks/provision_user.yml
Comment thread roles/ocp4_workload_multi_tenant_loop/tasks/provision_user.yml
# AAP rejects `include_role vars: "{{ dict_expression }}"` with
# "Vars in a IncludeRole must be specified as a dictionary".
# Only static YAML key: value mappings are accepted.
# So overrides go through set_fact, root vars through include_role.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this is true... I think it means like include_vars, though you can also bring in vars from a role with import_role, but I don't understand why one would try to get vars from a role with include_role. iirc, it isn't that it won't work because the reasons above... but anyway, I think set_fact is fine without this extra comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, the comment was misleading. The specific constraint is that AAP rejects Jinja2 expressions that evaluate to a dict in include_role vars: — it only accepts static YAML mappings. So we use set_fact per key instead. Simplified the comment to just explain what we're doing and why, without the incorrect assertion.

# Example: gitops_bootstrap gets repo_path, app_name, helm_values
# overridden from cluster values to tenant values.
- name: "Apply tenant overrides for {{ _tenant_workload | split('.') | last }}"
when: _tenant_workload in (tenant_workload_vars | default({}))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tenant_workload_vars is in defaults/main.yml so there is no need for | default({}) here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, removed.

# Run the workload with the two root vars.
# All other vars come from facts (set above or in provision_user.yml)
# or from the catalog item's playbook-level vars (via Jinja2 cascade).
- name: "Run {{ _tenant_workload | split('.') | last }} for {{ _tenant_username }}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the split on .? That just hides the collection name? Why would we want to hide that?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Removed the split — all loop labels now show the full FQCN.

Comment thread roles/ocp4_workload_multi_tenant_loop/tasks/register_user.yml
user: "{{ _tenant_username }}"
password: "{{ _tenant_password }}"
_user_data: >-
{{ _base_data | combine(tenant_user_info_data | default({})) }}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tenant_user_info_data is in defaults/main.yml, no need for | default({}) here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, removed.

agnosticd.core.agnosticd_user_info:
user: "{{ _tenant_username }}"
data: "{{ _user_data }}"
msg: "{{ tenant_user_info_msg | default(_default_msg, true) }}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of _default_msg here is a little weird? Why not just put in defaults/main.yml?

tenant_user_info_msg: "User: {{ _tenant_username }} / {{ _tenant_password }}"

Or, probably better, just:

    msg: "{{ tenant_user_info_msg | default(omit, true) }}"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. Moved the default to defaults/main.yml and simplified the task to msg: "{{ tenant_user_info_msg | default(omit, true) }}".

ansible.builtin.debug:
msg:
- "Tenant loop: destroying {{ num_users }} user(s)"
- "Destroy order: {{ tenant_remove_workloads | default([]) | map('split', '.') | map('last') | list }}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to previosu comment... why do the split and last? That's just hiding potentially useful information for troubleshooting?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, removed. Full FQCN now shown in all debug labels.

Comment thread roles/ocp4_workload_multi_tenant_loop/tasks/remove_workload.yml
msg:
- "Tenant loop: provisioning {{ num_users }} user(s)"
- "Username pattern: {{ tenant_user_prefix }}{{ tenant_user_offset }}..{{ tenant_user_prefix }}{{ (tenant_user_offset | int) + (num_users | int) - 1 }}"
- "Cluster workloads: {{ tenant_cluster_workloads | default([]) | map('split', '.') | map('last') | list }}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, why drop the collection name from the debug information?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. All labels now show the full FQCN.

- Add vars/main.yml with _tenant_username/_tenant_password defaults
- Clarify three-phase WHY: showroom multi-user detection behavior,
  not a showroom bug — fixing it would break existing labs
- Drop | default({}) where vars are already in defaults/main.yml
- Drop | default(omit) pattern for tenant_user_info_msg, use omit directly
- Remove split('.') | last from all debug labels — show full FQCN
- Remove password from destroy_user vars: (not needed for remove)
- Simplify provision_workload.yml — remove incorrect comment about
  include_role vars: accepting Jinja2 (it was right for our use case
  but the explanation was misleading)
- tenant_user_info_msg default moved to defaults/main.yml
@prakhar1985

Copy link
Copy Markdown
Contributor Author

I have addressed all your comments. Regarding showroom, we can work it out with Andrew — it should not be a blocker for us. Please approve and merge if you are ok with the changes.

prakhar added 2 commits June 11, 2026 09:34
- workload.yml line 28: shorten username pattern debug line (156→148 chars)
- name[template]: move Jinja2 templates to end of task names in
  destroy_user, provision_user, provision_workload, register_user
@prakhar1985 prakhar1985 merged commit 328d6f6 into main Jun 11, 2026
1 check passed
@prakhar1985 prakhar1985 deleted the GPTEINFRA-16720-multi-tenant-loop branch June 11, 2026 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants