[GPTEINFRA-16720] Generic multi-tenant loop bridge role#149
Conversation
Generic bridge workload that loops tenant workloads for N users on a shared cluster. Designed as a stopgap until Babylon gets native cluster-tenant lifecycle support. Interface: - num_users: number of tenants to provision - tenant_workloads: list of workload FQCNs to run per user - tenant_remove_workloads: list of workload FQCNs for destroy - tenant_user_prefix: username prefix (default: user) - tenant_user_offset: starting number (default: 1) - tenant_password_length: per-user password length (default: 8) The role overrides ocp4_workload_tenant_keycloak_username and common_password per user iteration. All other tenant vars defined in the AgnosticV catalog item cascade via Jinja2 lazy evaluation from these two roots.
…a module_defaults Use include_role apply: module_defaults: to set the user parameter on all agnosticd_user_info calls inside tenant workloads. This makes each user's data and messages separate in provision_data without modifying the tenant workloads themselves.
…rkloads Call agnosticd_user_info per user with username and password so workshop_user_mode: multi has the base credentials for each user. Lab-specific data (showroom URLs etc) comes from the tenant workloads own agnosticd_user_info calls via module_defaults.
… lab_ui_url The showroom role uses _showroom_user in its agnosticd_user_info call. When undefined it uses omit which overrides module_defaults, causing all users to get the last user's showroom URL. Passing _showroom_user explicitly fixes per-user lab_ui_url in the portal.
…and message leaks _showroom_user triggers the showroom role's multi-user path which appends the username to the namespace (user1-showroom-user1 instead of user1-showroom). Also causes cross-user message contamination. The bridge's own agnosticd_user_info call at the end of each user is the authoritative source for lab_ui_url — showroom role writes to global via omit, bridge writes correct URL per-user.
module_defaults injects user: into tenant workloads but conflicts with showroom role's omit pattern, causing lab_ui_url overwrites. Simpler design: tenant workloads write to global (we don't care), the bridge's own agnosticd_user_info call is the sole source of per-user data (user, password, lab_ui_url). Clean separation.
…ys generic Move lab-specific data (lab_ui_url pattern) out of the bridge role and into agnosticv via tenant_user_info_data dict. The bridge just merges it with the base user/password data per user. Each lab defines its own URL patterns in agnosticv. The bridge has zero knowledge of showroom or any specific workload.
The bridge's agnosticd_user_info call writes per-user data which the showroom role reads back via agnosticd_user_data lookup. This creates a 'users' dict that triggers showroom's multi-user path, creating wrong namespaces (user2-showroom-user1). Fix: split into two phases: 1. Loop all users — run tenant workloads only 2. Loop all users again — register with Babylon This way no per-user data exists in user-data.yaml while showroom is deploying.
…gister phase tenant_user_info_data references ocp4_workload_tenant_keycloak_username which is only set via include_role vars: in phase 1. Phase 2 needs it as a fact so the Jinja2 expressions in tenant_user_info_data resolve.
…cisions Extensive inline comments for PR reviewers covering: - Two-phase provisioning (why register after all workloads) - Variable scoping (Jinja2 lazy eval, include_role vars:) - Showroom interaction gotchas - Destroy error tolerance - Password determinism
…ad overrides When the same workload (e.g. gitops_bootstrap) runs at both cluster and tenant level with different vars, the bridge needs to override specific vars for the tenant invocation. tenant_workload_vars is a dict in agnosticv mapping workload FQCNs to var overrides. The bridge applies them via set_fact before include_role. ocp4_workload_tenant_keycloak_username is also set as a fact so Jinja2 expressions in tenant_workload_vars resolve.
loop: evaluates before when: in Ansible — accessing
tenant_workload_vars[_tenant_workload] fails with KeyError
for workloads not in the dict. Use default({}) to return
empty list for missing keys.
…overrides set_fact cannot override extra_vars (Ansible precedence). include_role vars: CAN — it has higher precedence than extra_vars. Merge tenant_workload_vars overrides into the include_role vars: dict so they override the top-level (extra_var) values. Also set common_password as a fact so Jinja2 chains in tenant_workload_vars values resolve outside include_role context.
…er workloads When the same workload (e.g. gitops_bootstrap) runs at both cluster and tenant level with different vars, defining either set at the top level makes them extra_vars (highest Ansible precedence — can't be overridden by anything). Fix: move the workload OUT of the workloads: list. Run it inside the bridge via include_role vars: (which CAN override role defaults when the vars are NOT extra_vars). tenant_cluster_workloads: workloads to run ONCE before the user loop tenant_cluster_workloads_vars: per-workload vars for those workloads
…ct not string The >- block scalar makes set_fact store a STRING representation of the dict instead of an actual dict. include_role vars: then fails because it gets a string, not a mapping. Remove >- so Jinja2 evaluates to an actual dict object.
…role vars include_role vars: with a Jinja2 dict expression fails in AAP. Switch to set_fact for each override key individually, then call include_role without dynamic vars. Since these vars are NOT at the top level (not extra_vars), set_fact CAN override role defaults.
Every file now explains WHY things are done the way they are: - Why set_fact instead of include_role vars (AAP rejects Jinja2 dicts) - Why set_fact works here (vars not at top level = not extra_vars) - Why three phases (showroom agnosticd_user_data lookup) - Why tenant_cluster_workloads exists (extra_vars precedence) - Why common_password must be a fact (Jinja2 chain resolution) - Ansible precedence order reference
| # showroom detects a 'users' dict and switches to multi-user | ||
| # mode — which appends the username to the namespace, creating | ||
| # wrong namespaces like user1-showroom-user1. | ||
| # Phase 2 (registration) happens AFTER all showroom deployments. |
There was a problem hiding this comment.
I'm really confused here... if the showroom role is doing something strange, shouldn't we just fix the showroom role? I may need more information to understand the approach.
There was a problem hiding this comment.
Good question — the showroom multi-user detection is actually intentional behavior, not a bug. When agnosticd_user_info is called with user: set during Phase 1, it writes per-user entries to user-data.yaml. The showroom role reads that file at deploy time, finds the users dict, and switches to multi-user mode — appending the username to the namespace (user1-showroom becomes user1-showroom-user1). This is the correct path for labs that provision users in a single pass and want showroom to detect them automatically. Fixing that behavior would break those existing labs. Our workaround is to defer registration (Phase 2) until after all showroom pods are deployed. Updated the comment to explain this more clearly.
| # AAP rejects `include_role vars: "{{ dict_expression }}"` with | ||
| # "Vars in a IncludeRole must be specified as a dictionary". | ||
| # Only static YAML key: value mappings are accepted. | ||
| # So overrides go through set_fact, root vars through include_role. |
There was a problem hiding this comment.
I don't believe this is true... I think it means like include_vars, though you can also bring in vars from a role with import_role, but I don't understand why one would try to get vars from a role with include_role. iirc, it isn't that it won't work because the reasons above... but anyway, I think set_fact is fine without this extra comment.
There was a problem hiding this comment.
You're right, the comment was misleading. The specific constraint is that AAP rejects Jinja2 expressions that evaluate to a dict in include_role vars: — it only accepts static YAML mappings. So we use set_fact per key instead. Simplified the comment to just explain what we're doing and why, without the incorrect assertion.
| # Example: gitops_bootstrap gets repo_path, app_name, helm_values | ||
| # overridden from cluster values to tenant values. | ||
| - name: "Apply tenant overrides for {{ _tenant_workload | split('.') | last }}" | ||
| when: _tenant_workload in (tenant_workload_vars | default({})) |
There was a problem hiding this comment.
tenant_workload_vars is in defaults/main.yml so there is no need for | default({}) here.
There was a problem hiding this comment.
Correct, removed.
| # Run the workload with the two root vars. | ||
| # All other vars come from facts (set above or in provision_user.yml) | ||
| # or from the catalog item's playbook-level vars (via Jinja2 cascade). | ||
| - name: "Run {{ _tenant_workload | split('.') | last }} for {{ _tenant_username }}" |
There was a problem hiding this comment.
Why the split on .? That just hides the collection name? Why would we want to hide that?
There was a problem hiding this comment.
Agreed. Removed the split — all loop labels now show the full FQCN.
| user: "{{ _tenant_username }}" | ||
| password: "{{ _tenant_password }}" | ||
| _user_data: >- | ||
| {{ _base_data | combine(tenant_user_info_data | default({})) }} |
There was a problem hiding this comment.
tenant_user_info_data is in defaults/main.yml, no need for | default({}) here.
There was a problem hiding this comment.
Correct, removed.
| agnosticd.core.agnosticd_user_info: | ||
| user: "{{ _tenant_username }}" | ||
| data: "{{ _user_data }}" | ||
| msg: "{{ tenant_user_info_msg | default(_default_msg, true) }}" |
There was a problem hiding this comment.
The use of _default_msg here is a little weird? Why not just put in defaults/main.yml?
tenant_user_info_msg: "User: {{ _tenant_username }} / {{ _tenant_password }}"
Or, probably better, just:
msg: "{{ tenant_user_info_msg | default(omit, true) }}"
There was a problem hiding this comment.
Good suggestion. Moved the default to defaults/main.yml and simplified the task to msg: "{{ tenant_user_info_msg | default(omit, true) }}".
| ansible.builtin.debug: | ||
| msg: | ||
| - "Tenant loop: destroying {{ num_users }} user(s)" | ||
| - "Destroy order: {{ tenant_remove_workloads | default([]) | map('split', '.') | map('last') | list }}" |
There was a problem hiding this comment.
Similar to previosu comment... why do the split and last? That's just hiding potentially useful information for troubleshooting?
There was a problem hiding this comment.
Agreed, removed. Full FQCN now shown in all debug labels.
| msg: | ||
| - "Tenant loop: provisioning {{ num_users }} user(s)" | ||
| - "Username pattern: {{ tenant_user_prefix }}{{ tenant_user_offset }}..{{ tenant_user_prefix }}{{ (tenant_user_offset | int) + (num_users | int) - 1 }}" | ||
| - "Cluster workloads: {{ tenant_cluster_workloads | default([]) | map('split', '.') | map('last') | list }}" |
There was a problem hiding this comment.
Again, why drop the collection name from the debug information?
There was a problem hiding this comment.
Removed. All labels now show the full FQCN.
- Add vars/main.yml with _tenant_username/_tenant_password defaults
- Clarify three-phase WHY: showroom multi-user detection behavior,
not a showroom bug — fixing it would break existing labs
- Drop | default({}) where vars are already in defaults/main.yml
- Drop | default(omit) pattern for tenant_user_info_msg, use omit directly
- Remove split('.') | last from all debug labels — show full FQCN
- Remove password from destroy_user vars: (not needed for remove)
- Simplify provision_workload.yml — remove incorrect comment about
include_role vars: accepting Jinja2 (it was right for our use case
but the explanation was misleading)
- tenant_user_info_msg default moved to defaults/main.yml
|
I have addressed all your comments. Regarding showroom, we can work it out with Andrew — it should not be a blocker for us. Please approve and merge if you are ok with the changes. |
- workload.yml line 28: shorten username pattern debug line (156→148 chars) - name[template]: move Jinja2 templates to end of task names in destroy_user, provision_user, provision_workload, register_user
Summary
New generic role
ocp4_workload_multi_tenant_loopthat loops tenant workloads for N users on a shared OpenShift cluster. Designed for combined catalog items that merge cluster and tenant provisioning into a single order.Stopgap until Babylon gets native cluster-tenant lifecycle support.
How it works
Three-phase provisioning:
tenant_workloadsper user withocp4_workload_tenant_keycloak_usernameandcommon_passwordoverridden per iterationworkshop_user_mode: multi)Interface variables
num_users0tenant_workloads[]tenant_remove_workloads[]tenant_pre_loop_delay0tenant_user_info_data{}tenant_workload_vars{}tenant_cluster_workloads[]tenant_cluster_workloads_vars{}Key design decisions
agnosticd_user_infocalls. Phase 2 registers users AFTER all workloads complete. This prevents showroom'sagnosticd_user_datalookup from detecting per-user data and switching to multi-user namespace mode.set_factfor overrides:include_role vars:cannot accept Jinja2 dict expressions in AAP. Per-workload overrides useset_factper key instead. Works because conflicting vars are NOT at the top level (not extra_vars).block/rescueper user — destroy never blocks cluster teardown.Testing
tenant_cluster_workloadsandtenant_workload_varspatternsFiles