Skip to content

fix(orchestrator): fail DB creation job on actual errors instead of silently succeeding#407

Open
Fortune-Ndlovu wants to merge 20 commits into
redhat-developer:mainfrom
Fortune-Ndlovu:RHDHBUGS-2577-helm-orchestrator-db-creation-job-should-be-able-to-retry-properly-and-fail-if-needed-it-shoud-not-silently-succeed-if-there-are-errors
Open

fix(orchestrator): fail DB creation job on actual errors instead of silently succeeding#407
Fortune-Ndlovu wants to merge 20 commits into
redhat-developer:mainfrom
Fortune-Ndlovu:RHDHBUGS-2577-helm-orchestrator-db-creation-job-should-be-able-to-retry-properly-and-fail-if-needed-it-shoud-not-silently-succeed-if-there-are-errors

Conversation

@Fortune-Ndlovu
Copy link
Copy Markdown
Member

@Fortune-Ndlovu Fortune-Ndlovu commented May 19, 2026

Description of the change

The create-sonataflow-database Job was using || echo WARNING which swallowed all psql errors, causing the Job to always report success. This prevented Kubernetes from retrying via backoffLimit when there were real failures (e.g. wrong credentials).

This PR replaces the blanket error suppression with proper handling that tolerates "database already exists" but exits non-zero on real failures. Also make backoffLimit configurable via values.yaml.

Which issue(s) does this PR fix or relate to

https://redhat.atlassian.net/browse/RHDHBUGS-2577

How to test changes / Special notes to the reviewer

  1. fresh database creation
  # Deploy the chart with orchestrator enabled
  helm install rhdh charts/backstage \
    --namespace rhdh --create-namespace \
    --set orchestrator.enabled=true \
    --set global.clusterRouterBase=<your-cluster-apps-domain>

  # Check the job completed
  oc get job rhdh-create-sonataflow-database -n rhdh

  # Verify logs show "CREATE DATABASE"
  oc logs -n rhdh -l job-name=rhdh-create-sonataflow-database -c psql
  Expected: Job completes 1/1, logs show CREATE DATABASE.
  1. database already exists
  # Delete the job and re-run it (database already exists from step 1)
  oc delete job rhdh-create-sonataflow-database -n rhdh
  # Re-apply just the Job from the rendered template
  helm template rhdh charts/backstage \
    --namespace rhdh \
    --set orchestrator.enabled=true \
    --set global.clusterRouterBase=<your-cluster-apps-domain> \
    --show-only templates/sonataflows.yaml \
    | awk '/^apiVersion: batch\/v1$/,/^---/' | head -n -1 \
    | oc apply -n rhdh -f -

  # Check logs
  oc logs -n rhdh -l job-name=rhdh-create-sonataflow-database -c psql
  Expected: Job completes 1/1, logs show Database 'sonataflow' already exists, skipping creation.
  1. Failure and retry, wrong credentials
  # Delete the existing job
  oc delete job rhdh-create-sonataflow-database -n rhdh

  # Create a job with a bad password to simulate auth failure
  oc create secret generic bad-password -n rhdh --from-literal=password=wrong
  # Patch the rendered job to use the bad secret, then apply it
  helm template rhdh charts/backstage \
    --namespace rhdh \
    --set orchestrator.enabled=true \
    --set global.clusterRouterBase=<your-cluster-apps-domain> \
    --show-only templates/sonataflows.yaml \
    | awk '/^apiVersion: batch\/v1$/,/^---/' | head -n -1 \
    | sed 's/rhdh-postgresql-svcbind-postgres/bad-password/' \
    | oc apply -n rhdh -f -

  # Watch retries (backoffLimit: 2 means up to 2 retries)
  oc get pods -n rhdh -l job-name=rhdh-create-sonataflow-database -w

  # Check logs show the error
  oc logs -n rhdh -l job-name=rhdh-create-sonataflow-database -c psql

  # Verify job ultimately failed
  oc get job rhdh-create-sonataflow-database -n rhdh
  Expected: Job fails, retries twice (3 pods total), logs show ERROR: Failed to create database 'sonataflow'., job status shows Failed.
  1. Configurable backoffLimit
  # Verify schema rejects invalid values
  helm template rhdh charts/backstage \
    --set orchestrator.enabled=true \
    --set orchestrator.sonataflowPlatform.dbCreationJobBackoffLimit=-1
  Expected: Helm rejects with a schema validation error.

Checklist

  • For each Chart updated, version bumped in the corresponding Chart.yaml according to Semantic Versioning.
  • For each Chart updated, variables are documented in the values.yaml and added to the corresponding README.md. The pre-commit utility can be used to generate the necessary content. Run pre-commit run --all-files to run the hooks and then push any resulting changes. The pre-commit Workflow will enforce this and warn you if needed.
  • JSON Schema template updated and re-generated the raw schema via the pre-commit hook.
  • Tests pass using the Chart Testing tool and the ct lint command.
  • If you updated the orchestrator-infra chart, make sure the versions of the Knative CRDs are aligned with the versions of the CRDs installed by the OpenShift Serverless operators declared in the values.yaml file. See Installing Knative Eventing and Knative Serving CRDs for more details.

@Fortune-Ndlovu Fortune-Ndlovu requested a review from a team as a code owner May 19, 2026 07:17
@openshift-ci openshift-ci Bot requested review from subhashkhileri and zdrapela May 19, 2026 07:17
@rhdh-qodo-merge
Copy link
Copy Markdown

rhdh-qodo-merge Bot commented May 19, 2026

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📎 Requirement gaps (0)

Context used
✅ Tickets: RHDHBUGS-2577

Grey Divider


Action required

1. Wrong psql target DB ✓ Resolved 🐞 Bug ≡ Correctness
Description
In the external-DB branch of the create-sonataflow-database Job, the updated psql invocations no
longer pass -d {{ .Values.orchestrator.sonataflowPlatform.externalDBName }}, so psql will
connect to its default database instead of the documented existing database. This can cause the job
to fail (and the existence-check query to run in the wrong session context) even when the external
DB configuration is otherwise correct.
Code

charts/backstage/templates/sonataflows.yaml[R201-202]

Relevance

⭐⭐⭐ High

PR #173 explicitly accepted using -d {{ externalDBName }} for external DB psql in create-db Job.

PR-#173

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The external-DB job’s new psql commands omit any -d selection, while the chart documentation
explicitly defines externalDBName as the existing database to connect to for external DB setups;
the job then creates the sonataflow database on that server.

charts/backstage/templates/sonataflows.yaml[198-208]
charts/backstage/README.md[420-434]
charts/backstage/values.yaml[524-532]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The external-DB create-db job runs `psql` without specifying a database (`-d ...`). The chart’s README states `externalDBName` is the *existing* DB to connect to when using an external server, while the job creates the `sonataflow` DB.

### Issue Context
Without `-d`, `psql` connects to its default DB (often derived from the username/PGDATABASE), which may not exist or may not be the intended maintenance DB, causing the create step and/or the follow-up `SELECT 1 FROM pg_database ...` check to fail incorrectly.

### Fix Focus Areas
- charts/backstage/templates/sonataflows.yaml[201-202]

### Suggested change
In the external-DB branch, add `-d {{ .Values.orchestrator.sonataflowPlatform.externalDBName }}` (or a safe default like `postgres` if you prefer) to **both**:
- the `CREATE DATABASE sonataflow;` `psql` call
- the `SELECT 1 FROM pg_database ...` `psql` call

This aligns behavior with the README guidance and avoids relying on `psql` defaults.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Schema blocks new value ✓ Resolved 🐞 Bug ≡ Correctness
Description
values.yaml introduces orchestrator.sonataflowPlatform.dbCreationJobBackoffLimit, but
values.schema.json has additionalProperties: false for sonataflowPlatform and does not define
this field, so schema validation/linting will fail and users cannot configure it.
Code

charts/backstage/values.yaml[R560-561]

Relevance

⭐⭐⭐ High

Schema/README kept in sync with new values in past (PRs #173/#225/#143); mismatch likely rejected by
lint.

PR-#173
PR-#225
PR-#143

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The default values introduce dbCreationJobBackoffLimit, but the schema explicitly forbids unknown
keys under sonataflowPlatform and does not list the new property; the README values table for the
same section also lacks the new key.

charts/backstage/values.yaml[548-565]
charts/backstage/templates/sonataflows.yaml[186-212]
charts/backstage/values.schema.json[557-680]
charts/backstage/README.md[208-226]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
A new Helm value `orchestrator.sonataflowPlatform.dbCreationJobBackoffLimit` was added and referenced by templates, but the chart’s `values.schema.json` does not define it while `sonataflowPlatform` is schema-locked (`additionalProperties: false`). This causes Helm schema validation / CI lint to fail and prevents users from setting the value.

## Issue Context
- `values.yaml` now contains the new key with a default.
- `templates/sonataflows.yaml` renders it into Job `backoffLimit`.
- `values.schema.json` defines `orchestrator.sonataflowPlatform` with `additionalProperties: false`, and the allowed `properties` list does not include `dbCreationJobBackoffLimit`.
- The generated `README.md` values table also does not list the new key, indicating docs/schema regeneration wasn’t done.

## Fix Focus Areas
- charts/backstage/values.schema.json[557-680]
- charts/backstage/values.yaml[548-565]
- charts/backstage/README.md[208-226]

### Implementation notes
- Add `dbCreationJobBackoffLimit` under `orchestrator.sonataflowPlatform.properties` with type `integer` (and ideally `minimum: 0`) and default `2`.
- Re-run the repo’s doc/schema generation (pre-commit hook) so `README.md` includes the new value.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

3. Implicit grep dependency 🐞 Bug ☼ Reliability ⭐ New
Description
The create-sonataflow-database Job’s new failure-handling path pipes psql output to grep -q 1,
but the Job image is configurable via orchestrator.sonataflowPlatform.createDBJobImage and may not
include grep, causing the Job to fail even when psql is present and the database already exists.
Code

charts/backstage/templates/sonataflows.yaml[195]

Relevance

⭐⭐ Medium

No prior review pattern about avoiding grep in configurable Job images; related create-db Job
changes accepted in #164/#173.

PR-#164
PR-#173

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The Job’s fallback existence check explicitly invokes grep, and the Job container image is
templated from a user-configurable value, so there is no guarantee grep exists in the runtime
image.

charts/backstage/templates/sonataflows.yaml[145-213]
charts/backstage/values.yaml[532-538]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The create-db Job uses `| grep -q 1` to determine whether the `sonataflow` database already exists. Because `createDBJobImage` is user-overridable, users may supply an image that includes `psql` but not `grep`, causing a hard failure in the fallback path.

## Issue Context
The Job image is driven by `.Values.orchestrator.sonataflowPlatform.createDBJobImage`, which is explicitly configurable in `values.yaml`.

## Fix Focus Areas
- charts/backstage/templates/sonataflows.yaml[190-212]
- charts/backstage/values.yaml[532-537]

## Suggested change
Replace the `psql ... | grep -q 1` existence test with a shell-only check that does not rely on `grep`.

Example pattern:
```sh
if db_exists="$(psql ... -Atc "SELECT 1 FROM pg_database WHERE datname='sonataflow'" 2>/dev/null)"; then
 if [ "$db_exists" = "1" ]; then
   echo "Database 'sonataflow' already exists, skipping creation."
 else
   echo "ERROR: Failed to create database 'sonataflow'."
   exit 1
 fi
else
 echo "ERROR: Failed to verify whether database 'sonataflow' exists."
 exit 1
fi
```
Apply this in both the upstream and external DB branches.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


4. PR-specific doc committed 🐞 Bug ⚙ Maintainability
Description
The newly added pr-info.md is a large, ticket-specific and date-stamped troubleshooting/testing
write-up, which is likely to become stale and create long-term maintenance noise in the repo. It
appears intended for PR context rather than evergreen project documentation.
Code

pr-info.md[R1-4]

Relevance

⭐⭐⭐ High

Team often removes PR-temporary content and prefers evergreen docs (accepted in PRs #165, #183,
#280).

PR-#165
PR-#183
PR-#280

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The file content is explicitly scoped to a single Jira ticket/PR and includes date-stamped test
results, which makes it likely to become outdated as chart behavior and versions evolve.

pr-info.md[1-4]
pr-info.md[464-474]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`pr-info.md` is being added to the repository root as a 700+ line, ticket-specific and date-stamped explanation/testing guide. This kind of PR-context artifact tends to become stale quickly and adds maintenance overhead for future contributors.

## Issue Context
The file includes a Jira ticket reference and dated test results, indicating it is not evergreen documentation.

## Fix Focus Areas
- pr-info.md[1-4]
- pr-info.md[464-474]

## Suggested fix
- Prefer moving this content to the PR description / Jira ticket / internal wiki, or
- If the content is valuable long-term, extract the evergreen parts into the appropriate project docs location (e.g., `docs/`), and drop dated/PR-specific sections (like the dated test results table).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


5. Job logs expire quickly ✓ Resolved 🐞 Bug ◔ Observability
Description
The create-sonataflow-database Job is now a Helm hook and also has ttlSecondsAfterFinished: 300,
so Kubernetes will garbage-collect the Job and its pods 5 minutes after it finishes (success or
failure), which can remove the primary failure evidence shortly after retries are exhausted. This is
especially risky now that the Job is expected to fail on real errors, because post-mortem debugging
may be time-boxed to a few minutes.
Code

charts/backstage/templates/sonataflows.yaml[R90-94]

Relevance

⭐⭐ Medium

No prior evidence on Job TTL/log retention; past sonataflows.yaml reviews focused on security/DB
wiring, not ttlSecondsAfterFinished.

PR-#166
PR-#173
PR-#235

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The Job is explicitly marked as a Helm hook and configured with a 300-second TTL, which instructs
Kubernetes to delete finished Jobs/pods after 5 minutes, limiting log availability after completion.

charts/backstage/templates/sonataflows.yaml[85-96]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`ttlSecondsAfterFinished: 300` causes the DB creation Job and its pods to be deleted 5 minutes after completion (including failed completion), which can make it hard to retrieve logs/events when the hook fails after retries.

### Issue Context
This Job is a Helm hook (`post-install,post-upgrade`) and is now intended to fail on real errors; fast TTL cleanup reduces the time window to investigate failures.

### Fix Focus Areas
- charts/backstage/templates/sonataflows.yaml[87-96]

### Suggested fix
- Prefer preserving failed Jobs for debugging:
 - Remove the hard-coded `ttlSecondsAfterFinished`, and instead set Helm hook deletion policy to clean up only on success (e.g., `before-hook-creation,hook-succeeded`).
 - Or make `ttlSecondsAfterFinished` configurable via values (and schema/README), with a safer default for troubleshooting (or document that centralized logging is expected).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Previous review results

Review updated until commit 333fd0c

Results up to commit 17a64a7


🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0)


Action required
1. Schema blocks new value ✓ Resolved 🐞 Bug ≡ Correctness
Description
values.yaml introduces orchestrator.sonataflowPlatform.dbCreationJobBackoffLimit, but
values.schema.json has additionalProperties: false for sonataflowPlatform and does not define
this field, so schema validation/linting will fail and users cannot configure it.
Code

charts/backstage/values.yaml[R560-561]

Relevance

⭐⭐⭐ High

Schema/README kept in sync with new values in past (PRs #173/#225/#143); mismatch likely rejected by
lint.

PR-#173
PR-#225
PR-#143

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The default values introduce dbCreationJobBackoffLimit, but the schema explicitly forbids unknown
keys under sonataflowPlatform and does not list the new property; the README values table for the
same section also lacks the new key.

charts/backstage/values.yaml[548-565]
charts/backstage/templates/sonataflows.yaml[186-212]
charts/backstage/values.schema.json[557-680]
charts/backstage/README.md[208-226]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
A new Helm value `orchestrator.sonataflowPlatform.dbCreationJobBackoffLimit` was added and referenced by templates, but the chart’s `values.schema.json` does not define it while `sonataflowPlatform` is schema-locked (`additionalProperties: false`). This causes Helm schema validation / CI lint to fail and prevents users from setting the value.

## Issue Context
- `values.yaml` now contains the new key with a default.
- `templates/sonataflows.yaml` renders it into Job `backoffLimit`.
- `values.schema.json` defines `orchestrator.sonataflowPlatform` with `additionalProperties: false`, and the allowed `properties` list does not include `dbCreationJobBackoffLimit`.
- The generated `README.md` values table also does not list the new key, indicating docs/schema regeneration wasn’t done.

## Fix Focus Areas
- charts/backstage/values.schema.json[557-680]
- charts/backstage/values.yaml[548-565]
- charts/backstage/README.md[208-226]

### Implementation notes
- Add `dbCreationJobBackoffLimit` under `orchestrator.sonataflowPlatform.properties` with type `integer` (and ideally `minimum: 0`) and default `2`.
- Re-run the repo’s doc/schema generation (pre-commit hook) so `README.md` includes the new value.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Results up to commit 635fd18


🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0)


Action required
1. Wrong psql target DB ✓ Resolved 🐞 Bug ≡ Correctness
Description
In the external-DB branch of the create-sonataflow-database Job, the updated psql invocations no
longer pass -d {{ .Values.orchestrator.sonataflowPlatform.externalDBName }}, so psql will
connect to its default database instead of the documented existing database. This can cause the job
to fail (and the existence-check query to run in the wrong session context) even when the external
DB configuration is otherwise correct.
Code

charts/backstage/templates/sonataflows.yaml[R201-202]

Relevance

⭐⭐⭐ High

PR #173 explicitly accepted using -d {{ externalDBName }} for external DB psql in create-db Job.

PR-#173

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The external-DB job’s new psql commands omit any -d selection, while the chart documentation
explicitly defines externalDBName as the existing database to connect to for external DB setups;
the job then creates the sonataflow database on that server.

charts/backstage/templates/sonataflows.yaml[198-208]
charts/backstage/README.md[420-434]
charts/backstage/values.yaml[524-532]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The external-DB create-db job runs `psql` without specifying a database (`-d ...`). The chart’s README states `externalDBName` is the *existing* DB to connect to when using an external server, while the job creates the `sonataflow` DB.

### Issue Context
Without `-d`, `psql` connects to its default DB (often derived from the username/PGDATABASE), which may not exist or may not be the intended maintenance DB, causing the create step and/or the follow-up `SELECT 1 FROM pg_database ...` check to fail incorrectly.

### Fix Focus Areas
- charts/backstage/templates/sonataflows.yaml[201-202]

### Suggested change
In the external-DB branch, add `-d {{ .Values.orchestrator.sonataflowPlatform.externalDBName }}` (or a safe default like `postgres` if you prefer) to **both**:
- the `CREATE DATABASE sonataflow;` `psql` call
- the `SELECT 1 FROM pg_database ...` `psql` call

This aligns behavior with the README guidance and avoids relying on `psql` defaults.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Results up to commit 13b5e0f


🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0)


Remediation recommended
1. Job logs expire quickly ✓ Resolved 🐞 Bug ◔ Observability
Description
The create-sonataflow-database Job is now a Helm hook and also has ttlSecondsAfterFinished: 300,
so Kubernetes will garbage-collect the Job and its pods 5 minutes after it finishes (success or
failure), which can remove the primary failure evidence shortly after retries are exhausted. This is
especially risky now that the Job is expected to fail on real errors, because post-mortem debugging
may be time-boxed to a few minutes.
Code

charts/backstage/templates/sonataflows.yaml[R90-94]

Relevance

⭐⭐ Medium

No prior evidence on Job TTL/log retention; past sonataflows.yaml reviews focused on security/DB
wiring, not ttlSecondsAfterFinished.

PR-#166
PR-#173
PR-#235

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The Job is explicitly marked as a Helm hook and configured with a 300-second TTL, which instructs
Kubernetes to delete finished Jobs/pods after 5 minutes, limiting log availability after completion.

charts/backstage/templates/sonataflows.yaml[85-96]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`ttlSecondsAfterFinished: 300` causes the DB creation Job and its pods to be deleted 5 minutes after completion (including failed completion), which can make it hard to retrieve logs/events when the hook fails after retries.

### Issue Context
This Job is a Helm hook (`post-install,post-upgrade`) and is now intended to fail on real errors; fast TTL cleanup reduces the time window to investigate failures.

### Fix Focus Areas
- charts/backstage/templates/sonataflows.yaml[87-96]

### Suggested fix
- Prefer preserving failed Jobs for debugging:
 - Remove the hard-coded `ttlSecondsAfterFinished`, and instead set Helm hook deletion policy to clean up only on success (e.g., `before-hook-creation,hook-succeeded`).
 - Or make `ttlSecondsAfterFinished` configurable via values (and schema/README), with a safer default for troubleshooting (or document that centralized logging is expected).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Results up to commit 41fbb3e


🐞 Bugs (1) 📘 Rule violations (0) 📎 Requirement gaps (0)


Remediation recommended
1. PR-specific doc committed 🐞 Bug ⚙ Maintainability
Description
The newly added pr-info.md is a large, ticket-specific and date-stamped troubleshooting/testing
write-up, which is likely to become stale and create long-term maintenance noise in the repo. It
appears intended for PR context rather than evergreen project documentation.
Code

pr-info.md[R1-4]

Relevance

⭐⭐⭐ High

Team often removes PR-temporary content and prefers evergreen docs (accepted in PRs #165, #183,
#280).

PR-#165
PR-#183
PR-#280

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The file content is explicitly scoped to a single Jira ticket/PR and includes date-stamped test
results, which makes it likely to become outdated as chart behavior and versions evolve.

pr-info.md[1-4]
pr-info.md[464-474]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`pr-info.md` is being added to the repository root as a 700+ line, ticket-specific and date-stamped explanation/testing guide. This kind of PR-context artifact tends to become stale quickly and adds maintenance overhead for future contributors.

## Issue Context
The file includes a Jira ticket reference and dated test results, indicating it is not evergreen documentation.

## Fix Focus Areas
- pr-info.md[1-4]
- pr-info.md[464-474]

## Suggested fix
- Prefer moving this content to the PR description / Jira ticket / internal wiki, or
- If the content is valuable long-term, extract the evergreen parts into the appropriate project docs location (e.g., `docs/`), and drop dated/PR-specific sections (like the dated test results table).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Qodo Logo

Signed-off-by: Fortune-Ndlovu <fndlovu@redhat.com>
@rhdh-qodo-merge
Copy link
Copy Markdown

Review Summary by Qodo

Fix DB creation job error handling and make backoffLimit configurable

🐞 Bug fix ✨ Enhancement

Grey Divider

Walkthroughs

Description
• Replace blanket error suppression with proper database creation error handling
• Distinguish between "database already exists" (acceptable) and real failures (exit non-zero)
• Make job backoffLimit configurable via dbCreationJobBackoffLimit in values.yaml
• Enable Kubernetes to properly retry on actual database creation failures
Diagram
flowchart LR
  A["DB Creation Job"] -->|"Old: || echo WARNING"| B["Always Succeeds"]
  A -->|"New: Check if exists"| C{"Database exists?"}
  C -->|"Yes"| D["Skip & Succeed"]
  C -->|"No"| E["Exit 1"]
  E -->|"Retry via backoffLimit"| A
Loading

Grey Divider

File Changes

1. charts/backstage/templates/sonataflows.yaml 🐞 Bug fix +19/-3

Implement proper DB creation error handling

• Replaced || echo WARNING error suppression with conditional logic that checks if database
 already exists
• Added proper error handling that exits with code 1 on real failures while tolerating existing
 databases
• Updated both upstream PostgreSQL and external database code paths with identical error handling
 logic
• Changed hardcoded backoffLimit: 2 to use configurable `{{
 .Values.orchestrator.sonataflowPlatform.dbCreationJobBackoffLimit }}`

charts/backstage/templates/sonataflows.yaml


2. charts/backstage/values.yaml ⚙️ Configuration changes +2/-0

Add configurable backoffLimit for DB creation job

• Added new configuration parameter dbCreationJobBackoffLimit with default value of 2
• Documented the parameter as "Number of retries for the create-db job if it fails"
• Allows users to customize retry behavior for database creation job failures

charts/backstage/values.yaml


Grey Divider

Qodo Logo

@rhdh-qodo-merge rhdh-qodo-merge Bot added enhancement New feature or request bug_fix labels May 19, 2026
…job-should-be-able-to-retry-properly-and-fail-if-needed-it-shoud-not-silently-succeed-if-there-are-errors
@rm3l
Copy link
Copy Markdown
Member

rm3l commented May 19, 2026

/cherry-pick release-1.10

@openshift-cherrypick-robot
Copy link
Copy Markdown

@rm3l: once the present PR merges, I will cherry-pick it on top of release-1.10 in a new PR and assign it to you.

Details

In response to this:

/cherry-pick release-1.10

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@Fortune-Ndlovu
Copy link
Copy Markdown
Member Author

/review

@rhdh-qodo-merge
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Warning

/review is deprecated. Use /agentic_review instead (removal date not yet scheduled).

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Config Validation

backoffLimit is now driven by orchestrator.sonataflowPlatform.dbCreationJobBackoffLimit; it would be good to ensure values are validated to be non-negative (and possibly set a sensible upper bound) in the schema to avoid invalid Job specs or runaway retries.

  backoffLimit: {{ .Values.orchestrator.sonataflowPlatform.dbCreationJobBackoffLimit }}
{{- end }}
📚 Focus areas based on broader codebase context

Possible Issue

The external-DB branch runs CREATE DATABASE sonataflow; while connected to -d {{ .Values.orchestrator.sonataflowPlatform.externalDBName }}. If that database is missing/unreachable (common in fresh setups), the psql connection can fail before the CREATE DATABASE is executed, and the follow-up existence check uses the same potentially-nonexistent DB, making error classification unreliable. (Ref 1)

          - |
            psql -h {{ .Release.Name }}-postgresql{{- if eq .Values.upstream.postgresql.architecture "replication" }}-primary{{- end }} -p 5432 -U postgres -c 'CREATE DATABASE sonataflow;' 2>&1 || {
              if psql -h {{ .Release.Name }}-postgresql{{- if eq .Values.upstream.postgresql.architecture "replication" }}-primary{{- end }} -p 5432 -U postgres -tc "SELECT 1 FROM pg_database WHERE datname='sonataflow'" | grep -q 1; then
                echo "Database 'sonataflow' already exists, skipping creation."
              else
                echo "ERROR: Failed to create database 'sonataflow'."
                exit 1
              fi
            }
{{- else }}
        args:
          - |
            psql -h ${POSTGRES_HOST} -p ${POSTGRES_PORT} -U ${POSTGRES_USER} -d {{ .Values.orchestrator.sonataflowPlatform.externalDBName }} -c 'CREATE DATABASE sonataflow;' 2>&1 || {
              if psql -h ${POSTGRES_HOST} -p ${POSTGRES_PORT} -U ${POSTGRES_USER} -d {{ .Values.orchestrator.sonataflowPlatform.externalDBName }} -tc "SELECT 1 FROM pg_database WHERE datname='sonataflow'" | grep -q 1; then
                echo "Database 'sonataflow' already exists, skipping creation."
              else
                echo "ERROR: Failed to create database 'sonataflow'."
                exit 1
              fi
            }

Reference reasoning: The reference implementation connects to a known maintenance database (postgres) when checking for existence and creating the target DB, avoiding dependence on the target DB being present. Adopting that pattern (connect to postgres for both the check and CREATE DATABASE) makes the job’s retry/fail behavior deterministic across fresh installs and misconfigurations.

📄 References
  1. redhat-developer/rhdh-operator/config/profile/rhdh/plugin-deps/sonataflow.yaml [99-105]
  2. redhat-developer/rhdh-operator/config/profile/rhdh/plugin-deps/sonataflow.yaml [68-98]
  3. redhat-developer/rhdh-operator/config/profile/rhdh/plugin-deps/sonataflow.yaml [107-139]
  4. redhat-developer/rhdh-chart/charts/backstage/ci/with-orchestrator-values.yaml [1-21]
  5. redhat-developer/rhdh-chart/charts/backstage/values.yaml [502-509]
  6. redhat-developer/rhdh-chart/charts/backstage/values.yaml [188-203]
  7. redhat-developer/rhdh/scripts/rhdh-openshift-setup/values.yaml [222-236]
  8. redhat-developer/rhdh-operator/bundle/rhdh/manifests/rhdh.redhat.com_backstages.yaml [277-291]

…json and values.schema.tmpl.json with minimum and maximum constraints. Update sonataflows.yaml to simplify database creation command.

Signed-off-by: Fortune-Ndlovu <fndlovu@redhat.com>
Signed-off-by: Fortune-Ndlovu <fndlovu@redhat.com>
@Fortune-Ndlovu
Copy link
Copy Markdown
Member Author

/agentic_review

@rhdh-qodo-merge
Copy link
Copy Markdown

rhdh-qodo-merge Bot commented May 19, 2026

Persistent review updated to latest commit 635fd18

…exists in PostgreSQL and matches the behavior of the internal branch

Signed-off-by: Fortune-Ndlovu <fndlovu@redhat.com>
@Fortune-Ndlovu
Copy link
Copy Markdown
Member Author

/agentic_review

@rhdh-qodo-merge
Copy link
Copy Markdown

rhdh-qodo-merge Bot commented May 19, 2026

Persistent review updated to latest commit 8f6b80f

Signed-off-by: Fortune-Ndlovu <fndlovu@redhat.com>
@Fortune-Ndlovu
Copy link
Copy Markdown
Member Author

cc/ @rm3l for review

…job-should-be-able-to-retry-properly-and-fail-if-needed-it-shoud-not-silently-succeed-if-there-are-errors
Comment thread charts/backstage/templates/sonataflows.yaml Outdated
Updated the Job name to use a shorter format and added a TTL of 300 seconds after the job finishes. Removed Helm hook annotations for cleaner configuration.
…job-should-be-able-to-retry-properly-and-fail-if-needed-it-shoud-not-silently-succeed-if-there-are-errors
Signed-off-by: Fortune-Ndlovu <fndlovu@redhat.com>
Signed-off-by: Fortune-Ndlovu <fndlovu@redhat.com>
…pgrade compatibility

The CI "Test Latest Release" check fails because helm upgrade tries to patch the existing Job's spec.template, which Kubernetes rejects as immutable. The old chart created the Job without ttlSecondsAfterFinished, so it persists indefinitely and blocks the upgrade. Adding helm.sh/hook and helm.sh/hook-delete-policy annotations makes Helm delete the old Job before creating the new one on upgrade.
…job-should-be-able-to-retry-properly-and-fail-if-needed-it-shoud-not-silently-succeed-if-there-are-errors
@Fortune-Ndlovu
Copy link
Copy Markdown
Member Author

/agentic_review

@rhdh-qodo-merge
Copy link
Copy Markdown

rhdh-qodo-merge Bot commented May 27, 2026

Code review by qodo was updated up to the latest commit 13b5e0f

Add hook-succeeded to the Helm hook delete policy so that successful. Jobs are cleaned up immediately while failed Jobs are kept for log   inspection. TTL still handles cleanup for ArgoCD users after 5 minutes.
@Fortune-Ndlovu
Copy link
Copy Markdown
Member Author

/agentic_review

@rhdh-qodo-merge
Copy link
Copy Markdown

rhdh-qodo-merge Bot commented May 27, 2026

Code review by qodo was updated up to the latest commit 41fbb3e

@sonarqubecloud
Copy link
Copy Markdown

@Fortune-Ndlovu
Copy link
Copy Markdown
Member Author

/agentic_review

@rhdh-qodo-merge
Copy link
Copy Markdown

rhdh-qodo-merge Bot commented May 27, 2026

Code review by qodo was updated up to the latest commit 333fd0c

args:
- "psql -h ${POSTGRES_HOST} -p ${POSTGRES_PORT} -U ${POSTGRES_USER} -d {{ .Values.orchestrator.sonataflowPlatform.externalDBName }} -c 'CREATE DATABASE sonataflow;' || echo WARNING: Could not create database"
- |
psql -h ${POSTGRES_HOST} -p ${POSTGRES_PORT} -U ${POSTGRES_USER} -d postgres -c 'CREATE DATABASE sonataflow;' 2>&1 || {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
psql -h ${POSTGRES_HOST} -p ${POSTGRES_PORT} -U ${POSTGRES_USER} -d postgres -c 'CREATE DATABASE sonataflow;' 2>&1 || {
psql -h ${POSTGRES_HOST} -p ${POSTGRES_PORT} -U ${POSTGRES_USER} -d {{ .Values.orchestrator.sonataflowPlatform.externalDBName }} -c 'CREATE DATABASE sonataflow;' 2>&1 || {

- "psql -h ${POSTGRES_HOST} -p ${POSTGRES_PORT} -U ${POSTGRES_USER} -d {{ .Values.orchestrator.sonataflowPlatform.externalDBName }} -c 'CREATE DATABASE sonataflow;' || echo WARNING: Could not create database"
- |
psql -h ${POSTGRES_HOST} -p ${POSTGRES_PORT} -U ${POSTGRES_USER} -d postgres -c 'CREATE DATABASE sonataflow;' 2>&1 || {
if psql -h ${POSTGRES_HOST} -p ${POSTGRES_PORT} -U ${POSTGRES_USER} -d postgres -tc "SELECT 1 FROM pg_database WHERE datname='sonataflow'" | grep -q 1; then
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if psql -h ${POSTGRES_HOST} -p ${POSTGRES_PORT} -U ${POSTGRES_USER} -d postgres -tc "SELECT 1 FROM pg_database WHERE datname='sonataflow'" | grep -q 1; then
if psql -h ${POSTGRES_HOST} -p ${POSTGRES_PORT} -U ${POSTGRES_USER} -d {{ .Values.orchestrator.sonataflowPlatform.externalDBName }} -tc "SELECT 1 FROM pg_database WHERE datname='sonataflow'" | grep -q 1; then

# Note that when this chart is published to https://github.com/openshift-helm-charts/charts
# it will follow the RHDH versioning 1.y.z
version: 5.14.0
version: 5.14.1
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
version: 5.14.1
version: 6.0.0

Let's denote the fact that it is a potentially breaking change.

Comment on lines +91 to +92
"helm.sh/hook": post-install,post-upgrade
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with the hooks here is that the helm install/upgrade command will hang by default until the Job is done, which might take some time. I think it would be safer to switch to the previous approach, where you changed the job name (might even be better to include a chart version in the name).
With hindsight, I think it should be fine in case of an upgrade even if the previous Job is still there, as it won't be running anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants