Skip to content

feat(ci): add unified deployment pipeline with preflight checks and rollback support#1982

Open
cal-id-actions[bot] wants to merge 13 commits into
mainfrom
pipeline_risk_analysis
Open

feat(ci): add unified deployment pipeline with preflight checks and rollback support#1982
cal-id-actions[bot] wants to merge 13 commits into
mainfrom
pipeline_risk_analysis

Conversation

@cal-id-actions
Copy link
Copy Markdown

@cal-id-actions cal-id-actions Bot commented May 27, 2026

Summary

Introduces a comprehensive unified deployment pipeline that includes preflight checks, migration validation, rollback mechanisms, and flow hardening to improve deployment reliability and safety.

Changes

  • Added new deployment workflows: deploy-all, rollback, validate-migration
  • Enhanced existing deployment workflows with migration script inclusion and sparse checkout preservation
  • Implemented preflight space and environment checks before deployment
  • Added scripts for lock management, worker draining, migration, rollback, and notification
  • Updated Dockerfiles and entrypoint for deployment compatibility
  • Added extensive documentation for deployment fixes and unified deployment process
  • Hardened migration native builds and validated release SHA before checkout
  • Improved release output handling to avoid secret exposure

Testing Notes

  • Run the full deployment pipeline on a staging environment to verify preflight checks trigger correctly
  • Test migration validation workflow with both valid and invalid migration schemas
  • Perform rollback using the new rollback workflow and confirm system state restoration
  • Validate that deployment logs do not expose secrets
  • Confirm that worker draining and lock management scripts operate as expected during deployment
  • Review documentation for clarity and completeness

@cal-id-actions cal-id-actions Bot added the main Auto PR target: main label May 27, 2026
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 27, 2026

Greptile Summary

This PR introduces a comprehensive unified deployment pipeline for all Cal-ID services (Web, API, Worker), replacing the per-service workflows with a single orchestrated deploy-all.yml. It also adds rollback, migration validation, and notification workflows, along with supporting infra scripts and a Dockerfile security improvement that moves NEXTAUTH_SECRET and CALENDSO_ENCRYPTION_KEY from build ARGs to BuildKit secret mounts.

  • deploy-all.yml: Multi-job pipeline covering preflight, parallel image builds, DB migration, NGINX promotion, worker handoff, auto-rollback on failure, and email notification.
  • rollback.yml: Manual app-only rollback workflow with dry-run mode, ECR image verification, and schema compatibility validation.
  • infra/scripts/: ~2,500 lines of new Bash scripts for locking, state recording, migration, staging, promotion, worker lifecycle, rollback, and notifications.

Confidence Score: 2/5

Not safe to merge — the migrate-db step will abort on every run due to $GITHUB_OUTPUT being unbound inside the SSH script, effectively breaking the entire deployment pipeline before any service is updated.

The migration step's SSH script references $GITHUB_OUTPUT (a GitHub runner environment variable) on the remote EC2 host where it is never set; with set -euo pipefail in force, this causes an immediate 'unbound variable' abort that blocks deploy-api and deploy-web. Separately, the rollback workflow writes IAM credentials permanently to the EC2 host's AWS config and silently validates the wrong codebase version when the target SHA is unavailable.

.github/workflows/deploy-all.yml (migration output capture in SSH script) and .github/workflows/rollback.yml (credential persistence and schema validation fallback).

Security Review

  • Persistent credentials on EC2 host (rollback.yml lines 177–181): aws configure set writes AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to ~/.aws/credentials on the remote host and is never cleaned up.
  • eval on secret-controlled input (infra/scripts/migrate.sh line 156): DB_BACKUP_COMMAND is executed via eval, allowing shell metacharacter injection.
  • Dockerfile secret handling improved: NEXTAUTH_SECRET and CALENDSO_ENCRYPTION_KEY correctly moved from build ARGs to BuildKit secret mounts — a positive security change.

Important Files Changed

Filename Overview
.github/workflows/deploy-all.yml New 1870-line unified deployment pipeline; critical bug where $GITHUB_OUTPUT is used inside an SSH script but never forwarded via envs:, causing migration step failure on every run.
.github/workflows/rollback.yml New manual rollback workflow; AWS credentials persist on EC2 host via aws configure set, and schema validation falls back to branch HEAD silently on SHA lookup failure.
infra/scripts/migrate.sh New migration script with backup, idempotency, and timeout; uses eval for DB_BACKUP_COMMAND.
infra/scripts/acquire-lock.sh New S3 conditional-write lock; dead expires_at variable (set to now instead of expiry) but lock payload is correct.
Dockerfile Refactored to use BuildKit secret mounts for NEXTAUTH_SECRET and CALENDSO_ENCRYPTION_KEY, preventing leakage into image metadata.
packages/trpc/server/routers/viewer/webhook/testTrigger.handler.ts Updated test webhook payload timezone and email to cal.id brand values.

Fix All in Claude Code Fix All in Codex

Reviews (1): Last reviewed commit: "Trigger auto pr" | Re-trigger Greptile

Comment on lines +482 to +527
run: |
set -euo pipefail
if [ "$REBUILD" = "true" ]; then
echo "image_exists=false" >> "$GITHUB_OUTPUT"
echo "Skipping image existence check — rebuild requested"
else
if aws ecr describe-images \
--repository-name "$REPO_NAME" \
--image-ids imageTag="$GIT_SHA" >/dev/null 2>&1; then
echo "image_exists=true" >> "$GITHUB_OUTPUT"
echo "Image ${REPO_NAME}:${GIT_SHA} already exists — will skip build"
else
echo "image_exists=false" >> "$GITHUB_OUTPUT"
echo "Image ${REPO_NAME}:${GIT_SHA} not found — will build"
fi
fi

- name: Build and push web image
if: ${{ steps.image-exists.outputs.image_exists != 'true' }}
id: build
uses: docker/build-push-action@v5
with:
context: .
file: ./Dockerfile
platforms: linux/amd64
push: true
build-args: |
NEXT_PUBLIC_GTM_ID=${{ needs.prepare-release.outputs.deploy_env == 'production' && secrets.NEXT_PUBLIC_GTM_ID_PROD || secrets.NEXT_PUBLIC_GTM_ID_STAG }}
NEXT_PUBLIC_META_WHATSAPP_BUSINESS_APP_ID=${{ secrets.NEXT_PUBLIC_META_WHATSAPP_BUSINESS_APP_ID }}
NEXT_PUBLIC_META_WHATSAPP_BUSINESS_CONFIG_ID=${{ secrets.NEXT_PUBLIC_META_WHATSAPP_BUSINESS_CONFIG_ID }}
NEXT_PUBLIC_WEBAPP_URL=${{ format('https://{0}', needs.prepare-release.outputs.deploy_env == 'production' && secrets.DOMAIN_NAME_PROD || secrets.DOMAIN_NAME_STAG) }}
NEXT_PUBLIC_WEBSITE_URL=${{ format('https://{0}', needs.prepare-release.outputs.deploy_env == 'production' && secrets.DOMAIN_NAME_PROD || secrets.DOMAIN_NAME_STAG) }}
NEXT_PUBLIC_API_V2_URL=${{ secrets.NEXT_PUBLIC_API_V2_URL }}
NEXT_PUBLIC_EMBED_LIB_URL=${{ format('https://{0}/embed-link/embed.js', needs.prepare-release.outputs.deploy_env == 'production' && secrets.DOMAIN_NAME_PROD || secrets.DOMAIN_NAME_STAG) }}
NEXT_PUBLIC_ONEHASH_URL=${{ secrets.NEXT_PUBLIC_ONEHASH_URL }}
NEXT_PUBLIC_SENDGRID_SENDER_NAME=${{ secrets.NEXT_PUBLIC_SENDGRID_SENDER_NAME }}
NEXT_PUBLIC_SENTRY_DSN=${{ needs.prepare-release.outputs.deploy_env == 'production' && secrets.NEXT_PUBLIC_SENTRY_DSN_PROD || secrets.NEXT_PUBLIC_SENTRY_DSN_STAG }}
NEXT_PUBLIC_LOGGER_LEVEL=${{ secrets.NEXT_PUBLIC_LOGGER_LEVEL }}
NEXT_PUBLIC_TEAM_IMPERSONATION=${{ secrets.NEXT_PUBLIC_TEAM_IMPERSONATION }}
NEXT_PUBLIC_APP_NAME=${{ secrets.NEXT_PUBLIC_APP_NAME }}
NEXT_PUBLIC_COMPANY_NAME=${{ secrets.BRAND_NAME }}
NEXT_PUBLIC_MINUTES_TO_BOOK=${{ secrets.NEXT_PUBLIC_MINUTES_TO_BOOK }}
NEXT_PUBLIC_BOOKER_NUMBER_OF_DAYS_TO_LOAD=${{ secrets.NEXT_PUBLIC_BOOKER_NUMBER_OF_DAYS_TO_LOAD }}
NEXT_PUBLIC_CALENDLY_OAUTH_URL=${{ secrets.NEXT_PUBLIC_CALENDLY_OAUTH_URL }}
NEXT_PUBLIC_CALENDLY_API_BASE_URL=${{ secrets.NEXT_PUBLIC_CALENDLY_API_BASE_URL }}
NEXT_PUBLIC_CALENDLY_CLIENT_ID=${{ needs.prepare-release.outputs.deploy_env == 'production' && secrets.NEXT_PUBLIC_CALENDLY_CLIENT_ID_PROD || secrets.NEXT_PUBLIC_CALENDLY_CLIENT_ID_STAG }}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 $GITHUB_OUTPUT unbound on remote host — migration step always fails

The script: block runs on the EC2 host via appleboy/ssh-action, but GITHUB_OUTPUT is not listed in envs: and therefore is not set on the remote host. With set -euo pipefail active, the -u flag causes bash to abort with GITHUB_OUTPUT: unbound variable as soon as the echo "migrations_applied=..." >> "$GITHUB_OUTPUT" line is reached. Even if GITHUB_OUTPUT were somehow non-empty, it would point to a runner-local file path that does not exist on the EC2 host. Either way, every deployment that performs a migration will fail at this step, blocking deploy-api and deploy-web.

Fix in Claude Code Fix in Codex

Comment on lines +177 to +181
git checkout "$TARGET_SHA" || git checkout "origin/$BRANCH_NAME"
aws configure set aws_access_key_id "$AWS_ACCESS_KEY_ID"
aws configure set aws_secret_access_key "$AWS_SECRET_ACCESS_KEY"
aws configure set default.region "$AWS_REGION"
aws ecr get-login-password --region "$AWS_REGION" \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 security AWS credentials written permanently to EC2 host config

aws configure set writes to ~/.aws/credentials on the EC2 host and is never cleaned up by the script. These long-lived IAM key credentials will persist on the instance between deployments and across any other sessions on that host. Prefer injecting credentials as ephemeral environment variables (the environment already contains AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_REGION via envs:) — the AWS CLI picks them up automatically without needing aws configure set, and they vanish when the SSH session ends.

Fix in Claude Code Fix in Codex

Comment on lines +74 to +75
git fetch origin "$TARGET_SHA" --depth 1 || true
git checkout "$TARGET_SHA" || git checkout "origin/$BRANCH_NAME"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Schema validation silently falls back to wrong codebase version

If the shallow clone cannot resolve TARGET_SHA (e.g. the commit is old enough to be absent from a --depth 1 fetch), git checkout "$TARGET_SHA" fails and the || git checkout "origin/$BRANCH_NAME" fallback silently runs the validation against the branch HEAD — a newer commit. validate-rollback-schema.sh may then report the schema as compatible even when TARGET_SHA's actual migrations would be incompatible. The fallback should be removed and replaced with a hard failure. This pattern appears identically at line 177 (in the rollback-app job).

Fix in Claude Code Fix in Codex

Comment on lines +69 to +71
expires_at="$(iso_from_epoch "$now_epoch")"
acquired_at="$(iso_from_epoch "$now_epoch")"
new_expires="$(iso_from_epoch "$expires_at_epoch")"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The variable expires_at is assigned iso_from_epoch "$now_epoch" — that is the current time, not the lock expiry time. The jq payload correctly uses $new_expires for the expires_at field, so expires_at is never referenced after this line and the lock object records the right expiry. This is a dead variable that is misleading: it looks like the expiry time but contains "now", making the code harder to audit.

Suggested change
expires_at="$(iso_from_epoch "$now_epoch")"
acquired_at="$(iso_from_epoch "$now_epoch")"
new_expires="$(iso_from_epoch "$expires_at_epoch")"
acquired_at="$(iso_from_epoch "$now_epoch")"
new_expires="$(iso_from_epoch "$expires_at_epoch")"

Fix in Claude Code Fix in Codex

Comment thread infra/scripts/migrate.sh
fail "ENABLE_DB_BACKUP=true but DB_BACKUP_COMMAND is not set — cannot run backup"
fi
log INFO "Backup enabled — executing: ${DB_BACKUP_COMMAND}"
eval "$DB_BACKUP_COMMAND" || fail "Backup step failed — aborting migration. Database may be in an inconsistent state."
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 eval "$DB_BACKUP_COMMAND" executes the secret value as a shell string, allowing embedded metacharacters (; rm -rf ..., subshell $(...), etc.) to run arbitrary commands. While DB_BACKUP_COMMAND is a repository secret today, prefer a safer invocation pattern — pass the command as an argument to a pre-approved script or use bash -c "$DB_BACKUP_COMMAND".

Suggested change
eval "$DB_BACKUP_COMMAND" || fail "Backup step failed — aborting migration. Database may be in an inconsistent state."
bash -c "$DB_BACKUP_COMMAND" || fail "Backup step failed — aborting migration. Database may be in an inconsistent state."

Fix in Claude Code Fix in Codex

@cal-id-actions cal-id-actions Bot changed the title feat(ci): add unified deployment pipeline and rollback workflows chore(ci): unify deployment pipeline and add rollback workflows May 27, 2026
@cal-id-actions cal-id-actions Bot changed the title chore(ci): unify deployment pipeline and add rollback workflows feat(ci): add unified deployment pipeline with rollback and migration validation May 27, 2026
@cal-id-actions cal-id-actions Bot changed the title feat(ci): add unified deployment pipeline with rollback and migration validation feat(pipeline): add unified deployment and rollback workflows with risk analysis May 27, 2026
@cal-id-actions cal-id-actions Bot changed the title feat(pipeline): add unified deployment and rollback workflows with risk analysis feat(ci): unify deployment pipeline with migration and rollback support May 27, 2026
Replace sparse checkout with full repository checkout in migrate-db step to
resolve workspace dependency resolution failures (@calcom/lib 404 errors).

Changes:
- migrate-db: clone full repo instead of sparse checkout
- migrate.sh: add DEFER_CLEANUP env support to defer cleanup until downstream stages complete
- deploy-all.yml: add cleanup steps to verify and rollback-after-promotion jobs
- migrate-db: set DEFER_CLEANUP=true to preserve checkout for deploy stages

This ensures all workspace packages resolve correctly during yarn install while
deferring cleanup until after deployment pipeline completes successfully.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@cal-id-actions cal-id-actions Bot changed the title feat(ci): unify deployment pipeline with migration and rollback support feat(ci): unify deployment pipeline with risk analysis and rollback support May 27, 2026
@cal-id-actions cal-id-actions Bot changed the title feat(ci): unify deployment pipeline with risk analysis and rollback support feat(pipeline): add unified deployment pipeline with rollback and migration validation May 27, 2026
@cal-id-actions cal-id-actions Bot changed the title feat(pipeline): add unified deployment pipeline with rollback and migration validation feat(deploy): add unified deployment pipeline with migration and rollback support May 27, 2026
@cal-id-actions cal-id-actions Bot changed the title feat(deploy): add unified deployment pipeline with migration and rollback support feat(deploy): unify and harden deployment pipeline with migration support May 27, 2026
@cal-id-actions cal-id-actions Bot changed the title feat(deploy): unify and harden deployment pipeline with migration support feat(ci): add unified deployment pipeline with preflight checks and rollback support May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

main Auto PR target: main

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant