Skip to content

playbook: add pickupFirst mode for shared-certificate distribution (TPP)#650

Open
jmeldrum76 wants to merge 1 commit into
Venafi:masterfrom
jmeldrum76:add-pickup-first-mode
Open

playbook: add pickupFirst mode for shared-certificate distribution (TPP)#650
jmeldrum76 wants to merge 1 commit into
Venafi:masterfrom
jmeldrum76:add-pickup-first-mode

Conversation

@jmeldrum76
Copy link
Copy Markdown

Add pickupFirst mode to vcert playbook for shared-certificate distribution (TPP)

BUSINESS PROBLEM

Many customers operate the "one cert, many endpoints" pattern: a single TLS certificate (often a wildcard) is installed on dozens to hundreds of heterogeneous endpoints — Apache servers, NGINX, F5/NetScaler load balancers, Imperva, etc. — all serving the same FQDN(s). When the cert is renewed in TPP (manually via Aperture, automatically via a renewal policy, or via vcert on a designated leader host), every follower needs to install that exact same cert + key during its own maintenance window, which may be days or weeks after the renewal happens.

vcert's current playbook (vcert run -f apache.yaml) is built around the assumption that the host running the playbook owns the enrollment — it always tries to enroll / renew through the request block. That means:

  • On a shared-wildcard scenario, every follower host running the playbook would attempt to enroll its own cert against TPP, each generating its own keypair. That's the opposite of "one wildcard everywhere."
  • Followers can't simply track the leader's renewal — the playbook has no mode for "fetch whatever cert TPP currently has at this object DN and install it locally if it's different."
  • Operators end up writing custom shell wrappers around vcert pickup to bridge this gap. We did exactly this for our customer — ~250 lines of bash that drives vcert pickup, compares thumbprints, decides whether to install or defer to the existing renewal path.

Business impact: every customer with shared / wildcard certs across multiple endpoints either accepts staggered-renewal pain, builds bespoke distribution scripts, or pushes the cert manually. The pattern is common enough that vcert should support it natively.

PROPOSED SOLUTION

Add an opt-in pickupFirst mode to the playbook request block. With one new boolean field (and one optional override), a follower host's playbook becomes a self-healing converger to whatever the platform currently holds at a given cert object.

certificateTasks:
  - name: apache-cert
    renewBefore: 30d
    request:
      csr: service
      pickupFirst: true                    # ← new, default false
      pickupId: '\VED\Policy\...\cert-dn'  # ← new, optional; defaults to zone\CN
      zone: '\VED\Policy\Demo\Apache'
      subject:
        commonName: 'shared.example.com'
        ...
    installations:
      - format: PEM
        file: /etc/pki/tls/certs/apache.crt
        keyFile: /etc/pki/tls/private/apache.key
        chainFile: /etc/pki/tls/certs/apache-chain.crt
        afterInstallAction: "systemctl reload httpd"

When pickupFirst: true:

  1. Locate (TPP)RetrieveCertificateMetaData(dn) — one cheap GET, returns thumbprint + ValidTo with no PEM / key payload.
  2. Compare the result against the installed cert's SHA-1 thumbprint and NotAfter.
  3. Decide:
State Action
Thumbprint matches installed Defer to the existing renewBefore window check (normal playbook flow takes over).
Platform cert is newer Full RetrieveCertificate for cert + chain + key, install at the playbook's paths via the existing installer chain, run afterInstallAction. No enrollment.
Platform cert is older than installed Log "refusing downgrade", exit cleanly.
Platform cert not found Fall through to the existing enroll flow (handles initial enrollment naturally).

Backwards compatibility: absent pickupFirst (or pickupFirst: false), the playbook behaves byte-identically to today. Existing customer playbooks are unaffected.

Architectural notes from a working prototype

  • Implemented as a single new file pkg/playbook/app/service/pickup_first.go (~150 lines) plus three small public helpers in vcertutil and installer. The patch is purely additive: zero deletions, zero modifications to existing logic. The new field defaults make every untouched code path identical to current behavior.
  • Hot path (thumbprint match) is ~50 ms on TPP — much cheaper than a full pickup. Doesn't touch the private-key vault. Doesn't exercise PKCS#8 decryption. Scales to any tenant size because RetrieveCertificateMetaData is O(1) by DN.
  • Reuses every existing component: runInstaller, CreateX509Cert (handles PKCS#8 encrypted-key decryption), afterInstallAction, backup / rollback. No new installer code.
  • The "platform older than installed" path is a genuine safety win — it prevents accidental downgrades when an admin imports an older cert into TPP by mistake.

Diffstat against v5.13.2

 pkg/playbook/app/domain/playbookRequest.go |   2 +
 pkg/playbook/app/installer/crypto.go       |   4 +
 pkg/playbook/app/service/pickup_first.go   | 133 +++++++++++++++++++++++++++
 pkg/playbook/app/service/service.go        |   8 ++
 pkg/playbook/app/vcertutil/vcertutil.go    | 143 +++++++++++++++++++++++++++++
 5 files changed, 290 insertions(+)

Scope for v1: TPP only

VCP support would require a different locator strategy. Its cert-object model is fundamentally different:

  • Multiple cert lineages per CN can coexist.
  • versionType (CURRENT / OLD) and certificateStatus (ACTIVE / RETIRED) are independent state machines.
  • managedCertificateId (the lineage identifier) is not currently a server-side searchable field.

The proposed implementation silently no-ops on non-TPP backends so VCP / Firefly / NGTS playbooks see zero behavior change and zero error noise. VCP-native support is a clean follow-up issue once the locator abstraction lands.

CURRENT ALTERNATIVES

In production for a customer today, we have evaluated or are doing all of the following:

  1. Bespoke shell wrapper around vcert pickup and vcert run. Reads install paths and renewBefore from the playbook YAML, drives the four-branch decision tree (newer pickup → install / match in window → renew / match outside window → no-op / nothing in TPP → initial enroll), handles PKCS#8 key decryption before write (because Apache without SSLPassPhraseDialog can't load encrypted keys), filters stderr noise, writes timestamped backups. Roughly 250 lines of bash that every customer in this situation ends up writing variants of.
  2. vcert pickup driven by cron with custom diffing. Same pattern, different language.
  3. Cert pushed manually out of band (rsync from a leader, configuration-management drift). Skips vcert entirely; the cert object in TPP becomes informational rather than authoritative.
  4. Accept staggered downtime — run vcert run --force-renew on every host on a coordinated maintenance window, even though only one of them actually needed to enroll.

All four approaches reinvent the same logic and put the burden on the operator. Native support in vcert would replace all of them with one YAML flag.

VENAFI EXPERIENCE

  • Working with vcert v5 (currently v5.12.3 in the customer environment; verified the proposed implementation also compiles and tests cleanly against v5.13.2 / master). Daily use of the playbook engine, vcert pickup, vcert run, and the standalone vcert enroll / vcert renew commands.
  • TPP customer for several years; both interactive Aperture use and API-driven via vcert. Mix of enrollment patterns: user-provided CSR, service-generated, mixed key-retrieval policies across folders.
  • Have prototyped this feature end-to-end on a live TPP lab and verified all seven decision scenarios:
    • backwards-compat (no pickupFirst field)
    • hot-path match
    • install-newer
    • refuse-downgrade
    • in-renew-window-defer-to-enroll
    • initial-enroll
    • VCP-silent-noop

A working prototype patch (pickupFirst.patch) is attached. Five files, +290 lines, zero deletions, zero modifications to existing code paths. Apply with git apply pickupFirst.patch from the vcert repo root.

Adds an opt-in `pickupFirst: true` field on the playbook `request:` block.
When enabled (TPP only for v1), `vcert run` queries the cert object's
current metadata first and installs whatever the platform holds rather
than enrolling a new cert on every follower host. This matches the
common "one cert, many endpoints" pattern (wildcards, load-balancer
pools) where one team renews centrally and many followers need to
converge to the same cert+key on their own maintenance windows.

Decision flow on each run:
  - locate (TPP RetrieveCertificateMetaData) - cheap O(1) metadata GET
  - thumbprint matches installed       -> defer to renewBefore check
  - platform cert newer than installed -> full pickup + install, no enroll
  - platform cert older than installed -> refuse downgrade (safety guard)
  - platform cert not found            -> fall through to existing enroll

The change is purely additive: 5 files, +289 lines, 0 deletions, 0
modifications to existing logic. Existing playbooks without
`pickupFirst` are byte-identical to current behavior. On VCP/Firefly/
NGTS the feature silently no-ops (ErrLocateNotSupported); VCP-native
support is a planned follow-up that needs a different locator strategy
(cert-object DN model differs).

Files:
  - pkg/playbook/app/domain/playbookRequest.go: PickupFirst, PickupID fields
  - pkg/playbook/app/vcertutil/vcertutil.go:    LocateLatestCN, locateTPP,
                                                 PickupCertificateByLocator
  - pkg/playbook/app/installer/crypto.go:       LoadInstalledPEM (export)
  - pkg/playbook/app/service/pickup_first.go:   orchestrator (new file)
  - pkg/playbook/app/service/service.go:        Execute() hook

Verified end-to-end against a live TPP lab across seven scenarios:
backwards-compat / hot-path match / install-newer-pickup / refuse-
downgrade / in-renew-window-defer-to-enroll / initial-enroll / non-TPP
silent-noop.

Signed-off-by: Jeremy Meldrum <21229220+jmeldrum76@users.noreply.github.com>
@jmeldrum76
Copy link
Copy Markdown
Author

For context: I'm submitting this as a Venafi (CyberArk) colleague — happy to follow whatever internal review process the playbook engine maintainers want before this gets merged. The commit is from my personal GitHub identity but the work is internal to the company. Let me know if there's an internal Jira/design-doc step I should do, a reviewer to ping, or anything else (README-PLAYBOOK.md update, tests, CHANGELOG) you'd like added to this PR before review. Happy to push follow-up commits on the same branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant