Skip to content

feat: implemented announce key reconciliation#143

Merged
NamanBalaji merged 4 commits into
mainfrom
feat/announce-key-agent
Jun 5, 2026
Merged

feat: implemented announce key reconciliation#143
NamanBalaji merged 4 commits into
mainfrom
feat/announce-key-agent

Conversation

@NamanBalaji
Copy link
Copy Markdown
Contributor

@NamanBalaji NamanBalaji commented May 26, 2026

No description provided.

@push-tags-from-workflow push-tags-from-workflow Bot added dependencies Pull requests that update a dependency file tests feature labels May 26, 2026
@NamanBalaji NamanBalaji force-pushed the feat/announce-key-agent branch from f46c40c to 6b83fc5 Compare May 26, 2026 06:19
@NamanBalaji NamanBalaji self-assigned this May 26, 2026
Copy link
Copy Markdown
Contributor

@fabenan-f fabenan-f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the end-to-end-approach, just a few remarks from my side

Comment thread cmd/agent/main.go Outdated
func setupOperator(ctx context.Context) (*sql.DB, *rpc.Server, *orbital.Operator) {
dsn := os.Getenv("AGENT_DATABASE_URL")
if dsn == "" {
log.Println("AGENT_DATABASE_URL not set, operator disabled")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is only an example agent, but still I'd not allow an agent to run without an operator

}
}

// awaitKeyExists polls the keys table until a key with the given ID and tenant exists.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make the tests even more integrational if we use the gRPC client for key assertions

Comment thread internal/agent/handler/announcekey.go Outdated
@@ -0,0 +1,42 @@
package handler
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imho having handler.NewAnnounceKey and announcekey.NewHandler is confusing. I think we should consolidate these into a single handler package that contains both the job handler (manager side) and the request handler (operator side). This would be beneficial because:

  1. Both handlers share common data structures
  2. Their interconnection would be immediately visible
  3. It would eliminate the current naming ambiguity

Comment thread internal/agent/handler/announcekey.go Outdated
key.State = model.KeyStatePreActivation

if _, err := keyStore.CreateKey(ctx, store.CreateKeyQuery{Key: key}); err != nil {
resp.Fail(fmt.Sprintf("store key: %v", err))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default case should be to continue, only known terminal errors (like the one in the test case) should fail

Comment thread internal/keylifecycle/keylifecycle.go Outdated
model.KeyStateDestroyed: {},
model.KeyStateActive: {},
model.KeyStateCompromised: {},
model.KeyStateAnnounceFailed: {},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we introduce a new state here, we'll deviate from the official NIST lifecycle definition. I'm uncertain whether this is advisable from either a compliance or signaling standpoint

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct we should have a different processing state
As per some discussion we should have a field a new field called KeyProcessingState and existing KeyState should be renamed to KeyLifeCycleState

type Key struct {
	ID                 string             `json:"id"`
	Name               string             `json:"name"`
	TenantID           string             `json:"tenant_id"`
	Kind               KeyKind            `json:"kind"`
	ParentID           *string            `json:"parent_id"`
	ManagedBy          string             `json:"managed_by"`
	Labels             Labels             `json:"labels"`
	KeyLifeCycleState  KeyLifeCycleState  `json:"key_lifecycle_state"`
	KeyProcessingState KeyProcessingState `json:"key_processing_state"`
	CreatedAt          clock.UnixNano     `json:"created_at"`
	UpdatedAt          clock.UnixNano     `json:"updated_at"`
}

type KeyProcessingState struct {
	Status string `json:"status"`
	JobID  string `json:"job_id,omitempty"` 
}

Here jobID shows which JobID is having the lock on , and this will be useful when we do a large key rotation.
but maybe we can start without and extend it later

Credit to @apatsap as well

)

type fakeKeyStore struct {
keys map[string]*model.Key
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather use the real sql store implementation (unhappy paths can still be mocked with wrapper functions)

Comment thread pkg/api/v1/proto/admin/key_service.go Outdated
)
}

job := orbital.NewJob(announcekey.JobType, data).WithExternalID(key.ID)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since key.ID is generated internally, a lost response could cause the client to retry, resulting in duplicate jobs performing the same action

Copy link
Copy Markdown
Contributor

@jithinkunjachan jithinkunjachan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job 👍🏽 , just few things

Comment thread internal/agent/handler/announcekey.go Outdated
parentID = &data.ParentID
}

key := model.NewKey(data.TenantID, data.Name, data.Kind, parentID, data.Target, data.Labels)
Copy link
Copy Markdown
Contributor

@jithinkunjachan jithinkunjachan May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but just a question how the tenants info will be propagated to agent DB as we have DB constraints. We might need to think about this topic later.

Comment thread internal/keylifecycle/keylifecycle.go Outdated
model.KeyStateDestroyed: {},
model.KeyStateActive: {},
model.KeyStateCompromised: {},
model.KeyStateAnnounceFailed: {},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct we should have a different processing state
As per some discussion we should have a field a new field called KeyProcessingState and existing KeyState should be renamed to KeyLifeCycleState

type Key struct {
	ID                 string             `json:"id"`
	Name               string             `json:"name"`
	TenantID           string             `json:"tenant_id"`
	Kind               KeyKind            `json:"kind"`
	ParentID           *string            `json:"parent_id"`
	ManagedBy          string             `json:"managed_by"`
	Labels             Labels             `json:"labels"`
	KeyLifeCycleState  KeyLifeCycleState  `json:"key_lifecycle_state"`
	KeyProcessingState KeyProcessingState `json:"key_processing_state"`
	CreatedAt          clock.UnixNano     `json:"created_at"`
	UpdatedAt          clock.UnixNano     `json:"updated_at"`
}

type KeyProcessingState struct {
	Status string `json:"status"`
	JobID  string `json:"job_id,omitempty"` 
}

Here jobID shows which JobID is having the lock on , and this will be useful when we do a large key rotation.
but maybe we can start without and extend it later

Credit to @apatsap as well

@NamanBalaji NamanBalaji force-pushed the feat/announce-key-agent branch from 6b83fc5 to 7ebc5cd Compare June 2, 2026 07:51
Signed-off-by: Naman Balaji <namanb487@gmail.com>
Signed-off-by: Naman Balaji <namanb487@gmail.com>
@NamanBalaji NamanBalaji force-pushed the feat/announce-key-agent branch from 7ebc5cd to 89d7a65 Compare June 2, 2026 07:54
Signed-off-by: Naman Balaji <namanb487@gmail.com>
Comment thread cmd/agent/main.go
defer agentDB.Close()

go func() {
if err := operator.ListenAndRespond(ctx); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a log.Info before ListenAndRespond

Comment thread examples/root.config.yaml
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we put this in /examples/

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we put this in /examples/

Comment thread internal/handler/announcekey/job.go Outdated
return orbital.CancelJobConfirmer(fmt.Sprintf("invalid job data: %v", err)), nil
}

_, err := h.keyStore.GetKeyByID(ctx, data.KeyID, data.TenantID)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to check that the key state allows the target state using keylifecycle.ValidateTransition

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo keylifecycle.ValidateTransition is more for checking state transitions, but in this case we just create a key and for the whole announce operation it will stay in a preactive state so there's effectively no transition taking place here.

// TaskData is the payload exchanged between the root job handler and the
// agent task handler. It is JSON-encoded into the orbital Job/Task data
// field.
type TaskData struct {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now its alright. But lets add a comment here that when we add the next handler that certain parts of the data (e.g. Target, Labels, TenantID) might need to be refactored into something like common.TaskInfo

key.ID = data.KeyID
key.LifeCycleState = model.KeyLifeCyclePreActivation

err := keyStore.CreateKey(ctx, key)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to make sure parentID key exists and can be used to from a lifecycle perspective to announce this key

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be done as part of a subsequent PR

if data.ParentID != "" {
parentID = &data.ParentID
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to check that data.TenantID exists

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be done as part of a subsequent PR

@@ -37,16 +61,97 @@ func (s *KeyService) AnnounceKey(ctx context.Context, req *AnnounceKeyRequest) (
req.GetLabels(),
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check

  • tenantID exists
  • Parent exists within the same tenant.
  • If Parent doesn't exist we need to make sure key is root jkey
  • If Parent exists we need to check that parent:
    • is in an allowed key lifecycle state
    • that the keyKind of the new key can be attached to the parent key by the hierarchy definition

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now just allowing active state and new key being generated from it's direct parent.

Comment thread pkg/api/v1/proto/admin/key_service.go Outdated
)

if err := s.keyStore.CreateKey(ctx, key); err != nil {
key, err := s.upsertKey(ctx, newKey)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the double commit to work we should create the job first and then check in the jobConfirm func whether the key exists and can be activated

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

Comment thread pkg/api/v1/proto/admin/key_service.go Outdated
}

// If a job is already linked, the caller is retrying — return as-is.
if key.KeyProcessingState.JobID != "" {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the previous job failed (i.e. it has a job id) and this call's intention is to retry the job

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We look up the key by name and if we find a key in failed state we create a new job, we use oldJobID concatenated with the keyID as an externalID. This is effectively our retry job and by using oldJobId we kind of dedup it.

Signed-off-by: Naman Balaji <namanb487@gmail.com>
@NamanBalaji NamanBalaji force-pushed the feat/announce-key-agent branch from 0917ea9 to 9d68fe0 Compare June 4, 2026 09:01
@NamanBalaji NamanBalaji requested a review from apatsap June 4, 2026 09:11
Copy link
Copy Markdown
Contributor

@fabenan-f fabenan-f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uniqueness constraint on key.Name can detect concurrent jobs when used as an external ID. This might let us remove some of the conditional logic we currently have in the double commit. But the current approach should work as well

if err != nil {
if errors.Is(err, store.ErrKeyNotFound) {
// ConfirmJob is idempotent and orbital will eventually
// time out the job if the key never lands.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jobs do not time out in the confirming phase but we can think of something in the future

return nil, vErr
}

existing, lookupErr := s.keyStore.GetKeyByName(ctx, store.GetKeyByNameQuery{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could be more optimistic and try to create a key first and handle a potential existing key error later, but nothing that needs to be done now

Copy link
Copy Markdown
Contributor

@jithinkunjachan jithinkunjachan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done ✅

@NamanBalaji NamanBalaji merged commit 45b4343 into main Jun 5, 2026
5 checks passed
@NamanBalaji NamanBalaji deleted the feat/announce-key-agent branch June 5, 2026 07:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file feature tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants