Skip to content

Support for Redis AWS IAM auth#8078

Merged
dotansimha merged 12 commits into
graphql-hive:mainfrom
mish-elle:redis-iam-auth
Jun 28, 2026
Merged

Support for Redis AWS IAM auth#8078
dotansimha merged 12 commits into
graphql-hive:mainfrom
mish-elle:redis-iam-auth

Conversation

@mish-elle

@mish-elle mish-elle commented May 26, 2026

Copy link
Copy Markdown
Contributor

Closes #8177

Background

Self-hosters running Hive on AWS with AWS ElastiCache Redis currently have no way to use IAM-based authentication for Redis connections, which forces them to use static passwords.

This PR adds opt-in AWS IAM authentication support for ElastiCache Redis across all services that communicate with Redis: api, schema, server, tokens, usage, and workflows. It also adds Redis Cluster mode support.

This PR is part of the following issue. We will have separate PRs for each IAM support to help decrease the scope per PR.

Description

The ElastiCache IAM token generation logic lives in two new modules inside service-common:

  • service-common/src/iam-aws.ts: A generic AWS SigV4 pre-signed token generation and a reusable periodic token refresh timer with retry/backoff logic. This is necessary since Elasticache does not have a dedicated signer like MSK or RDS.
  • service-common/src/iam-redis.ts: ElastiCache-specific helpers built on top of iam-aws. Handles token generation (generateIamAuthToken), in-place re-authentication for both standalone and cluster connections (refreshIamAuth), periodic token rotation (startIamTokenRefresh), and credential resolution (resolveRedisCredentials).

When Redis IAM auth is enabled:

  1. Generates an initial token at startup using resolveRedisCredentials()
  2. Connects to Redis using the token as the password and the configured Redis username
  3. Starts a background refresh timer startIamTokenRefresh() that rotates the token every ~12 minutes (15-min SigV4 TTL minus 3-min buffer), with jitter to prevent thundering-herd refreshes across instances
  4. On each refresh, issues AUTH commands to re-authenticate active connections (including per-node AUTH for cluster mode)

Redis Cluster mode is added as an opt-in feature REDIS_CLUSTER_MODE_ENABLED=1. When enabled, services connect using Redis.Cluster from ioredis with dnsLookup configured to pass addresses through directly (required for ElastiCache's DNS-based endpoint resolution).

Note that the underlying ioredis library does not support dynamic passwords, so using a token refresh timer is the workaround for now until ioredis supports dynamic passwords.

New environment variables introduced

Variable Required Description
AWS_REGION No Default AWS region for the service. Used as the fallback for all AWS connections.
REDIS_USERNAME No Redis Access Control List username.
REDIS_CLUSTER_MODE_ENABLED No Set to 1 to connect using Redis Cluster mode.
REDIS_AWS_IAM_AUTH_ENABLED No Set to 1 to enable IAM authentication for Redis.
REDIS_AWS_REGION No Optional override for the Redis region (defaults to AWS_REGION).
REDIS_AWS_IAM_CACHE_NAME No ElastiCache cache name used as the hostname for the SigV4 signer. Required when IAM is enabled.

Environment Variable Validation

When REDIS_AWS_IAM_AUTH_ENABLED=1, the environment validation enforces:

  • REDIS_TLS_ENABLED=1 (ElastiCache IAM requires TLS)
  • REDIS_AWS_IAM_CACHE_NAME must be set
  • REDIS_AWS_REGION or AWS_REGION must be set

Pnpm-lock file generation

Like MSK, the CI pipelines will fail because of the pnpm-lock file. We're downgrading dependencies and not building a few packages, due to constraints in our environment. Is it possible for a code maintainer to help us generate the pnpm-lock file?

Checklist

  • Input validation
  • Output encoding
  • Authentication management
  • Session management
  • Access control
  • Cryptographic practices
  • Error handling and logging
  • Data protection
  • Communication security
  • System configuration
  • Database security
  • File management
  • Memory management
  • Testing

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces opt-in AWS IAM authentication for ElastiCache Redis connections and supports Redis Cluster mode across multiple services. The reviewer identified critical issues where the PubSub subscriber Redis clients in both the server and workflows services are not registered for IAM token refresh, which will cause subscription failures after the initial 15-minute token expires. Additionally, the reviewer noted that if the AUTH command fails during a refresh, the password is not updated, potentially leading to permanent authentication failures upon reconnection. Lastly, unreferencing the interval timer in startTokenRefreshTimer was suggested to prevent blocking clean process termination during graceful shutdowns.

Comment thread packages/services/server/src/index.ts Outdated
Comment thread packages/services/workflows/src/index.ts Outdated
Comment thread packages/services/service-common/src/iam-redis.ts
Comment thread packages/services/service-common/src/iam-aws.ts Outdated

@n1ru4l n1ru4l left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution, could you please address the comments I pointed out? Things are getting a bit too much duplicated and I would prefer a central place where we manage these.

Comment thread packages/services/tokens/src/environment.ts
Comment thread packages/services/schema/src/index.ts Outdated
@mish-elle

Copy link
Copy Markdown
Contributor Author

Hey @n1ru4l sorry about the delay,

  • All redis client creation is abstracted to service-commons/redis-client.ts createRedisClient() function.
  • Redis validations happens on service-commons/redis-config-validation.ts
  • Additionally, while I was removing all direct ioredis imports there's a ioredis version mismatch between the ioredis bentocache is expecting, 5.10.1, and the version that the repo uses, 5.8.2, which causes issues during runtime connection timeouts on bentocache falling back to localhost. For now I've added an pnpm override since I wasn't sure if you wanted to handle the ioredis upgrade in another git issue/PR.

@n1ru4l

n1ru4l commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

@mish-elle Please feel free to bump the ioredis version to 5.10.1 instead of using the override.

@n1ru4l n1ru4l left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mish-elle, I have some more feedback for you that I would like to see implemented. Aside from these points the implementation looks solid to me. 🙏

Comment thread packages/services/service-common/src/redis-config-validation.ts Outdated
Comment thread packages/services/workflows/src/index.ts Outdated
Comment thread packages/services/service-common/src/redis-client.ts Outdated
Comment thread packages/services/service-common/src/iam-redis.ts Outdated
Comment thread packages/services/service-common/src/redis-client.ts Outdated
Comment thread packages/services/service-common/src/iam-aws.spec.ts
Comment thread packages/services/service-common/src/iam-aws.ts
Comment thread packages/services/service-common/src/iam-redis.spec.ts
Comment thread packages/services/service-common/src/iam-redis.ts Outdated
@mish-elle mish-elle force-pushed the redis-iam-auth branch 2 times, most recently from 42658e7 to 6a59c87 Compare June 8, 2026 15:19
@mish-elle

Copy link
Copy Markdown
Contributor Author

Hey @mish-elle, I have some more feedback for you that I would like to see implemented. Aside from these points the implementation looks solid to me. 🙏

Thanks for the detailed feedback, @n1ru4l! I believed I addressed everything mentioned, please let me know if we need additional changes.

@n1ru4l n1ru4l mentioned this pull request Jun 8, 2026
@n1ru4l n1ru4l self-requested a review June 8, 2026 16:18
@n1ru4l

n1ru4l commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@mish-elle Thank you, we are close to getting this in! 🥳 Could you please have a stab at the typescript and linting issues?

Comment thread packages/services/service-common/src/iam-redis.ts
Comment thread packages/services/service-common/src/iam-aws.ts Outdated
@n1ru4l

n1ru4l commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

@mish-elle I added two more comments:

  • one for cluster error handling
  • one alternative proposal for the abort logic

Once these two are addressed, we are good to merge 👍

@mish-elle mish-elle requested a review from n1ru4l June 15, 2026 21:11
@mish-elle

Copy link
Copy Markdown
Contributor Author

Actually there seems to be a minor issue with the subscriber reconnecting with a stale token, causing WRONG user/pass pair error and the container crashes, looking into it 🤔

@mish-elle

mish-elle commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

Actually there seems to be a minor issue with the subscriber reconnecting with a stale token, causing WRONG user/pass pair error and the container crashes, looking into it 🤔

Cluster.nodes('all') delegates to ConnectionPool.getNodes(), which only returns pool nodes discovered via CLUSTER SLOTS, it never includes the ClusterSubscriber's internal Redis, which is created separately with password: options.password (a value copy) and stored in this.subscriber.

This means refreshIamAuth's loop over nodes('all') never updates the subscriber's password, so after a disconnect it reconnects with the stale startup token -> WRONGPASS -> crash. We'll add getClusterSubscriberInstance() to access the internal subscriber and update its options.password directly (since AUTH can't be sent to a connection in SUBSCRIBE mode).

@dotansimha

Copy link
Copy Markdown
Member

Thanks @mish-elle ! I approved the PR, it seems like it needs a slight rebase and then we can run ci and merge :)

mish-elle and others added 10 commits June 26, 2026 18:42
Add opt-in AWS IAM authentication for ElastiCache Redis connections and Redis Cluster mode support. When IAM is enabled, services authenticate to Redis using short-lived SigV4 pre-signed tokens instead of static passwords, with automatic token refresh before expiry.

New environment variables:

- REDIS_AWS_IAM_AUTH_ENABLED: enable IAM authentication for Redis

- REDIS_AWS_IAM_CACHE_NAME: ElastiCache cache instance name for the signer

- REDIS_AWS_REGION: optional override for the Redis region

- REDIS_CLUSTER_MODE_ENABLED: enable Redis Cluster mode

- REDIS_USERNAME: optional Redis username for ACL-based authentication
- Fix refreshIamAuth to set password BEFORE AUTH call (prevents auth failures)
- Add timer initialization for pubsub Redis client
- Enhance test coverage with unhappy paths and organized test structure
- Improve JSDoc comments for AWS IAM interfaces and functions
- Add IAM authentication support for AWS-managed Redis
- Refactor redis-config-validation to redis-config with enhanced schema
- Update all services to use centralized Redis config
- Add ClickHouse and feature flags support to workflows
- Implement tracing configuration across services
@dotansimha

Copy link
Copy Markdown
Member

Thanks @mish-elle , I ran the full ci checks in #8177 , once cleared, i'll merge this one :)

@dotansimha dotansimha merged commit bd6cce7 into graphql-hive:main Jun 28, 2026
68 of 70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants