Skip to content

Add Ray cluster log persistence (proto + collector sidecar + log_url)#1124

Open
dragod812 wants to merge 1 commit intomainfrom
stack/raycluster-history-server/proto-log-persistence
Open

Add Ray cluster log persistence (proto + collector sidecar + log_url)#1124
dragod812 wants to merge 1 commit intomainfrom
stack/raycluster-history-server/proto-log-persistence

Conversation

@dragod812
Copy link
Copy Markdown
Contributor

@dragod812 dragod812 commented Apr 28, 2026

What type of PR is this? (check all applicable)

  • Refactor
  • Feature
  • Bug Fix
  • Optimization
  • Documentation Update

What changed?
Wires the KubeRay History Server log-collection path end-to-end on the
controllermgr side. KubeRay History Server reference:
https://github.com/ray-project/kuberay/tree/master/historyserver

  • Proto: add log_url to RayClusterStatus so callers can navigate
    to persisted logs after a cluster is gone.
  • Mapper (go/components/jobs/client/k8sengine):
    • Load LogPersistenceConfig via FX from jobs.k8sengine.mapper.logPersistence.
    • injectCollectorSidecar() — inject the KubeRay collector sidecar into
      head + worker pod templates (env-var S3 config, no ConfigMap),
      reuse an existing /tmp/ray volume when present, set
      ImagePullPolicy: IfNotPresent, add the postStart raylet-node-id hook.
    • Wire injection into mapRayCluster() when enabled: true.
    • Compute log_url in MapLocalClusterStatusToGlobal() — the mapper
      is the single source of truth (it has the config, the local cluster
      name, and knows the compute-cluster Ray namespace).
      Format: {LogUrlBaseUrl}/{Bucket}/{PathPrefix}/{cluster}_{rayLocalNamespace}/
  • Controller (go/components/ray/cluster): applyRayClusterStatus()
    copies clusterStatus.Ray.LogUrl through to RayCluster.Status.LogUrl.
    Controller intentionally owns no log-persistence config or formula.
  • Defaults (go/cmd/controllermgr/config/base.yaml): sandbox MinIO
    defaults under jobs.k8sengine.mapper.logPersistence. Production
    overlays flip these.

Why?
The collector sidecar streams Ray cluster logs into MinIO/S3 so the
KubeRay History Server can replay the Ray Dashboard for clusters that
are already torn down. log_url lets uniflow / Pipeline CR expose a
direct browse link.

How did you test it?
Local testing in the OSS k3d sandbox (next PR in the stack brings up
History Server + MinIO + sandbox wiring). Unit tests cover sidecar
injection (env vars, lifecycle preservation, config knobs) and the
controller log_url pass-through.

Potential risks

  • Requires kuberay-collector image to be published to GHCR before
    rollout beyond the sandbox.
  • enabled: false in config is the safe rollback — no sidecar gets
    injected and no log_url is written.

Documentation Changes
Follow-up PR: expose log_url through uniflow onto the Pipeline CR.

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@github-actions
Copy link
Copy Markdown

🔍 Go Lint & TODO Tracking Results

⚠️ golangci-lint: Issues detected

level=warning msg="[config_reader] The configuration option `run.skip-dirs` is deprecated, please use `issues.exclude-dirs`."
level=warning msg="[config_reader] The configuration option `output.format` is deprecated, please use `output.formats`"
Error: can't load config: the Go language version (go1.23) used to build golangci-lint is lower than the targeted Go version (1.24.3)
Failed executing command with error: can't load config: the Go language version (go1.23) used to build golangci-lint is lower than the targeted Go version (1.24.3)

✅ TODO Tracking: All TODOs properly linked to issues

},
{
Name: "RAY_DASHBOARD_AGGREGATOR_AGENT_EVENTS_EXPORT_ADDR",
Value: fmt.Sprintf("http://localhost:%d/events", collectorPort),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put the localhost into a configurable prefix?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to make it configurable, this is inside the sidecar and the Ray container and the collector share the network, so localhost is the only correct address. I can extract the string into a constant though.

@dragod812 dragod812 force-pushed the stack/raycluster-history-server/proto-log-persistence branch from b616436 to 292d38b Compare April 30, 2026 15:19
@github-actions
Copy link
Copy Markdown

🔍 Go Lint & TODO Tracking Results

⚠️ golangci-lint: Issues detected

level=warning msg="[config_reader] The configuration option `run.skip-dirs` is deprecated, please use `issues.exclude-dirs`."
level=warning msg="[config_reader] The configuration option `output.format` is deprecated, please use `output.formats`"
Error: can't load config: the Go language version (go1.23) used to build golangci-lint is lower than the targeted Go version (1.24.3)
Failed executing command with error: can't load config: the Go language version (go1.23) used to build golangci-lint is lower than the targeted Go version (1.24.3)

✅ TODO Tracking: All TODOs properly linked to issues

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 30, 2026

Go Coverage Report (Bazel)

Total Coverage: 63.1%

Coverage Policy:

  • Baseline (existing code): ≥60% (current coverage)
  • New/changed code: ≥90% ✅ STRICTLY ENFORCED
  • Long-term goal: Improve baseline to 90%

View detailed HTML report in artifacts

@dragod812 dragod812 force-pushed the stack/raycluster-history-server/proto-log-persistence branch from 292d38b to 4cdc4c2 Compare May 5, 2026 01:49
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

🔍 Go Lint & TODO Tracking Results

⚠️ golangci-lint: Issues detected

level=warning msg="[config_reader] The configuration option `run.skip-dirs` is deprecated, please use `issues.exclude-dirs`."
level=warning msg="[config_reader] The configuration option `output.format` is deprecated, please use `output.formats`"
Error: can't load config: the Go language version (go1.23) used to build golangci-lint is lower than the targeted Go version (1.24.3)
Failed executing command with error: can't load config: the Go language version (go1.23) used to build golangci-lint is lower than the targeted Go version (1.24.3)

✅ TODO Tracking: All TODOs properly linked to issues

Proto:
- Add LogPersistenceSpec message and log_url to RayClusterStatus
  (proto/api/v2/ray_cluster.proto)

Mapper (go/components/jobs/client/k8sengine):
- Load LogPersistenceConfig from controllers.rayCluster.logPersistence
  via FX
- injectCollectorSidecar(): inject the KubeRay History Server collector
  into Ray pod templates with env-var S3 config (no ConfigMap), reuse an
  existing /tmp/ray volume when present, set ImagePullPolicy: IfNotPresent,
  add the postStart raylet-node-id hook
- Wire sidecar injection into mapRayCluster() when logPersistence.enabled
- buildLogURL() and MapLocalClusterStatusToGlobal(): compute the
  human-browsable log URL during local→global status translation
  (mapper owns the formula since it knows config + local cluster name +
   compute-cluster Ray namespace = RayLocalNamespace)
  Format: {LogUrlBaseUrl}/{Bucket}/{PathPrefix}/{cluster}_{rayLocalNamespace}/

Controller (go/components/ray/cluster):
- applyRayClusterStatus(): copy clusterStatus.Ray.LogUrl onto
  rayCluster.Status.LogUrl so callers see the value the mapper computed
- Controller intentionally does NOT own the log_url formula or any
  LogPersistenceConfig — single source of truth lives in the mapper

Defaults (go/cmd/controllermgr/config/base.yaml):
- controllers.rayCluster.logPersistence:
    enabled: true
    storageEndpoint: "k3d-michelangelo-sandbox-agent-0:30007"
    bucket: ray-history
    pathPrefix: log
    region: us-east-1
    credentialsSecret: minio-credentials
    collectorImage: kuberay-collector:v0.1.0
    logUrlBaseUrl: "http://localhost:9090/browser"
- Sandbox MinIO defaults; production overlays flip these.

Tests cover: mapper sidecar injection (env vars, lifecycle preservation,
config knobs), controller log_url pass-through.
@dragod812 dragod812 force-pushed the stack/raycluster-history-server/proto-log-persistence branch from 4cdc4c2 to 6f3c998 Compare May 5, 2026 04:37
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

🔍 Go Lint & TODO Tracking Results

⚠️ golangci-lint: Issues detected

level=warning msg="[config_reader] The configuration option `run.skip-dirs` is deprecated, please use `issues.exclude-dirs`."
level=warning msg="[config_reader] The configuration option `output.format` is deprecated, please use `output.formats`"
Error: can't load config: the Go language version (go1.23) used to build golangci-lint is lower than the targeted Go version (1.24.3)
Failed executing command with error: can't load config: the Go language version (go1.23) used to build golangci-lint is lower than the targeted Go version (1.24.3)

✅ TODO Tracking: All TODOs properly linked to issues

@dragod812 dragod812 changed the title Add log persistence proto field and collector sidecar injection Add Ray cluster log persistence (proto + collector sidecar + log_url) May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants