Skip to content

Commit 40e7e82

Browse files
zhoward-1claude
andauthored
docs: add missing prerequisites, next steps, and troubleshooting to pipeline guides (#1086)
## Summary Fills in missing structural sections across four pipeline guides that were flagged as incomplete. - **`running-uniflow.md`** — Added Prerequisites (sandbox, Poetry, Docker, workflow definition) and Next Steps (file sync, caching, triggers, model registry) - **`cache-and-pipelinerun-resume-form.md`** — Added Prerequisites (remote execution required, Ray/Spark only) and Next Steps (triggers, file sync, MA Studio monitoring) - **`file-sync-testing-flow-runbook.md`** — Added Prerequisites (sandbox, existing Docker image, Git repo, cloud storage creds); expanded Troubleshooting from a single raw log snippet into three structured entries (missing credentials, unexpected files, changes not picked up); added Next Steps - **`train-and-register-a-model.md`** — Added Prerequisites (sandbox, prepared dataset, SDK install, Docker for distributed runs) ## Test plan - [ ] `cd website && bun run build` passes (verified locally) - [ ] Prerequisites link to correct pages - [ ] Next Steps links resolve correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 83d66c1 commit 40e7e82

4 files changed

Lines changed: 66 additions & 6 deletions

File tree

docs/user-guides/ml-pipelines/cache-and-pipelinerun-resume-form.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,11 @@
66
* How cache keys are determined
77
* How to resume a pipeline run from a specific step
88

9+
## Prerequisites
10+
11+
- **A working remote execution setup** — Caching and resume only apply to remote runs. See [Running Uniflow Pipelines](./running-uniflow.md) to get remote execution working first.
12+
- **Ray or Spark tasks** — Only Ray and Spark tasks support caching. Local execution does not cache results.
13+
914
## Task caching
1015

1116
For each task in a Uniflow Remote Run, we cache and index the task results after execution. Next time you execute the task, you have the option to skip execution by reusing the cached results.
@@ -55,3 +60,9 @@ ma pipeline run -n <namespace> --revision <pipeline-revision-name> --resume_from
5560
```
5661

5762
**Important:** To skip a step during resume, Uniflow requires that the input of the step has not changed.
63+
64+
## Next Steps
65+
66+
- **Run pipelines on a schedule** — See [Set Up Triggers](../set-up-triggers.md) to automate pipeline execution with cron triggers
67+
- **Test changes without rebuilding** — Use [file sync](./file-sync-testing-flow-runbook.md) to iterate faster during development
68+
- **Monitor pipeline runs** — Open MA Studio at `http://localhost:8090/<your-project>` to view run history, step status, and cached results

docs/user-guides/ml-pipelines/file-sync-testing-flow-runbook.md

Lines changed: 31 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,13 @@
66
* When to use (and not use) file sync
77
* What files get synced and the typical development flow
88

9+
## Prerequisites
10+
11+
- **A running sandbox** — File sync pushes code to a remote cluster. Follow the [Sandbox Setup](../../getting-started/sandbox-setup.md) guide first.
12+
- **A Docker image already built for your workflow** — File sync patches an existing image; it does not build one. See [Running Uniflow Pipelines](./running-uniflow.md) for image build steps.
13+
- **A Git repository** — File sync uses Git metadata to determine which files changed since the image was built.
14+
- **Cloud storage credentials** — S3/MinIO access is required to upload the sync tarball (see Troubleshooting below).
15+
916
## What is file sync?
1017

1118
File Sync lets you test your local code changes on remote infrastructure **without rebuilding Docker images**. Instead of waiting 20+ minutes for image builds per task, you can sync your changes in 2-5 minutes.
@@ -89,11 +96,32 @@ ma pipeline dev-run --file-sync --file <path_to_pipeline.yaml>
8996
4. **Commit and rebuild image** only when ready for production
9097

9198
## Troubleshooting
92-
1. No fsspec credentials, once kicking off remote run, it failed with below error:
93-
```2026-03-23 09:14:44,722 | ERROR | michelangelo.uniflow.core.file_sync | Failed to upload tarball: Unable to locate credentials```
94-
setup credentials before starting remote run workflow
99+
100+
### Missing cloud storage credentials
101+
102+
If the sync fails with:
95103
```
104+
Failed to upload tarball: Unable to locate credentials
105+
```
106+
107+
Set credentials before running:
108+
109+
```bash
96110
export AWS_ACCESS_KEY_ID=minioadmin
97111
export AWS_SECRET_ACCESS_KEY=minioadmin
98112
export AWS_ENDPOINT_URL=http://localhost:9091
99113
```
114+
115+
### Unexpected files being synced
116+
117+
If more files are synced than expected, your Docker image may not have Git metadata. In that case, file sync sends all uncommitted changes rather than just the diff since the image was built. Ensure the image is built from your current branch with Git history included.
118+
119+
### File sync not picking up changes
120+
121+
File sync only includes files tracked by Git. If you added new files, make sure they are staged (`git add`) so Git is aware of them.
122+
123+
## Next Steps
124+
125+
- **Cache results between runs** — See [Uniflow caching and pipeline run resume](./cache-and-pipelinerun-resume-form.md) to skip unchanged steps and resume failed runs
126+
- **Run on a schedule** — See [Set Up Triggers](../set-up-triggers.md) to automate pipeline execution with cron triggers
127+
- **Build and register your model** — Once your code is validated, follow the [Model Registry Guide](../model-registry-guide.md) to package and version your model

docs/user-guides/ml-pipelines/running-uniflow.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,13 @@ This guide covers how to run Uniflow pipelines locally and remotely.
88
* The differences between local and remote execution modes
99
* How to debug workflows and container issues
1010

11+
## Prerequisites
12+
13+
- **A running sandbox environment** — Remote execution requires a local Kubernetes cluster. Follow the [Sandbox Setup](../../getting-started/sandbox-setup.md) guide if you haven't done this yet.
14+
- **Python 3.11+ and Poetry installed** — See the [Sandbox Setup prerequisites](../../getting-started/sandbox-setup.md#prerequisites).
15+
- **A Uniflow workflow defined** — See [Getting Started with ML Pipelines](./getting-started.md) for a walkthrough of defining tasks and workflows.
16+
- **Docker** — Required for building images used in remote execution.
17+
1118
## Environment setup
1219

1320
Create Python virtual environment and install packages:
@@ -161,3 +168,10 @@ docker pull ghcr.io/michelangelo-ai/worker:latest
161168
docker images
162169
docker exec -it k3d-michelangelo-sandbox-server-0 crictl images
163170
```
171+
172+
## Next Steps
173+
174+
- **Speed up iteration** — Use [file sync](./file-sync-testing-flow-runbook.md) to test local code changes on remote infrastructure without rebuilding Docker images
175+
- **Cache task results** — Learn how [Uniflow caching and pipeline run resume](./cache-and-pipelinerun-resume-form.md) can speed up repeated runs
176+
- **Run on a schedule** — See [Set Up Triggers](../set-up-triggers.md) to run your pipeline automatically on a cron schedule
177+
- **Register your model** — After a successful training run, follow the [Model Registry Guide](../model-registry-guide.md) to package and version your model

docs/user-guides/train-and-register-a-model.md

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,18 @@ The focus is simplicity: **you control your training logic**, Michelangelo provi
66

77
## What You'll Learn
88

9-
* How datasets are passed to training tasks
10-
* How to load Ray, Pandas, or Spark datasets
11-
* How to scale training with Ray workers
9+
* How datasets are passed to training tasks
10+
* How to load Ray, Pandas, or Spark datasets
11+
* How to scale training with Ray workers
1212
* How to use the Lightning Trainer SDK for deep learning
1313

14+
## Prerequisites
15+
16+
- **A running sandbox** — Remote training runs require a local Kubernetes cluster. Follow the [Sandbox Setup](../getting-started/sandbox-setup.md) guide if you haven't done this yet.
17+
- **A prepared dataset** — Training tasks expect datasets passed as `DatasetVariable`. See [Data Preparation](./prepare-your-data.md) for how to produce them.
18+
- **Python 3.11+, Poetry, and the Michelangelo SDK installed** — Run `cd python && poetry install` from the repo root.
19+
- **For distributed training:** A Docker image with your workflow code. See [Running Uniflow Pipelines](./ml-pipelines/running-uniflow.md) for image build steps.
20+
1421
## Understanding Training Inputs
1522

1623
Michelangelo workflows pass datasets using **DatasetVariable**.

0 commit comments

Comments
 (0)