[CICD] Add MetaX image build workflow#1203
Open
BrianPei wants to merge 12 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds first-class MetaX image build support and aligns CI/install/test utilities to better handle MetaX-specific environment/layout differences (Conda base env behavior, requirements includes, and log parsing robustness).
Changes:
- Added MetaX Dockerfiles + a dedicated
build_image_metax.ymlworkflow to build/load/push images and run MetaX tests. - Added MetaX install scripts and requirements sets (base/inference/train) and adjusted installer utilities for Conda base + requirements filtering.
- Improved functional-test log parsing to tolerate ANSI color codes and non-pipe log formats, with unit test coverage.
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/install/utils/retry_utils.sh | Places filtered requirements temp file alongside original requirements to preserve -r relative includes. |
| tools/install/utils/pkg_utils.sh | Treats FLAGSCALE_ENV_NAME=base as base env pip path. |
| tools/install/metax/install_train.sh | Adds MetaX train task installer (requirements + TransformerEngine-FL source dep). |
| tools/install/metax/install_inference.sh | Adds MetaX inference task installer. |
| tools/install/metax/install_base.sh | Adds MetaX base phase installer. |
| tools/install/metax/env.sh | Introduces MetaX env var bootstrap for Docker/interactive shells. |
| tools/install/install.sh | Stops overriding pre-set env vars for conda/deps/downloads/uv venv paths. |
| tools/install/install_system.sh | Normalizes env_name=base to install into conda base environment. |
| tests/unit_tests/runner/test_check_results_parser.py | Adds unit tests for log metric extraction (pipe + ANSI/non-pipe formats). |
| tests/test_utils/runners/parse_benchmark_output.py | Makes benchmark metric extraction ANSI-tolerant and format-flexible. |
| tests/test_utils/runners/check_results.py | Makes training metric extraction ANSI-tolerant and format-flexible; formatting cleanups. |
| requirements/metax/train.txt | Adds MetaX train requirements (includes base). |
| requirements/metax/inference.txt | Adds MetaX inference requirements (includes base). |
| requirements/metax/base.txt | Adds MetaX base requirements (includes common). |
| docker/metax/Dockerfile.train | Adds MetaX train image build (deps/dev/release stages). |
| docker/metax/Dockerfile.inference | Adds MetaX inference image build (deps/dev/release stages). |
| docker/metax/Dockerfile.all | Adds MetaX all-in-one image build (deps/dev/release stages). |
| .github/workflows/build_image_metax.yml | New workflow to build MetaX images, save/load tar, push, and invoke common tests. |
| .github/workflows/build_image_cuda.yml | Updates CUDA workflow runner selection handling. |
| .github/workflows/all_tests_metax.yml | Excludes MetaX docker/workflow changes from triggering the generic MetaX test workflow. |
| .github/workflows/all_tests_common.yml | Makes checkout parameters resilient when PR context isn’t present. |
| .github/configs/metax.yml | Updates MetaX platform config (C550 naming, tar_dir, runner labels, env naming). |
Comments suppressed due to low confidence (4)
.github/workflows/build_image_metax.yml:250
- Same
runs-onexpression issue as the build job: the||fallback returns a JSON string, andfromJson(inputs.runs_on)can error when the input is empty. UsefromJson(inputs.runs_on || '["flagscale-metax-c550-gpu2-8c-256g"]')to ensure a valid runner label array in all cases.
load_images:
name: Load and push images
needs: ['build', 'summary']
runs-on: ${{ fromJson(inputs.runs_on) || '["flagscale-metax-c550-gpu2-8c-256g"]' }}
steps:
.github/workflows/build_image_cuda.yml:320
- Same
runs-onexpression problem as the build job: ensure the fallback is parsed byfromJson(...)so it always evaluates to an array of runner labels and doesn’t fail when the input is empty.
load_images:
name: Load and push images
needs: ['build', 'summary']
runs-on: ${{ fromJson(inputs.runs_on) || '["flagscale-nvidia-a100-gpu2-32c-128g"]' }}
steps:
tests/test_utils/runners/check_results.py:78
- This now scans all log lines for metric keys (not just pipe-separated iteration lines) and strips ANSI codes via
_extract_metric_value. Consider updating theextract_metrics_from_logdocstring above to reflect the broader supported formats so callers don’t assume pipe-only logs.
results = {key: {"values": []} for key in metric_keys}
for line in lines:
for key in metric_keys:
value = _extract_metric_value(line, key)
if value is not None:
results[key]["values"].append(value)
tests/test_utils/runners/parse_benchmark_output.py:53
- The parsing logic no longer relies on pipe-separated
iteration ... |formatting; it now extractskey: valueanywhere in the line (after stripping ANSI). Please update theextract_metrics_from_logdocstring to match this behavior.
def extract_metrics_from_log(lines, metric_keys):
"""Extract metrics from training log lines.
Log format (pipe-separated):
" [2026-01-15 09:13:30] iteration 4/10 | ... | lm loss: 1.161108E+01 | ... |"
"""
results = {key: [] for key in metric_keys}
for line in lines:
for key in metric_keys:
value = _extract_metric_value(line, key)
if value is not None:
results[key].append(value)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| build: | ||
| name: Build ${{ matrix.task }} | ||
| needs: prepare | ||
| runs-on: ${{ fromJson(inputs.runs_on) || '["flagscale-metax-c550-gpu2-8c-256g"]' }} |
Comment on lines
+246
to
+258
| load_images: | ||
| name: Load and push images | ||
| needs: ['build', 'summary'] | ||
| runs-on: ${{ fromJson(inputs.runs_on) || '["flagscale-metax-c550-gpu2-8c-256g"]' }} | ||
| steps: | ||
| - name: Load train image from tar and push | ||
| run: | | ||
| TAR="${{ needs.build.outputs.train_tar || needs.build.outputs.all_tar }}" | ||
| TAG="${{ needs.build.outputs.train_tag || needs.build.outputs.all_tag }}" | ||
| if [ -f "$TAR" ]; then | ||
| sudo docker load -i "$TAR" | ||
| sudo docker push "$TAG" | ||
| else |
Comment on lines
+13
to
+19
| def _extract_metric_value(line, key): | ||
| """Extract a metric value from a log line, tolerating formatting variations.""" | ||
| cleaned_line = ANSI_ESCAPE_RE.sub("", line) | ||
| pattern = re.compile( | ||
| rf"{re.escape(key.rstrip(':'))}\s*:\s*([+-]?\d+\.?\d*(?:[eE][+-]?\d+)?)", | ||
| re.IGNORECASE, | ||
| ) |
Comment on lines
+15
to
+23
| ANSI_ESCAPE_RE = re.compile(r"\x1B\[[0-?]*[ -/]*[@-~]") | ||
|
|
||
|
|
||
| def _extract_metric_value(line, key): | ||
| cleaned_line = ANSI_ESCAPE_RE.sub("", line) | ||
| pattern = re.compile( | ||
| rf"{re.escape(key.rstrip(':'))}\s*:\s*([+-]?\d+\.?\d*(?:[eE][+-]?\d+)?)", | ||
| re.IGNORECASE, | ||
| ) |
| name: Build ${{ matrix.task }} | ||
| needs: prepare | ||
| runs-on: [self-hosted, Linux, X64, nvidia-0, gpus-8] | ||
| runs-on: ${{ fromJson(inputs.runs_on) || '["flagscale-nvidia-a100-gpu2-32c-128g"]' }} |
| runs_on: >- | ||
| ${{ inputs.runs_on || | ||
| '["self-hosted", "Linux", "X64", "nvidia-0", "gpus-8"]' }} | ||
| ${{ fromJson(inputs.runs_on) || '["flagscale-nvidia-a100-gpu2-32c-128g"]' }} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Add MetaX image build support and stabilize the related CI/test workflow by introducing MetaX-specific Dockerfiles, install scripts, and requirements, while fixing workflow input handling, Conda base environment behavior, requirements processing, and training log parsing.
Type of change
Changes
Checklist