[CICD] Add MetaX image build workflow by BrianPei · Pull Request #1203 · flagos-ai/FlagScale

BrianPei · 2026-05-20T04:05:09Z

Description

Add MetaX image build support and stabilize the related CI/test workflow by introducing MetaX-specific Dockerfiles, install scripts, and requirements, while fixing workflow input handling, Conda base environment behavior, requirements processing, and training log parsing.

Type of change

Infra/Build change (changes to CI/CD workflows or build scripts)
Bug fix
Code refactoring
New feature (non-breaking change which adds functionality)
Documentation change
Breaking change

Changes

Added MetaX build workflow and MetaX-specific Dockerfiles.
Added MetaX install scripts and requirements for image-based dependency setup.
Fixed workflow input handling and image fallback logic for test execution.
Fixed Conda base environment and requirements include handling in installer utilities.
Improved training and benchmark log parsing for MetaX functional tests.

Checklist

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in CI workflow setup steps
My changes generate no new warnings
I have tested my feature on Metax platform

Copilot

Pull request overview

Adds first-class MetaX image build support and aligns CI/install/test utilities to better handle MetaX-specific environment/layout differences (Conda base env behavior, requirements includes, and log parsing robustness).

Changes:

Added MetaX Dockerfiles + a dedicated build_image_metax.yml workflow to build/load/push images and run MetaX tests.
Added MetaX install scripts and requirements sets (base/inference/train) and adjusted installer utilities for Conda base + requirements filtering.
Improved functional-test log parsing to tolerate ANSI color codes and non-pipe log formats, with unit test coverage.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
tools/install/utils/retry_utils.sh	Places filtered requirements temp file alongside original requirements to preserve `-r` relative includes.
tools/install/utils/pkg_utils.sh	Treats `FLAGSCALE_ENV_NAME=base` as base env pip path.
tools/install/metax/install_train.sh	Adds MetaX train task installer (requirements + TransformerEngine-FL source dep).
tools/install/metax/install_inference.sh	Adds MetaX inference task installer.
tools/install/metax/install_base.sh	Adds MetaX base phase installer.
tools/install/metax/env.sh	Introduces MetaX env var bootstrap for Docker/interactive shells.
tools/install/install.sh	Stops overriding pre-set env vars for conda/deps/downloads/uv venv paths.
tools/install/install_system.sh	Normalizes `env_name=base` to install into conda base environment.
tests/unit_tests/runner/test_check_results_parser.py	Adds unit tests for log metric extraction (pipe + ANSI/non-pipe formats).
tests/test_utils/runners/parse_benchmark_output.py	Makes benchmark metric extraction ANSI-tolerant and format-flexible.
tests/test_utils/runners/check_results.py	Makes training metric extraction ANSI-tolerant and format-flexible; formatting cleanups.
requirements/metax/train.txt	Adds MetaX train requirements (includes base).
requirements/metax/inference.txt	Adds MetaX inference requirements (includes base).
requirements/metax/base.txt	Adds MetaX base requirements (includes common).
docker/metax/Dockerfile.train	Adds MetaX train image build (deps/dev/release stages).
docker/metax/Dockerfile.inference	Adds MetaX inference image build (deps/dev/release stages).
docker/metax/Dockerfile.all	Adds MetaX all-in-one image build (deps/dev/release stages).
.github/workflows/build_image_metax.yml	New workflow to build MetaX images, save/load tar, push, and invoke common tests.
.github/workflows/build_image_cuda.yml	Updates CUDA workflow runner selection handling.
.github/workflows/all_tests_metax.yml	Excludes MetaX docker/workflow changes from triggering the generic MetaX test workflow.
.github/workflows/all_tests_common.yml	Makes checkout parameters resilient when PR context isn’t present.
.github/configs/metax.yml	Updates MetaX platform config (C550 naming, tar_dir, runner labels, env naming).

Comments suppressed due to low confidence (4)

.github/workflows/build_image_metax.yml:250

Same runs-on expression issue as the build job: the || fallback returns a JSON string, and fromJson(inputs.runs_on) can error when the input is empty. Use fromJson(inputs.runs_on || '["flagscale-metax-c550-gpu2-8c-256g"]') to ensure a valid runner label array in all cases.

  load_images:
    name: Load and push images
    needs: ['build', 'summary']
    runs-on: ${{ fromJson(inputs.runs_on) || '["flagscale-metax-c550-gpu2-8c-256g"]' }}
    steps:

.github/workflows/build_image_cuda.yml:320

Same runs-on expression problem as the build job: ensure the fallback is parsed by fromJson(...) so it always evaluates to an array of runner labels and doesn’t fail when the input is empty.

  load_images:
    name: Load and push images
    needs: ['build', 'summary']
    runs-on: ${{ fromJson(inputs.runs_on) || '["flagscale-nvidia-a100-gpu2-32c-128g"]' }}
    steps:

tests/test_utils/runners/check_results.py:78

This now scans all log lines for metric keys (not just pipe-separated iteration lines) and strips ANSI codes via _extract_metric_value. Consider updating the extract_metrics_from_log docstring above to reflect the broader supported formats so callers don’t assume pipe-only logs.

    results = {key: {"values": []} for key in metric_keys}

    for line in lines:
        for key in metric_keys:
            value = _extract_metric_value(line, key)
            if value is not None:
                results[key]["values"].append(value)

tests/test_utils/runners/parse_benchmark_output.py:53

The parsing logic no longer relies on pipe-separated iteration ... | formatting; it now extracts key: value anywhere in the line (after stripping ANSI). Please update the extract_metrics_from_log docstring to match this behavior.

def extract_metrics_from_log(lines, metric_keys):
    """Extract metrics from training log lines.

    Log format (pipe-separated):
        " [2026-01-15 09:13:30] iteration 4/10 | ... | lm loss: 1.161108E+01 | ... |"
    """
    results = {key: [] for key in metric_keys}

    for line in lines:
        for key in metric_keys:
            value = _extract_metric_value(line, key)
            if value is not None:
                results[key].append(value)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+  build:
+    name: Build ${{ matrix.task }}
+    needs: prepare
+    runs-on: ${{ fromJson(inputs.runs_on) || '["flagscale-metax-c550-gpu2-8c-256g"]' }}


+  load_images:
+    name: Load and push images
+    needs: ['build', 'summary']
+    runs-on: ${{ fromJson(inputs.runs_on) || '["flagscale-metax-c550-gpu2-8c-256g"]' }}
+    steps:
+      - name: Load train image from tar and push
+        run: |
+          TAR="${{ needs.build.outputs.train_tar || needs.build.outputs.all_tar }}"
+          TAG="${{ needs.build.outputs.train_tag || needs.build.outputs.all_tag }}"
+          if [ -f "$TAR" ]; then
+            sudo docker load -i "$TAR"
+            sudo docker push "$TAG"
+          else


+def _extract_metric_value(line, key):
+    """Extract a metric value from a log line, tolerating formatting variations."""
+    cleaned_line = ANSI_ESCAPE_RE.sub("", line)
+    pattern = re.compile(
+        rf"{re.escape(key.rstrip(':'))}\s*:\s*([+-]?\d+\.?\d*(?:[eE][+-]?\d+)?)",
+        re.IGNORECASE,
+    )


+ANSI_ESCAPE_RE = re.compile(r"\x1B\[[0-?]*[ -/]*[@-~]")
+
+
+def _extract_metric_value(line, key):
+    cleaned_line = ANSI_ESCAPE_RE.sub("", line)
+    pattern = re.compile(
+        rf"{re.escape(key.rstrip(':'))}\s*:\s*([+-]?\d+\.?\d*(?:[eE][+-]?\d+)?)",
+        re.IGNORECASE,
+    )


    name: Build ${{ matrix.task }}
    needs: prepare
-    runs-on: [self-hosted, Linux, X64, nvidia-0, gpus-8]
+    runs-on: ${{ fromJson(inputs.runs_on) || '["flagscale-nvidia-a100-gpu2-32c-128g"]' }}


      runs_on: >-
-        ${{ inputs.runs_on ||
-        '["self-hosted", "Linux", "X64", "nvidia-0", "gpus-8"]' }}
+        ${{ fromJson(inputs.runs_on) || '["flagscale-nvidia-a100-gpu2-32c-128g"]' }}


BrianPei added 12 commits May 18, 2026 15:08

add metax build image workflow

6f0781d

use input runner labels

b8f87b1

fix runs_on format

5cc686d

metax clean workspace add diagnose

47be47a

check metax clean workspace step status

87395c2

fix metax conda base

0178c6f

add curl install on metax

0ff7904

fix requirments temp file error

bbd27cc

remove debug steps

f4992c1

fix upload logic when build all images

7670535

fix tests after build runs_on parameter

dd4e22a

fix metax functional_tests_train loss empty

a3f812e

Copilot AI review requested due to automatic review settings May 20, 2026 04:05

BrianPei requested a review from aoyulong as a code owner May 20, 2026 04:05

Copilot started reviewing on behalf of BrianPei May 20, 2026 04:05 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CICD] Add MetaX image build workflow#1203

[CICD] Add MetaX image build workflow#1203
BrianPei wants to merge 12 commits into
flagos-ai:mainfrom
BrianPei:PR-0520-MetaxBuild

BrianPei commented May 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

BrianPei commented May 20, 2026

Description

Type of change

Changes

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants