Skip to content

[CICD] Add MetaX image build workflow#1203

Open
BrianPei wants to merge 12 commits into
flagos-ai:mainfrom
BrianPei:PR-0520-MetaxBuild
Open

[CICD] Add MetaX image build workflow#1203
BrianPei wants to merge 12 commits into
flagos-ai:mainfrom
BrianPei:PR-0520-MetaxBuild

Conversation

@BrianPei
Copy link
Copy Markdown
Contributor

Description

Add MetaX image build support and stabilize the related CI/test workflow by introducing MetaX-specific Dockerfiles, install scripts, and requirements, while fixing workflow input handling, Conda base environment behavior, requirements processing, and training log parsing.

Type of change

  • Infra/Build change (changes to CI/CD workflows or build scripts)
  • Bug fix
  • Code refactoring
  • New feature (non-breaking change which adds functionality)
  • Documentation change
  • Breaking change

Changes

  • Added MetaX build workflow and MetaX-specific Dockerfiles.
  • Added MetaX install scripts and requirements for image-based dependency setup.
  • Fixed workflow input handling and image fallback logic for test execution.
  • Fixed Conda base environment and requirements include handling in installer utilities.
  • Improved training and benchmark log parsing for MetaX functional tests.

Checklist

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in CI workflow setup steps
  • My changes generate no new warnings
  • I have tested my feature on Metax platform

Copilot AI review requested due to automatic review settings May 20, 2026 04:05
@BrianPei BrianPei requested a review from aoyulong as a code owner May 20, 2026 04:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class MetaX image build support and aligns CI/install/test utilities to better handle MetaX-specific environment/layout differences (Conda base env behavior, requirements includes, and log parsing robustness).

Changes:

  • Added MetaX Dockerfiles + a dedicated build_image_metax.yml workflow to build/load/push images and run MetaX tests.
  • Added MetaX install scripts and requirements sets (base/inference/train) and adjusted installer utilities for Conda base + requirements filtering.
  • Improved functional-test log parsing to tolerate ANSI color codes and non-pipe log formats, with unit test coverage.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tools/install/utils/retry_utils.sh Places filtered requirements temp file alongside original requirements to preserve -r relative includes.
tools/install/utils/pkg_utils.sh Treats FLAGSCALE_ENV_NAME=base as base env pip path.
tools/install/metax/install_train.sh Adds MetaX train task installer (requirements + TransformerEngine-FL source dep).
tools/install/metax/install_inference.sh Adds MetaX inference task installer.
tools/install/metax/install_base.sh Adds MetaX base phase installer.
tools/install/metax/env.sh Introduces MetaX env var bootstrap for Docker/interactive shells.
tools/install/install.sh Stops overriding pre-set env vars for conda/deps/downloads/uv venv paths.
tools/install/install_system.sh Normalizes env_name=base to install into conda base environment.
tests/unit_tests/runner/test_check_results_parser.py Adds unit tests for log metric extraction (pipe + ANSI/non-pipe formats).
tests/test_utils/runners/parse_benchmark_output.py Makes benchmark metric extraction ANSI-tolerant and format-flexible.
tests/test_utils/runners/check_results.py Makes training metric extraction ANSI-tolerant and format-flexible; formatting cleanups.
requirements/metax/train.txt Adds MetaX train requirements (includes base).
requirements/metax/inference.txt Adds MetaX inference requirements (includes base).
requirements/metax/base.txt Adds MetaX base requirements (includes common).
docker/metax/Dockerfile.train Adds MetaX train image build (deps/dev/release stages).
docker/metax/Dockerfile.inference Adds MetaX inference image build (deps/dev/release stages).
docker/metax/Dockerfile.all Adds MetaX all-in-one image build (deps/dev/release stages).
.github/workflows/build_image_metax.yml New workflow to build MetaX images, save/load tar, push, and invoke common tests.
.github/workflows/build_image_cuda.yml Updates CUDA workflow runner selection handling.
.github/workflows/all_tests_metax.yml Excludes MetaX docker/workflow changes from triggering the generic MetaX test workflow.
.github/workflows/all_tests_common.yml Makes checkout parameters resilient when PR context isn’t present.
.github/configs/metax.yml Updates MetaX platform config (C550 naming, tar_dir, runner labels, env naming).
Comments suppressed due to low confidence (4)

.github/workflows/build_image_metax.yml:250

  • Same runs-on expression issue as the build job: the || fallback returns a JSON string, and fromJson(inputs.runs_on) can error when the input is empty. Use fromJson(inputs.runs_on || '["flagscale-metax-c550-gpu2-8c-256g"]') to ensure a valid runner label array in all cases.
  load_images:
    name: Load and push images
    needs: ['build', 'summary']
    runs-on: ${{ fromJson(inputs.runs_on) || '["flagscale-metax-c550-gpu2-8c-256g"]' }}
    steps:

.github/workflows/build_image_cuda.yml:320

  • Same runs-on expression problem as the build job: ensure the fallback is parsed by fromJson(...) so it always evaluates to an array of runner labels and doesn’t fail when the input is empty.
  load_images:
    name: Load and push images
    needs: ['build', 'summary']
    runs-on: ${{ fromJson(inputs.runs_on) || '["flagscale-nvidia-a100-gpu2-32c-128g"]' }}
    steps:

tests/test_utils/runners/check_results.py:78

  • This now scans all log lines for metric keys (not just pipe-separated iteration lines) and strips ANSI codes via _extract_metric_value. Consider updating the extract_metrics_from_log docstring above to reflect the broader supported formats so callers don’t assume pipe-only logs.
    results = {key: {"values": []} for key in metric_keys}

    for line in lines:
        for key in metric_keys:
            value = _extract_metric_value(line, key)
            if value is not None:
                results[key]["values"].append(value)

tests/test_utils/runners/parse_benchmark_output.py:53

  • The parsing logic no longer relies on pipe-separated iteration ... | formatting; it now extracts key: value anywhere in the line (after stripping ANSI). Please update the extract_metrics_from_log docstring to match this behavior.
def extract_metrics_from_log(lines, metric_keys):
    """Extract metrics from training log lines.

    Log format (pipe-separated):
        " [2026-01-15 09:13:30] iteration 4/10 | ... | lm loss: 1.161108E+01 | ... |"
    """
    results = {key: [] for key in metric_keys}

    for line in lines:
        for key in metric_keys:
            value = _extract_metric_value(line, key)
            if value is not None:
                results[key].append(value)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

build:
name: Build ${{ matrix.task }}
needs: prepare
runs-on: ${{ fromJson(inputs.runs_on) || '["flagscale-metax-c550-gpu2-8c-256g"]' }}
Comment on lines +246 to +258
load_images:
name: Load and push images
needs: ['build', 'summary']
runs-on: ${{ fromJson(inputs.runs_on) || '["flagscale-metax-c550-gpu2-8c-256g"]' }}
steps:
- name: Load train image from tar and push
run: |
TAR="${{ needs.build.outputs.train_tar || needs.build.outputs.all_tar }}"
TAG="${{ needs.build.outputs.train_tag || needs.build.outputs.all_tag }}"
if [ -f "$TAR" ]; then
sudo docker load -i "$TAR"
sudo docker push "$TAG"
else
Comment on lines +13 to +19
def _extract_metric_value(line, key):
"""Extract a metric value from a log line, tolerating formatting variations."""
cleaned_line = ANSI_ESCAPE_RE.sub("", line)
pattern = re.compile(
rf"{re.escape(key.rstrip(':'))}\s*:\s*([+-]?\d+\.?\d*(?:[eE][+-]?\d+)?)",
re.IGNORECASE,
)
Comment on lines +15 to +23
ANSI_ESCAPE_RE = re.compile(r"\x1B\[[0-?]*[ -/]*[@-~]")


def _extract_metric_value(line, key):
cleaned_line = ANSI_ESCAPE_RE.sub("", line)
pattern = re.compile(
rf"{re.escape(key.rstrip(':'))}\s*:\s*([+-]?\d+\.?\d*(?:[eE][+-]?\d+)?)",
re.IGNORECASE,
)
name: Build ${{ matrix.task }}
needs: prepare
runs-on: [self-hosted, Linux, X64, nvidia-0, gpus-8]
runs-on: ${{ fromJson(inputs.runs_on) || '["flagscale-nvidia-a100-gpu2-32c-128g"]' }}
runs_on: >-
${{ inputs.runs_on ||
'["self-hosted", "Linux", "X64", "nvidia-0", "gpus-8"]' }}
${{ fromJson(inputs.runs_on) || '["flagscale-nvidia-a100-gpu2-32c-128g"]' }}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants