Skip to content

[MPS Dev] 2026.06.11#4

Open
Manfredss wants to merge 98 commits into
mainfrom
dev_2026_06_11
Open

[MPS Dev] 2026.06.11#4
Manfredss wants to merge 98 commits into
mainfrom
dev_2026_06_11

Conversation

@Manfredss

@Manfredss Manfredss commented Jun 11, 2026

Copy link
Copy Markdown
Owner

PR Category

Custom Device

PR Types

New features

Description

Implements the remaining basic operators in the MPS backend (39 new .mm kernel files, 6 new test files), bringing the backend from 54 to 96 registered ops.

Family Ops
Unary math asinh, acosh, atanh, erf, expm1, log1p, trunc
Activations elu, relu6, hardsigmoid, hardswish, softplus, softsign, logsigmoid, swish, hardtanh, mish, hard_shrink, softshrink, tanh_shrink, thresholded_relu, stanh, celu,
selu, logit
Binary floor_divide, remainder, fmax, fmin, heaviside, atan2
Misc scale, clip, where, pow, isnan, isinf, isfinite
Reductions prod, any, all

Implementation notes

  • All kernels follow the existing MPSGraph pattern: one op per file, float32-only registration, empty-tensor early return, @available(macOS 12.0, *) guard, CPU-fallback on
    missing Metal buffers. CMake picks the files up automatically via the mps/*.mm glob.
  • Kernel signatures match the phi header declarations exactly (incl. float vs double attrs); registration names verified against the CPU registrations (hardsigmoid vs
    hard_shrink vs logsigmoid etc.).
  • Semantics matched to the CPU functors:
    • remainder = Python-style modulo (result sign follows divisor); floor_divide = floor semantics
    • fmax/fmin ignore NaN (composed from isNaN + select)
    • logsigmoid uses the numerically stable min(x,0) − log(1+exp(−|x|)) form
    • prod/any/all fill empty-input outputs with the reduction identity (1 / False / True)
    • pow (scalar exponent) is separate from the existing elementwise_pow (tensor exponent)
    • isnan/isinf/isfinite, any, all set BOOL output dtype; any/all take bool input
  • where plumbs the bool condition tensor through selectWithPredicateTensor.

Testing

Six standalone test files (test/test_mps_*.py), each comparing MPS output against both a hand-written numpy reference and the CPU backend, over multiple shapes,
attribute variations, known-value tables, and edge cases (NaN/inf inputs, empty tensors, domain boundaries). No torch dependency.

是否引起精度变化

YuhanXu and others added 30 commits May 20, 2026 17:58
* [API Compat] Add alias movedim for moveaxis

* add param alias

* move alias to __init__
* Align torch.Tensor.index_copy_

* Modify index_copy_

* Modify manipulation.py for xpu & Add some test case

* XPU special cases have been added

* Simplify error messages

* Test dygragh backward
* [CI] add support for Python 3.14 in setup and CI scripts

* [CI] add support for Python 3.13 in run_mac_test.sh
…9057)

The _is_safe_class() function only checked the class's own __dict__ for
dangerous magic methods (__reduce__, __reduce_ex__, __getstate__, __setstate__).
It did NOT check the method resolution order (MRO), allowing classes that
inherit these methods from parent classes to bypass the safety check.

This fix checks all classes in the MRO (except object, whose default
__reduce__ is safe for user-defined classes) for dangerous method definitions.
---------

Co-authored-by: Codex <noreply@openai.com>
…9035)

* [API Compat] Add aliases for apis in paddle.optimizer.lr

* add apis to __all__

* delete lr_scheduler module from __all__

* fix import
Co-authored-by: Codex <codex@openai.com>
…(#79117)

---------

Co-authored-by: Codex <noreply@openai.com>
---------

Co-authored-by: Codex <codex@openai.com>
…t 3 (#79118)

---------

Co-authored-by: SigureMo <sigure.qaq@gmail.com>
Co-authored-by: Codex <codex@openai.com>
---------

Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* [XPU][PIR] Add conv2d+bn(+add)+act fusion passes for XPU

- Fix precision bug in conv2d_bn_xpu_fuse_pass:
  filter_max should be Max(Abs(folded_filter)) instead of Max(filter),
  to match XDNN int16 weight quantization expectation. Without this fix,
  conv layers whose weights cross zero after BN folding produce wrong
  scaling factors and hurt accuracy.

- Add conv2d_bn_act_xpu_fuse_pass:
  fuses {Conv2d | DepthwiseConv2d} x {batch_norm | batch_norm_} x
  {relu | swish | hardswish} into a single conv2d_xpu op (12 variants).

- Add conv2d_bn_add_act_xpu_fuse_pass:
  fuses the same chain plus a residual branch add (both residual_first
  and residual_last orderings) into conv2d_xpu with branch input
  (24 variants).

- Add depthwise_conv2d_xpu_fuse_pass as a fallback that converts any
  remaining bare depthwise_conv2d (not absorbed by earlier fusions)
  into conv2d_xpu(act=LINEAR, no_bias, no_branch) so the runtime
  always dispatches to the fused XPU kernel.

- Register the three new passes in passes.h and append them to
  kPirXpuPasses in paddle_pass_builder.cc, ordered after the existing
  conv2d_xpu fusion family.

Validated end-to-end on PP-OCRv5_server (det+rec) on Kunlun P800:
latency 184ms -> 21.19ms (-88%), no accuracy regression.

* [XPU][PIR][Test] Add unit tests for conv2d+bn(+add)+act fusion passes

- test_conv2d_bn_act_xpu_fuse_pass.py:
  6 cases = {conv2d, depthwise_conv2d} x {relu, swish, hardswish}
  expects conv2d_xpu == 1 and bn/relu/swish/hardswish == 0.

- test_conv2d_bn_add_act_xpu_fuse_pass.py:
  6 cases = {relu, swish, hardswish} x {residual_first, residual_last}
  expects conv2d_xpu == 1 (with branch input) and bn/add/act == 0.

- test_depthwise_conv2d_xpu_fuse_pass.py:
  bare depthwise_conv2d -> conv2d_xpu(act=LINEAR, no_bias, no_branch).

All tests follow the existing pass_test.PassTest framework under
test/ir/pir/fused_pass/xpu/, run only when XPU is compiled in, and
verify both fusion topology (valid_op_map) and numerical accuracy.

* [XPU][PIR] Consolidate conv2d+bn(+add)+act XPU fusion passes

Refactor the four separate conv2d_* XPU fusion passes
(`conv2d_bn_xpu_fuse_pass`, `conv2d_bn_act_xpu_fuse_pass`,
`conv2d_bn_add_act_xpu_fuse_pass`, `depthwise_conv2d_xpu_fuse_pass`)
into a single unified `conv2d_xpu_fuse_pass`, following the design of
`fc_xpu_fuse_pass`.

Highlights:
- One pass file holding four DRR patterns with explicit benefit scores
  so longer subgraphs (bn+add+act) win over shorter ones (bn).
- Bottom-up traversal (`use_top_down_traversal=false`) so patterns
  anchored on later ops (relu) match before patterns anchored on
  earlier ops (bn).
- Shared helpers (`BuildBnFoldSubgraph`, `BuildFilterMaxSubgraph`,
  `CheckConvBnConstraints`, ComputeAttr factories) eliminate the ~50%
  duplication that existed across the original four files.
- Pattern selection via `XPU_PADDLE_CONV2D_PATTERN` env var
  (default 0xff, bits gate individual pattern groups).

Tests are consolidated into a single `test_conv2d_xpu_fuse_pass.py`
matching the unified pass name, with four `PassTest` classes covering
depthwise-only, conv+bn, conv+bn+act, and conv+bn+add+act. The
add+act test expects `pd_op.add: 1` to account for the
`bn_var + epsilon` add emitted by the BN-fold subgraph (operates on
persistable constants; can be removed by a subsequent constant-folding
pass).

---------

Co-authored-by: mayang002 <mayang002@users.noreply.github.com>
---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* [MoE] Harden permute and unpermute validation

Add targeted boundary checks for MoE permute/unpermute inputs to fail early on invalid buffer sizes, shape mismatches, and unsafe FP8 scale usage.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* [MoE] Move safety coverage to gated legacy test

Keep AI-edited coverage tests stable on environments without the MoE GPU kernel while covering the new validation paths in the existing architecture-gated MoE test.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* [API compatibility] Align torch.kaiser_window

* [API Compibility] Align kaiser_window
---------

Co-authored-by: Codex <noreply@openai.com>
Co-authored-by: Codex <codex@openai.com>
ShigureNyako and others added 30 commits June 5, 2026 11:49
* [Allocator] Fix VMM allocator hot paths and IPC slices

* [Allocator] Add VMM allocator hot-path stats

* [Allocator] Keep VMM block parts after compact

* [Allocator] Free multi-scale allocations by owner

* [Allocator] Share VMM block part helpers
* [CI] Fix empty C++ coverage reports

* [CI] Address coverage review comments

* test

(cherry picked from commit f9658edfdabef36b79ad1fb08fcd0f7963688587)

* Revert "test"

This reverts commit 9dd8dc28e8143a1d5f292b79c09e8e2e775da07f.

---------

Co-authored-by: gushiwei <gushiwei@baidu.com>
* [API Compatibility] Align torch module

* [API Compatibility] Fix

* [API Compatibility] Fix

* [API Compatibility] Fix

* [API Compatibility] fix bug and add test

* [API Compatibility] Fix

* [API Compatibility] Fix

* [API Compatibility] Fix
…tivariateNormal、torch.distributions.distribution.Distribution、torch.distributions.distribution.Distribution.icdf / entropy、torch.distributions.normal.Normal (#79124)

* Align MultivariateNormal

* Try to align multivariate

* [API Compatibility] Alias

* [API Compatibility] Fix

* [API Compatibility] Fix

* [API Compatibility] Fix
* temp fix

* fix bugs

add test

fix bugs

fix bugs

fix macos bug

fix bugs

* Update test_dataset_security.py

* fix bugs
…197)

* LRScheduler, ExponentialDecay

* CosineAnnealingDecay

* CosineAnnealingWarmRestarts

* MultiStepDecay

* ReduceOnPlateau

* StepDecay

* use decorator and overload to support optimizer arg

* resolve conflict

* move logic into decorator

* revert previous changes within methods

* fix name

* add network test

* move init function call before setting scheduler

* add lr value test
* [API Compatibility] state_dict_post_hook

* [API Compatibility] Fix
…for `paddle.io.DistributedBatchSampler` (#79268)

* add alias paddle.utils.data.DistributedSampler

* support arg seed
…sordot/tril_indices/triu_indices/vander/logaddexp/logspace/moveaxis/nan_to_num/nanmean/nansum/masked_fill/addmv/addr/fix/histc/trunc Edit By AI Agent (#79215)

* [API Compatibility] logaddexp/logspace/moveaxis/nan_to_num/nanmean/nansum/sgn/signbit/slice_scatter/take/tensordot/tril_indices/triu_indices/trunc/vander Edit By AI Agent

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [API Compatibility] logspace/tril_indices/triu_indices/tensordot/slice_scatter Edit By AI Agent

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [API Compatibility] addmv/addr/addmv_/addr_ Edit By AI Agent

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [API Compatibility] special.round Edit By AI Agent

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [API Compatibility] take Edit By AI Agent

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [API Compatibility] special.round/select_scatter/slice_scatter/col_indices/crow_indices Edit By AI Agent

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [API Compatibility] slice_scatter Edit By AI Agent

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [API Compatibility] slice_scatter Edit By AI Agent

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [API Compatibility] logspace/slice_scatter fix Edit By AI Agent

- logspace: add requires_grad support in PIR mode
- slice_scatter: fix overload signature default values

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [API Compatibility] slice_scatter multi-axis fix Edit By AI Agent

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [API Compatibility] skip sparse tests on XPU Edit By AI Agent

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [API Compatibility] add out/requires_grad tests and XPU skip conditions Edit By AI Agent

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [API Compatibility] skip test_api_compatibility_part1~5 on XPU Edit By AI Agent

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [API Compatibility] add test_api_compatibility_part5.py Edit By AI Agent

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [API Compatibility] export addmv/addr/histc/negative_/fix/mvlgamma APIs Edit By AI Agent

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: zhouwei25 <zhouwei25@baidu.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* [API Compatibility] Align torch.Tensor.transpose_/ reshape_as

* [API Compatibility] Fix
* fix UniformKernelImpl in gpu

* fix test
* [Operator Mechanism] Support mm out_dtype for BF16 CUDA

Add a narrow CUDA BF16 x BF16 -> FP32 path for paddle.mm(out_dtype=paddle.float32), including schema, infermeta, stride dispatch, fused cuBLAS GEMM, and focused tests.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* [Operator Mechanism] Fix mm out_dtype review issues

Use the canonical matmul path for static mm out_dtype handling and keep legacy compatibility attrs limited to supported types.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* [Operator Mechanism] Fix matmul out_dtype CI regressions

Keep matmul compatible with unknown symbolic dimensions and legacy matmul_v2 to PIR translation when out_dtype is unset.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* [Operator Mechanism] Harden matmul out_dtype static path

Preserve the legacy static mm path when out_dtype is unset and avoid rejecting unknown symbolic matmul dimensions during InferMeta.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* [Operator Mechanism] Fix mm out_dtype static BF16 test

Allow the explicit static out_dtype path to pass BF16 variables through Python validation and feed BF16 static test data using the existing uint16 encoding helper.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* [Operator Mechanism] Fix matmul out_dtype PIR compat

Add missing default/propagated out_dtype handling for legacy matmul translation, PIR serialization compatibility, and handwritten PIR/DRR matmul rewrites.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* [Operator Mechanism] Harden matmul out_dtype PIR fusions

Avoid fusing explicit matmul out_dtype paths in PIR rewrite passes, document BF16 GEMM lda/ldb narrowing safety, and add a legacy matmul_v2 translator regression.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* [Operator Mechanism] Fix matmul out_dtype static compat

Route static mm out_dtype through matmul_v2 so it reaches the phi matmul kernel, preserve user-provided out tensors, and let legacy matmul_v2 fusion pass compatibility accept only missing/default out_dtype while rejecting explicit output dtype paths.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* [Operator Mechanism] Prune mm out_dtype to dynamic path

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* [Operator Mechanism] Skip mm out_dtype tests on ROCm

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* fix add_n 0-size bug

* refine

* refine
* [CI] Check C++ coverage before diff filtering

* [CI] Add coverage guard smoke test change

* Revert "[CI] Add coverage guard smoke test change"

This reverts commit f9187c91242d92b399950859ab306f05e0257158.

* [CI] Fail fast on coverage diff filtering errors

* [CI] Skip diff filtering when coverage input is absent

* [CI] Add coverage-info smoke test marker

* Revert "[CI] Add coverage-info smoke test marker"

This reverts commit eaad8e571cbb37993280b3f142e99ccb2d6e70e9.
… 8.6.0 - part 5 (#79266)

* Upgrade Crypto++ to 8.6.0

* Fix Crypto++ CMake flag handling
---------

Co-authored-by: Codex <noreply@openai.com>
Co-authored-by: Nyakku Shigure <sigure.qaq@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.