[MPS Dev] 2026.06.11 by Manfredss · Pull Request #4 · Manfredss/Paddle-MPS-Dev

Manfredss · 2026-06-11T03:36:50Z

PR Category

Custom Device

PR Types

New features

Description

Implements the remaining basic operators in the MPS backend (39 new .mm kernel files, 6 new test files), bringing the backend from 54 to 96 registered ops.

Family	Ops
Unary math	asinh, acosh, atanh, erf, expm1, log1p, trunc
Activations	elu, relu6, hardsigmoid, hardswish, softplus, softsign, logsigmoid, swish, hardtanh, mish, hard_shrink, softshrink, tanh_shrink, thresholded_relu, stanh, celu,
selu, logit
Binary	floor_divide, remainder, fmax, fmin, heaviside, atan2
Misc	scale, clip, where, pow, isnan, isinf, isfinite
Reductions	prod, any, all

Implementation notes

All kernels follow the existing MPSGraph pattern: one op per file, float32-only registration, empty-tensor early return, @available(macOS 12.0, *) guard, CPU-fallback on
missing Metal buffers. CMake picks the files up automatically via the mps/*.mm glob.
Kernel signatures match the phi header declarations exactly (incl. float vs double attrs); registration names verified against the CPU registrations (hardsigmoid vs
hard_shrink vs logsigmoid etc.).
Semantics matched to the CPU functors:
- remainder = Python-style modulo (result sign follows divisor); floor_divide = floor semantics
- fmax/fmin ignore NaN (composed from isNaN + select)
- logsigmoid uses the numerically stable min(x,0) − log(1+exp(−|x|)) form
- prod/any/all fill empty-input outputs with the reduction identity (1 / False / True)
- pow (scalar exponent) is separate from the existing elementwise_pow (tensor exponent)
- isnan/isinf/isfinite, any, all set BOOL output dtype; any/all take bool input
where plumbs the bool condition tensor through selectWithPredicateTensor.

Testing

Six standalone test files (test/test_mps_*.py), each comparing MPS output against both a hand-written numpy reference and the CPU backend, over multiple shapes,
attribute variations, known-value tables, and edge cases (NaN/inf inputs, empty tensors, domain boundaries). No torch dependency.

是否引起精度变化

否

* [API Compat] Add alias movedim for moveaxis * add param alias * move alias to __init__

* Align torch.Tensor.index_copy_ * Modify index_copy_ * Modify manipulation.py for xpu & Add some test case * XPU special cases have been added * Simplify error messages * Test dygragh backward

* [CI] add support for Python 3.14 in setup and CI scripts * [CI] add support for Python 3.13 in run_mac_test.sh

…9057) The _is_safe_class() function only checked the class's own __dict__ for dangerous magic methods (__reduce__, __reduce_ex__, __getstate__, __setstate__). It did NOT check the method resolution order (MRO), allowing classes that inherit these methods from parent classes to bypass the safety check. This fix checks all classes in the MRO (except object, whose default __reduce__ is safe for user-defined classes) for dangerous method definitions.

…ctor size (#79088)

--------- Co-authored-by: Codex <noreply@openai.com>

…9035) * [API Compat] Add aliases for apis in paddle.optimizer.lr * add apis to __all__ * delete lr_scheduler module from __all__ * fix import

Co-authored-by: Codex <codex@openai.com>

…(#79117) --------- Co-authored-by: Codex <noreply@openai.com>

--------- Co-authored-by: Codex <codex@openai.com>

…t 3 (#79118) --------- Co-authored-by: SigureMo <sigure.qaq@gmail.com>

Co-authored-by: Codex <codex@openai.com>

--------- Co-authored-by: Codex <codex@openai.com>

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…des. (#79142)

* [XPU][PIR] Add conv2d+bn(+add)+act fusion passes for XPU - Fix precision bug in conv2d_bn_xpu_fuse_pass: filter_max should be Max(Abs(folded_filter)) instead of Max(filter), to match XDNN int16 weight quantization expectation. Without this fix, conv layers whose weights cross zero after BN folding produce wrong scaling factors and hurt accuracy. - Add conv2d_bn_act_xpu_fuse_pass: fuses {Conv2d | DepthwiseConv2d} x {batch_norm | batch_norm_} x {relu | swish | hardswish} into a single conv2d_xpu op (12 variants). - Add conv2d_bn_add_act_xpu_fuse_pass: fuses the same chain plus a residual branch add (both residual_first and residual_last orderings) into conv2d_xpu with branch input (24 variants). - Add depthwise_conv2d_xpu_fuse_pass as a fallback that converts any remaining bare depthwise_conv2d (not absorbed by earlier fusions) into conv2d_xpu(act=LINEAR, no_bias, no_branch) so the runtime always dispatches to the fused XPU kernel. - Register the three new passes in passes.h and append them to kPirXpuPasses in paddle_pass_builder.cc, ordered after the existing conv2d_xpu fusion family. Validated end-to-end on PP-OCRv5_server (det+rec) on Kunlun P800: latency 184ms -> 21.19ms (-88%), no accuracy regression. * [XPU][PIR][Test] Add unit tests for conv2d+bn(+add)+act fusion passes - test_conv2d_bn_act_xpu_fuse_pass.py: 6 cases = {conv2d, depthwise_conv2d} x {relu, swish, hardswish} expects conv2d_xpu == 1 and bn/relu/swish/hardswish == 0. - test_conv2d_bn_add_act_xpu_fuse_pass.py: 6 cases = {relu, swish, hardswish} x {residual_first, residual_last} expects conv2d_xpu == 1 (with branch input) and bn/add/act == 0. - test_depthwise_conv2d_xpu_fuse_pass.py: bare depthwise_conv2d -> conv2d_xpu(act=LINEAR, no_bias, no_branch). All tests follow the existing pass_test.PassTest framework under test/ir/pir/fused_pass/xpu/, run only when XPU is compiled in, and verify both fusion topology (valid_op_map) and numerical accuracy. * [XPU][PIR] Consolidate conv2d+bn(+add)+act XPU fusion passes Refactor the four separate conv2d_* XPU fusion passes (`conv2d_bn_xpu_fuse_pass`, `conv2d_bn_act_xpu_fuse_pass`, `conv2d_bn_add_act_xpu_fuse_pass`, `depthwise_conv2d_xpu_fuse_pass`) into a single unified `conv2d_xpu_fuse_pass`, following the design of `fc_xpu_fuse_pass`. Highlights: - One pass file holding four DRR patterns with explicit benefit scores so longer subgraphs (bn+add+act) win over shorter ones (bn). - Bottom-up traversal (`use_top_down_traversal=false`) so patterns anchored on later ops (relu) match before patterns anchored on earlier ops (bn). - Shared helpers (`BuildBnFoldSubgraph`, `BuildFilterMaxSubgraph`, `CheckConvBnConstraints`, ComputeAttr factories) eliminate the ~50% duplication that existed across the original four files. - Pattern selection via `XPU_PADDLE_CONV2D_PATTERN` env var (default 0xff, bits gate individual pattern groups). Tests are consolidated into a single `test_conv2d_xpu_fuse_pass.py` matching the unified pass name, with four `PassTest` classes covering depthwise-only, conv+bn, conv+bn+act, and conv+bn+add+act. The add+act test expects `pd_op.add: 1` to account for the `bn_var + epsilon` add emitted by the BN-fold subgraph (operates on persistable constants; can be removed by a subsequent constant-folding pass). --------- Co-authored-by: mayang002 <mayang002@users.noreply.github.com>

--------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [MoE] Harden permute and unpermute validation Add targeted boundary checks for MoE permute/unpermute inputs to fail early on invalid buffer sizes, shape mismatches, and unsafe FP8 scale usage. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * [MoE] Move safety coverage to gated legacy test Keep AI-edited coverage tests stable on environments without the MoE GPU kernel while covering the new validation paths in the existing architecture-gated MoE test. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…ote precision for fp16/bf16 (#79011)

* [API compatibility] Align torch.kaiser_window * [API Compibility] Align kaiser_window

--------- Co-authored-by: Codex <noreply@openai.com> Co-authored-by: Codex <codex@openai.com>

…#79251)

Co-authored-by: Codex <noreply@openai.com>

* fix * fix * fix

* [Allocator] Fix VMM allocator hot paths and IPC slices * [Allocator] Add VMM allocator hot-path stats * [Allocator] Keep VMM block parts after compact * [Allocator] Free multi-scale allocations by owner * [Allocator] Share VMM block part helpers

* [CI] Fix empty C++ coverage reports * [CI] Address coverage review comments * test (cherry picked from commit f9658edfdabef36b79ad1fb08fcd0f7963688587) * Revert "test" This reverts commit 9dd8dc28e8143a1d5f292b79c09e8e2e775da07f. --------- Co-authored-by: gushiwei <gushiwei@baidu.com>

* [API Compatibility] Align torch module * [API Compatibility] Fix * [API Compatibility] Fix * [API Compatibility] Fix * [API Compatibility] fix bug and add test * [API Compatibility] Fix * [API Compatibility] Fix * [API Compatibility] Fix

…tivariateNormal、torch.distributions.distribution.Distribution、torch.distributions.distribution.Distribution.icdf / entropy、torch.distributions.normal.Normal (#79124) * Align MultivariateNormal * Try to align multivariate * [API Compatibility] Alias * [API Compatibility] Fix * [API Compatibility] Fix * [API Compatibility] Fix

* temp fix * fix bugs add test fix bugs fix bugs fix macos bug fix bugs * Update test_dataset_security.py * fix bugs

…197) * LRScheduler, ExponentialDecay * CosineAnnealingDecay * CosineAnnealingWarmRestarts * MultiStepDecay * ReduceOnPlateau * StepDecay * use decorator and overload to support optimizer arg * resolve conflict * move logic into decorator * revert previous changes within methods * fix name * add network test * move init function call before setting scheduler * add lr value test

* [API Compatibility] state_dict_post_hook * [API Compatibility] Fix

…for `paddle.io.DistributedBatchSampler` (#79268) * add alias paddle.utils.data.DistributedSampler * support arg seed

…sordot/tril_indices/triu_indices/vander/logaddexp/logspace/moveaxis/nan_to_num/nanmean/nansum/masked_fill/addmv/addr/fix/histc/trunc Edit By AI Agent (#79215) * [API Compatibility] logaddexp/logspace/moveaxis/nan_to_num/nanmean/nansum/sgn/signbit/slice_scatter/take/tensordot/tril_indices/triu_indices/trunc/vander Edit By AI Agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [API Compatibility] logspace/tril_indices/triu_indices/tensordot/slice_scatter Edit By AI Agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [API Compatibility] addmv/addr/addmv_/addr_ Edit By AI Agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [API Compatibility] special.round Edit By AI Agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [API Compatibility] take Edit By AI Agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [API Compatibility] special.round/select_scatter/slice_scatter/col_indices/crow_indices Edit By AI Agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [API Compatibility] slice_scatter Edit By AI Agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [API Compatibility] slice_scatter Edit By AI Agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [API Compatibility] logspace/slice_scatter fix Edit By AI Agent - logspace: add requires_grad support in PIR mode - slice_scatter: fix overload signature default values Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [API Compatibility] slice_scatter multi-axis fix Edit By AI Agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [API Compatibility] skip sparse tests on XPU Edit By AI Agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [API Compatibility] add out/requires_grad tests and XPU skip conditions Edit By AI Agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [API Compatibility] skip test_api_compatibility_part1~5 on XPU Edit By AI Agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [API Compatibility] add test_api_compatibility_part5.py Edit By AI Agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [API Compatibility] export addmv/addr/histc/negative_/fix/mvlgamma APIs Edit By AI Agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: zhouwei25 <zhouwei25@baidu.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* [API Compatibility] Align torch.Tensor.transpose_/ reshape_as * [API Compatibility] Fix

* fix UniformKernelImpl in gpu * fix test

* [Operator Mechanism] Support mm out_dtype for BF16 CUDA Add a narrow CUDA BF16 x BF16 -> FP32 path for paddle.mm(out_dtype=paddle.float32), including schema, infermeta, stride dispatch, fused cuBLAS GEMM, and focused tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * [Operator Mechanism] Fix mm out_dtype review issues Use the canonical matmul path for static mm out_dtype handling and keep legacy compatibility attrs limited to supported types. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * [Operator Mechanism] Fix matmul out_dtype CI regressions Keep matmul compatible with unknown symbolic dimensions and legacy matmul_v2 to PIR translation when out_dtype is unset. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * [Operator Mechanism] Harden matmul out_dtype static path Preserve the legacy static mm path when out_dtype is unset and avoid rejecting unknown symbolic matmul dimensions during InferMeta. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * [Operator Mechanism] Fix mm out_dtype static BF16 test Allow the explicit static out_dtype path to pass BF16 variables through Python validation and feed BF16 static test data using the existing uint16 encoding helper. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * [Operator Mechanism] Fix matmul out_dtype PIR compat Add missing default/propagated out_dtype handling for legacy matmul translation, PIR serialization compatibility, and handwritten PIR/DRR matmul rewrites. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * [Operator Mechanism] Harden matmul out_dtype PIR fusions Avoid fusing explicit matmul out_dtype paths in PIR rewrite passes, document BF16 GEMM lda/ldb narrowing safety, and add a legacy matmul_v2 translator regression. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * [Operator Mechanism] Fix matmul out_dtype static compat Route static mm out_dtype through matmul_v2 so it reaches the phi matmul kernel, preserve user-provided out tensors, and let legacy matmul_v2 fusion pass compatibility accept only missing/default out_dtype while rejecting explicit output dtype paths. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * [Operator Mechanism] Prune mm out_dtype to dynamic path Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * [Operator Mechanism] Skip mm out_dtype tests on ROCm Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* fix add_n 0-size bug * refine * refine

* [CI] Check C++ coverage before diff filtering * [CI] Add coverage guard smoke test change * Revert "[CI] Add coverage guard smoke test change" This reverts commit f9187c91242d92b399950859ab306f05e0257158. * [CI] Fail fast on coverage diff filtering errors * [CI] Skip diff filtering when coverage input is absent * [CI] Add coverage-info smoke test marker * Revert "[CI] Add coverage-info smoke test marker" This reverts commit eaad8e571cbb37993280b3f142e99ccb2d6e70e9.

… 8.6.0 - part 5 (#79266) * Upgrade Crypto++ to 8.6.0 * Fix Crypto++ CMake flag handling

--------- Co-authored-by: Codex <noreply@openai.com> Co-authored-by: Nyakku Shigure <sigure.qaq@gmail.com>

# Conflicts: # README.md

…with tests

YuhanXu and others added 30 commits May 20, 2026 17:58

CINN CustomDevice support bf16 intrinsics. (#78968)

5de6578

[API Compatibility] Add alias movedim for moveaxis (#79033)

601e828

* [API Compat] Add alias movedim for moveaxis * add param alias * move alias to __init__

[API compatibility] Align torch.Tensor.index_copy_ (#78974)

776720a

* Align torch.Tensor.index_copy_ * Modify index_copy_ * Modify manipulation.py for xpu & Add some test case * XPU special cases have been added * Simplify error messages * Test dygragh backward

Refine matmul codes for ap. (#79092)

5f62f1c

[CI] add support for Python 3.14 in setup and CI scripts (#79082)

a754a90

* [CI] add support for Python 3.14 in setup and CI scripts * [CI] add support for Python 3.13 in run_mac_test.sh

Fix ChooseAlgoByWorkspace to use actual algorithm count instead of ve…

f0a51ba

…ctor size (#79088)

[CodeStyle] Add shell tabs remover hook (#79104)

e55b609

--------- Co-authored-by: Codex <noreply@openai.com>

[API Compatibility] Add aliases for apis in paddle.optimizer.lr (#7…

da6eb98

…9035) * [API Compat] Add aliases for apis in paddle.optimizer.lr * add apis to __all__ * delete lr_scheduler module from __all__ * fix import

fix FP8 quant ops (#79096)

67145cc

[CI] Reduce Model-Benchmark download log noise (#79112)

53d060e

[CodeStyle][Typos] Bump typos to v1.46.2 (#79116)

683cbb5

Co-authored-by: Codex <codex@openai.com>

[CI] Upgrade apache-tvm-ffi to v0.1.11 (#79114)

4bdb54f

[CI][H-Coverage] Use YAML timeouts for Fleet single/multi-card tests …

0b4ba60

…(#79117) --------- Co-authored-by: Codex <noreply@openai.com>

[CI] Restore self-hosted runners for GitHub workflows (#79123)

7ac7614

--------- Co-authored-by: Codex <codex@openai.com>

【Hackathon 10th Spring No.52】update gast to 0.7 for Python 3.14 - par…

6cd9619

…t 3 (#79118) --------- Co-authored-by: SigureMo <sigure.qaq@gmail.com>

[CI] Remove unused doc preview comment workflow (#79125)

8349845

Co-authored-by: Codex <codex@openai.com>

[ThirdParty] Bump pybind11 version to v3.0.4 (#79121)

fe25a47

[CI] Upgrade Linux-CPU and NPU CI to Python 3.10 (#79119)

de9bb94

--------- Co-authored-by: Codex <codex@openai.com>

[Infra] Use repository README.md for PyPI long_description (#79115)

8bbf96f

[CI] Disable IXUCA workflow (#79131)

6e4d84a

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

[XPU] update version of XHPC to 20660522 (#79053)

e06e49e

[AP] Rename namespace from cutlass to cutlass_patch and reorganize co…

845ec1c

…des. (#79142)

[XPU] update version of XHPC to 20260523 (#79137)

05e095f

[CUDA13.2] Environment Adaptation support Paddle on CUDA 13.2 (#78720)

d59f4f1

--------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Optimize squared_l2_norm GPU kernel: use new ReduceGpuKernel and prom…

e9b4982

…ote precision for fp16/bf16 (#79011)

[API compatibility] Align torch.kaiser_window (#79134)

d35d159

* [API compatibility] Align torch.kaiser_window * [API Compibility] Align kaiser_window

[PHI] Fix local index in strided elementwise kernel (#79162)

cad7a18

--------- Co-authored-by: Codex <noreply@openai.com> Co-authored-by: Codex <codex@openai.com>

ShigureNyako and others added 30 commits June 5, 2026 11:49

[API Compatibility] Add torch-compatible NVTX range context manager (…

29e37f9

…#79251)

[API Compatibility] Add cuda_stream to Paddle stream wrapper (#79250)

148b086

Co-authored-by: Codex <noreply@openai.com>

Fix docker build error handling (#79253)

29ae6bb

Optimize the compilation of sleef (#79242)

44b28a7

* fix * fix * fix

revert delete deepep (#79249)

15f9805

[Security] Harden CIFAR and Flowers dataset loading (#79210)

06d8af5

* temp fix * fix bugs add test fix bugs fix bugs fix macos bug fix bugs * Update test_dataset_security.py * fix bugs

[API compatibility] Align Module state_dict_post_hook (#79164)

722421e

* [API Compatibility] state_dict_post_hook * [API Compatibility] Fix

[API Compatibility] Add alias paddle.utils.data.DistributedSampler …

641cde7

…for `paddle.io.DistributedBatchSampler` (#79268) * add alias paddle.utils.data.DistributedSampler * support arg seed

add fleet time (#79278)

181bff1

[API Compatibility] Align torch.Tensor.transpose_/ reshape_as (#79228)

d34cf2e

* [API Compatibility] Align torch.Tensor.transpose_/ reshape_as * [API Compatibility] Fix

[Precision Depth Alignment] UniformKernel in GPU (#79130)

e313387

* fix UniformKernelImpl in gpu * fix test

fix add_n 0-size bug (#79276)

4a0da07

* fix add_n 0-size bug * refine * refine

[CodeStyle] Clean up Python 3.9 compatibility code (#79290)

255dd8e

[Typing] Upgrade mypy to 2.1.0 and target Python 3.10 (#79289)

83c656a

【Hackathon 10th Spring No.52】[ThirdParty][Crypto] Upgrade Crypto++ to…

6e6b2ee

… 8.6.0 - part 5 (#79266) * Upgrade Crypto++ to 8.6.0 * Fix Crypto++ CMake flag handling

[CodeStyle] Clean up Python 3.10 type-hint diagnostics (#79291)

00ece82

--------- Co-authored-by: Codex <noreply@openai.com> Co-authored-by: Nyakku Shigure <sigure.qaq@gmail.com>

[CI] Keep gcov tool selection only (#79300)

3824862

Merge branch 'develop' of https://github.com/paddlepaddle/paddle

5e5dca5

# Conflicts: # README.md

Add 42 basic operator kernels (unary/activation/binary/misc/reduce) …

6dc6321

…with tests

pre commit

b0e3c5e

expand support dtype

e8e5bc0

fix

2efb898

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MPS Dev] 2026.06.11#4

[MPS Dev] 2026.06.11#4
Manfredss wants to merge 98 commits into
mainfrom
dev_2026_06_11

Manfredss commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

Manfredss commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

Implementation notes

Testing

是否引起精度变化

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Manfredss commented Jun 11, 2026 •

edited

Loading