[MPS Dev] 2026.06.11#4
Open
Manfredss wants to merge 98 commits into
Open
Conversation
* [API Compat] Add alias movedim for moveaxis * add param alias * move alias to __init__
* Align torch.Tensor.index_copy_ * Modify index_copy_ * Modify manipulation.py for xpu & Add some test case * XPU special cases have been added * Simplify error messages * Test dygragh backward
* [CI] add support for Python 3.14 in setup and CI scripts * [CI] add support for Python 3.13 in run_mac_test.sh
…9057) The _is_safe_class() function only checked the class's own __dict__ for dangerous magic methods (__reduce__, __reduce_ex__, __getstate__, __setstate__). It did NOT check the method resolution order (MRO), allowing classes that inherit these methods from parent classes to bypass the safety check. This fix checks all classes in the MRO (except object, whose default __reduce__ is safe for user-defined classes) for dangerous method definitions.
…ctor size (#79088)
--------- Co-authored-by: Codex <noreply@openai.com>
…9035) * [API Compat] Add aliases for apis in paddle.optimizer.lr * add apis to __all__ * delete lr_scheduler module from __all__ * fix import
Co-authored-by: Codex <codex@openai.com>
…(#79117) --------- Co-authored-by: Codex <noreply@openai.com>
--------- Co-authored-by: Codex <codex@openai.com>
…t 3 (#79118) --------- Co-authored-by: SigureMo <sigure.qaq@gmail.com>
Co-authored-by: Codex <codex@openai.com>
--------- Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* [XPU][PIR] Add conv2d+bn(+add)+act fusion passes for XPU
- Fix precision bug in conv2d_bn_xpu_fuse_pass:
filter_max should be Max(Abs(folded_filter)) instead of Max(filter),
to match XDNN int16 weight quantization expectation. Without this fix,
conv layers whose weights cross zero after BN folding produce wrong
scaling factors and hurt accuracy.
- Add conv2d_bn_act_xpu_fuse_pass:
fuses {Conv2d | DepthwiseConv2d} x {batch_norm | batch_norm_} x
{relu | swish | hardswish} into a single conv2d_xpu op (12 variants).
- Add conv2d_bn_add_act_xpu_fuse_pass:
fuses the same chain plus a residual branch add (both residual_first
and residual_last orderings) into conv2d_xpu with branch input
(24 variants).
- Add depthwise_conv2d_xpu_fuse_pass as a fallback that converts any
remaining bare depthwise_conv2d (not absorbed by earlier fusions)
into conv2d_xpu(act=LINEAR, no_bias, no_branch) so the runtime
always dispatches to the fused XPU kernel.
- Register the three new passes in passes.h and append them to
kPirXpuPasses in paddle_pass_builder.cc, ordered after the existing
conv2d_xpu fusion family.
Validated end-to-end on PP-OCRv5_server (det+rec) on Kunlun P800:
latency 184ms -> 21.19ms (-88%), no accuracy regression.
* [XPU][PIR][Test] Add unit tests for conv2d+bn(+add)+act fusion passes
- test_conv2d_bn_act_xpu_fuse_pass.py:
6 cases = {conv2d, depthwise_conv2d} x {relu, swish, hardswish}
expects conv2d_xpu == 1 and bn/relu/swish/hardswish == 0.
- test_conv2d_bn_add_act_xpu_fuse_pass.py:
6 cases = {relu, swish, hardswish} x {residual_first, residual_last}
expects conv2d_xpu == 1 (with branch input) and bn/add/act == 0.
- test_depthwise_conv2d_xpu_fuse_pass.py:
bare depthwise_conv2d -> conv2d_xpu(act=LINEAR, no_bias, no_branch).
All tests follow the existing pass_test.PassTest framework under
test/ir/pir/fused_pass/xpu/, run only when XPU is compiled in, and
verify both fusion topology (valid_op_map) and numerical accuracy.
* [XPU][PIR] Consolidate conv2d+bn(+add)+act XPU fusion passes
Refactor the four separate conv2d_* XPU fusion passes
(`conv2d_bn_xpu_fuse_pass`, `conv2d_bn_act_xpu_fuse_pass`,
`conv2d_bn_add_act_xpu_fuse_pass`, `depthwise_conv2d_xpu_fuse_pass`)
into a single unified `conv2d_xpu_fuse_pass`, following the design of
`fc_xpu_fuse_pass`.
Highlights:
- One pass file holding four DRR patterns with explicit benefit scores
so longer subgraphs (bn+add+act) win over shorter ones (bn).
- Bottom-up traversal (`use_top_down_traversal=false`) so patterns
anchored on later ops (relu) match before patterns anchored on
earlier ops (bn).
- Shared helpers (`BuildBnFoldSubgraph`, `BuildFilterMaxSubgraph`,
`CheckConvBnConstraints`, ComputeAttr factories) eliminate the ~50%
duplication that existed across the original four files.
- Pattern selection via `XPU_PADDLE_CONV2D_PATTERN` env var
(default 0xff, bits gate individual pattern groups).
Tests are consolidated into a single `test_conv2d_xpu_fuse_pass.py`
matching the unified pass name, with four `PassTest` classes covering
depthwise-only, conv+bn, conv+bn+act, and conv+bn+add+act. The
add+act test expects `pd_op.add: 1` to account for the
`bn_var + epsilon` add emitted by the BN-fold subgraph (operates on
persistable constants; can be removed by a subsequent constant-folding
pass).
---------
Co-authored-by: mayang002 <mayang002@users.noreply.github.com>
--------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* [MoE] Harden permute and unpermute validation Add targeted boundary checks for MoE permute/unpermute inputs to fail early on invalid buffer sizes, shape mismatches, and unsafe FP8 scale usage. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * [MoE] Move safety coverage to gated legacy test Keep AI-edited coverage tests stable on environments without the MoE GPU kernel while covering the new validation paths in the existing architecture-gated MoE test. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…ote precision for fp16/bf16 (#79011)
* [API compatibility] Align torch.kaiser_window * [API Compibility] Align kaiser_window
--------- Co-authored-by: Codex <noreply@openai.com> Co-authored-by: Codex <codex@openai.com>
Co-authored-by: Codex <noreply@openai.com>
* fix * fix * fix
* [Allocator] Fix VMM allocator hot paths and IPC slices * [Allocator] Add VMM allocator hot-path stats * [Allocator] Keep VMM block parts after compact * [Allocator] Free multi-scale allocations by owner * [Allocator] Share VMM block part helpers
* [CI] Fix empty C++ coverage reports * [CI] Address coverage review comments * test (cherry picked from commit f9658edfdabef36b79ad1fb08fcd0f7963688587) * Revert "test" This reverts commit 9dd8dc28e8143a1d5f292b79c09e8e2e775da07f. --------- Co-authored-by: gushiwei <gushiwei@baidu.com>
* [API Compatibility] Align torch module * [API Compatibility] Fix * [API Compatibility] Fix * [API Compatibility] Fix * [API Compatibility] fix bug and add test * [API Compatibility] Fix * [API Compatibility] Fix * [API Compatibility] Fix
…tivariateNormal、torch.distributions.distribution.Distribution、torch.distributions.distribution.Distribution.icdf / entropy、torch.distributions.normal.Normal (#79124) * Align MultivariateNormal * Try to align multivariate * [API Compatibility] Alias * [API Compatibility] Fix * [API Compatibility] Fix * [API Compatibility] Fix
* temp fix * fix bugs add test fix bugs fix bugs fix macos bug fix bugs * Update test_dataset_security.py * fix bugs
…197) * LRScheduler, ExponentialDecay * CosineAnnealingDecay * CosineAnnealingWarmRestarts * MultiStepDecay * ReduceOnPlateau * StepDecay * use decorator and overload to support optimizer arg * resolve conflict * move logic into decorator * revert previous changes within methods * fix name * add network test * move init function call before setting scheduler * add lr value test
* [API Compatibility] state_dict_post_hook * [API Compatibility] Fix
…for `paddle.io.DistributedBatchSampler` (#79268) * add alias paddle.utils.data.DistributedSampler * support arg seed
…sordot/tril_indices/triu_indices/vander/logaddexp/logspace/moveaxis/nan_to_num/nanmean/nansum/masked_fill/addmv/addr/fix/histc/trunc Edit By AI Agent (#79215) * [API Compatibility] logaddexp/logspace/moveaxis/nan_to_num/nanmean/nansum/sgn/signbit/slice_scatter/take/tensordot/tril_indices/triu_indices/trunc/vander Edit By AI Agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [API Compatibility] logspace/tril_indices/triu_indices/tensordot/slice_scatter Edit By AI Agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [API Compatibility] addmv/addr/addmv_/addr_ Edit By AI Agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [API Compatibility] special.round Edit By AI Agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [API Compatibility] take Edit By AI Agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [API Compatibility] special.round/select_scatter/slice_scatter/col_indices/crow_indices Edit By AI Agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [API Compatibility] slice_scatter Edit By AI Agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [API Compatibility] slice_scatter Edit By AI Agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [API Compatibility] logspace/slice_scatter fix Edit By AI Agent - logspace: add requires_grad support in PIR mode - slice_scatter: fix overload signature default values Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [API Compatibility] slice_scatter multi-axis fix Edit By AI Agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [API Compatibility] skip sparse tests on XPU Edit By AI Agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [API Compatibility] add out/requires_grad tests and XPU skip conditions Edit By AI Agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [API Compatibility] skip test_api_compatibility_part1~5 on XPU Edit By AI Agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [API Compatibility] add test_api_compatibility_part5.py Edit By AI Agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [API Compatibility] export addmv/addr/histc/negative_/fix/mvlgamma APIs Edit By AI Agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: zhouwei25 <zhouwei25@baidu.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* [API Compatibility] Align torch.Tensor.transpose_/ reshape_as * [API Compatibility] Fix
* fix UniformKernelImpl in gpu * fix test
* [Operator Mechanism] Support mm out_dtype for BF16 CUDA Add a narrow CUDA BF16 x BF16 -> FP32 path for paddle.mm(out_dtype=paddle.float32), including schema, infermeta, stride dispatch, fused cuBLAS GEMM, and focused tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * [Operator Mechanism] Fix mm out_dtype review issues Use the canonical matmul path for static mm out_dtype handling and keep legacy compatibility attrs limited to supported types. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * [Operator Mechanism] Fix matmul out_dtype CI regressions Keep matmul compatible with unknown symbolic dimensions and legacy matmul_v2 to PIR translation when out_dtype is unset. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * [Operator Mechanism] Harden matmul out_dtype static path Preserve the legacy static mm path when out_dtype is unset and avoid rejecting unknown symbolic matmul dimensions during InferMeta. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * [Operator Mechanism] Fix mm out_dtype static BF16 test Allow the explicit static out_dtype path to pass BF16 variables through Python validation and feed BF16 static test data using the existing uint16 encoding helper. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * [Operator Mechanism] Fix matmul out_dtype PIR compat Add missing default/propagated out_dtype handling for legacy matmul translation, PIR serialization compatibility, and handwritten PIR/DRR matmul rewrites. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * [Operator Mechanism] Harden matmul out_dtype PIR fusions Avoid fusing explicit matmul out_dtype paths in PIR rewrite passes, document BF16 GEMM lda/ldb narrowing safety, and add a legacy matmul_v2 translator regression. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * [Operator Mechanism] Fix matmul out_dtype static compat Route static mm out_dtype through matmul_v2 so it reaches the phi matmul kernel, preserve user-provided out tensors, and let legacy matmul_v2 fusion pass compatibility accept only missing/default out_dtype while rejecting explicit output dtype paths. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * [Operator Mechanism] Prune mm out_dtype to dynamic path Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * [Operator Mechanism] Skip mm out_dtype tests on ROCm Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* fix add_n 0-size bug * refine * refine
* [CI] Check C++ coverage before diff filtering * [CI] Add coverage guard smoke test change * Revert "[CI] Add coverage guard smoke test change" This reverts commit f9187c91242d92b399950859ab306f05e0257158. * [CI] Fail fast on coverage diff filtering errors * [CI] Skip diff filtering when coverage input is absent * [CI] Add coverage-info smoke test marker * Revert "[CI] Add coverage-info smoke test marker" This reverts commit eaad8e571cbb37993280b3f142e99ccb2d6e70e9.
… 8.6.0 - part 5 (#79266) * Upgrade Crypto++ to 8.6.0 * Fix Crypto++ CMake flag handling
--------- Co-authored-by: Codex <noreply@openai.com> Co-authored-by: Nyakku Shigure <sigure.qaq@gmail.com>
# Conflicts: # README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Category
Custom Device
PR Types
New features
Description
Implements the remaining basic operators in the MPS backend (39 new
.mmkernel files, 6 new test files), bringing the backend from 54 to 96 registered ops.Implementation notes
@available(macOS 12.0, *)guard, CPU-fallback onmissing Metal buffers. CMake picks the files up automatically via the
mps/*.mmglob.floatvsdoubleattrs); registration names verified against the CPU registrations (hardsigmoidvshard_shrinkvslogsigmoidetc.).remainder= Python-style modulo (result sign follows divisor);floor_divide= floor semanticsfmax/fminignore NaN (composed fromisNaN+select)logsigmoiduses the numerically stablemin(x,0) − log(1+exp(−|x|))formprod/any/allfill empty-input outputs with the reduction identity (1 / False / True)pow(scalar exponent) is separate from the existingelementwise_pow(tensor exponent)isnan/isinf/isfinite,any,allset BOOL output dtype;any/alltake bool inputwhereplumbs the bool condition tensor throughselectWithPredicateTensor.Testing
Six standalone test files (
test/test_mps_*.py), each comparing MPS output against both a hand-written numpy reference and the CPU backend, over multiple shapes,attribute variations, known-value tables, and edge cases (NaN/inf inputs, empty tensors, domain boundaries). No torch dependency.
是否引起精度变化
否