Skip to content

FROMLIST: misc: fastrpc: fix use-after-free of fastrpc_user in workqueue context#594

Open
quic-anane wants to merge 2 commits into
qualcomm-linux:qcom-6.18.yfrom
quic-anane:wq_v4
Open

FROMLIST: misc: fastrpc: fix use-after-free of fastrpc_user in workqueue context#594
quic-anane wants to merge 2 commits into
qualcomm-linux:qcom-6.18.yfrom
quic-anane:wq_v4

Conversation

@quic-anane
Copy link
Copy Markdown

Changes in this PR
This PR contains two commits:

Revert previous patch

Reverts the earlier fastrpc_user reference counting change.
This is done to avoid carrying forward a partially updated
implementation and to ensure a clean base.

Apply latest patch (v4) from mailing list

Applies the full, updated version of the fix.
Incorporates all revisions from earlier versions.
Ensures correct ordering of:

fastrpc_user_put()
fastrpc_channel_ctx_put()

Consolidates teardown logic into fastrpc_user_free().
Fixes use-after-free scenarios in workqueue and error paths.

CRs-Fixed: 4502232

Anandu Krishnan E added 2 commits May 19, 2026 02:35
…ser structure"

This reverts commit 14e526a.

This change corresponds to the initial (v1) version shared with the
upstream community.

Revert it to apply the complete v4 revision, which includes additional
fixes and updates not present in the earlier version. v4 version
contains this changes as well.
…eue context

There is a race between fastrpc_device_release() and the workqueue
that processes DSP responses. When the user closes the file descriptor,
fastrpc_device_release() frees the fastrpc_user structure. Concurrently,
an in-flight DSP invocation can complete and fastrpc_rpmsg_callback()
schedules context cleanup via schedule_work(&ctx->put_work). If the
workqueue runs fastrpc_context_free() in parallel with or after
fastrpc_device_release() has freed the user structure, it dereferences
the freed fastrpc_user. Depending on the state of the context at the
time of the race, any one of the following accesses can be hit:

 1. fastrpc_buf_free() calls fastrpc_ipa_to_dma_addr(buf->fl->cctx, ...)
    to strip the SID bits from the stored IOVA before passing the
    physical address to dma_free_coherent().

 2. fastrpc_free_map() reads map->fl->cctx->vmperms[0].vmid to
    reconstruct the source permission bitmask needed for the
    qcom_scm_assign_mem() call that returns memory from the DSP VM
    back to HLOS.

 3. fastrpc_free_map() acquires map->fl->lock to safely remove the
    map node from the fl->maps list.

The resulting use-after-free manifests as:

  pc : fastrpc_buf_free+0x38/0x80 [fastrpc]
  lr : fastrpc_context_free+0xa8/0x1b0 [fastrpc]
  fastrpc_context_free+0xa8/0x1b0 [fastrpc]
  fastrpc_context_put_wq+0x78/0xa0 [fastrpc]
  process_one_work+0x180/0x450
  worker_thread+0x26c/0x388

Add kref-based reference counting to fastrpc_user. Have each invoke
context take a reference on the user at allocation time and release it
when the context is freed. Release the initial reference in
fastrpc_device_release() at file close. Move the teardown of the user
structure — freeing pending contexts, maps, mmaps, and the channel
context reference — into the kref release callback fastrpc_user_free(),
so that it runs only when the last reference is dropped, regardless of
whether that happens at device close or after the final in-flight
context completes.

Link:https://lore.kernel.org/all/20260518203507.3754994-1-anandu.e@oss.qualcomm.com/
Fixes: 6cffd79 ("misc: fastrpc: Add support for dmabuf exporter")
Cc: stable@kernel.org
Signed-off-by: Anandu Krishnan E <anandu.e@oss.qualcomm.com>
@quic-anane quic-anane requested review from a team, idlethread, quic-tingweiz and sgaud-quic May 18, 2026 21:17
@qswat-orbit-external
Copy link
Copy Markdown

Merge Check Failed: CR Not Eligible for Merge

CR 4502232 is not eligible for merge.

The parent software image for kernel.qli.2.0 is not development complete.

Entity: kernel.qli.2.0
CR: 4502232
Reason: CR_CANNOT_MERGE

Please ensure the CR passes both CCT (ComponentChangeTasks) and ICT (Integration Change Tasks) validations.

@qcomlnxci
Copy link
Copy Markdown

Test Matrix

Test Case lemans-evk monaco-evk qcs615-ride qcs6490-rb3gen2 qcs8300-ride qcs9100-ride-r3 x1e80100-crd
BT_FW_KMD_Service ✅ Pass ◻️ ✅ Pass ◻️ ✅ Pass ✅ Pass ◻️
BT_ON_OFF ✅ Pass ◻️ ✅ Pass ◻️ ✅ Pass ✅ Pass ◻️
BT_SCAN ✅ Pass ◻️ ✅ Pass ◻️ ✅ Pass ✅ Pass ◻️
CPUFreq_Validation ✅ Pass ◻️ ✅ Pass ◻️ ✅ Pass ✅ Pass ◻️
CPU_affinity ✅ Pass ◻️ ✅ Pass ◻️ ✅ Pass ✅ Pass ◻️
DSP_AudioPD ✅ Pass ◻️ ⚠️ skip ◻️ ✅ Pass ⚠️ skip ◻️
Ethernet ✅ Pass ◻️ ⚠️ skip ◻️ ⚠️ skip ⚠️ skip ◻️
Freq_Scaling ✅ Pass ◻️ ✅ Pass ◻️ ✅ Pass ✅ Pass ◻️
GIC ✅ Pass ◻️ ✅ Pass ◻️ ✅ Pass ✅ Pass ◻️
IPA ✅ Pass ◻️ ✅ Pass ◻️ ✅ Pass ✅ Pass ◻️
Interrupts ✅ Pass ◻️ ✅ Pass ◻️ ✅ Pass ✅ Pass ◻️
OpenCV ✅ Pass ◻️ ✅ Pass ◻️ ✅ Pass ✅ Pass ◻️
PCIe ✅ Pass ◻️ ✅ Pass ◻️ ✅ Pass ✅ Pass ◻️
Probe_Failure_Check ❌ Fail ◻️ ✅ Pass ◻️ ❌ Fail ❌ Fail ◻️
RMNET ✅ Pass ◻️ ✅ Pass ◻️ ✅ Pass ✅ Pass ◻️
UFS_Validation ✅ Pass ◻️ ✅ Pass ◻️ ✅ Pass ✅ Pass ◻️
USBHost ❌ Fail ◻️ ❌ Fail ◻️ ❌ Fail ❌ Fail ◻️
WiFi_Firmware_Driver ❌ Fail ◻️ ❌ Fail ◻️ ✅ Pass ✅ Pass ◻️
WiFi_OnOff ✅ Pass ◻️ ⚠️ skip ◻️ ✅ Pass ✅ Pass ◻️
adsp_remoteproc ✅ Pass ◻️ ✅ Pass ◻️ ✅ Pass ❌ Fail ◻️
cdsp_remoteproc ✅ Pass ◻️ ✅ Pass ◻️ ✅ Pass ❌ Fail ◻️
gpdsp_remoteproc ✅ Pass ◻️ ⚠️ skip ◻️ ✅ Pass ❌ Fail ◻️
hotplug ✅ Pass ◻️ ✅ Pass ◻️ ✅ Pass ✅ Pass ◻️
irq ✅ Pass ◻️ ✅ Pass ◻️ ✅ Pass ✅ Pass ◻️
kaslr ✅ Pass ◻️ ✅ Pass ◻️ ✅ Pass ✅ Pass ◻️
pinctrl ✅ Pass ◻️ ✅ Pass ◻️ ✅ Pass ✅ Pass ◻️
qcom_hwrng ✅ Pass ◻️ ✅ Pass ◻️ ✅ Pass ✅ Pass ◻️
remoteproc ✅ Pass ◻️ ✅ Pass ◻️ ✅ Pass ❌ Fail ◻️
rngtest ✅ Pass ◻️ ✅ Pass ◻️ ✅ Pass ✅ Pass ◻️
shmbridge ❌ Fail ◻️ ❌ Fail ◻️ ❌ Fail ❌ Fail ◻️
smmu ❌ Fail ◻️ ❌ Fail ◻️ ✅ Pass ❌ Fail ◻️
watchdog ✅ Pass ◻️ ✅ Pass ◻️ ✅ Pass ✅ Pass ◻️
wpss_remoteproc ✅ Pass ◻️ ⚠️ skip ◻️ ✅ Pass ✅ Pass ◻️

@knaveen-qc
Copy link
Copy Markdown

LAVA Failed Case Triage Summary

PR: #594

Job 101881 | SoC qcs9100-ride

LAVA job: https://lava-oss.qualcomm.com/scheduler/job/101881

Failed test cases in LAVA job 101881 (SoC: qcs9100-ride).

  Case 1: ** Remoteproc Boot Failure — SCM PAS init returned -EINVAL (untested machine)
  1. Failed case: ** Remoteproc Boot Failure — SCM PAS init returned -EINVAL (untested machine)
  2. Root cause: ** qcom_scm skipped qseecom initialization for the qcs9100/sa8775p machine ID (qseecom: untested machine, skipping), causing qcom_scm_pas_init_image() to return -EINVAL (-22) for all DSP subsystems; both cdsp0 (remoteproc2) and cdsp1 (remoteproc3) fail at the PAS firmware-init SCM call and remain offline.
  3. Possible fix: Add the qcs9100/sa8775p machine ID to the qseecom-tested-machines allowlist in drivers/firmware/qcom/qcom_scm.c (or the equivalent qseecom machine table), then re-run the CI job to confirm all DSP remoteprocs reach running state. This failure is pre-existing and not introduced by the PR.
  4. Detail analysis attachment: failed_case_job101881_1_detailed.md
  Case 2: ** Remoteproc Boot Failure — PAS Firmware Initialization Error (EINVAL)
  1. Failed case: ** Remoteproc Boot Failure — PAS Firmware Initialization Error (EINVAL)
  2. Root cause: ** On qcs9100-ride (sa8775p), qcom_q6v5_pas returns error -22 (EINVAL) from qcom_scm_pas_init_image() for all DSP subsystems (ADSP remoteproc4, CDSP0/1, GPDSP0/1) at boot time (~7.6s), indicating the TrustZone PAS authentication SMC call is rejected — consistent with a kernel/DSP firmware image misalignment where the flashed .mbn images do not match the signing expectations of the running kernel build; the PR patch (drivers/misc/fastrpc.c only) is entirely unrelated.
  3. Possible fix: Re-flash the board with a fully aligned build (kernel image and DSP .mbn firmware from the same meta/build drop) and re-trigger the CI job; if the error persists on an aligned build, verify that qcom_scm_pas_supported() returns true for the ADSP/CDSP/GPDSP PAS IDs on sa8775p in the kernel's SCM call availability table.
  4. Detail analysis attachment: failed_case_job101881_2_detailed.md
  Case 3: ** Remoteproc Boot Failure — PAS firmware initialization error (-EINVAL)
  1. Failed case: ** Remoteproc Boot Failure — PAS firmware initialization error (-EINVAL)
  2. Root cause: ** Both gpdsp0 (remoteproc0, 20c00000.remoteproc) and gpdsp1 (remoteproc1, 21c00000.remoteproc) fail to boot on the qcs9100-ride (sa8775p) platform because qcom_q6v5_pas returns error -22 (EINVAL) during PAS/SCM firmware initialization of qcom/sa8775p/gpdsp0.mbn and gpdsp1.mbn; the same error affects all remoteprocs (cdsp0, cdsp1, adsp) and GPU/video firmware, indicating a platform-wide firmware authentication or memory-region layout mismatch between the flashed firmware package and the kernel's PAS expectations — this failure is pre-existing and unrelated to the PR under test (fastrpc only).
  3. Possible fix: This failure is not introduced by PR FROMLIST: misc: fastrpc: fix use-after-free of fastrpc_user in workqueue context #594 (fastrpc changes only); re-trigger the CI job with a firmware package aligned to kernel 6.18.25-gf71491fcd9b4 for the sa8775p/qcs9100-ride platform, or investigate the PAS/SCM EINVAL path in qcom_q6v5_pas to identify the specific metadata field rejected by TrustZone for this firmware+kernel combination.
  4. Detail analysis attachment: failed_case_job101881_3_detailed.md
  Case 4: ** Remoteproc Boot Failure — PAS/SCM authentication returns EINVAL (qseecom untested machine)
  1. Failed case: ** Remoteproc Boot Failure — PAS/SCM authentication returns EINVAL (qseecom untested machine)
  2. Root cause: ** On qcs9100-ride (sa8775p), qcom_scm logs qseecom: untested machine, skipping at boot, causing qcom_scm_pas_init_image() to return -EINVAL (-22) for all 5 q6v5 PAS subsystems (gpdsp0, gpdsp1, cdsp, cdsp1, adsp); every remoteproc stays offline and the test finds 0 of 5 expected subsystems in running state.
  3. Possible fix: Add the qcs9100/sa8775p machine ID to the qseecom tested-machine allowlist in drivers/firmware/qcom/qcom_scm.c so that qcom_scm_pas_init_image() can authenticate DSP firmware images via TrustZone; verify by re-running the LAVA job and confirming all 5 remoteprocs reach running state.
  4. Detail analysis attachment: failed_case_job101881_4_detailed.md
  Case 5: ** Probe_Failure_Check — Driver Probe Failure (firmware dependency + DT mismatch)
  1. Failed case: ** Probe_Failure_Check — Driver Probe Failure (firmware dependency + DT mismatch)
  2. Root cause: ** Two pre-existing, PR-unrelated errors in dmesg triggered the Probe_Failure_Check scanner: (1) faux_driver regulatory: Direct firmware load for regulatory.db failed with error -2regulatory.db is absent from the rootfs image; (2) Aquantia AQR115C stmmac-0:08: probe with driver Aquantia AQR115C failed with error -22 — the firmware-name DT property is missing for the PHY at MDIO address 0x08 on the qcs9100-ride board, causing the Aquantia driver to return -EINVAL from probe.
  3. Possible fix: Add regulatory.db (from wireless-regdb) to the rootfs image and add/correct the firmware-name DT property in the qcs9100-ride DTS for the Aquantia PHY node at MDIO address 0x08; alternatively, add these two known-platform errors to the Probe_Failure_Check CI allowlist since they are pre-existing and unrelated to PR FROMLIST: misc: fastrpc: fix use-after-free of fastrpc_user in workqueue context #594.
  4. Detail analysis attachment: failed_case_job101881_5_detailed.md
  Case 6: ** smmu
  1. Failed case: ** smmu
  2. Root cause: ** The smmu test script asserts that aa00000.video-codec (Video) and interconnect-lpass-ag-noc (Audio) are attached to IOMMU groups, but neither device appears in /sys/kernel/iommu_groupsaa00000.video-codec never attached because the qcom-iris driver failed firmware init (error -22 initializing firmware qcom/vpu/vpu30_p4_s6_16mb.mbn) before IOMMU attachment could complete, and interconnect-lpass-ag-noc is absent because the LPASS/ADSP subsystem is offline on this qcs9100-ride board; this is a pre-existing platform condition unrelated to the PR.
  3. Possible fix: This failure is pre-existing and not introduced by PR FROMLIST: misc: fastrpc: fix use-after-free of fastrpc_user in workqueue context #594 (which only modifies drivers/misc/fastrpc.c for a use-after-free fix); update the smmu test script's critical-master checklist to either skip aa00000.video-codec and interconnect-lpass-ag-noc when their upstream dependencies (VPU firmware / ADSP remoteproc) are known-offline on qcs9100-ride, or mark them as non-fatal warnings rather than hard failures.
  4. Detail analysis attachment: failed_case_job101881_6_detailed.md
  Case 7: ** USBHost
  1. Failed case: ** USBHost
  2. Root cause: ** The USBHost test on qcs9100-ride failed because only USB root hubs (Linux Foundation 2.0/3.0 virtual hub devices) were enumerated on all three buses — no physical USB peripheral device was connected to the board's host-mode USB ports at the time of the test; the test script explicitly checks for non-hub devices and fails with "Only USB hubs detected, no functional USB devices."
  3. Possible fix: Attach a physical USB peripheral (e.g., USB mass storage or HID device) to one of the qcs9100-ride board's USB host ports in the LAVA lab setup, or mark this test as infrastructure-dependent and skip it when no USB device is wired to the DUT.
  4. Detail analysis attachment: failed_case_job101881_7_detailed.md
  Case 8: ** shmbridge — Suppressed (Known Benign: shmbridge CI noise)
  1. Failed case: ** shmbridge — Suppressed (Known Benign: shmbridge CI noise)
  2. Root cause: ** The shmbridge test unconditionally fails in this CI environment due to known infrastructure noise; per lava-known-benign-failures.md Rule 1, any shmbridge result:fail is always suppressed and does not indicate a kernel regression. The test script detected qcom_scm-related errors in the kernel log (triggered by the presence of qcom_scm.download_mode=1 in the kernel command line, which is normal boot configuration on qcs9100-ride), but this is a known false-positive trigger in the test harness.
  3. Possible fix: No fix required — suppress this failure as known benign CI noise; no kernel or PR change is needed.
  4. Detail analysis attachment: failed_case_job101881_8_detailed.md
Job 101883 | SoC qcs8300-ride

LAVA job: https://lava-oss.qualcomm.com/scheduler/job/101883

Failed test cases in LAVA job 101883 (SoC: qcs8300-ride).

  Case 1: ** Probe_Failure_Check
  1. Failed case: ** Probe_Failure_Check
  2. Root cause: ** Two pre-existing, PR-unrelated probe/firmware errors are caught by the test's dmesg scan: (1) faux_driver regulatory: Direct firmware load for regulatory.db failed with error -2 — the regulatory.db firmware file is absent from the rootfs for this kernel version, and (2) Aquantia AQR115C stmmac-0:08: probe with driver Aquantia AQR115C failed with error -22 — the Aquantia AQR115C PHY driver fails because the firmware-name DT property is missing or invalid (-EINVAL) for the stmmac-0:08 MDIO bus node on qcs8300-ride.
  3. Possible fix: No change to the PR is required; fix the qcs8300-ride board DT/rootfs: (1) add the regulatory.db firmware file to the test rootfs image, and (2) add or correct the firmware-name property in the stmmac-0:08 PHY node in the qcs8300 DTS so the Aquantia AQR115C driver can read it without returning -EINVAL.
  4. Detail analysis attachment: failed_case_job101883_1_detailed.md
  Case 2: ** USBHost
  1. Failed case: ** USBHost
  2. Root cause: ** No functional USB peripheral device is physically connected to the USB host port of the qcs8300-ride board in the LAVA lab; the xHCI controller at 0x0a400000 probed correctly and registered Bus 001 with one HS port, but lsusb at test time shows only the Linux virtual root hub (1d6b:0002) — no downstream device was enumerated.
  3. Possible fix: Attach a functional USB peripheral device (e.g., USB storage stick or USB-to-serial adapter) to the USB host port of the qcs8300-ride board in the LAVA lab; this is a board/lab infrastructure gap unrelated to the PR under test.
  4. Detail analysis attachment: failed_case_job101883_2_detailed.md
  Case 3: ** shmbridge
  1. Failed case: ** shmbridge
  2. Root cause: ** The shmbridge test is a known CI infrastructure false positive — it failed because the test script matched the literal string qcom_scm.download_mode=1 in the kernel command line as a "qcom_scm-related error", not because of any actual SCM or shmbridge fault; this failure is unconditionally suppressed per lava-known-benign-failures.md Rule 1.
  3. Possible fix: No fix required — suppress this result as known benign CI noise per Rule 1; no kernel change is needed and this failure must not block PR FROMLIST: misc: fastrpc: fix use-after-free of fastrpc_user in workqueue context #594.
  4. Detail analysis attachment: failed_case_job101883_3_detailed.md
  Case 4: ** 0_qcom-next-ci-premerge-tests
  1. Failed case: ** 0_qcom-next-ci-premerge-tests
  2. Root cause: ** The top-level test case is marked failed by LAVA because three sub-tests reported FAIL: (1) Probe_Failure_Checkregulatory.db firmware missing and Aquantia AQR115C PHY probe failure (-EINVAL) on qcs8300-ride; (2) shmbridge — test script false-positively matched qcom_scm.download_mode=1 in the kernel command line as a qcom_scm error; (3) USBHost — only USB hubs detected, no functional USB devices connected in the lab. None of these failures are introduced by the PR (which only modifies drivers/misc/fastrpc.c).
  3. Possible fix: Suppress the three known false/infra failures in the qcom-linux-testkit test plan for qcs8300-ride: add regulatory.db and Aquantia AQR115C to the Probe_Failure_Check allowlist, fix the shmbridge script to exclude kernel cmdline matches when scanning for qcom_scm errors, and mark USBHost as skipped when no USB devices are present in the lab setup.
  4. Detail analysis attachment: failed_case_job101883_4_detailed.md
Job 101885 | SoC qcs6490-rb3gen2

LAVA job: https://lava-oss.qualcomm.com/scheduler/job/101885

Failed test cases in LAVA job 101885 (SoC: qcs6490-rb3gen2).

  Case 1: ** Build Load Failure — HTTP download timeout
  1. Failed case: ** Build Load Failure — HTTP download timeout
  2. Root cause: ** Result: Build Load Failure — the LAVA dispatcher's http-download action (stage 1.2.1) stalled mid-transfer at ~20% (186 MB of 932 MB) of the rootfs artifact from S3 (qcom-multimedia-image-rb3gen2-core-kit.rootfs.qcomflash.tar.gz) and was killed after exhausting the full 1797-second timeout; exact error: "http-download timed out after 1797 seconds" with error_type: Infrastructure.
  3. Possible fix: Re-trigger the CI job; if the timeout recurs, increase the http-download timeout beyond 1797 s and configure download-retry with at least 2–3 attempts in the LAVA job definition to survive transient S3/network stalls.
  4. Detail analysis attachment: failed_case_job101885_1_detailed.md
  Case 2: ** Build Load Failure — HTTP download timeout
  1. Failed case: ** Build Load Failure — HTTP download timeout
  2. Root cause: ** Result: Build Load Failure — the LAVA dispatcher's http-download action stalled mid-transfer at ~20% (186 MB of 932 MB) while downloading the rootfs artifact qcom-multimedia-image-rb3gen2-core-kit.rootfs.qcomflash.tar.gz from AWS S3, and was killed after exhausting the full 1797-second (≈30 min) timeout; error_type: Infrastructure confirms this is a lab-side network connectivity issue, not a kernel or PR defect.
  3. Possible fix: Re-trigger the CI job; if the timeout recurs, increase the http-download timeout and download-retry block timeout in the LAVA job definition (e.g. from ~30 min to 45–60 min), and investigate intermittent S3/network connectivity from the LAVA worker hosting the qcs6490-rb3gen2 board.
  4. Detail analysis attachment: failed_case_job101885_2_detailed.md
  Case 3: downloads
  1. Failed case: downloads
  2. Root cause: Could not be determined confidently from available logs.
  3. Possible fix: Re-trigger the CI job; if the timeout recurs, increase the http-download timeout beyond 1797 s and add a download-retry count > 1 in the LAVA job definition to handle transient S3 throughput degradation.
  4. Detail analysis attachment: failed_case_job101885_3_detailed.md
  Case 4: ** Build Load Failure — HTTP download timeout
  1. Failed case: ** Build Load Failure — HTTP download timeout
  2. Root cause: ** Result: Build Load Failure. The http-download stage (level 1.2.1) timed out after 1797 seconds while downloading the 932 MB rootfs artifact qcom-multimedia-image-rb3gen2-core-kit.rootfs.qcomflash.tar.gz from S3; the transfer stalled at ~20% (186 MB) and never resumed, exhausting the 30-minute download-retry budget with only 1 attempt configured. Exact error: "http-download timed out after 1797 seconds".
  3. Possible fix: Re-trigger the CI job; if the timeout recurs, increase the http-download timeout beyond 1797 s and configure download-retry with at least 2–3 attempts in the LAVA job definition to survive transient S3 stalls.
  4. Detail analysis attachment: failed_case_job101885_4_detailed.md
Job 101886 | SoC x1e80100

LAVA job: https://lava-oss.qualcomm.com/scheduler/job/101886

Failed test cases in LAVA job 101886 (SoC: x1e80100).

  Case 1: ** boot-fastboot
  1. Failed case: ** boot-fastboot
  2. Root cause: ** The x1e80100 CRD board's ABL (BOOT.MXF.2.4-00541-HAMOA-1) rejected the boot.img during fastboot RAM-boot with remote: 'Failed to load/authenticate boot image: Load Error' — the image was transferred successfully (234920 KB, OKAY) but the firmware's image authentication/verification step failed, indicating the boot.img is unsigned or signed with an untrusted key, or is structurally incompatible with the ABL version on this board.
  3. Possible fix: Verify that the CI pipeline signs boot.img with the correct key trusted by this board's ABL, or confirm the boot.img is built with the correct format (e.g., Android boot image v2/v3 header) expected by BOOT.MXF.2.4-00541-HAMOA-1; if the image format/signing is correct, re-trigger the job to rule out a transient build artifact corruption.
  4. Detail analysis attachment: failed_case_job101886_1_detailed.md
  Case 2: ** Build Load Failure — Fastboot boot image authentication rejected by firmware
  1. Failed case: ** Build Load Failure — Fastboot boot image authentication rejected by firmware
  2. Root cause: ** Result: Build Load Failure; the fastboot boot stage failed because the x1e80100 ABL/UEFI firmware (BOOT.MXF.2.4-00541-HAMOA-1) rejected the boot image with remote: 'Failed to load/authenticate boot image: Load Error' — the boot.img was built with mkbootimg --header_version 2, which is incompatible with the boot image header version expected by the x1e80100 (Snapdragon X Elite) ABL on this board.
  3. Possible fix: Update the postprocess.sh mkbootimg invocation in the LAVA job definition for x1e80100 from --header_version 2 to the correct header version required by the x1e80100 ABL (verify with fastboot getvar all on the board, typically --header_version 4 for Snapdragon X Elite); re-trigger the CI job after the fix to confirm the board boots successfully.
  4. Detail analysis attachment: failed_case_job101886_2_detailed.md
  Case 3: ** Build Load Failure — Fastboot boot image authentication rejected by ABL
  1. Failed case: ** Build Load Failure — Fastboot boot image authentication rejected by ABL
  2. Root cause: ** Result: Build Load Failure. The x1e80100 CRD board's ABL rejected the LAVA-assembled boot.img at the fastboot boot stage on all 3 attempts with FAILED (remote: 'Failed to load/authenticate boot image: Load Error') — the kernel never booted; this is not caused by the PR's fastrpc driver changes.
  3. Possible fix: Re-trigger the CI job to rule out a transient ABL state; if the failure recurs, verify that the boot.img produced by LAVA's mkbootimg step is signed with the key accepted by this board's ABL secure-boot policy, or confirm the board is in an unlocked/test-signed state that permits unsigned fastboot boot images on the x1e80100 CRD platform.
  4. Detail analysis attachment: failed_case_job101886_3_detailed.md
Job 101887 | SoC monaco-evk

LAVA job: https://lava-oss.qualcomm.com/scheduler/job/101887

Failed test cases in LAVA job 101887 (SoC: monaco-evk).

  Case 1: ** Build Load Failure — HTTP download timeout
  1. Failed case: ** Build Load Failure — HTTP download timeout
  2. Root cause: ** Result: Build Load Failure — the initramfs artifact download (stage 1.4.1, initramfs-kerneltest-full-image-qcom-armv8a.cpio.gz, 146 MB from S3) stalled at 45% (65 MB) for ~2 min 36 s before the 300 s http-download timeout expired with "http-download timed out after 300 seconds"; the download-retry block had only 1 attempt, so no retry was made and the job was aborted before the monaco-evk board was ever powered on.
  3. Possible fix: Re-trigger the CI job; if the timeout recurs, increase the http-download timeout from 300 s to 600 s and the download-retry block timeout from ~9 min 45 s to 15 min in the LAVA job definition to accommodate the 146 MB initramfs over a congested S3 link.
  4. Detail analysis attachment: failed_case_job101887_1_detailed.md
  Case 2: ** Build Load Failure — HTTP download timeout
  1. Failed case: ** Build Load Failure — HTTP download timeout
  2. Root cause: ** Result: Build Load Failure — transient network throughput collapse during the download-retry (level 1.4) stage; the LAVA worker's HTTP download of initramfs-kerneltest-full-image-qcom-armv8a.cpio.gz (146 MB from AWS S3) stalled at ~50% after ~21 s of healthy transfer (~3.1 MB/s), then crawled to ~0.05 MB/s, exhausting the hard 300 s http-download timeout with the exact error "http-download timed out after 300 seconds".
  3. Possible fix: Re-trigger the CI job; if the timeout recurs, increase the http-download timeout from 300 s to 600 s and the download-retry block timeout from ~9 m 45 s to 15 min in the LAVA job definition, and configure download-retry with at least 2 attempts to survive transient S3 throttling.
  4. Detail analysis attachment: failed_case_job101887_2_detailed.md
  Case 3: ** Build Load Failure — HTTP download timeout
  1. Failed case: ** Build Load Failure — HTTP download timeout
  2. Root cause: ** Result: Build Load Failure. The http-download step 1.4.1 timed out after exactly 300 seconds while fetching the 146 MB initramfs ramdisk (initramfs-kerneltest-full-image-qcom-armv8a.cpio.gz) from the meta-qcom S3 bucket; transfer stalled at ~50% (73 MB) with a ~2.5-minute gap between progress ticks, indicating a transient network congestion or S3 throttle event on the LAVA worker's outbound connection. Error: "http-download timed out after 300 seconds".
  3. Possible fix: Re-trigger the CI job; if the timeout recurs, increase the http-download timeout from 300 s to 600 s and the download-retry block timeout from 00:09:45 to 00:15:00 in the LAVA job definition, and consider adding a retry count (retries: 2) for the initramfs download step.
  4. Detail analysis attachment: failed_case_job101887_3_detailed.md
  Case 4: ** Build Load Failure — HTTP download timeout
  1. Failed case: ** Build Load Failure — HTTP download timeout
  2. Root cause: ** Result: Build Load Failure — the http-download action (step 1.4.1) for the initramfs artifact initramfs-kerneltest-full-image-qcom-armv8a.cpio.gz (146 MB) stalled mid-transfer at ~50% after a 2m36s gap with no data, exhausting the 300 s per-attempt timeout; exact error: "http-download timed out after 300 seconds".
  3. Possible fix: Re-trigger the CI job; if the timeout recurs, increase the http-download timeout from 300 s to 600 s and the download-retry block timeout from the current ~10 min to 15 min in the LAVA job definition, and configure download-retry with at least 2 attempts to survive transient S3 connection drops.
  4. Detail analysis attachment: failed_case_job101887_4_detailed.md
Job 101888 | SoC qcs615-ride

LAVA job: https://lava-oss.qualcomm.com/scheduler/job/101888

Failed test cases in LAVA job 101888 (SoC: qcs615-ride).

  Case 1: ** smmu
  1. Failed case: ** smmu
  2. Root cause: ** The smmu test script asserts that aa00000.video-codec:video-decoder and aa00000.video-codec:video-encoder sub-devices each have their own IOMMU group entry, but on qcs615-ride with the non-legacy Venus binding these sub-devices are not registered as independent IOMMU clients, causing two [FAIL] Critical master is missing iommu_group attachment assertions and a final FAIL=2 result.
  3. Possible fix: Update the qcs615-ride DT (arch/arm64/boot/dts/qcom/qcs615-ride.dts) to add iommu-map entries for the video-decoder and video-encoder child nodes of aa00000.video-codec, or update the smmu test script to accept the non-legacy Venus binding where sub-device IOMMU groups are managed through the parent device node.
  4. Detail analysis attachment: failed_case_job101888_1_detailed.md
  Case 2: ** USBHost
  1. Failed case: ** USBHost
  2. Root cause: ** No physical USB device is connected to the qcs615-ride board's USB host port in the LAVA lab infrastructure — the xHCI controller (at 0x0a800000) probed successfully with 1 port detected, but lsusb enumeration returned only the Linux Foundation 2.0 root hub (1d6b:0002), indicating the test harness has no USB peripheral attached to the board.
  3. Possible fix: Attach a functional USB peripheral (e.g. USB storage or HID device) to the qcs615-ride board's USB host port in the LAVA lab; if the board is correctly cabled, re-trigger the CI job to confirm — this failure is a lab infrastructure/hardware setup issue unrelated to the PR under test.
  4. Detail analysis attachment: failed_case_job101888_2_detailed.md
  Case 3: ** shmbridge
  1. Failed case: ** shmbridge
  2. Root cause: ** The shmbridge test is a known CI infrastructure false positive on qcs615-ride — the test script flags qcom_scm-related errors found in the kernel log (recurring fastrpc glink-edge DSP errors: err: -1), but these are pre-existing background noise unrelated to the SCM/shmbridge subsystem and do not indicate any kernel regression introduced by this PR.
  3. Possible fix: No fix required — suppress this result per Known Benign Failure Rule 1; the shmbridge test always produces false failures in this CI environment regardless of kernel state.
  4. Detail analysis attachment: failed_case_job101888_3_detailed.md
  Case 4: ** WiFi_Firmware_Driver
  1. Failed case: ** WiFi_Firmware_Driver
  2. Root cause: ** The ath11k WiFi driver modules (ath11k, ath11k_pci, ath11k_ahb) could not be loaded because the running kernel (6.18.25-gf71491fcd9b4) does not match the module tree on disk (6.19.0-00717-ge3aded47f3e5), causing a kernel/module version mismatch; as a result no WiFi interface was created, WiFi_OnOff was skipped (not passed), so the known-benign suppression rule does not apply and the failure is genuine.
  3. Possible fix: Rebuild or redeploy the rootfs so that the installed kernel modules in /lib/modules/ match the kernel image version (6.18.25-gf71491fcd9b4) being booted, ensuring ath11k/ath11k_pci/ath11k_ahb modules load successfully on qcs615-ride.
  4. Detail analysis attachment: failed_case_job101888_4_detailed.md
  Case 5: ** 0_qcom-next-ci-premerge-tests
  1. Failed case: ** 0_qcom-next-ci-premerge-tests
  2. Root cause: ** The overall test run was marked failed by LAVA's "unfinished test run" mechanism because result_parse.sh did not emit a final LAVA_SIGNAL_TESTCASE for the suite; the underlying sub-test failures (smmu — video-codec IOMMU sub-device missing iommu_group, shmbridgetz_armv8_smc_call failed TzStatus=0xFFFFFFFF at boot, USBHost — no functional USB devices, WiFi_Firmware_Driverath11k module absent) are all pre-existing platform issues on this qcs615-ride board, unrelated to the PR's fastrpc.c reference-counting changes.
  3. Possible fix: Re-trigger the CI job to confirm reproducibility; the failures are pre-existing board/platform issues (IOMMU sub-device registration, TZ SMC boot error, USB hardware, missing ath11k module) that are not introduced by this PR and require separate platform/infra investigation on the qcs615-ride LAVA worker.
  4. Detail analysis attachment: failed_case_job101888_5_detailed.md
Job 101889 | SoC lemans-evk

LAVA job: https://lava-oss.qualcomm.com/scheduler/job/101889

Failed test cases in LAVA job 101889 (SoC: lemans-evk).

  Case 1: ** Probe_Failure_Check — Firmware Dependency Failure (missing `regulatory.db`)
  1. Failed case: ** Probe_Failure_Check — Firmware Dependency Failure (missing regulatory.db)
  2. Root cause: ** The Probe_Failure_Check test script detected faux_driver regulatory: Direct firmware load for regulatory.db failed with error -2 in dmesg (cfg80211 regulatory database absent from /lib/firmware on the lemans-evk rootfs); the two Bluetooth firmware failures (qca/wcnhpbtfw21.tlv, qca/hpbtfw21.tlv) are suppressed as known-benign per Rule 3 (BT_ON_OFF PASS), leaving the regulatory.db miss as the sole genuine trigger — this is a pre-existing rootfs/infra gap unrelated to the PR's fastrpc.c changes.
  3. Possible fix: Add regulatory.db to the lemans-evk LAVA test rootfs under /lib/firmware/ (obtainable from the wireless-regdb package), or add a suppression rule in lava-known-benign-failures.md for regulatory.db firmware load failures if this is a known-benign condition on this board; the PR itself requires no changes.
  4. Detail analysis attachment: failed_case_job101889_1_detailed.md
  Case 2: ** smmu
  1. Failed case: ** smmu
  2. Root cause: ** The smmu test script asserts that two devices — aa00000.video-codec (Video) and interconnect-lpass-ag-noc (Audio) — appear in /sys/kernel/iommu_groups, but neither is present: the video codec exposes IOMMU groups via its iris sub-devices (iris_pixel.0, iris_non_pixel.0) rather than the parent DT node, and the LPASS audio NoC interconnect is a bus provider, not a DMA master, so it correctly has no IOMMU group attachment.
  3. Possible fix: Update the smmu test script's critical-master list for lemans-evk: replace aa00000.video-codec with iris_pixel.0 and iris_non_pixel.0, and remove interconnect-lpass-ag-noc (it is an interconnect provider, not a DMA master requiring IOMMU protection).
  4. Detail analysis attachment: failed_case_job101889_2_detailed.md
  Case 3: ** USBHost
  1. Failed case: ** USBHost
  2. Root cause: ** The xHCI controller on the lemans-evk (USB bus 2, SuperSpeed port) failed to assign an address to the Genesys Logic USB3.1 hub (usb 2-1: device not accepting address 2, error -62 / two xHCI setup-command timeouts at T+12s and T+18s), leaving only USB hubs visible on the bus; the test script requires at least one non-hub functional USB device and fails with "Only USB hubs detected, no functional USB devices."
  3. Possible fix: Investigate the xHCI SuperSpeed port enumeration failure on the lemans-evk board — check whether the USB3.1 hub downstream device (expected functional USB peripheral) is physically connected and powered, and whether the xHCI Timeout while waiting for setup device command is a pre-existing board/infra issue or a regression introduced by the PR; re-trigger the CI job to determine if this is a transient hardware enumeration glitch, and if it recurs, inspect the DWC3/xHCI driver configuration for the a800000.usb controller on lemans-evk.
  4. Detail analysis attachment: failed_case_job101889_3_detailed.md
  Case 4: ** shmbridge — Suppressed (known benign: shmbridge CI noise)
  1. Failed case: ** shmbridge — Suppressed (known benign: shmbridge CI noise)
  2. Root cause: ** The shmbridge test script incorrectly flags the kernel command-line string qcom_scm.download_mode=1 as a qcom_scm-related error; this is a known false-positive in the CI environment and is unconditionally suppressed per Rule 1 of the LAVA Known Benign Failure Suppressions policy.
  3. Possible fix: No action required — this failure is classified as known CI infrastructure noise and must not be treated as a kernel regression; the PR under test (drivers/misc/fastrpc.c kref refcount fix) is unrelated to qcom_scm or shmbridge.
  4. Detail analysis attachment: failed_case_job101889_4_detailed.md
  Case 5: ** WiFi_Firmware_Driver — **Suppressed (Known Benign: WiFi Firmware False Positive — WiFi ON/OFF Passed)**
  1. Failed case: ** WiFi_Firmware_Driver — Suppressed (Known Benign: WiFi Firmware False Positive — WiFi ON/OFF Passed)
  2. Root cause: ** The WiFi_Firmware_Driver test failed because ath12k/ath12k_wifi7/ath12k_pci/ath12k_ahb modules were not loaded at the time the firmware-driver check ran; however, the board's actual WiFi hardware (driven by ath11k/ath11k_pci) was fully functional — the subsequent WiFi_OnOff test passed with interface wlp1s0 toggling up/down successfully, confirming firmware loaded correctly at runtime. This is a known false positive caused by the test script checking for ath12k-family modules that are not present on this lemans-evk board (which uses ath11k), not a kernel regression introduced by PR FROMLIST: misc: fastrpc: fix use-after-free of fastrpc_user in workqueue context #594.
  3. Possible fix: No fix required — suppress this failure per Rule 2 of lava-known-benign-failures.md; the WiFi_OnOff PASS confirms WiFi is fully functional. If the test script should be updated to recognise ath11k/ath11k_pci as valid transport modules for lemans-evk, update the WiFi_Firmware_Driver test's module list to include ath11k variants alongside ath12k variants.
  4. Detail analysis attachment: failed_case_job101889_5_detailed.md
  Case 6: ** 0_qcom-next-ci-premerge-tests
  1. Failed case: ** 0_qcom-next-ci-premerge-tests
  2. Root cause: ** The LAVA test definition was marked failed because multiple sub-tests reported FAIL: (1) smmuaa00000.video-codec and interconnect-lpass-ag-noc are not attached to any IOMMU group on lemans-evk, indicating a DT or driver probe issue for Video/Audio masters; (2) WiFi_Firmware_Driver — kernel/rootfs version mismatch (6.18.25-gf71491fcd9b4 running vs 6.19.0-00717-ge3aded47f3e5 modules on rootfs) prevents ath12k/ath12k_wifi7 from loading; (3) shmbridge — false positive triggered by qcom_scm.download_mode=1 in the kernel cmdline being matched as an error; (4) USBHost — no functional USB device connected to the board in the lab; (5) Probe_Failure_Check — known benign WiFi/BT firmware-not-found messages. None of these failures are introduced by the PR's fastrpc.c changes.
  3. Possible fix: The smmu IOMMU group failure for aa00000.video-codec and interconnect-lpass-ag-noc on lemans-evk requires investigation of the DT iommus property for those nodes and whether the video/audio drivers probe correctly on this kernel; the kernel/rootfs version mismatch must be resolved by rebuilding the rootfs against kernel 6.18.25-gf71491fcd9b4 or updating the LAVA job to flash a matching rootfs image; the shmbridge test script should be fixed to exclude qcom_scm.download_mode= from its error pattern; and the USBHost test should be marked as infra-dependent or a USB device should be connected to the lemans-evk lab board.
  4. Detail analysis attachment: failed_case_job101889_6_detailed.md

@qswat-orbit-external
Copy link
Copy Markdown

Merge Check Failed: CR Not Eligible for Merge

CR 4502232 is not eligible for merge.

The parent software image for kernel.qli.2.0 is not development complete.

Entity: kernel.qli.2.0
CR: 4502232
Reason: CR_CANNOT_MERGE

Please ensure the CR passes both CCT (ComponentChangeTasks) and ICT (Integration Change Tasks) validations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants