scst_user: Fix infinite cleanup loop caused by stale SGV pool reference by smileusd · Pull Request #378 · SCST-project/scst

smileusd · 2026-06-05T08:16:29Z

fix the issue #279

The full analize doc at https://smileusd.github.io/2026/06/01/2026-06-01-scst-infinite-loop-deadlock-analysis/

The device cleanup loop in dev_user_process_cleanup() spins at ~2 million iterations per second and never exits, ultimately triggering a kernel soft lockup. The previous workaround panicked the system after 10,000 iterations.

Root cause (confirmed by instrumentation):

A ucmd gets permanently stuck in ucmd_hash with:
state = UCMD_STATE_ON_FREE_SKIPPED (7)
cmd = NULL
ref = 1
sent_to_user = 0

The stuck ref=1 is the reference taken by dev_user_alloc_pages() via
ucmd_get() for the first scatter-gather page. It is released only by
dev_user_free_sg_entries() → ucmd_put(), which fires when the SGV pool
evicts a cached object. The sequence that prevents this eviction:

dev_user_unjam_dev() finds an EXECING command (sent_to_user=1, ref=2: alloc + alloc_pages), bumps ref to 3 via ucmd_get_check(), then calls dev_user_unjam_cmd().
dev_user_unjam_cmd() releases cmd_list_lock and calls scst_cmd_done(SCST_CONTEXT_THREAD), which synchronously runs the full SCST completion pipeline:

dev_user_on_free_cmd()
ucmd->cmd = NULL
ucmd->state = UCMD_STATE_ON_FREE_SKIPPED (type == IGNORE)
dev_user_process_reply_on_free()
dev_user_free_sgv()
sgv_pool_free(ucmd->sgv) ← returns SGV to pool LRU cache;
dev_user_free_sg_entries() NOT called;
alloc_pages ucmd_get() NOT balanced
ucmd->sgv = NULL
ucmd_put() ← ref: 3→2
Back in dev_user_unjam_dev(): ucmd_put() ← ref: 2→1. ref != 0, so dev_user_free_ucmd() / cmd_remove_hash() are NOT called. ucmd remains in ucmd_hash.
unjam_cmd also reset sent_to_user=0, so on every subsequent pass through dev_user_unjam_dev() the ucmd is counted (res++) but skipped (!sent_to_user → continue). dev_user_get_next_cmd() returns -EAGAIN (ucmd is not in ready_cmd_list). With cleanup_done=1 the while(1) loop has no exit condition.

The sgv_pool_flush() calls at the TOP of dev_user_unjam_dev() run
BEFORE any commands are unjammed. SGV objects cached during unjamming
are therefore never flushed; dev_user_free_sg_entries() never fires.

Fix:

Add sgv_pool_flush() for both pools at the BOTTOM of dev_user_unjam_dev(),
after the unjam loop exits and after the spinlock is released. This evicts
all SGV objects cached during unjamming, triggering:
dev_user_free_sg_entries() → ucmd_put() → dev_user_free_ucmd()
→ cmd_remove_hash()
removing the stuck ucmd from the hash. On the next cleanup-loop iteration
dev_user_unjam_dev() returns res=0 and dev_user_process_cleanup() breaks.

Also replace the panic() workaround with schedule() so the CPU yields
to let SGV free callbacks complete between iterations.

The device cleanup loop in dev_user_process_cleanup() spins at ~2 million iterations per second and never exits, ultimately triggering a kernel soft lockup. The previous workaround panicked the system after 10,000 iterations. Root cause (confirmed by instrumentation): A ucmd gets permanently stuck in ucmd_hash with: state = UCMD_STATE_ON_FREE_SKIPPED (7) cmd = NULL ref = 1 sent_to_user = 0 The stuck ref=1 is the reference taken by dev_user_alloc_pages() via ucmd_get() for the first scatter-gather page. It is released only by dev_user_free_sg_entries() → ucmd_put(), which fires when the SGV pool *evicts* a cached object. The sequence that prevents this eviction: 1. dev_user_unjam_dev() finds an EXECING command (sent_to_user=1, ref=2: alloc + alloc_pages), bumps ref to 3 via ucmd_get_check(), then calls dev_user_unjam_cmd(). 2. dev_user_unjam_cmd() releases cmd_list_lock and calls scst_cmd_done(SCST_CONTEXT_THREAD), which synchronously runs the full SCST completion pipeline: dev_user_on_free_cmd() ucmd->cmd = NULL ucmd->state = UCMD_STATE_ON_FREE_SKIPPED (type == IGNORE) dev_user_process_reply_on_free() dev_user_free_sgv() sgv_pool_free(ucmd->sgv) ← returns SGV to pool LRU cache; dev_user_free_sg_entries() NOT called; alloc_pages ucmd_get() NOT balanced ucmd->sgv = NULL ucmd_put() ← ref: 3→2 3. Back in dev_user_unjam_dev(): ucmd_put() ← ref: 2→1. ref != 0, so dev_user_free_ucmd() / cmd_remove_hash() are NOT called. ucmd remains in ucmd_hash. 4. unjam_cmd also reset sent_to_user=0, so on every subsequent pass through dev_user_unjam_dev() the ucmd is counted (res++) but skipped (!sent_to_user → continue). dev_user_get_next_cmd() returns -EAGAIN (ucmd is not in ready_cmd_list). With cleanup_done=1 the while(1) loop has no exit condition. The sgv_pool_flush() calls at the TOP of dev_user_unjam_dev() run BEFORE any commands are unjammed. SGV objects cached during unjamming are therefore never flushed; dev_user_free_sg_entries() never fires. Fix: Add sgv_pool_flush() for both pools at the BOTTOM of dev_user_unjam_dev(), after the unjam loop exits and after the spinlock is released. This evicts all SGV objects cached during unjamming, triggering: dev_user_free_sg_entries() → ucmd_put() → dev_user_free_ucmd() → cmd_remove_hash() removing the stuck ucmd from the hash. On the next cleanup-loop iteration dev_user_unjam_dev() returns res=0 and dev_user_process_cleanup() breaks. sgv_pool_flush() is fully synchronous (calls sgv_dtor_and_free() inline); by the time it returns the callbacks have already fired and the ucmd has already been removed from the hash. No schedule() or sleep is needed.

lnocturno · 2026-06-05T10:17:27Z

 	spin_unlock_irq(&dev->udev_cmd_threads.cmd_list_lock);

+	/*
+	 * Post-unjam SGV pool flush — fixes the infinite-loop root cause.


Please reduce the in-code comment significantly. The full lifecycle explanation is useful in the commit message, but the code comment should only describe why a second flush is needed.

smileusd force-pushed the upstream_scst_D_state_issue branch from 1b7c0d1 to 17a53bf Compare June 5, 2026 10:05

lnocturno requested changes Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scst_user: Fix infinite cleanup loop caused by stale SGV pool reference#378

scst_user: Fix infinite cleanup loop caused by stale SGV pool reference#378
smileusd wants to merge 1 commit into
SCST-project:masterfrom
smileusd:upstream_scst_D_state_issue

smileusd commented Jun 5, 2026

Uh oh!

lnocturno Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

smileusd commented Jun 5, 2026

Uh oh!

lnocturno Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants