Skip to content

providers/base: fix race condition in turn_down_nm_connections (BugFix)#2449

Draft
eugene-yujinwu wants to merge 7 commits into
mainfrom
fix/wifi-nmcli-turn-down-race-condition
Draft

providers/base: fix race condition in turn_down_nm_connections (BugFix)#2449
eugene-yujinwu wants to merge 7 commits into
mainfrom
fix/wifi-nmcli-turn-down-race-condition

Conversation

@eugene-yujinwu
Copy link
Copy Markdown
Contributor

@eugene-yujinwu eugene-yujinwu commented Apr 3, 2026

Problem

There is a race condition in turn_down_nm_connections() in wifi_nmcli_test.py.

The function calls _get_nm_wireless_connections() to list active connections, then runs nmcli c down <uuid> for each activated one. However, between the query and the nmcli c down call, the connection may have already become inactive (e.g., due to a network event or prior disconnect).

In this case nmcli returns exit code 10 ("not an active connection"), which caused sp.check_call() to raise CalledProcessError and crash the test — even though the connection being down is exactly the desired outcome.

Observed failure:

Turn down connection Canonical
+ nmcli c down cf0b6f04-db09-46f5-a8bb-e703f97c003a
Error: 'cf0b6f04-db09-46f5-a8bb-e703f97c003a' is not an active connection.
subprocess.CalledProcessError: Command returned non-zero exit status 10.

https://warthogs.atlassian.net/browse/SUTTON-4809
https://certification.canonical.com/hardware/202603-38441/submission/482952/test/84491/result/57704127/
https://certification.canonical.com/hardware/202602-38419/submission/481505/test/221519/result/57390085/

Fix

Replace sp.check_call() with sp.call() and explicitly handle the return code:

  • Exit code 0: success
  • Exit code 10: connection already inactive — goal achieved, treat as success
  • Any other non-zero code: still raises CalledProcessError

Tests

Re-run the test with the new way which use statce check other than ignore exit 10:
[CE-QA-PC]: Automatic (BugFix) #2449_use_state_check test for Paris-P-SIT-S15
https://certification.canonical.com/hardware/202603-38440/submission/483385/

nmcli returns exit code 10 when a connection is not active. There is
a race condition where _get_nm_wireless_connections() reports a
connection as 'activated', but by the time 'nmcli c down <uuid>' is
called the connection has already gone inactive. This caused
sp.check_call() to raise CalledProcessError, failing the test.

Replace sp.check_call() with sp.call() and treat exit code 10 as
success, since the goal (connection is down) has already been achieved.
Any other non-zero exit code still raises CalledProcessError.

Add a unit test to cover the race condition (exit code 10 must not
raise).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.78%. Comparing base (e647807) to head (c14a863).
⚠️ Report is 28 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2449      +/-   ##
==========================================
+ Coverage   58.74%   58.78%   +0.04%     
==========================================
  Files         475      476       +1     
  Lines       47936    48005      +69     
  Branches     8559     8569      +10     
==========================================
+ Hits        28159    28222      +63     
- Misses      18883    18887       +4     
- Partials      894      896       +2     
Flag Coverage Δ
provider-base 34.03% <100.00%> (+0.23%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@eugene-yujinwu eugene-yujinwu force-pushed the fix/wifi-nmcli-turn-down-race-condition branch from 738c5db to a8e6158 Compare April 3, 2026 10:00
@eugene-yujinwu eugene-yujinwu marked this pull request as ready for review April 14, 2026 03:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Fixes a race condition in turn_down_nm_connections() where nmcli c down can legitimately return exit code 10 (“not an active connection”) if the connection deactivates between query and teardown, preventing spurious test failures.

Changes:

  • Switch from subprocess.check_call() to subprocess.call() and treat return code 10 as a successful “already inactive” outcome.
  • Add a unit test for the “already inactive” case and update existing tests to mock sp.call().

Reviewed changes

Copilot reviewed 1 out of 2 changed files in this pull request and generated no comments.

File Description
providers/base/bin/wifi_nmcli_test.py Makes connection teardown resilient to nmcli’s rc=10 by handling return codes explicitly.
providers/base/tests/test_wifi_nmcli_test.py Updates mocks for sp.call() and adds coverage for the rc=10 “already inactive” scenario.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hanhsuan
Copy link
Copy Markdown
Contributor

In this doc the exit code 10 means Connection, device, or access point does not exist. Do you mind to provide more information about your finding for the exit code 10 ?

Copy link
Copy Markdown
Contributor

@tomli380576 tomli380576 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See Hanhsuan's comment

Comment thread providers/base/bin/wifi_nmcli_test.py Outdated
Comment on lines +103 to +106
ret = sp.call(shlex.split(cmd))
# exit code 10: connection is not active — already down, that's fine
if ret not in (0, 10):
raise sp.CalledProcessError(ret, cmd)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: I think you can use try-catch here to avoid manually constructing the error

try:
    sp.check_call(["nmcli", "connection", "down", uuid])
except sp.CalledProcessError as e:
    if e.returncode == 10:
        print("WARN: connection {} doesn't exist, ignoring".format(uuid))
    else:
        raise e

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! We've actually taken it a step further in the latest commit. Rather than catching the error after the fact, we now pre-check the specific connection's state via UUID before issuing the down command:

nmcli -t -f GENERAL.STATE c show <uuid>

If the connection is no longer activated, we skip it. This removes the need to interpret exit code 10 entirely, and check_call reverts to its default behavior — any real error still raises as expected.

eugene-yujinwu and others added 3 commits April 14, 2026 13:42
Replace sp.call() + manual return code check with sp.check_call()
wrapped in a try/except block, as suggested in code review.

This is more Pythonic: check_call() expresses the expectation of
success, while the except clause clearly documents the one tolerated
exception (exit code 10: connection already inactive).

Also use bare 'raise' instead of 'raise e' to preserve the original
traceback.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…turn down

Instead of catching exit code 10 from 'nmcli c down', re-check the
specific connection's state via UUID right before issuing the down
command:

  nmcli -t -f GENERAL.STATE c show <uuid>

If the connection is no longer activated (e.g. it was deactivated by NM
automatically after another connection was brought down), skip it with a
warning instead of calling 'nmcli c down'. This avoids relying on exit
code interpretation and makes the intent explicit.

Updated tests mock sp.check_output for the per-UUID state check and
add a dedicated test case for the already-inactive scenario.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 2 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +187 to +197
def test_turn_down_connection_already_inactive(
self, get_connections_mock, sp_check_call_mock, sp_check_output_mock
):
# UUID state re-check shows connection already inactive — skip down
turn_down_nm_connections()
sp_check_output_mock.assert_called_once_with(
"nmcli -t -f GENERAL.STATE c show uuid1".split(),
universal_newlines=True,
)
sp_check_call_mock.assert_not_called()

Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new test test_turn_down_connection_already_inactive covers the pre-check path (state becomes deactivated before calling down), but it doesn’t cover the failure reported in the PR description where nmcli c down returns exit code 10. Add a unit test that simulates the down command returning/raising with rc=10 (e.g., check_call raising CalledProcessError(returncode=10, ...) or sp.call returning 10, depending on the chosen implementation) and assert turn_down_nm_connections() does not raise.

Copilot uses AI. Check for mistakes.
@eugene-yujinwu
Copy link
Copy Markdown
Contributor Author

Thanks for raising this — you're right that exit code 10 can mean multiple things ("connection/device/AP does not exist"), making it risky to ignore unconditionally.

Instead of relying on exit code interpretation, the latest commit takes a different approach: before calling nmcli c down, we now query the specific connection's state by UUID:

nmcli -t -f GENERAL.STATE c show <uuid>

If the state is no longer activated, we skip the down command entirely with a warning. This avoids the exit code 10 discussion altogether — check_call is back to its standard behavior and any unexpected error will still propagate normally.

The pre-check (nmcli -t -f GENERAL.STATE c show <uuid>) reduces the
race window, but a connection can still become inactive in the narrow
gap between the pre-check and the actual 'nmcli c down' call. Add a
try/except around check_call to treat exit code 10 as success in that
case, and add a dedicated test to cover this path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Hook25
Copy link
Copy Markdown
Collaborator

Hook25 commented Apr 14, 2026

@eugene-yujinwu two things:

  1. please verify the AI output before asking for a review
  2. please don't ask for a review on a draft PR, I've spent 20m checking if nmcli -t -f GENERAL.STATE c show works on xenial and you changed the implementation again. Use the draft status if you aren't ready
  3. writing a description to a PR is the most important part of the work. I don't want to see AI generated description please. At least make it conform to the template if you really want to do it this way..

If nmcli c down returns exit code 10 in the narrow window after the
pre-check, let it fail rather than silently ignoring it.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@eugene-yujinwu eugene-yujinwu marked this pull request as draft April 14, 2026 10:43
@eugene-yujinwu eugene-yujinwu marked this pull request as draft April 14, 2026 10:43
@eugene-yujinwu
Copy link
Copy Markdown
Contributor Author

@eugene-yujinwu two things:

  1. please verify the AI output before asking for a review
  2. please don't ask for a review on a draft PR, I've spent 20m checking if nmcli -t -f GENERAL.STATE c show works on xenial and you changed the implementation again. Use the draft status if you aren't ready
  3. writing a description to a PR is the most important part of the work. I don't want to see AI generated description please. At least make it conform to the template if you really want to do it this way..

Sorry, I forgot to put it in draft PR as I still modify it.
Yes, I will double check these descriptions and code before push it.
Sorry again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants