Skip to content

Added lost node cleanup restriction option for GCP projects with multiple instances.#558

Open
tweirtx wants to merge 19 commits into
jenkinsci:developfrom
tweirtx:lost-node-cleanup
Open

Added lost node cleanup restriction option for GCP projects with multiple instances.#558
tweirtx wants to merge 19 commits into
jenkinsci:developfrom
tweirtx:lost-node-cleanup

Conversation

@tweirtx

@tweirtx tweirtx commented Jun 17, 2026

Copy link
Copy Markdown

My team's Jenkins setup has multiple hosts utilizing the same GCP project, and the changes in #503 have led to an uptick in incidents where one Jenkins host terminates another's build job in the lost node cleanup routine. I understand the original criticism that led to this change, and I propose that instead of using the previous behavior which could cause issues, we add an option to have a configurable label to restrict the cleanup job's targeted GCE VMs. This may also provide a solid solution to #157 and https://issues.jenkins.io/browse/JENKINS-68244 as well.

Testing done

Aside from the unit tests, I deployed this in our QA environment and thoroughly tested the behavior of the clean lost nodes routine with the changes I made. At this time, I have not run the integration tests due to security restrictions, but I will be completing that task as soon as I can. (and moving this PR out of draft status). EDIT: Integration tests have been successfully completed.

UI before:
image

UI after:
image

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests that demonstrate the feature works or the issue is fixed

@gbhat618

gbhat618 commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Hi, have you checked the follow up PR that fixed the bug #514? The problem you describe shouldn't be happening if you are using the version with the follow-up that fixed the issue.
Which version of the plugin are you using?

@tweirtx

tweirtx commented Jun 22, 2026

Copy link
Copy Markdown
Author

Yes, I did check that PR, but we're still seeing the issue on versions newer than the one that included that fix.

@gbhat618

Copy link
Copy Markdown
Contributor

Yes, I did check that PR, but we're still seeing the issue on versions newer than the one that included that fix.

I think it's better to confirm where the bug is coming from. Some thoughts,

  1. Confirm that another controller is actually the one responsible for deletion of agent VMs of some other controller: if you have separate service accounts configured in each of the Jenkins controllers, could you validate from the GCP Cloud Logging that indeed the principalEmail was the other Jenkins controller account?

  2. Enable FINEST on only on the class com.google.jenkins.plugins.computeengine.com.google.jenkins.plugins.computeengine.CleanLostNodesWork by adding a new log recorder in $JENKINS_URL/manage/log/ -> Add recorder. (Note: do not enable FINEST on the package level, which will start logging across the GCE plugin, that will simply be causing too much logging). Then confirm the deleted VM of controller-A is showing up in say the logs of controller-B. You would be looking for

    logger.log(
    Level.FINEST,
    () -> "Instance " + remote.getName() + " last_refresh label value: " + nodeLastRefresh + ", isOrphan: "
    + isOrphan);

    logger.log(Level.INFO, "Removing orphaned instance: " + instanceName);
    . Note: the INFO log you can confirm it now itself.

The way the CleanLostNodesWork works is,

  • when an agent VM is created the label jenkins_node_last_refresh contains the current time of the creation time.
  • then, every 1hr, when the CleanLostNodesWork runs, it updates the jenkins_node_last_refresh timestamp for all the it's connected agent VMs -> therefore any agent that is part of a long running build (several hours) keeps getting refreshed for this label timestamp. And it only deletes the VMs - whose jenkins_node_last_refresh > 3h.
    /**
    * Updates the label of the local instances to indicate they are still in use. The method makes N network calls
    * for N local instances, couldn't find any bulk update apis.
    */
    private void updateLocalInstancesLabel(
    ComputeClientV2 clientV2, Set<String> localInstances, List<Instance> remoteInstances) {
    var remoteInstancesByName =
    remoteInstances.stream().collect(Collectors.toMap(Instance::getName, instance -> instance));
    var labelToUpdate = ImmutableMap.of(NODE_IN_USE_LABEL_KEY, getLastRefreshLabelVal());
    for (String instanceName : localInstances) {
    var remoteInstance = remoteInstancesByName.get(instanceName);
    if (remoteInstance == null) {
    continue;
    }
    try {
    clientV2.updateInstanceLabels(remoteInstance, labelToUpdate);
    logger.log(Level.FINEST, () -> "Updated label for instance " + instanceName);
    } catch (IOException e) {
    logger.log(Level.WARNING, "Error updating label for instance " + instanceName, e);
    }
    }
    }
    }

So unless the controller which is facing the agent VM deletion by other controllers is running an older of the plugin that doesn't have the current version of the code of CleanLostNodesWork - I am not able to think of another scenario that is causing this wrong deletion.

@gbhat618

Copy link
Copy Markdown
Contributor

But if you wanted to disable the CleanLostNodesWork - beware you will need to disable it from all controller pointing to the same GCP project, otherwise whichever controller still has it enabled will mark the VMs whose jenkins_node_last_refresh is > 3hr as orphan.

  1. A way to disable in the current Jenkins JVM - will not persist across restarts, could be,
    From script console - $JENKINS_URL/manage/script,
    import com.google.jenkins.plugins.computeengine.CleanLostNodesWork
    ExtensionList.lookupSingleton(CleanLostNodesWork.class).cancel()
  2. As a hack if you just want to disable the CleanLostNodesWork you can set this system property to a really large number, like 100yr or so — -Dcom.google.jenkins.plugins.computeengine.CleanLostNodesWork.recurrencePeriod=3153600000 (unit is millis - 100yr = 3600L * 24 * 365 * 100;)

But as already mentioned if you are attempting to do either of these, it will need to be done in all controllers.

@gbhat618

gbhat618 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Now, coming to how can disable deletion of agent VMs created by a specific controller,

Perhaps instead of introducing a new label jenkins_node_cleanup, we can introduce a magic value to the existing label jenkins_node_last_refresh which will be set to a value like 0 and then put a filter in the code to both — exclude the VM from updating the value of the label, and exclude from deletion the VM with label value 0 - and just do a FINE logging. And expose it via a system property -Dcom.google.jenkins.plugins.computeengine.CleanLostNodesWork.disableUpdatingNodeLastRefresh = true (which is false by default)

Edit: I will think a bit more about this, and get back. I think this will greatly reduce the diff for this PR.

@gbhat618

gbhat618 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Perhaps instead of introducing a new label [...] a magic value [...] 0 [...]

This has a problem, it won't delete the orphans of the current controller as well.

Then we need an identifier for the instances created by the current controller's, which can be the instanceIds of the GCE clouds,

// Apply a label that associates an instance configuration with
// this cloud provider
configuration.appendLabel(CLOUD_ID_LABEL_KEY, getInstanceId());

And we can have a system property -Dcom.google.jenkins.plugins.computeengine.CleanLostNodesWork.deleteOnlyVMsOfThisController=true (or a similar name) and have the logic around that. We won't have to introduce a new label then also, and diff would be smaller too.

Note: a small gotcha (that only affects a CasC based controller though) if somebody doesn't set the instanceId field for a GCE cloud in their CasC configuration, and after the restart when the CasC config is redone - it will compute new random uuid. This could be avoided in by setting a UUID in the CasC yaml file. Otherwise after the controller restart it won't recognize it's own created VMs before the restart and leave them be hanging around. We can document it, and if someone misses to set this up, it is still acceptable in that rarely some VMs would be hanging around if there were existing during that specific restart time, which I think is timing dependent as well, so likely be rare -

public void setInstanceId(String instanceId) {
if (Strings.isNullOrEmpty(instanceId)) {
this.instanceId = UUID.randomUUID().toString();
} else {
.

@gbhat618 gbhat618 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • #558 (comment) is much simpler implementation leveraging the existing labels and implementation.
  • worth debugging to know if there is an actual bug or it's infra problem - #558 (comment)

@tweirtx

tweirtx commented Jun 22, 2026

Copy link
Copy Markdown
Author

We did extensive verification to ensure the plugin was the cause, including validation of running versions and logging active systems. We don't want to completely disable lost node detection, but we do want to ensure that the system is not picking up nodes belonging to another Jenkins host. We have a hotfix in place that sets the recurrence period to 24 hours, and that solves the problem, but the better fix is to remove the bad behavior in the first place.

Regarding the use of the instanceId label, I saw that you had mentioned in #503 the potential for a misconfiguration to occur based on that cloud ID. I took that into account when designing my solution, which originally sought to allow for the option to revert to the old behavior (versions prior to that PR did not exhibit the issue), however I decided a new implementation was in order to address the root cause of the original PR.

@tweirtx tweirtx marked this pull request as ready for review June 22, 2026 18:11
@tweirtx tweirtx requested a review from a team as a code owner June 22, 2026 18:11
@gbhat618

gbhat618 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

extensive verification to ensure the plugin was the cause, including validation of running versions and logging active systems

did you track down the bug ?

Edit: I have gone through the code again, and did some more analysis. I don't think there is a way for controller-A to delete the VMs of controller-B, unless,

  • controller-B faced exceptions while updating the timestamps
  • controller-B has a longer recurrencePeriod configured by system property, and thus it didn't refresh the timestamp.

The 24h recurrencePeriod hotfix also leans towards this asymmetric config, or exception facing, causing delay in refreshing the timestamp.

Comment thread src/main/java/com/google/jenkins/plugins/computeengine/CleanLostNodesWork.java Outdated
}

@Test
public void testLostNodeCleanedUpBySecondController() throws Throwable {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test will now be failing due to

cloud.setLostNodeCleanupRestriction(true);

?

@gbhat618 gbhat618 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please help explain what would be the need for configurable label for lost node restriction at individual cloud level. Perhaps having a system property to decide and hardcoding the label name is easier. Also requesting to not use the mock based tests, rather use the IT test, something like - adding another test testSecondControllerDoesNotCleanUpLostNodeWhenRestrictionEnabled in the existing CleanLostNodeWorkIT would provide the coverage.

@gbhat618

Copy link
Copy Markdown
Contributor

Given #558 (comment), I still think we should track down the actual bug and think of reproducible scenarios (preferably via an IT test) and arrive at the solution accordingly.

If unable to come up with a reproducer at all, I think it is better to just introduce a system property alone, rather introducing configuration properties, adding new configurable properties means we will need to support it, and they cannot be taken out in future releases without appropriate backward compatibility support - which will just complicate the whole thing.

@tweirtx

tweirtx commented Jun 23, 2026

Copy link
Copy Markdown
Author

I haven't been able to track down the root cause of the problem, but it was observed before the recurrence periods were reconfigured (which is why they were configured to 24 hours in the first place). I will make the requested changes to the tests, though I will say that the integration tests currently all pass. The log message was mistakenly left in from debugging and I will remove it.

The reasoning for a configurable label is to differentiate which Jenkins master node created the VMs, and consequently, filter out which ones actually apply to the master node when it runs cleanup. I will refactor the code to use a system property instead of a per-cloud configuration as requested.

Comment thread src/main/java/com/google/jenkins/plugins/computeengine/CleanLostNodesWork.java Outdated
Comment thread src/main/java/com/google/jenkins/plugins/computeengine/CleanLostNodesWork.java Outdated
Comment thread src/main/java/com/google/jenkins/plugins/computeengine/CleanLostNodesWork.java Outdated
Comment thread src/main/java/com/google/jenkins/plugins/computeengine/ComputeEngineCloud.java Outdated
Comment thread src/main/java/com/google/jenkins/plugins/computeengine/CleanLostNodesWork.java Outdated
Comment thread src/main/java/com/google/jenkins/plugins/computeengine/ComputeEngineCloud.java Outdated
Comment thread src/main/java/com/google/jenkins/plugins/computeengine/CleanLostNodesWork.java Outdated
Comment thread src/main/java/com/google/jenkins/plugins/computeengine/CleanLostNodesWork.java Outdated
@gbhat618

Copy link
Copy Markdown
Contributor

#558 (comment) and let me fix the CleanLostNodesWorkIT deflake it without so much sleep.

@gbhat618

gbhat618 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

noticed this commit cab5ddc - it won't deflake the IT's - the flakyness comes from relying on timeouts in waiting for agents - which should be done using SemaphoreStep - this pattern can be seen in more recent ITs ex- https://github.com/jenkinsci/google-compute-engine-plugin/blob/develop/src/test/java/com/google/jenkins/plugins/computeengine/integration/ComputeEngineCloudDiskMappingIT.java (and other recent tests)

Imo better to file a separate PR for fixing - I can do that later today or earlier tomorrow my time.

@tweirtx

tweirtx commented Jun 24, 2026

Copy link
Copy Markdown
Author

Changed to the behavior of using the Jenkins instance ID, I'll let you update the integration tests to what you want to see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants