Added lost node cleanup restriction option for GCP projects with multiple instances. by tweirtx · Pull Request #558 · jenkinsci/google-compute-engine-plugin

tweirtx · 2026-06-17T17:06:14Z

My team's Jenkins setup has multiple hosts utilizing the same GCP project, and the changes in #503 have led to an uptick in incidents where one Jenkins host terminates another's build job in the lost node cleanup routine. I understand the original criticism that led to this change, and I propose that instead of using the previous behavior which could cause issues, we add an option to have a configurable label to restrict the cleanup job's targeted GCE VMs. This may also provide a solid solution to #157 and https://issues.jenkins.io/browse/JENKINS-68244 as well.

Testing done

Aside from the unit tests, I deployed this in our QA environment and thoroughly tested the behavior of the clean lost nodes routine with the changes I made. At this time, I have not run the integration tests due to security restrictions, but I will be completing that task as soon as I can. (and moving this PR out of draft status). EDIT: Integration tests have been successfully completed.

UI before:

UI after:

Submitter checklist

Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
Ensure that the pull request title represents the desired changelog entry
Please describe what you did
Link to relevant issues in GitHub or Jira
Link to relevant pull requests, esp. upstream and downstream changes
Ensure you have provided tests that demonstrate the feature works or the issue is fixed

gbhat618 · 2026-06-19T07:18:20Z

Hi, have you checked the follow up PR that fixed the bug #514? The problem you describe shouldn't be happening if you are using the version with the follow-up that fixed the issue.
Which version of the plugin are you using?

tweirtx · 2026-06-22T14:00:26Z

Yes, I did check that PR, but we're still seeing the issue on versions newer than the one that included that fix.

gbhat618 · 2026-06-22T15:18:29Z

Yes, I did check that PR, but we're still seeing the issue on versions newer than the one that included that fix.

I think it's better to confirm where the bug is coming from. Some thoughts,

Confirm that another controller is actually the one responsible for deletion of agent VMs of some other controller: if you have separate service accounts configured in each of the Jenkins controllers, could you validate from the GCP Cloud Logging that indeed the principalEmail was the other Jenkins controller account?

Enable FINEST on only on the class com.google.jenkins.plugins.computeengine.com.google.jenkins.plugins.computeengine.CleanLostNodesWork by adding a new log recorder in $JENKINS_URL/manage/log/ -> Add recorder. (Note: do not enable FINEST on the package level, which will start logging across the GCE plugin, that will simply be causing too much logging). Then confirm the deleted VM of controller-A is showing up in say the logs of controller-B. You would be looking for

google-compute-engine-plugin/src/main/java/com/google/jenkins/plugins/computeengine/CleanLostNodesWork.java

Lines 116 to 119 in 9ade970

    
           logger.log( 
        
                   Level.FINEST, 
        
                   () -> "Instance " + remote.getName() + " last_refresh label value: " + nodeLastRefresh + ", isOrphan: " 
        
                           + isOrphan);

google-compute-engine-plugin/src/main/java/com/google/jenkins/plugins/computeengine/CleanLostNodesWork.java

Line 125 in 9ade970

logger.log(Level.INFO, "Removing orphaned instance: " + instanceName);

. Note: the INFO log you can confirm it now itself.

The way the CleanLostNodesWork works is,

when an agent VM is created the label jenkins_node_last_refresh contains the current time of the creation time.

then, every 1hr, when the CleanLostNodesWork runs, it updates the jenkins_node_last_refresh timestamp for all the it's connected agent VMs -> therefore any agent that is part of a long running build (several hours) keeps getting refreshed for this label timestamp. And it only deletes the VMs - whose jenkins_node_last_refresh > 3h.

google-compute-engine-plugin/src/main/java/com/google/jenkins/plugins/computeengine/CleanLostNodesWork.java

Lines 162 to 184 in 9ade970

    
               /** 
        
                * Updates the label of the local instances to indicate they are still in use. The method makes N network calls 
        
                * for N local instances, couldn't find any bulk update apis. 
        
                */ 
        
               private void updateLocalInstancesLabel( 
        
                       ComputeClientV2 clientV2, Set<String> localInstances, List<Instance> remoteInstances) { 
        
                   var remoteInstancesByName = 
        
                           remoteInstances.stream().collect(Collectors.toMap(Instance::getName, instance -> instance)); 
        
                   var labelToUpdate = ImmutableMap.of(NODE_IN_USE_LABEL_KEY, getLastRefreshLabelVal()); 
        
                   for (String instanceName : localInstances) { 
        
                       var remoteInstance = remoteInstancesByName.get(instanceName); 
        
                       if (remoteInstance == null) { 
        
                           continue; 
        
                       } 
        
                       try { 
        
                           clientV2.updateInstanceLabels(remoteInstance, labelToUpdate); 
        
                           logger.log(Level.FINEST, () -> "Updated label for instance " + instanceName); 
        
                       } catch (IOException e) { 
        
                           logger.log(Level.WARNING, "Error updating label for instance " + instanceName, e); 
        
                       } 
        
                   } 
        
               } 
        
           }

So unless the controller which is facing the agent VM deletion by other controllers is running an older of the plugin that doesn't have the current version of the code of CleanLostNodesWork - I am not able to think of another scenario that is causing this wrong deletion.

gbhat618 · 2026-06-22T15:34:27Z

But if you wanted to disable the CleanLostNodesWork - beware you will need to disable it from all controller pointing to the same GCP project, otherwise whichever controller still has it enabled will mark the VMs whose jenkins_node_last_refresh is > 3hr as orphan.

A way to disable in the current Jenkins JVM - will not persist across restarts, could be,
From script console - $JENKINS_URL/manage/script,
```
import com.google.jenkins.plugins.computeengine.CleanLostNodesWork
ExtensionList.lookupSingleton(CleanLostNodesWork.class).cancel()
```
As a hack if you just want to disable the CleanLostNodesWork you can set this system property to a really large number, like 100yr or so — -Dcom.google.jenkins.plugins.computeengine.CleanLostNodesWork.recurrencePeriod=3153600000 (unit is millis - 100yr = 3600L * 24 * 365 * 100;)

But as already mentioned if you are attempting to do either of these, it will need to be done in all controllers.

gbhat618 · 2026-06-22T15:40:17Z

Now, coming to how can disable deletion of agent VMs created by a specific controller,

Perhaps instead of introducing a new label jenkins_node_cleanup, we can introduce a magic value to the existing label jenkins_node_last_refresh which will be set to a value like 0 and then put a filter in the code to both — exclude the VM from updating the value of the label, and exclude from deletion the VM with label value 0 - and just do a FINE logging. And expose it via a system property -Dcom.google.jenkins.plugins.computeengine.CleanLostNodesWork.disableUpdatingNodeLastRefresh = true (which is false by default)

Edit: I will think a bit more about this, and get back. I think this will greatly reduce the diff for this PR.

gbhat618 · 2026-06-22T16:01:39Z

Perhaps instead of introducing a new label [...] a magic value [...] 0 [...]

This has a problem, it won't delete the orphans of the current controller as well.

Then we need an identifier for the instances created by the current controller's, which can be the instanceIds of the GCE clouds,

google-compute-engine-plugin/src/main/java/com/google/jenkins/plugins/computeengine/ComputeEngineCloud.java

Lines 163 to 165 in 9ade970

    
           // Apply a label that associates an instance configuration with 
        
           // this cloud provider 
        
           configuration.appendLabel(CLOUD_ID_LABEL_KEY, getInstanceId());

And we can have a system property -Dcom.google.jenkins.plugins.computeengine.CleanLostNodesWork.deleteOnlyVMsOfThisController=true (or a similar name) and have the logic around that. We won't have to introduce a new label then also, and diff would be smaller too.

Note: a small gotcha (that only affects a CasC based controller though) if somebody doesn't set the instanceId field for a GCE cloud in their CasC configuration, and after the restart when the CasC config is redone - it will compute new random uuid. This could be avoided in by setting a UUID in the CasC yaml file. Otherwise after the controller restart it won't recognize it's own created VMs before the restart and leave them be hanging around. We can document it, and if someone misses to set this up, it is still acceptable in that rarely some VMs would be hanging around if there were existing during that specific restart time, which I think is timing dependent as well, so likely be rare -

google-compute-engine-plugin/src/main/java/com/google/jenkins/plugins/computeengine/ComputeEngineCloud.java

Lines 181 to 184 in 9ade970

    
           public void setInstanceId(String instanceId) { 
        
               if (Strings.isNullOrEmpty(instanceId)) { 
        
                   this.instanceId = UUID.randomUUID().toString(); 
        
               } else {

.

gbhat618

#558 (comment) is much simpler implementation leveraging the existing labels and implementation.
worth debugging to know if there is an actual bug or it's infra problem - #558 (comment)

tweirtx · 2026-06-22T16:32:44Z

We did extensive verification to ensure the plugin was the cause, including validation of running versions and logging active systems. We don't want to completely disable lost node detection, but we do want to ensure that the system is not picking up nodes belonging to another Jenkins host. We have a hotfix in place that sets the recurrence period to 24 hours, and that solves the problem, but the better fix is to remove the bad behavior in the first place.

Regarding the use of the instanceId label, I saw that you had mentioned in #503 the potential for a misconfiguration to occur based on that cloud ID. I took that into account when designing my solution, which originally sought to allow for the option to revert to the old behavior (versions prior to that PR did not exhibit the issue), however I decided a new implementation was in order to address the root cause of the original PR.

gbhat618 · 2026-06-23T01:40:07Z

extensive verification to ensure the plugin was the cause, including validation of running versions and logging active systems

did you track down the bug ?

Edit: I have gone through the code again, and did some more analysis. I don't think there is a way for controller-A to delete the VMs of controller-B, unless,

controller-B faced exceptions while updating the timestamps
controller-B has a longer recurrencePeriod configured by system property, and thus it didn't refresh the timestamp.

The 24h recurrencePeriod hotfix also leans towards this asymmetric config, or exception facing, causing delay in refreshing the timestamp.

gbhat618 · 2026-06-23T02:07:16Z

    }

    @Test
    public void testLostNodeCleanedUpBySecondController() throws Throwable {


This test will now be failing due to

cloud.setLostNodeCleanupRestriction(true);

?

gbhat618

Please help explain what would be the need for configurable label for lost node restriction at individual cloud level. Perhaps having a system property to decide and hardcoding the label name is easier. Also requesting to not use the mock based tests, rather use the IT test, something like - adding another test testSecondControllerDoesNotCleanUpLostNodeWhenRestrictionEnabled in the existing CleanLostNodeWorkIT would provide the coverage.

gbhat618 · 2026-06-23T02:39:04Z

Given #558 (comment), I still think we should track down the actual bug and think of reproducible scenarios (preferably via an IT test) and arrive at the solution accordingly.

If unable to come up with a reproducer at all, I think it is better to just introduce a system property alone, rather introducing configuration properties, adding new configurable properties means we will need to support it, and they cannot be taken out in future releases without appropriate backward compatibility support - which will just complicate the whole thing.

tweirtx · 2026-06-23T15:23:16Z

I haven't been able to track down the root cause of the problem, but it was observed before the recurrence periods were reconfigured (which is why they were configured to 24 hours in the first place). I will make the requested changes to the tests, though I will say that the integration tests currently all pass. The log message was mistakenly left in from debugging and I will remove it.

The reasoning for a configurable label is to differentiate which Jenkins master node created the VMs, and consequently, filter out which ones actually apply to the master node when it runs cleanup. I will refactor the code to use a system property instead of a per-cloud configuration as requested.

gbhat618 · 2026-06-24T03:33:54Z

#558 (comment) and let me fix the CleanLostNodesWorkIT deflake it without so much sleep.

gbhat618 · 2026-06-24T14:19:52Z

noticed this commit cab5ddc - it won't deflake the IT's - the flakyness comes from relying on timeouts in waiting for agents - which should be done using SemaphoreStep - this pattern can be seen in more recent ITs ex- https://github.com/jenkinsci/google-compute-engine-plugin/blob/develop/src/test/java/com/google/jenkins/plugins/computeengine/integration/ComputeEngineCloudDiskMappingIT.java (and other recent tests)

Imo better to file a separate PR for fixing - I can do that later today or earlier tomorrow my time.

…ot updated yet)

tweirtx · 2026-06-24T18:05:16Z

Changed to the behavior of using the Jenkins instance ID, I'll let you update the integration tests to what you want to see.

tweirtx added 8 commits June 11, 2026 14:52

Initial attempt to add lost node label restriction

38e1892

Enhance debug logging

39c5176

Change how we're doing lookups

4518724

Improve logging, rearrange order of operations

a1d2454

Re-enable deletions

01d3bdd

Spotless apply

44474c4

Generate unit and integration tests with copilot

0abafa4

Spotless test

dcf47d4

gbhat618 reviewed Jun 19, 2026

View reviewed changes

Comment thread src/test/java/com/google/jenkins/plugins/computeengine/CleanLostNodesWorkTest.java Outdated

tweirtx added 2 commits June 22, 2026 10:01

Remove copyright that doesn't need to be there

f719593

Merge branch 'develop' into lost-node-cleanup

65184fc

gbhat618 requested changes Jun 22, 2026

View reviewed changes

tweirtx marked this pull request as ready for review June 22, 2026 18:11

tweirtx requested a review from a team as a code owner June 22, 2026 18:11