Added lost node cleanup restriction option for GCP projects with multiple instances.#558
Added lost node cleanup restriction option for GCP projects with multiple instances.#558tweirtx wants to merge 19 commits into
Conversation
|
Hi, have you checked the follow up PR that fixed the bug #514? The problem you describe shouldn't be happening if you are using the version with the follow-up that fixed the issue. |
|
Yes, I did check that PR, but we're still seeing the issue on versions newer than the one that included that fix. |
I think it's better to confirm where the bug is coming from. Some thoughts,
The way the
So unless the controller which is facing the agent VM deletion by other controllers is running an older of the plugin that doesn't have the current version of the code of |
|
But if you wanted to disable the
But as already mentioned if you are attempting to do either of these, it will need to be done in all controllers. |
|
Now, coming to how can disable deletion of agent VMs created by a specific controller, Perhaps instead of introducing a new label Edit: I will think a bit more about this, and get back. I think this will greatly reduce the diff for this PR. |
This has a problem, it won't delete the orphans of the current controller as well. Then we need an identifier for the instances created by the current controller's, which can be the And we can have a system property Note: a small gotcha (that only affects a CasC based controller though) if somebody doesn't set the |
gbhat618
left a comment
There was a problem hiding this comment.
- #558 (comment) is much simpler implementation leveraging the existing labels and implementation.
- worth debugging to know if there is an actual bug or it's infra problem - #558 (comment)
|
We did extensive verification to ensure the plugin was the cause, including validation of running versions and logging active systems. We don't want to completely disable lost node detection, but we do want to ensure that the system is not picking up nodes belonging to another Jenkins host. We have a hotfix in place that sets the recurrence period to 24 hours, and that solves the problem, but the better fix is to remove the bad behavior in the first place. Regarding the use of the instanceId label, I saw that you had mentioned in #503 the potential for a misconfiguration to occur based on that cloud ID. I took that into account when designing my solution, which originally sought to allow for the option to revert to the old behavior (versions prior to that PR did not exhibit the issue), however I decided a new implementation was in order to address the root cause of the original PR. |
did you track down the bug ? Edit: I have gone through the code again, and did some more analysis. I don't think there is a way for controller-A to delete the VMs of controller-B, unless,
The |
| } | ||
|
|
||
| @Test | ||
| public void testLostNodeCleanedUpBySecondController() throws Throwable { |
There was a problem hiding this comment.
This test will now be failing due to
cloud.setLostNodeCleanupRestriction(true);
?
There was a problem hiding this comment.
Please help explain what would be the need for configurable label for lost node restriction at individual cloud level. Perhaps having a system property to decide and hardcoding the label name is easier. Also requesting to not use the mock based tests, rather use the IT test, something like - adding another test testSecondControllerDoesNotCleanUpLostNodeWhenRestrictionEnabled in the existing CleanLostNodeWorkIT would provide the coverage.
|
Given #558 (comment), I still think we should track down the actual bug and think of reproducible scenarios (preferably via an IT test) and arrive at the solution accordingly. If unable to come up with a reproducer at all, I think it is better to just introduce a system property alone, rather introducing configuration properties, adding new configurable properties means we will need to support it, and they cannot be taken out in future releases without appropriate backward compatibility support - which will just complicate the whole thing. |
|
I haven't been able to track down the root cause of the problem, but it was observed before the recurrence periods were reconfigured (which is why they were configured to 24 hours in the first place). I will make the requested changes to the tests, though I will say that the integration tests currently all pass. The log message was mistakenly left in from debugging and I will remove it. The reasoning for a configurable label is to differentiate which Jenkins master node created the VMs, and consequently, filter out which ones actually apply to the master node when it runs cleanup. I will refactor the code to use a system property instead of a per-cloud configuration as requested. |
|
#558 (comment) and let me fix the |
|
noticed this commit cab5ddc - it won't deflake the IT's - the flakyness comes from relying on timeouts in waiting for agents - which should be done using SemaphoreStep - this pattern can be seen in more recent ITs ex- https://github.com/jenkinsci/google-compute-engine-plugin/blob/develop/src/test/java/com/google/jenkins/plugins/computeengine/integration/ComputeEngineCloudDiskMappingIT.java (and other recent tests) Imo better to file a separate PR for fixing - I can do that later today or earlier tomorrow my time. |
|
Changed to the behavior of using the Jenkins instance ID, I'll let you update the integration tests to what you want to see. |
My team's Jenkins setup has multiple hosts utilizing the same GCP project, and the changes in #503 have led to an uptick in incidents where one Jenkins host terminates another's build job in the lost node cleanup routine. I understand the original criticism that led to this change, and I propose that instead of using the previous behavior which could cause issues, we add an option to have a configurable label to restrict the cleanup job's targeted GCE VMs. This may also provide a solid solution to #157 and https://issues.jenkins.io/browse/JENKINS-68244 as well.
Testing done
Aside from the unit tests, I deployed this in our QA environment and thoroughly tested the behavior of the clean lost nodes routine with the changes I made. At this time, I have not run the integration tests due to security restrictions, but I will be completing that task as soon as I can. (and moving this PR out of draft status). EDIT: Integration tests have been successfully completed.
UI before:

UI after:

Submitter checklist