Skip to content

coordinator does not handle exporter side cancel #1855

@sgross-emlix

Description

@sgross-emlix

A friend of mine misconfigured their exporter.yaml .
After the exporter failed to start the configuration was fixed and another attempt started.
This however resulted in a failure with

status = StatusCode.ALREADY_EXISTS

System

Reproduction

Have a misconfigured exporter.yaml and try to start the exporter.
Fix the configuration after a failed attempt.

Observed Behaviour

The coordinator keeps an instance of the exporter and will refuse to accept a new instance of this exporter.

Coordinator

DEBUG:grpc._cython.cygrpc:[_cygrpc] Loaded running loop: id(loop)=139766749596624
INFO:root:exporter connected: ipv4:10.88.0.49:49302
DEBUG:root:exporter in_msg startup {
  version: "25.0+264-gc246fab8"
  name: "emlix-test"
}

ERROR:root:error in exporter message handler
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/labgrid/remote/coordinator.py", line 426, in request_task
    raise ExporterError(
labgrid.remote.coordinator.ExporterError: exporter with name 'emlix-test' is already connected from ipv4:10.88.0.49:46262

Exporter

ERROR:root:unexpected grpc error in coordinator message pump task
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/labgrid/remote/exporter.py", line 899, in message_pump
    async for out_message in self.stub.ExporterStream(queue_as_aiter(self.out_queue)):
  File "/usr/local/lib/python3.11/dist-packages/grpc/aio/_call.py", line 366, in _fetch_stream_responses
    await self._raise_for_status()
  File "/usr/local/lib/python3.11/dist-packages/grpc/aio/_call.py", line 274, in _raise_for_status
    raise _create_rpc_error(
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.ALREADY_EXISTS
	details = "startup failed: exporter with name 'emlix-test' is already connected from ipv4:10.88.0.49:46262"
	debug_error_string = "UNKNOWN:Error received from peer ipv4:192.168.201.3:20409 {grpc_message:"startup failed: exporter with name \'emlix-test\' is already connected from ipv4:10.88.0.49:46262", grpc_status:6}"
>
DEBUG:root:pump task exited, shutting down exporter
DEBUG:asyncio:Close <_UnixSelectorEventLoop running=False closed=False debug=True>

Expected Behaviour

An exporter that failed to startup properly should not change the state of the coordinator.

Additional information

Even though the exporter clearly fails to startup the return code after being shut down will be 0 in case the configuration is correct.
This should be fixed by not masking errors in

except grpc.aio.AioRpcError as e:

Please also find the attached tarball

repro.tar.gz

for reproduction and logs.

The cleanup routine

session = self.exporters.pop(peer)
is entered occassionally so the error will not always be observed and multiple restarts may be required to trigger this behaviour.
Multiple restart is handled by systemd for the exporter.

There are certain ways to address this issue:

  • fix the underlying grpc code or to actually promote the cancelled status and handle it in
  • untangle
    async def add_resource(self, group_name, resource_name, cls, params):
    and
    async def run(self) -> None:
    , ie. do a sanity check on the configuration first
  • provide a command for labgrid-client to clear a reference to a broken exporter
  • add an option to exporter and a field to the startup message to forcefully register the exporter in the coordinator even though a reference exists in
    if existing := self.get_exporter_by_name(name):
  • make the exporter disconnect properly from the coordinator if it fails due to configuration errors

There is also an unhandled exception AttributeError in

def __del__(self):
since self.child was never set.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions