Skip to content

worker and endpoint cleanup #118

Description

@angainor

Currently we destroy the endpoint with

ucs_status_ptr_t ret = ucp_ep_close_nb(m_ep, UCP_EP_CLOSE_MODE_FORCE);

We have to use UCP_EP_CLOSE_MODE_FORCE instead of UCP_EP_CLOSE_MODE_FLUSH because of a cleanup glitch. Summary: if one rank destroys the worker and the endpoint, and another rank then tries to destroy the endpoint with a FLUSH, it can to communicate with an already closed remote worker. This causes a segfault. UCP_EP_CLOSE_MODE_FORCE fixes the segfault, but the solution suggested by the developers was to use a barrier after endpoint destructor, only then close the workers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions