Skip to content

Safe error handling and fail-safes implementation #1031

Description

@aaronleder-elastic

Feature Request: Add maximum output size, runtime limits, and graceful connection failure handling to support-diagnostics

Situation

I have experienced environment-specific errors when connecting to localhost using the support-diagnostics tool.

When an error occurs, the generated ZIP archive can continue growing until it fills all allocated disk space. I personally experienced this issue, where the tool created a 13 GB ZIP archive that was unreadable.

This can create several operational problems:

  • Exhausts available disk space
  • Produces an unusable diagnostics archive
  • Requires manual cleanup
  • Can create additional system instability on already impacted hosts

Proposed Solution

Add command-line options to limit the maximum size and runtime of the generated diagnostics archive.

For example:

./diagnostics.sh --maxSize 4g

This would limit the generated archive to 4 GB. Once the limit is reached, the tool should stop collecting data, close the archive cleanly if possible, and report that the maximum size was reached.

Additionally, a runtime limit would be useful:

./diagnostics.sh --maxTime 90s

This would stop diagnostics collection after 90 seconds, regardless of whether the collection completed.

The tool should also gracefully fail if a successful connection to the target host cannot be made. If the connection to localhost or any specified host fails, the tool should stop collection, avoid continuing to write data into the archive, and return a clear error message explaining that the connection was unsuccessful.

Suggested Behavior

If the connection cannot be established, the tool should:

  1. Stop diagnostics collection immediately
  2. Avoid creating or continuing to grow a ZIP archive
  3. Emit a clear connection failure message
  4. Exit with a non-zero return code
  5. Include enough detail in the logs to identify the failed host and connection reason

Example message:

Diagnostics collection stopped because a successful connection to localhost could not be established.

If either size or runtime limit is reached, the tool should:

  1. Stop the current collection process
  2. Finalize the archive if possible
  3. Emit a clear warning or error message
  4. Exit with a non-zero return code
  5. Include enough detail in the logs to show whether the stop was caused by size or runtime limit

Example message:

Diagnostics collection stopped because maximum archive size was reached: 4g

or:

Diagnostics collection stopped because maximum runtime was reached: 90s

Use Case

This would help prevent runaway diagnostics collections in situations where the tool encounters an unexpected environment-specific issue, such as a failed or problematic localhost connection.

It would also make the tool safer to run in production or space-constrained environments.

Example

./diagnostics.sh \
  --type local \
  --ssl \
  --maxSize 4g \
  --maxTime 90s \
  --output /tmp/diagnostics

Benefit

Adding these safeguards would make support-diagnostics more resilient and safer to run by preventing excessive disk usage, avoiding unusable oversized archive files, and gracefully handling failed host connections.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions