Skip to content

feat: dynamic mesh idle_timeout from patch.cfg without container restart#414

Open
rippleitinnz wants to merge 2 commits into
mainfrom
feat/dynamic-mesh-idle-timeout-from-patch-cfg
Open

feat: dynamic mesh idle_timeout from patch.cfg without container restart#414
rippleitinnz wants to merge 2 commits into
mainfrom
feat/dynamic-mesh-idle-timeout-from-patch-cfg

Conversation

@rippleitinnz
Copy link
Copy Markdown
Contributor

@rippleitinnz rippleitinnz commented May 20, 2026

Problem

mesh.idle_timeout is only read at startup in p2p::init() and cached in
metric_thresholds[4]. It cannot be changed on a running node without a
container restart, which is not possible on leased Evernode instances.

consensus.roundtime can already be changed dynamically from patch.cfg (it is
read from the contract section each ledger). However the effective maximum
roundtime is constrained by mesh.idle_timeout — with 4 consensus stages each
taking roundtime × stage_slice% (default 25%), the longest any single stage
can wait is roundtime × 0.25. If this exceeds mesh.idle_timeout, peers
disconnect during the wait and proposals from that peer are discarded as stale.

At the default mesh.idle_timeout of 120000ms this creates a hard ceiling:

safe_max_roundtime = mesh.idle_timeout / stage_slice% = 120000 / 0.25 = 480000ms

Exceeding 480000ms roundtime causes peer disconnections during stage waits,
leading to Not enough stage X proposals every round and permanent consensus
failure. The cluster cannot recover without terminating all nodes.

Without this fix, the dynamically-configurable roundtime range of 1000–3600000ms
is misleading — only 1000–480000ms is actually safe with default settings.

Fix

Three changes:

comm_server.hpp — added for_each_session() template method to iterate
over all active sessions under mutex protection.

p2p.cpp — added update_idle_timeout() which updates metric_thresholds[4]
for future connections AND calls set_threshold(IDLE_CONNECTION_TIMEOUT) on all
existing active sessions via for_each_session().

conf.cpp — reads mesh.idle_timeout from patch.cfg in apply_patch_config(),
calls p2p::update_idle_timeout() when value changes.

Effect

Operators can now increase mesh.idle_timeout via patch.cfg alongside
consensus.roundtime, enabling roundtimes beyond 480000ms without peer
disconnections. Takes effect immediately on all active and future connections
without any container restart. The full 1000–3600000ms roundtime range becomes
safely usable.

Sibling PRs

This is part of a series making some hpcore config fields dynamically updatable from patch.cfg:

These three PRs should be reviewed and merged together. This PR and #415 share
the for_each_session() template added to comm_server.hpp.

Testing

Tested on a live 3-node Evernode cluster. Roundtime of 485000ms with default
mesh.idle_timeout=120000ms causes permanent consensus failure. With
mesh.idle_timeout updated dynamically to 200000ms alongside the roundtime
change, the cluster runs cleanly.

When log.log_level is present in patch.cfg, apply_patch_config() now
updates the live plog logger severity via plog::get()->setMaxSeverity()
in addition to persisting the change to hp.cfg and the runtime cfg struct.

Previously log level was only read at startup (hplog::init()) and could
not be changed on a running node without a container restart. This meant
operators had no way to change log verbosity on external Evernode hosts
where they don't control the container lifecycle.

The fix uses plog's built-in setMaxSeverity() API which is thread-safe
and takes effect immediately on the next log statement.
When mesh.idle_timeout is present in patch.cfg, apply_patch_config() now
updates all active peer sessions via a new p2p::update_idle_timeout() function
and also updates the cached metric_thresholds array for future connections.

Previously mesh.idle_timeout was only read at startup (p2p::init()) and could
not be changed on a running node without a container restart. This is critical
for operators who need to increase roundtime beyond mesh.idle_timeout * 4 —
without this fix, increasing roundtime past 480000ms (at default idle_timeout
of 120000ms) causes peer disconnections and permanent consensus failure.

Implementation:
- comm_server.hpp: added for_each_session() template to iterate live sessions
- p2p.cpp: added update_idle_timeout() which updates metric_thresholds[4] and
  calls set_threshold(IDLE_CONNECTION_TIMEOUT) on all active sessions
- conf.cpp: reads mesh.idle_timeout from patch.cfg in apply_patch_config(),
  calls p2p::update_idle_timeout() when value changes

Sibling PRs (part of dynamic config series):
- fix/dynamic-log-level-from-patch-cfg (already raised)
- feat/dynamic-user-idle-timeout-from-patch-cfg (to be raised)

Fixes: mesh connections dropping during long roundtimes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant