test: Add diagnostic logging to investigate intermittent electrs failures in CI #849
Draft
joostjager wants to merge 13 commits intolightningdevkit:mainfrom
Draft
test: Add diagnostic logging to investigate intermittent electrs failures in CI #849joostjager wants to merge 13 commits intolightningdevkit:mainfrom
joostjager wants to merge 13 commits intolightningdevkit:mainfrom
Conversation
|
👋 Hi! I see this is a draft PR. |
443aa30 to
48a1bb2
Compare
tnull
reviewed
Mar 26, 2026
Contributor
Author
|
Indeed, hypothesis failed. |
Enable electrs stderr output in CI and log connection details at startup. Log errors that were previously silently discarded: the first block_headers_subscribe failure, generate_to_address failures, and ping errors across all polling helpers. This will help diagnose intermittent CI failures where electrs appears to crash or become unreachable mid-test. AI tools were used in preparing this commit.
48a1bb2 to
33927c3
Compare
Run integration tests 10 times in a loop with --nocapture to maximize the chance of hitting the intermittent electrs crash and to capture the new diagnostic logging output. AI tools were used in preparing this commit.
Increase resource pressure to reproduce intermittent electrs failures. Each of the 3 shards runs 5 iterations with 3 concurrent cargo test processes, for 45 total test runs with up to 9 simultaneous processes. AI tools were used in preparing this commit.
Read LDK_NODE_TEST_BASE_PORT env var to offset the listening port range, avoiding collisions when multiple test processes run simultaneously. Assign base ports 20000, 21000, 22000 to the three concurrent processes in the stress-test CI job. AI tools were used in preparing this commit.
Ports from previous iterations may still be in TIME_WAIT when the next iteration starts. Offset the base port by both iteration and process index to ensure no overlap. AI tools were used in preparing this commit.
Check whether the kernel OOM killer is responsible for electrs silently disappearing during tests. Dump relevant dmesg output after any test failure on ubuntu runners. AI tools were used in preparing this commit.
…ation Revert the deterministic port allocation approach from PR lightningdevkit#847 and instead use random ports with a retry loop around node.start(). This avoids collisions with ports allocated by electrsd/corepc_node via get_available_port(), which use the OS ephemeral port allocator and can land in any range. On InvalidSocketAddress, new random ports are selected and the node is rebuilt, up to 5 attempts. AI tools were used in preparing this commit.
When node.start() fails with InvalidSocketAddress and we retry with new random ports, also generate a fresh storage directory. Reusing the same directory causes the second build to fail with ReadFailed/Namespace not found since the first build already wrote data there. AI tools were used in preparing this commit.
Run lsof to identify what is using the port when node.start() fails with a binding error. This helps distinguish between collisions with electrsd/bitcoind, other test processes, or TIME_WAIT leftovers. AI tools were used in preparing this commit.
Read node_b's listening addresses from the node after setup instead of using the pre-retry variable, which may differ if setup_node retried with new ports. Simplify the stress test to run 1 process per shard with 10 iterations instead of 3 concurrent processes. The concurrent processes caused port collisions in code paths outside setup_node that don't have retry logic, which is noise unrelated to the electrs crash we're investigating. AI tools were used in preparing this commit.
Avoid intra-process port collisions between parallel tests by using an atomic counter that increments by 2 for each allocation. The base port is randomized once per process to reduce inter-process collisions. This eliminates the birthday-paradox collisions that occurred when every call independently picked a random port from the range. AI tools were used in preparing this commit.
fetch_update returns the previous value, so the first caller got port 0 instead of the random base. Use compare_exchange for one-time init followed by fetch_add, which correctly returns the base port to the first caller. AI tools were used in preparing this commit.
Restrict the random base port to 10000-30000, which is below the Linux ephemeral port range (32768-60999). This prevents collisions with OS-assigned ports used by electrsd and bitcoind. AI tools were used in preparing this commit.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Force Esplora chain source and run on macOS to reproduce the fee rate estimation failure caused by electrsd exposing 0.0.0.0 as the Esplora URL. On macOS, connecting to 0.0.0.0 as a destination address results in ConnectionRefused.
AI tools were used in preparing this commit.