Skip to content

test: Add diagnostic logging to investigate intermittent electrs failures in CI #849

Draft
joostjager wants to merge 13 commits intolightningdevkit:mainfrom
joostjager:repro-fee-estimation
Draft

test: Add diagnostic logging to investigate intermittent electrs failures in CI #849
joostjager wants to merge 13 commits intolightningdevkit:mainfrom
joostjager:repro-fee-estimation

Conversation

@joostjager
Copy link
Contributor

Force Esplora chain source and run on macOS to reproduce the fee rate estimation failure caused by electrsd exposing 0.0.0.0 as the Esplora URL. On macOS, connecting to 0.0.0.0 as a destination address results in ConnectionRefused.

AI tools were used in preparing this commit.

@ldk-reviews-bot
Copy link

👋 Hi! I see this is a draft PR.
I'll wait to assign reviewers until you mark it as ready for review.
Just convert it out of draft status when you're ready for review!

@joostjager joostjager force-pushed the repro-fee-estimation branch from 443aa30 to 48a1bb2 Compare March 26, 2026 09:17
Copy link
Collaborator

@tnull tnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"All checks have passed" 🙃

@joostjager
Copy link
Contributor Author

Indeed, hypothesis failed.

@joostjager joostjager closed this Mar 26, 2026
@joostjager joostjager reopened this Mar 26, 2026
Enable electrs stderr output in CI and log connection details at
startup. Log errors that were previously silently discarded: the
first block_headers_subscribe failure, generate_to_address failures,
and ping errors across all polling helpers. This will help diagnose
intermittent CI failures where electrs appears to crash or become
unreachable mid-test.

AI tools were used in preparing this commit.
@joostjager joostjager force-pushed the repro-fee-estimation branch from 48a1bb2 to 33927c3 Compare March 26, 2026 14:39
@joostjager joostjager changed the title ci: Reproduce Esplora 0.0.0.0 connection failure on macOS test: Add diagnostic logging to investigate intermittent electrs failures in CI Mar 26, 2026
Run integration tests 10 times in a loop with --nocapture to maximize
the chance of hitting the intermittent electrs crash and to capture the
new diagnostic logging output.

AI tools were used in preparing this commit.
Increase resource pressure to reproduce intermittent electrs failures.
Each of the 3 shards runs 5 iterations with 3 concurrent cargo test
processes, for 45 total test runs with up to 9 simultaneous processes.

AI tools were used in preparing this commit.
Read LDK_NODE_TEST_BASE_PORT env var to offset the listening port range,
avoiding collisions when multiple test processes run simultaneously.
Assign base ports 20000, 21000, 22000 to the three concurrent processes
in the stress-test CI job.

AI tools were used in preparing this commit.
Ports from previous iterations may still be in TIME_WAIT when the next
iteration starts. Offset the base port by both iteration and process
index to ensure no overlap.

AI tools were used in preparing this commit.
Check whether the kernel OOM killer is responsible for electrs silently
disappearing during tests. Dump relevant dmesg output after any test
failure on ubuntu runners.

AI tools were used in preparing this commit.
…ation

Revert the deterministic port allocation approach from PR lightningdevkit#847 and
instead use random ports with a retry loop around node.start(). This
avoids collisions with ports allocated by electrsd/corepc_node via
get_available_port(), which use the OS ephemeral port allocator and
can land in any range. On InvalidSocketAddress, new random ports are
selected and the node is rebuilt, up to 5 attempts.

AI tools were used in preparing this commit.
When node.start() fails with InvalidSocketAddress and we retry with new
random ports, also generate a fresh storage directory. Reusing the same
directory causes the second build to fail with ReadFailed/Namespace not
found since the first build already wrote data there.

AI tools were used in preparing this commit.
Run lsof to identify what is using the port when node.start() fails
with a binding error. This helps distinguish between collisions with
electrsd/bitcoind, other test processes, or TIME_WAIT leftovers.

AI tools were used in preparing this commit.
Read node_b's listening addresses from the node after setup instead of
using the pre-retry variable, which may differ if setup_node retried
with new ports.

Simplify the stress test to run 1 process per shard with 10 iterations
instead of 3 concurrent processes. The concurrent processes caused port
collisions in code paths outside setup_node that don't have retry logic,
which is noise unrelated to the electrs crash we're investigating.

AI tools were used in preparing this commit.
Avoid intra-process port collisions between parallel tests by using an
atomic counter that increments by 2 for each allocation. The base port
is randomized once per process to reduce inter-process collisions.
This eliminates the birthday-paradox collisions that occurred when every
call independently picked a random port from the range.

AI tools were used in preparing this commit.
fetch_update returns the previous value, so the first caller got port 0
instead of the random base. Use compare_exchange for one-time init
followed by fetch_add, which correctly returns the base port to the
first caller.

AI tools were used in preparing this commit.
Restrict the random base port to 10000-30000, which is below the Linux
ephemeral port range (32768-60999). This prevents collisions with
OS-assigned ports used by electrsd and bitcoind.

AI tools were used in preparing this commit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants