Skip to content

fix: ignore SIGPIPE to prevent server crash on S3 idle connection timeout#8768

Open
Its-Tanay wants to merge 1 commit intotriton-inference-server:mainfrom
Its-Tanay:fix/sigpipe-crash
Open

fix: ignore SIGPIPE to prevent server crash on S3 idle connection timeout#8768
Its-Tanay wants to merge 1 commit intotriton-inference-server:mainfrom
Its-Tanay:fix/sigpipe-crash

Conversation

@Its-Tanay
Copy link
Copy Markdown

What does the PR do?

This PR globally sets the SIGPIPE disposition to SIG_IGN inside the tritonserver executable.

This prevents an unhandled SIGPIPE from crashing the entire server with Exit Code 141 when the AWS C++ SDK attempts to write to an S3 keep-alive connection that has timed out during a slow model initialization. By ignoring the signal, the write() system call correctly returns EPIPE, allowing the AWS SDK to gracefully catch the error and reconnect to S3.

Checklist

  • I have read the Contribution guidelines and signed the Contributor License Agreement
  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • I ran pre-commit locally (pre-commit install, pre-commit run --all)
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

  • fix

Related PRs:

N/A

Where should the reviewer start?

src/triton_signal.cc -> RegisterSignalHandler()

Test plan:

Because this is an OS-level signal handler fix for a network idle timeout, standard unit tests cannot reliably cover it. I have verified this manually, and the exact deterministic reproduction steps using a dummy Python model (which forces the required 300s S3 idle timeout) are thoroughly documented in the attached GitHub issue.

Caveats:

None. The fix is explicitly placed in the standalone tritonserver executable (triton_signal.cc) rather than triton-core to ensure there are no global signal side-effects for users embedding Triton as a C++ library.

Background

This was discovered while loading a sequence of models from an S3 model repository, where one model took > 5 minutes to initialize on the GPU, causing the AWS connection to sit idle and be killed by the network layer.

Related Issues:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Bug: tritonserver crashes with Exit Code 141 (SIGPIPE) when loading models from S3 due to idle connection timeout

1 participant