fix: ignore SIGPIPE to prevent server crash on S3 idle connection timeout#8768
Open
Its-Tanay wants to merge 1 commit intotriton-inference-server:mainfrom
Open
fix: ignore SIGPIPE to prevent server crash on S3 idle connection timeout#8768Its-Tanay wants to merge 1 commit intotriton-inference-server:mainfrom
Its-Tanay wants to merge 1 commit intotriton-inference-server:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does the PR do?
This PR globally sets the
SIGPIPEdisposition toSIG_IGNinside thetritonserverexecutable.This prevents an unhandled
SIGPIPEfrom crashing the entire server with Exit Code 141 when the AWS C++ SDK attempts to write to an S3 keep-alive connection that has timed out during a slow model initialization. By ignoring the signal, thewrite()system call correctly returnsEPIPE, allowing the AWS SDK to gracefully catch the error and reconnect to S3.Checklist
<commit_type>: <Title>pre-commit install, pre-commit run --all)Commit Type:
Related PRs:
N/A
Where should the reviewer start?
src/triton_signal.cc->RegisterSignalHandler()Test plan:
Because this is an OS-level signal handler fix for a network idle timeout, standard unit tests cannot reliably cover it. I have verified this manually, and the exact deterministic reproduction steps using a dummy Python model (which forces the required 300s S3 idle timeout) are thoroughly documented in the attached GitHub issue.
Caveats:
None. The fix is explicitly placed in the standalone
tritonserverexecutable (triton_signal.cc) rather thantriton-coreto ensure there are no global signal side-effects for users embedding Triton as a C++ library.Background
This was discovered while loading a sequence of models from an S3 model repository, where one model took > 5 minutes to initialize on the GPU, causing the AWS connection to sit idle and be killed by the network layer.
Related Issues: