This repository was archived by the owner on Feb 23, 2026. It is now read-only.
fix: Major refactoring of Polling, Retry and Timeout logic#462
Merged
parthea merged 12 commits intogoogleapis:mainfrom Nov 10, 2022
Merged
fix: Major refactoring of Polling, Retry and Timeout logic#462parthea merged 12 commits intogoogleapis:mainfrom
parthea merged 12 commits intogoogleapis:mainfrom
Conversation
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
If you are a user of python-api-core and experience polling timing out after ~15 minutes (900s) in your python code after this change, please make sure that instead of calling
PollingFuture.result()you call it with additonaltimeoutargument like this:PollingFuture.result(timeout = <desired timeout in seconds>)orPollingFuture.result(timeout = None)(for infinite timeout, but infinite timeouts are strongly discouraged) for your code to stop timing out at 900s.The core libraries cannot and should not have infitine polling as as default behavior. The fact that it had it like that in python for years (unlike other GCP-supported languages) was a bug, as it contradicts original cross-language LRO methods design in GAPIC libraries and in general users are not supposed to be put into an infinite-loop scenario implicitly. If an LRO can actually ran for hours or even days and it is Ok, then users are expected to acknowledge this fact by providing the huge timeouts explicitly.
This is in response to https://freeman.vc/notes/aws-vs-gcp-reliability-is-wildly-different, which triggered an investigation of the whole Polling/Retry/Timeout behavior in Python GAPIC clients and revealed many fundamental flaws in its implementation.
To properly describe the refactoring in this PR we need to stick to a rigorous terminology, as vague definitions of retries, timeouts, polling and related concepts seems to be the main source of the present bugs and overall confusion among both groups: users of the library and creators of the library. Please check the updated (in this PR) documentation of the
google.api_core.retry.Retryclass and thegoogle.api_core.future.polling.Polling.result()method for the proper definitions and context.Note, the overall semantics around Polling, Retry and Timeout remains quite confusing even after refactoring (although it is now more or less rigorously defined), but it was as clean as I could make it while still maintaining backward compatibility of the whole library.
The quick summary of the changes in this PR:
Properly define and fix the application of Deadline and Timeout concepts. Please check the updated documentation for the
google.api_core.retry.Retryclass for the actual definitions. Originally thedeadlinehas been used to represent timeouts conflating the two concepts. As result this PR replacesdeadlinearguments withtimeoutones in as backward-compatible manner as possible (i.e. backward compatible in all practical applications).Properly define RPC Timeout, Retry Timeout and Polling Timeout and how a generic Timeout concept (aka Logical Timeout) is mapped to one of those depending on the context. Please check
google.api_core.retry.Retryclass documentation for details.Properly define and fix the application of Retry and Polling concepts. Please check the updated documentation for
google.api_core.future.polling.PollingFuture.result()for details.Separate
retryandpollingconfigurations for Polling future, as these are two different concepts (although both operating onRetryclass). Originally both retry and polling configurations were controlled by a singleretryparameter, merging configuration regarding how "rpc error responses" and how "operation not completed" responses are supposed to be handled.For the following config properties -
Retry(includingRetry Timeout),Polling(includingPolling Timeout) andRPC Timeout- fix and properly define how each of the above properties gets configured and which config gets precedence in case of a conflict (checkPollingFuture.result()method documentation for details). Each of those properties can be specified as follows: directly provided by the user for each call, specified during gapic generation time from config values ingrpc_service_config.jsonfile (for Retry and RPC Timeout) andgapic.yamlfile (for Polling Timeout), or be provided as a hard-coded basic default values in python-api-core library itself. This alo includes fixing the per-call polling config propagation logic (the polling/retry configs supplied toPollingFuture.result()used to be ignored for actual call).Deprecate
ExponentialTimeout,ConstantTimeoutand related logic as those are outdated concepts and are not consistent with the other GAPIC Languages. Replace it withTimeToDeadlineTimeoutto be consistent with how the rest of the languages do it.Deprecate
google.api_core.operations_v1.configas it is an outdated concept and self-inconsistent (as all gapic clients provide configuraiton in code). The configs are directly provided in code instead.Switch randomized delay calculation from
delaybeing treated as expected value for randomized_delay todelaybeing treated as maximum value forrandomized_delay(i.e. the new expected valud forrandomized_delayisdelay / 2). See theexponential_sleep_generator()method implementation for details. This is needed to make Python implementation of retries and polling exponential backoff consistent with the rest of GAPIC languages. Also fix the uncontrollable growth ofdelayvalue (since it is a subject of exponential growth, thedelayvalue was quickly reaching "infinity" value, and the whole thing was not failing simply due to python being a very forgiving language which forgives multiplying "infinity" by a number (inf * number = inf) binstead of simply overflowing to a (most likely) negative number). Also essentially rollback the 52f12af change, since that is inconsistent with the other languages and damages uniform distibution of retry delays artificially shifting their concentration towards the end of timeout.Fix url construction in
OperationsRestTransport. Without this fix the polling logic for REST transport was completely broken (is not affecting Compute client, as that one has custom LRO).Last but not least: change the default values for Polling logic to be the following:
initial=1.0(same as before),maximum=20.0(was60),multiplier=1.5(was2.0),timeout=900(was120, but due to timeout resolution logic was actually None (i.e. infinity)). This, in conjunction with changed calculation of randomized delay (i.e. its expected value now beingdelay / 2) overall makes polling logic much less aggressive in terms of increasing delays between each polling iteration, making LRO return much earlier for users on average, but still keeping a healthy balance between strain put on both client and server by polling and responsiveness of LROs for user.*The design doc summarising all the changes and reasons for them is in progress.
In addition to the timeout/retry fixes, this PR has some other non-related technical fixes:
Python 3.10as the default python version used in CI.