feat: add retry mechanism with exponential backoff for transient API errors [SUPPORT-15536]#17
feat: add retry mechanism with exponential backoff for transient API errors [SUPPORT-15536]#17devin-ai-integration[bot] wants to merge 2 commits intomasterfrom
Conversation
…errors - Add retry logic (3 attempts, exponential backoff) for HTTP 403/429/5xx - Include restaurant ID in all error messages for debugging - Add per-restaurant logging in the processing loop - Separate rate limiting from retry logic in client Co-Authored-By: Zora Jelínková <zora.jelinkova@keboola.com>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
…nges Co-Authored-By: Zora Jelínková <zora.jelinkova@keboola.com>
|
|
||
| MAX_RETRIES = 3 | ||
| RETRY_BASE_DELAY = 5 | ||
| RETRYABLE_STATUS_CODES = {403, 429, 500, 502, 503, 504} |
There was a problem hiding this comment.
🟡 HTTP 403 (Forbidden) incorrectly included in retryable status codes
HTTP 403 (Forbidden) is included in RETRYABLE_STATUS_CODES but this status code indicates an authorization/permissions failure, not a transient error. The Toast API documentation (referenced at src/client.py:28) states rate-limited calls return 429, not 403. This causes unnecessary retries with exponential backoff (5s + 10s = 15s delay) for permission errors that will never succeed by retrying. This is especially impactful during the get_token authentication call (src/client.py:65), which is invoked from __init__ (src/client.py:25): if credentials are wrong, the user will wait 15 seconds for retries before seeing the UserException, rather than failing immediately.
| RETRYABLE_STATUS_CODES = {403, 429, 500, 502, 503, 504} | |
| RETRYABLE_STATUS_CODES = {429, 500, 502, 503, 504} |
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
403 is intentionally included here. The Datadog investigation showed that the Toast API intermittently returns 403 "You are not permitted to access this resource" for valid credentials — the same restaurants succeed on retry or on subsequent runs with identical config.
This was confirmed in production on job 1296848007 which ran with this branch: it hit 403 on both restaurants/v1/restaurants/{id} and orders/v2/ordersBulk, retried after 5s, and the retry succeeded — the job completed successfully with ~190 order batches.
The 15s worst-case delay for permanent auth failures (e.g. wrong credentials) is acceptable since that's a one-time configuration error, not a recurring production issue.
Summary
Adds retry logic with exponential backoff to the Toast API client to handle intermittent
403 Forbiddenresponses that have been causing irregular job failures (observed in project 10132, config 1178156880). Datadog investigation confirmed the same credentials and restaurants succeed on some runs and fail on others within hours, pointing to transient Toast API-side permission enforcement.Changes:
client.py): The originalrequest()is renamed to_rate_limited_request()(retains rate-limit decorators). A newrequest()method wraps it with up to 3 retries and exponential backoff (5s → 10s → 20s) for status codes{403, 429, 500, 502, 503, 504}. Non-retryable errors and the final attempt are returned as-is for existing error handling.component.py): Logs restaurant count and each restaurant GUID atINFOlevel so that future failures are traceable to a specific restaurant.client.py): AllUserExceptionmessages now include therestaurant_id(orrestaurant_group_id) that triggered the error.list_restaurantspreviously said "List all orders").Review & Testing Checklist for Human
RETRYABLE_STATUS_CODES. Worst case: a permanent 403 now takes ~35s longer to fail (3 attempts with backoff).get_token()too: The auth call goes throughrequest()→ will also retry on 403/5xx. Confirm this is acceptable (it likely is, but adds delay on permanent auth failures).runtime.tag: Deploy the branch build to a test configuration and confirm (a) successful runs still complete normally, (b) retry warnings appear in logs when transient errors occur.Notes
"You are not permitted to access this resource"from different Toast endpoints (restaurants/v1/restaurants/{id},orders/v2/ordersBulk). Successful runs process the same set of restaurants. The component currently has zero retry logic — any transient HTTP error is immediately fatal.logging.warning()which will be shipped to Datadog via the GELF handler (unlikelogging.debug()which is filtered).except Exceptioncatch for JSON parsing in the retry loop is broad but safe — it only affects log formatting, not control flow.Devin Session | Requested by Zora Jelínková
Release Notes
Justification, description
Add retry mechanism (3 attempts, exponential backoff) for transient Toast API errors (403/429/5xx) to resolve irregular job failures.
Plans for Customer Communication
N/A — transparent improvement, no user-facing config changes.
Impact Analysis
Low risk. Retry only activates on error responses that would have already caused job failure. Successful API calls are unaffected. Worst case for permanent errors: ~35s additional delay before failure.
Deployment Plan
Standard tag release after merge.
Rollback Plan
Revert to previous tag if retry behavior causes issues (e.g., excessive delays, unexpected side effects).
Post-Release Support Plan
Monitor Datadog logs for retry warnings (
"Retrying in {delay}s...") to validate the fix resolves intermittent 403 issues. If 403 errors persist after 3 retries, investigate Toast API authentication scoping or rate limits further.