Skip to content

feat: add retry mechanism with exponential backoff for transient API errors [SUPPORT-15536]#17

Draft
devin-ai-integration[bot] wants to merge 2 commits intomasterfrom
SUPPORT-15536
Draft

feat: add retry mechanism with exponential backoff for transient API errors [SUPPORT-15536]#17
devin-ai-integration[bot] wants to merge 2 commits intomasterfrom
SUPPORT-15536

Conversation

@devin-ai-integration
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot commented Mar 6, 2026

Summary

Adds retry logic with exponential backoff to the Toast API client to handle intermittent 403 Forbidden responses that have been causing irregular job failures (observed in project 10132, config 1178156880). Datadog investigation confirmed the same credentials and restaurants succeed on some runs and fail on others within hours, pointing to transient Toast API-side permission enforcement.

Changes:

  • Retry wrapper (client.py): The original request() is renamed to _rate_limited_request() (retains rate-limit decorators). A new request() method wraps it with up to 3 retries and exponential backoff (5s → 10s → 20s) for status codes {403, 429, 500, 502, 503, 504}. Non-retryable errors and the final attempt are returned as-is for existing error handling.
  • Per-restaurant logging (component.py): Logs restaurant count and each restaurant GUID at INFO level so that future failures are traceable to a specific restaurant.
  • Restaurant ID in error messages (client.py): All UserException messages now include the restaurant_id (or restaurant_group_id) that triggered the error.
  • Docstring corrections: Fixed misleading docstrings (e.g., list_restaurants previously said "List all orders").

Note: The logging and error-message changes were requested to be removed in review. Please verify the final diff only contains retry logic. If the revert commit is not reflected, a follow-up commit may be needed.

Review & Testing Checklist for Human

  • Verify diff scope matches intent: The request was to keep only the retry logic and remove per-restaurant logging and error message updates. Confirm the final diff on the branch reflects this — if not, the revert commit may need to be re-applied.
  • Verify 403 should be retryable: This is the core assumption. Toast API intermittently returns 403 for valid credentials. If 403 should instead always be fatal, remove it from RETRYABLE_STATUS_CODES. Worst case: a permanent 403 now takes ~35s longer to fail (3 attempts with backoff).
  • Retry applies to get_token() too: The auth call goes through request() → will also retry on 403/5xx. Confirm this is acceptable (it likely is, but adds delay on permanent auth failures).
  • No unit tests added: The retry logic has no test coverage. Consider whether tests mocking HTTP responses should be added before merge.
  • Test with runtime.tag: Deploy the branch build to a test configuration and confirm (a) successful runs still complete normally, (b) retry warnings appear in logs when transient errors occur.

Notes

  • Datadog findings: 15 error spans in the last 60 days for this config, all "You are not permitted to access this resource" from different Toast endpoints (restaurants/v1/restaurants/{id}, orders/v2/ordersBulk). Successful runs process the same set of restaurants. The component currently has zero retry logic — any transient HTTP error is immediately fatal.
  • The retry warning logs use logging.warning() which will be shipped to Datadog via the GELF handler (unlike logging.debug() which is filtered).
  • The except Exception catch for JSON parsing in the retry loop is broad but safe — it only affects log formatting, not control flow.

Devin Session | Requested by Zora Jelínková

Release Notes

Justification, description

Add retry mechanism (3 attempts, exponential backoff) for transient Toast API errors (403/429/5xx) to resolve irregular job failures.

Plans for Customer Communication

N/A — transparent improvement, no user-facing config changes.

Impact Analysis

Low risk. Retry only activates on error responses that would have already caused job failure. Successful API calls are unaffected. Worst case for permanent errors: ~35s additional delay before failure.

Deployment Plan

Standard tag release after merge.

Rollback Plan

Revert to previous tag if retry behavior causes issues (e.g., excessive delays, unexpected side effects).

Post-Release Support Plan

Monitor Datadog logs for retry warnings ("Retrying in {delay}s...") to validate the fix resolves intermittent 403 issues. If 403 errors persist after 3 retries, investigate Toast API authentication scoping or rate limits further.


Open with Devin

…errors

- Add retry logic (3 attempts, exponential backoff) for HTTP 403/429/5xx
- Include restaurant ID in all error messages for debugging
- Add per-restaurant logging in the processing loop
- Separate rate limiting from retry logic in client

Co-Authored-By: Zora Jelínková <zora.jelinkova@keboola.com>
@devin-ai-integration
Copy link
Copy Markdown
Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

…nges

Co-Authored-By: Zora Jelínková <zora.jelinkova@keboola.com>
Copy link
Copy Markdown
Author

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 4 additional findings in Devin Review.

Open in Devin Review

Comment thread src/client.py

MAX_RETRIES = 3
RETRY_BASE_DELAY = 5
RETRYABLE_STATUS_CODES = {403, 429, 500, 502, 503, 504}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 HTTP 403 (Forbidden) incorrectly included in retryable status codes

HTTP 403 (Forbidden) is included in RETRYABLE_STATUS_CODES but this status code indicates an authorization/permissions failure, not a transient error. The Toast API documentation (referenced at src/client.py:28) states rate-limited calls return 429, not 403. This causes unnecessary retries with exponential backoff (5s + 10s = 15s delay) for permission errors that will never succeed by retrying. This is especially impactful during the get_token authentication call (src/client.py:65), which is invoked from __init__ (src/client.py:25): if credentials are wrong, the user will wait 15 seconds for retries before seeing the UserException, rather than failing immediately.

Suggested change
RETRYABLE_STATUS_CODES = {403, 429, 500, 502, 503, 504}
RETRYABLE_STATUS_CODES = {429, 500, 502, 503, 504}
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

403 is intentionally included here. The Datadog investigation showed that the Toast API intermittently returns 403 "You are not permitted to access this resource" for valid credentials — the same restaurants succeed on retry or on subsequent runs with identical config.

This was confirmed in production on job 1296848007 which ran with this branch: it hit 403 on both restaurants/v1/restaurants/{id} and orders/v2/ordersBulk, retried after 5s, and the retry succeeded — the job completed successfully with ~190 order batches.

The 15s worst-case delay for permanent auth failures (e.g. wrong credentials) is acceptable since that's a one-time configuration error, not a recurring production issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants