Skip to content

[slop]fix(workflow-engine): only commit step state after success#5010

Draft
abcxff wants to merge 1 commit into
graphite-base/5010from
05-09-fix_workflow-engine_only_commit_step_state_after_success
Draft

[slop]fix(workflow-engine): only commit step state after success#5010
abcxff wants to merge 1 commit into
graphite-base/5010from
05-09-fix_workflow-engine_only_commit_step_state_after_success

Conversation

@abcxff
Copy link
Copy Markdown
Contributor

@abcxff abcxff commented May 9, 2026

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link
Copy Markdown

railway-app Bot commented May 9, 2026

🚅 Deployed to the rivet-pr-5010 environment in rivet-frontend

Service Status Web Updated (UTC)
website 😴 Sleeping (View Logs) Web May 9, 2026 at 3:51 pm
frontend-cloud ❌ Build Failed (View Logs) Web May 9, 2026 at 3:40 pm
frontend-inspector ❌ Build Failed (View Logs) Web May 9, 2026 at 3:40 pm
ladle ❌ Build Failed (View Logs) Web May 9, 2026 at 3:40 pm
mcp-hub ✅ Success (View Logs) Web May 9, 2026 at 3:40 pm
kitchen-sink ❌ Build Failed (View Logs) Web May 9, 2026 at 3:39 pm

Copy link
Copy Markdown
Contributor Author

abcxff commented May 9, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@abcxff abcxff changed the title fix(workflow-engine): only commit step state after success [slop]fix(workflow-engine): only commit step state after success May 9, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 9, 2026

Code Review: fix(workflow-engine): only commit step state after success

Overview

This PR removes error data from step history entries on failure and eliminates intermediate flushStorage() calls in error paths. Error information now lives exclusively in EntryMetadata rather than StepEntry. The correctness relies on the outer workflow runner (setRetryState / setFailedState / "unrecoverable error" path in index.ts) calling flush() after catching any step-level error.


Correctness Analysis

The change is logically sound. All error-throw paths in tryStep are caught by the outer runner in index.ts, which unconditionally calls flush():

  • StepFailedError (willRetry) → setRetryStateflush()
  • StepExhaustedError (exhausted) → unrecoverable path → flush()
  • CriticalError / RollbackError / StepTimeoutError → unrecoverable path → flush()

Metadata is already marked dirty at the start of each attempt (metadata.dirty = true at line 902), so it will be picked up by any subsequent outer flush. The removal of the per-error await this.flushStorage() calls is safe.


Issues

Comment at line 828–831 is now misleading (Medium)

// Prefer step history error, but fall back to metadata since
// driver implementations may persist metadata without the history
// entry error (e.g. partial writes/crashes between attempts).
const lastError = stepData.error ?? metadata.error;

With this PR, stepData.error will never be set for new workflows — the only way it's non-null is for data persisted before this change. The comment describes the old design and will mislead future readers. The priority should be inverted in the comment: metadata.error is now canonical; stepData.error is a backward-compat fallback for pre-existing entries.

Suggested update:

// Prefer metadata error (canonical since errors were removed from step entries).
// Fall back to step-entry error only for entries persisted before this change.
const lastError = metadata.error ?? stepData.error;

Note this also flips the operands to match the documented priority.

StepEntry.error is now a ghost field (Low)

types.ts still declares error?: string on StepEntry. It will never be written by new code, but it must remain for deserialization of old KV data. A brief comment would help future readers understand why this field exists without being written:

export interface StepEntry {
    output?: unknown;
    /** @deprecated Error is persisted exclusively on EntryMetadata. Retained for backward compatibility with pre-existing KV entries. */
    error?: string;
}

Test doesn't verify the metadata error path (Low)

The new test confirms that entry.kind.data.error is undefined, which is good. But it doesn't verify that metadata.error is set correctly — that's the canonical error location now. Consider adding a check that persisted metadata entries with a failed/exhausted status do carry the error string.


Minor / Style

  • The PR title has a [slop] prefix which is non-standard for this repo. Consider dropping it before merge.
  • The PR description checklist is entirely unchecked and the rationale section is empty. Even a one-line "why" (e.g. "step entries should only reflect successful outputs; error data belongs in metadata and was causing unnecessary KV writes on failure") would help reviewers and future blame readers.

Summary

The core logic is correct and the test is a welcome addition. The two actionable items before merge are: (1) fix the now-misleading comment at line 828–831 and flip the operand order to match the new intended priority, and (2) add a deprecation note on StepEntry.error so the ghost field doesn't confuse future contributors.

@abcxff abcxff changed the base branch from 05-07-fix_pegboard_validate_drain_grace_period_request_lifespan to graphite-base/5010 May 11, 2026 03:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant