Skip to content

code mode ideas#3002

Closed
tim-inkeep wants to merge 1 commit intomainfrom
tim/code-mode-spec
Closed

code mode ideas#3002
tim-inkeep wants to merge 1 commit intomainfrom
tim/code-mode-spec

Conversation

@tim-inkeep
Copy link
Copy Markdown
Contributor

No description provided.

@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Apr 3, 2026

⚠️ No Changeset found

Latest commit: 25c291e

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@vercel
Copy link
Copy Markdown

vercel bot commented Apr 3, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agents-api Error Error Apr 3, 2026 5:11pm
agents-docs Ready Ready Preview, Comment Apr 3, 2026 5:11pm
agents-manage-ui Ready Ready Preview, Comment Apr 3, 2026 5:11pm

Request Review

@pullfrog
Copy link
Copy Markdown
Contributor

pullfrog bot commented Apr 3, 2026

TL;DR — Adds a design spec for "Code Mode," a proposed runtime capability that lets agents write and execute code on the fly instead of requiring bespoke tools for every data operation. The spec covers an execute_code platform tool, a bundled standard library (grep, jq, schema, text, csv), agent-scoped shared function registries, and cross-session persistence — phased across three milestones.

Key changes

  • Add Code Mode design spec — New specs/2026-04-03-code-mode/SPEC.md capturing the problem statement, vision (code execution + stdlib + shared library + persistence), relationship to existing sandbox/artifact/platform-tool systems, open questions, and a three-phase rollout plan.

Summary | 1 file | 1 commit | base: maintim/code-mode-spec

Before: Agents that need to grep artifacts, extract JSON fields, or transform data require purpose-built tools — each one scoped, built, and maintained individually.
After: A design spec proposes giving agents a general-purpose execute_code sandbox tool backed by a curated stdlib, letting them write ad-hoc logic at runtime and share reusable functions across sub-agents and sessions.

The spec identifies four layered capabilities — sandbox code execution, a platform stdlib (grep, jq, schema, text, csv), an agent-scoped shared function registry for sub-agent collaboration, and cross-session persistence so agents accumulate battle-tested utilities over time. It explicitly builds on existing infrastructure (SandboxExecutorFactory, artifacts, platform tools) and calls out eight open questions around security boundaries, data access, governance, and token economics.

What are the three proposed phases?

Phase 1 ships execute_code + stdlib with pure-computation sandboxing. Phase 2 adds agent-scoped shared function registration and discovery across sub-agents. Phase 3 persists shared functions to the manage database with versioning, a curation model, and Agent Builder UI support.

specs/2026-04-03-code-mode/SPEC.md

Pullfrog  | View workflow run | Triggered by Pullfrog | Using Claude Opus𝕏

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(0) Total Issues | Risk: Low

This PR introduces an early-stage vision spec for "Code Mode" — a capability that would let agents write and execute code at runtime with access to a standard library. The document is well-structured as a brainstorming/ideation artifact capturing the problem space, core ideas, and open questions.

Document Assessment

Strengths:

  • Clear problem statement with concrete examples
  • Reasonable phasing strategy (code execution → shared library → persistence)
  • Good identification of open questions and security considerations
  • Appropriate scope limitation ("pure computation first")
  • Nice architectural diagram showing the execution flow

Observations (not blockers):

💭 1) structure Spec maturity level

Issue: This document is structured as an informal vision/ideation document rather than the more formal spec format used elsewhere in this repo (e.g., specs/2026-03-06-altcha-pow/SPEC.md with numbered sections, decision logs, requirements tables, acceptance criteria).

Why: Neither approach is wrong — early-stage ideation docs serve a different purpose than implementation-ready specs. However, consider whether this should mature into the formal format before implementation begins.

Refs: specs/2026-03-06-altcha-pow/SPEC.md — example of the formal spec structure

💭 2) lines:54-55 Scope clarification for "agent-scoped"

Issue: The spec mentions the shared library is "scoped to the agent (project + agentId)" but also that "sub-agents within the same agent graph can read/write."

Why: These two statements need reconciliation. Is the scope project + agentId (single agent) or project + root agent graph (parent + all sub-agents)? The latter seems intended but the phrasing is ambiguous.

Fix: Clarify whether the scope is per-agent or per-agent-graph.

💭 3) open-questions Security model depth

Issue: Question #4 mentions sandboxing guarantees but doesn't explore the LLM-prompt-injection → code-execution vector.

Why: If an attacker can inject malicious instructions into an agent's context (via user input, tool results, or external data), those instructions could cause the agent to write malicious code that then executes. This is distinct from the sandboxing question and worth calling out explicitly.

Fix: Consider adding an open question about protecting against prompt-injection-to-code-execution attacks.

💭 Consider

💭 1) Missing status metadata

Other specs include status/owner/last-updated headers. This helps track document maturity and ownership.

💭 2) Relationship to existing SandboxExecutorFactory

The spec mentions building on existing sandbox infrastructure but doesn't detail what changes would be needed. Consider whether the current sandbox supports the stdlib injection and session-scoped shared library patterns.

💭 3) Token accounting for stdlib

Question #5 asks about token economics, but there's also a related question: does stdlib usage need to be metered for billing purposes? If agents use grep() internally, does that count differently than raw code?


✅ APPROVE

Summary: This is a solid early-stage vision document for an interesting capability. The core ideas are well-articulated, the phasing is sensible, and the open questions capture the right concerns. No blocking issues — the "Consider" items are suggestions for strengthening the spec before implementation planning begins. Nice work capturing the ideation from the Slack discussion!

Discarded (0)

No findings discarded.

Reviewers (1)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
orchestrator 5 0 5 0 0 0 0
Total 5 0 5 0 0 0 0

Note: Single-file spec document — no code reviewers dispatched. Review performed directly by orchestrator.

Copy link
Copy Markdown
Contributor

@pullfrog pullfrog bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid spec that correctly identifies the right abstraction — general code execution beats an ever-growing bespoke tool surface. A few technical accuracy issues and design gaps worth addressing before this becomes the reference document for implementation.

Pullfrog  | Fix all ➔Fix 👍s ➔View workflow run | Using Claude Opus𝕏


#### 1. Code Execution as a Platform Tool

Agents get a `execute_code` tool that accepts code (TypeScript/JS) and runs it in the existing sandbox infrastructure (native or Vercel). The agent writes whatever logic it needs on the fly — no need for us to anticipate every operation.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec says "existing sandbox infrastructure (native or Vercel)" — worth noting that the current SandboxExecutorFactory.executeFunctionTool() expects a functionId and args, and the code being executed is pulled from the functions table's execute_code column. An execute_code platform tool would need a different entry point that accepts arbitrary agent-authored code as input rather than looking up a pre-registered function. The sandbox infra (pooling, concurrency semaphore, dependency caching) is reusable, but the execution wrapper in sandbox-utils.ts assumes a fixed execute(args) contract that would need to be generalized.

| `text` | Split, chunk, truncate, word count, diff | Common text processing for conversation/artifact data |
| `csv` | Parse, filter, transform tabular data | Common tool result format |

The stdlib ships with the platform — agents don't need to install or import anything. It's just available in the execution environment.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"agents don't need to install or import anything" — this implies the stdlib is injected as globals or as a pre-loaded module in the sandbox. The current NativeSandboxExecutor writes a fresh package.json per sandbox directory and runs npm install. The stdlib would either need to be:

  1. Published as an npm package included in the generated package.json (simplest, plays well with existing infra)
  2. Pre-bundled as files written into the sandbox directory before execution
  3. Injected as globals in the execution wrapper

Option 1 is worth calling out since it determines whether the stdlib can be versioned/updated independently of the platform release.


This means agents can build up specialized utilities during a conversation. A sub-agent that figured out how to parse a specific API response format can share that parser with its siblings.

**Scope:** The shared library is scoped to the agent (project + agentId). Sub-agents within the same agent graph can read/write. Different agents in different projects cannot see each other's libraries.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scoping here is project + agentId, but the sandbox executors are currently scoped by sessionId (for Vercel) or by a dependency hash (for native). A shared library that is readable by sibling sub-agents within the same conversation requires a new scoping dimension — the agent graph context — that doesn't exist in the sandbox layer today. Phase 2 should call out that this likely needs an in-memory or database-backed registry separate from the sandbox itself, since sub-agents may run in different sandbox instances.

Session N: Library grows with battle-tested utilities
```

**Storage:** Persisted functions are stored in the manage database, versioned, and scoped to the agent. They can be viewed/managed in the Agent Builder UI.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"stored in the manage database" — the manage database is Doltgres (versioned). Storing generated code there means it participates in branch/merge semantics, which could be an advantage (rollback) or a footgun (merge conflicts on auto-generated code). Worth explicitly noting whether that's intentional or whether runtime-schema Postgres is more appropriate for generated artifacts.


## Relationship to Existing Systems

- **Sandbox Executors:** Code Mode builds on the existing `SandboxExecutorFactory` (native + Vercel). The sandbox infrastructure already handles execution, session scoping, and cleanup.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor accuracy: SandboxExecutorFactory does not handle "session scoping" itself — it delegates session-scoped sandbox pooling to getForSession(sessionId), which ensures Vercel sandbox instances are reused within a session. The cleanup is triggered externally by AgentSession.cleanupSession(). The spec's phrasing ("already handles execution, session scoping, and cleanup") slightly overstates the factory's responsibility.

- **Sandbox Executors:** Code Mode builds on the existing `SandboxExecutorFactory` (native + Vercel). The sandbox infrastructure already handles execution, session scoping, and cleanup.
- **Function Tools:** Existing function tools are pre-configured in the manage database. Code Mode is dynamic — the agent writes code at runtime. Persisted shared functions could eventually become function tools.
- **Artifacts:** Oversized tool results are already stored as artifacts. Code Mode gives agents a way to actually *work with* those artifacts (grep, query, transform) instead of just retrieving them whole.
- **Platform Tools:** `execute_code` would be a platform tool (like the search tools in the conversation history spec), auto-loaded for agents that have code mode enabled.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec references "platform tools" as an existing pattern ("like the search tools in the conversation history spec"). In practice, these are implemented as default-tools in agents-api/src/domains/run/agents/tools/default-tools.ts and conditionally loaded based on agent configuration checks (e.g., agentHasArtifactComponents()). There's no generic feature-flag or per-agent opt-in for platform tools today. Phase 1 needs to either:

  1. Add an enableCodeMode flag to the agent config (manage schema) and check it in getDefaultTools()
  2. Or build a more general platform-tool registry

Given Phase 1 scope, option 1 is likely sufficient — but should be explicit.


## Open Questions

1. **What can code access?** Can code in the sandbox make network requests? Access the database? Or is it purely computational (data in → data out)? Starting with pure computation is safest.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open question 1 and 4 overlap significantly (what can code access vs. security boundaries). Consider merging them and instead adding an open question about result size limits — the current NativeSandboxExecutor caps output at 1MB (FUNCTION_TOOL_SANDBOX_MAX_OUTPUT_SIZE_BYTES). For execute_code, the agent could produce large outputs that then need to flow back into context. Should oversized code results use the existing artifact storage path (like oversized tool results do today via detectOversizedArtifact)?


1. **What can code access?** Can code in the sandbox make network requests? Access the database? Or is it purely computational (data in → data out)? Starting with pure computation is safest.

2. **How does code get data?** Does the agent pass data as arguments to `execute_code`, or can code directly reference artifacts/messages by ID? Passing data is simpler; referencing by ID is more powerful for large data.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is arguably the most important open question for Phase 1 and deserves a concrete recommendation. Passing data as arguments has a hard limit — NativeSandboxExecutor serializes args as JSON in the execution wrapper (sandbox-utils.ts:createExecutionWrapper). For truly large data (oversized artifacts), this means deserializing potentially megabytes of JSON in the sandbox. A hybrid approach — pass small data as args, provide an artifacts.get(id) function in the stdlib for large data — would be more practical. This also connects to open question 6 (what data is accessible).

**Phase 1 — Code Execution + Stdlib**
- `execute_code` platform tool
- Sandbox execution (leverage existing infra)
- Ship stdlib with grep, jq, text basics
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phase 1 is missing a critical item: the execute_code tool schema definition (input/output Zod schemas). The tool description shown to the LLM is what determines whether agents actually use code mode effectively. The description needs to convey: what the stdlib offers, what the execution constraints are (no network, timeout, output limits), and when to prefer execute_code over requesting data directly. This is the highest-leverage design decision in Phase 1 and worth calling out explicitly.

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has been automatically marked as stale because it has not had recent activity.
It will be closed in 7 days if no further activity occurs.

If this PR is still relevant:

  • Rebase it on the latest main branch
  • Add a comment explaining its current status
  • Request a review if it's ready

Thank you for your contributions!

@github-actions github-actions bot added the stale label Apr 11, 2026
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has been automatically closed due to inactivity.

If you'd like to continue working on this, please:

  1. Create a new branch from the latest main
  2. Cherry-pick your commits or rebase your changes
  3. Open a new pull request

Thank you for your understanding!

@github-actions github-actions bot closed this Apr 18, 2026
@github-actions github-actions bot deleted the tim/code-mode-spec branch April 18, 2026 00:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant