Conversation
|
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
TL;DR — Adds a design spec for "Code Mode," a proposed runtime capability that lets agents write and execute code on the fly instead of requiring bespoke tools for every data operation. The spec covers an Key changes
Summary | 1 file | 1 commit | base:
The spec identifies four layered capabilities — sandbox code execution, a platform stdlib (
|
There was a problem hiding this comment.
PR Review Summary
(0) Total Issues | Risk: Low
This PR introduces an early-stage vision spec for "Code Mode" — a capability that would let agents write and execute code at runtime with access to a standard library. The document is well-structured as a brainstorming/ideation artifact capturing the problem space, core ideas, and open questions.
Document Assessment
Strengths:
- Clear problem statement with concrete examples
- Reasonable phasing strategy (code execution → shared library → persistence)
- Good identification of open questions and security considerations
- Appropriate scope limitation ("pure computation first")
- Nice architectural diagram showing the execution flow
Observations (not blockers):
💭 1) structure Spec maturity level
Issue: This document is structured as an informal vision/ideation document rather than the more formal spec format used elsewhere in this repo (e.g., specs/2026-03-06-altcha-pow/SPEC.md with numbered sections, decision logs, requirements tables, acceptance criteria).
Why: Neither approach is wrong — early-stage ideation docs serve a different purpose than implementation-ready specs. However, consider whether this should mature into the formal format before implementation begins.
Refs: specs/2026-03-06-altcha-pow/SPEC.md — example of the formal spec structure
💭 2) lines:54-55 Scope clarification for "agent-scoped"
Issue: The spec mentions the shared library is "scoped to the agent (project + agentId)" but also that "sub-agents within the same agent graph can read/write."
Why: These two statements need reconciliation. Is the scope project + agentId (single agent) or project + root agent graph (parent + all sub-agents)? The latter seems intended but the phrasing is ambiguous.
Fix: Clarify whether the scope is per-agent or per-agent-graph.
💭 3) open-questions Security model depth
Issue: Question #4 mentions sandboxing guarantees but doesn't explore the LLM-prompt-injection → code-execution vector.
Why: If an attacker can inject malicious instructions into an agent's context (via user input, tool results, or external data), those instructions could cause the agent to write malicious code that then executes. This is distinct from the sandboxing question and worth calling out explicitly.
Fix: Consider adding an open question about protecting against prompt-injection-to-code-execution attacks.
💭 Consider
💭 1) Missing status metadata
Other specs include status/owner/last-updated headers. This helps track document maturity and ownership.
💭 2) Relationship to existing SandboxExecutorFactory
The spec mentions building on existing sandbox infrastructure but doesn't detail what changes would be needed. Consider whether the current sandbox supports the stdlib injection and session-scoped shared library patterns.
💭 3) Token accounting for stdlib
Question #5 asks about token economics, but there's also a related question: does stdlib usage need to be metered for billing purposes? If agents use grep() internally, does that count differently than raw code?
✅ APPROVE
Summary: This is a solid early-stage vision document for an interesting capability. The core ideas are well-articulated, the phasing is sensible, and the open questions capture the right concerns. No blocking issues — the "Consider" items are suggestions for strengthening the spec before implementation planning begins. Nice work capturing the ideation from the Slack discussion!
Discarded (0)
No findings discarded.
Reviewers (1)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
orchestrator |
5 | 0 | 5 | 0 | 0 | 0 | 0 |
| Total | 5 | 0 | 5 | 0 | 0 | 0 | 0 |
Note: Single-file spec document — no code reviewers dispatched. Review performed directly by orchestrator.
There was a problem hiding this comment.
Solid spec that correctly identifies the right abstraction — general code execution beats an ever-growing bespoke tool surface. A few technical accuracy issues and design gaps worth addressing before this becomes the reference document for implementation.
Claude Opus | 𝕏
|
|
||
| #### 1. Code Execution as a Platform Tool | ||
|
|
||
| Agents get a `execute_code` tool that accepts code (TypeScript/JS) and runs it in the existing sandbox infrastructure (native or Vercel). The agent writes whatever logic it needs on the fly — no need for us to anticipate every operation. |
There was a problem hiding this comment.
The spec says "existing sandbox infrastructure (native or Vercel)" — worth noting that the current SandboxExecutorFactory.executeFunctionTool() expects a functionId and args, and the code being executed is pulled from the functions table's execute_code column. An execute_code platform tool would need a different entry point that accepts arbitrary agent-authored code as input rather than looking up a pre-registered function. The sandbox infra (pooling, concurrency semaphore, dependency caching) is reusable, but the execution wrapper in sandbox-utils.ts assumes a fixed execute(args) contract that would need to be generalized.
| | `text` | Split, chunk, truncate, word count, diff | Common text processing for conversation/artifact data | | ||
| | `csv` | Parse, filter, transform tabular data | Common tool result format | | ||
|
|
||
| The stdlib ships with the platform — agents don't need to install or import anything. It's just available in the execution environment. |
There was a problem hiding this comment.
"agents don't need to install or import anything" — this implies the stdlib is injected as globals or as a pre-loaded module in the sandbox. The current NativeSandboxExecutor writes a fresh package.json per sandbox directory and runs npm install. The stdlib would either need to be:
- Published as an npm package included in the generated
package.json(simplest, plays well with existing infra) - Pre-bundled as files written into the sandbox directory before execution
- Injected as globals in the execution wrapper
Option 1 is worth calling out since it determines whether the stdlib can be versioned/updated independently of the platform release.
|
|
||
| This means agents can build up specialized utilities during a conversation. A sub-agent that figured out how to parse a specific API response format can share that parser with its siblings. | ||
|
|
||
| **Scope:** The shared library is scoped to the agent (project + agentId). Sub-agents within the same agent graph can read/write. Different agents in different projects cannot see each other's libraries. |
There was a problem hiding this comment.
The scoping here is project + agentId, but the sandbox executors are currently scoped by sessionId (for Vercel) or by a dependency hash (for native). A shared library that is readable by sibling sub-agents within the same conversation requires a new scoping dimension — the agent graph context — that doesn't exist in the sandbox layer today. Phase 2 should call out that this likely needs an in-memory or database-backed registry separate from the sandbox itself, since sub-agents may run in different sandbox instances.
| Session N: Library grows with battle-tested utilities | ||
| ``` | ||
|
|
||
| **Storage:** Persisted functions are stored in the manage database, versioned, and scoped to the agent. They can be viewed/managed in the Agent Builder UI. |
There was a problem hiding this comment.
"stored in the manage database" — the manage database is Doltgres (versioned). Storing generated code there means it participates in branch/merge semantics, which could be an advantage (rollback) or a footgun (merge conflicts on auto-generated code). Worth explicitly noting whether that's intentional or whether runtime-schema Postgres is more appropriate for generated artifacts.
|
|
||
| ## Relationship to Existing Systems | ||
|
|
||
| - **Sandbox Executors:** Code Mode builds on the existing `SandboxExecutorFactory` (native + Vercel). The sandbox infrastructure already handles execution, session scoping, and cleanup. |
There was a problem hiding this comment.
Minor accuracy: SandboxExecutorFactory does not handle "session scoping" itself — it delegates session-scoped sandbox pooling to getForSession(sessionId), which ensures Vercel sandbox instances are reused within a session. The cleanup is triggered externally by AgentSession.cleanupSession(). The spec's phrasing ("already handles execution, session scoping, and cleanup") slightly overstates the factory's responsibility.
| - **Sandbox Executors:** Code Mode builds on the existing `SandboxExecutorFactory` (native + Vercel). The sandbox infrastructure already handles execution, session scoping, and cleanup. | ||
| - **Function Tools:** Existing function tools are pre-configured in the manage database. Code Mode is dynamic — the agent writes code at runtime. Persisted shared functions could eventually become function tools. | ||
| - **Artifacts:** Oversized tool results are already stored as artifacts. Code Mode gives agents a way to actually *work with* those artifacts (grep, query, transform) instead of just retrieving them whole. | ||
| - **Platform Tools:** `execute_code` would be a platform tool (like the search tools in the conversation history spec), auto-loaded for agents that have code mode enabled. |
There was a problem hiding this comment.
The spec references "platform tools" as an existing pattern ("like the search tools in the conversation history spec"). In practice, these are implemented as default-tools in agents-api/src/domains/run/agents/tools/default-tools.ts and conditionally loaded based on agent configuration checks (e.g., agentHasArtifactComponents()). There's no generic feature-flag or per-agent opt-in for platform tools today. Phase 1 needs to either:
- Add an
enableCodeModeflag to the agent config (manage schema) and check it ingetDefaultTools() - Or build a more general platform-tool registry
Given Phase 1 scope, option 1 is likely sufficient — but should be explicit.
|
|
||
| ## Open Questions | ||
|
|
||
| 1. **What can code access?** Can code in the sandbox make network requests? Access the database? Or is it purely computational (data in → data out)? Starting with pure computation is safest. |
There was a problem hiding this comment.
Open question 1 and 4 overlap significantly (what can code access vs. security boundaries). Consider merging them and instead adding an open question about result size limits — the current NativeSandboxExecutor caps output at 1MB (FUNCTION_TOOL_SANDBOX_MAX_OUTPUT_SIZE_BYTES). For execute_code, the agent could produce large outputs that then need to flow back into context. Should oversized code results use the existing artifact storage path (like oversized tool results do today via detectOversizedArtifact)?
|
|
||
| 1. **What can code access?** Can code in the sandbox make network requests? Access the database? Or is it purely computational (data in → data out)? Starting with pure computation is safest. | ||
|
|
||
| 2. **How does code get data?** Does the agent pass data as arguments to `execute_code`, or can code directly reference artifacts/messages by ID? Passing data is simpler; referencing by ID is more powerful for large data. |
There was a problem hiding this comment.
This is arguably the most important open question for Phase 1 and deserves a concrete recommendation. Passing data as arguments has a hard limit — NativeSandboxExecutor serializes args as JSON in the execution wrapper (sandbox-utils.ts:createExecutionWrapper). For truly large data (oversized artifacts), this means deserializing potentially megabytes of JSON in the sandbox. A hybrid approach — pass small data as args, provide an artifacts.get(id) function in the stdlib for large data — would be more practical. This also connects to open question 6 (what data is accessible).
| **Phase 1 — Code Execution + Stdlib** | ||
| - `execute_code` platform tool | ||
| - Sandbox execution (leverage existing infra) | ||
| - Ship stdlib with grep, jq, text basics |
There was a problem hiding this comment.
Phase 1 is missing a critical item: the execute_code tool schema definition (input/output Zod schemas). The tool description shown to the LLM is what determines whether agents actually use code mode effectively. The description needs to convey: what the stdlib offers, what the execution constraints are (no network, timeout, output limits), and when to prefer execute_code over requesting data directly. This is the highest-leverage design decision in Phase 1 and worth calling out explicitly.
|
This pull request has been automatically marked as stale because it has not had recent activity. If this PR is still relevant:
Thank you for your contributions! |
|
This pull request has been automatically closed due to inactivity. If you'd like to continue working on this, please:
Thank you for your understanding! |

No description provided.