This guide shows how to create custom agents for RL training. AReaL supports any agent framework (OpenAI Agents SDK, LangChain, CAMEL-AI, etc.) with minimal integration.
Notes:
-
Agent workflows are supported on
localandslurmschedulers only. Therayscheduler is incompatible with the HTTP proxy architecture. -
For internal architecture details, see the Agent Workflow Reference.
An agent workflow is any class with an async def run(data, **extra_kwargs) method that
returns a reward. AReaL automatically wraps it for RL training.
class MyAgent:
async def run(self, data, **extra_kwargs):
# Get injected client and URL
http_client = extra_kwargs.get("http_client")
base_url = extra_kwargs.get("base_url") or os.getenv("OPENAI_BASE_URL")
api_key = extra_kwargs.get("api_key") or os.getenv("OPENAI_API_KEY")
# Use standard OpenAI SDK
client = AsyncOpenAI(
base_url=base_url,
api_key=api_key,
http_client=http_client,
max_retries=0,
)
response = await client.chat.completions.create(
model="default",
messages=data["messages"],
)
# Return reward (float or dict[str, float])
return compute_reward(response, data["answer"])Pass the agent to the trainer:
trainer.train(workflow="my_module.MyAgent")The run method must follow this signature:
async def run(self, data: dict, **extra_kwargs) -> float | dict[str, float]| Parameter | Description |
|---|---|
data |
A sample from your dataset (dict with your data keys) |
extra_kwargs |
AReaL-injected arguments (see below) |
| Return | float: reward for last completion |
dict[str, float]: maps completion IDs to rewards |
AReaL injects these arguments via extra_kwargs:
| Key | Type | Description |
|---|---|---|
base_url |
str |
URL to AReaL's proxy server |
api_key |
str |
Session-wise API key to AReaL's proxy server |
http_client |
httpx.AsyncClient |
Shared HTTP client (reduces overhead) |
AReaL supports two execution modes, configured via rollout.openai.mode:
The agent runs in the same process as the rollout worker. Recommended for most use cases.
rollout:
openai:
mode: inlineRequirements:
- The
runmethod must beasync - Use
extra_kwargs["base_url"]for LLM calls - Optionally use
extra_kwargs["http_client"]to reduce overhead
Advantages:
- No serialization overhead
- Direct access to shared HTTP client
- Lower latency
The agent runs in a separate process pool. Use this when your agent code is not async-compatible or uses libraries that conflict with the main process.
rollout:
openai:
mode: subproc
subproc_max_workers: 4 # Process pool sizeRequirements:
- The agent class must be picklable (serializable)
- Read
OPENAI_BASE_URLfrom environment instead ofextra_kwargs
Example:
import os
from openai import OpenAI # Sync client is OK
class MySyncAgent:
async def run(self, data, **extra_kwargs):
# In subproc mode, base_url and api_key come from environment
client = OpenAI(
base_url=os.getenv("OPENAI_BASE_URL"),
api_key=os.getenv("OPENAI_API_KEY"),
api_key="DUMMY", # Not used by AReaL
)
response = client.chat.completions.create(
model="default",
messages=data["messages"],
)
return compute_reward(response, data["answer"])Note: The method signature remains async def run(...) even in subprocess mode, but
AReaL wraps the call with asyncio.run() internally. You can use synchronous code
inside the method.
Trade-offs:
- Pickling overhead for agent and data
- No access to shared HTTP client
- Higher latency per call
- Useful for non-async libraries or process isolation
Return a single float to assign reward to the last LLM completion:
async def run(self, data, **extra_kwargs):
# ... agent logic ...
return 1.0 if is_correct else 0.0For multi-turn conversations, return a dict mapping completion IDs to rewards:
async def run(self, data, **extra_kwargs):
# ... multi-turn agent logic ...
return {
"completion-id-1": 0.5,
"completion-id-2": 1.0,
}Access completion IDs from the response:
response = await client.chat.completions.create(...)
completion_id = response.id # Use this ID for reward mappingAgent workflow settings are in rollout.openai:
rollout:
openai:
mode: inline # "inline" or "subproc"
turn_discount: 0.9 # Reward discount for earlier turns
export_style: individual # "individual" or "concat"
subproc_max_workers: 4 # Process pool size (subproc mode only)| Field | Default | Description |
|---|---|---|
mode |
inline |
Execution mode |
turn_discount |
1.0 |
Geometric discount for multi-turn rewards |
export_style |
individual |
How to export interactions for training |
subproc_max_workers |
4 |
Max worker processes for subprocess mode |
- Agentic RL Tutorial - End-to-end training examples
- Async Workflow Best Practices - Writing efficient inline async agent workflows
- Agent Workflow Reference - Internal architecture