Skip to content

Latest commit

 

History

History
988 lines (807 loc) · 180 KB

File metadata and controls

988 lines (807 loc) · 180 KB

Configurations

This page provides a comprehensive reference for all configuration parameters available in AReaL's command-line interface. These parameters are defined using dataclasses and can be specified in YAML configuration files or overridden via command line arguments.

Usage

Configuration files are specified using the --config parameter:

python3 train.py --config path/to/config.yaml

You can override specific parameters from the command line:

python3 train.py --config path/to/config.yaml actor.lr=1e-4 seed=42

For detailed examples, see the experiment configurations in the examples/ directory.

Table of Contents

Core Experiment Configurations

Training Configurations

Inference Configurations

Dataset

System and Cluster Configurations

Logging and Monitoring

Others


(section-base-experiment)=

BaseExperiment Configuration

Base configuration class for all experiment types with common settings.

Parameter Type Default Description
experiment_name string Required Name of the experiment (no '_' or '/'). Required.
trial_name string Required Name of the trial (no '-' or '/'). Required.
cluster ClusterSpecConfig Required Cluster specification. Mainly used by slurm.
allocation_mode string "" Pattern-based GPU parallel strategy allocation mode.
seed integer 1 Random seed for reproducibility.
enable_offload boolean False Whether to enable training offload using torch_memory_saver. This requires setting up the environment for TMS (e.g., via LD_PRELOAD).
total_train_epochs integer 1 Total number of epochs to train the model.
total_train_steps integer | None None Terminate training after this number of steps. For benchmarking purposes only. None indicates normal training.
total_train_n_seqs integer | None None Terminate training after consuming this number of samples. For benchmarking purposes only. None indicates normal training.
tokenizer_path string "" Path to the tokenizer.
train_dataset TrainDatasetConfig Required -
valid_dataset ValidDatasetConfig | None None -
saver SaverConfig Required -
evaluator EvaluatorConfig Required -
stats_logger StatsLoggerConfig Required -
perf_tracer PerfTracerConfig | None None Performance tracer configuration. None means disabled.
recover RecoverConfig Required -
sglang SGLangConfig Required -
vllm vLLMConfig Required -
scheduler SchedulerConfig Required -

(section-grpo)=

GRPO Configuration

A dummy place holder of GRPO config for backward compatibility.

Parameter Type Default Description
experiment_name string Required Name of the experiment (no '_' or '/'). Required.
trial_name string Required Name of the trial (no '-' or '/'). Required.
cluster ClusterSpecConfig Required Cluster specification. Mainly used by slurm.
allocation_mode string "" Pattern-based GPU parallel strategy allocation mode.
seed integer 1 Random seed for reproducibility.
enable_offload boolean False Whether to enable training offload using torch_memory_saver. This requires setting up the environment for TMS (e.g., via LD_PRELOAD).
total_train_epochs integer 1 Total number of epochs to train the model.
total_train_steps integer | None None Terminate training after this number of steps. For benchmarking purposes only. None indicates normal training.
total_train_n_seqs integer | None None Terminate training after consuming this number of samples. For benchmarking purposes only. None indicates normal training.
tokenizer_path string "" Path to the tokenizer.
train_dataset TrainDatasetConfig Required -
valid_dataset ValidDatasetConfig | None None -
saver SaverConfig Required -
evaluator EvaluatorConfig Required -
stats_logger StatsLoggerConfig Required -
perf_tracer PerfTracerConfig | None None Performance tracer configuration. None means disabled.
recover RecoverConfig Required -
sglang SGLangConfig Required -
vllm vLLMConfig Required -
scheduler SchedulerConfig Required -
gconfig GenerationHyperparameters Required -
eval_gconfig GenerationHyperparameters | None None Generation hyperparameters for evaluation. If None, use gconfig.
rollout InferenceEngineConfig Required -
actor PPOActorConfig Required -
ref PPOActorConfig | None None -
critic PPOCriticConfig | None None -
dynamic_bs boolean False Enable dynamic batch sizing in prepare_batch. When True, batch collection stops when (accepted + rejected) >= batch_size, returning only accepted results. This results in variable-sized batches of valid data.

(section-ppo)=

PPO Configuration

Configuration for Proximal Policy Optimization (PPO) reinforcement learning experiments.

Parameter Type Default Description
experiment_name string Required Name of the experiment (no '_' or '/'). Required.
trial_name string Required Name of the trial (no '-' or '/'). Required.
cluster ClusterSpecConfig Required Cluster specification. Mainly used by slurm.
allocation_mode string "" Pattern-based GPU parallel strategy allocation mode.
seed integer 1 Random seed for reproducibility.
enable_offload boolean False Whether to enable training offload using torch_memory_saver. This requires setting up the environment for TMS (e.g., via LD_PRELOAD).
total_train_epochs integer 1 Total number of epochs to train the model.
total_train_steps integer | None None Terminate training after this number of steps. For benchmarking purposes only. None indicates normal training.
total_train_n_seqs integer | None None Terminate training after consuming this number of samples. For benchmarking purposes only. None indicates normal training.
tokenizer_path string "" Path to the tokenizer.
train_dataset TrainDatasetConfig Required -
valid_dataset ValidDatasetConfig | None None -
saver SaverConfig Required -
evaluator EvaluatorConfig Required -
stats_logger StatsLoggerConfig Required -
perf_tracer PerfTracerConfig | None None Performance tracer configuration. None means disabled.
recover RecoverConfig Required -
sglang SGLangConfig Required -
vllm vLLMConfig Required -
scheduler SchedulerConfig Required -
gconfig GenerationHyperparameters Required -
eval_gconfig GenerationHyperparameters | None None Generation hyperparameters for evaluation. If None, use gconfig.
rollout InferenceEngineConfig Required -
actor PPOActorConfig Required -
ref PPOActorConfig | None None -
critic PPOCriticConfig | None None -
dynamic_bs boolean False Enable dynamic batch sizing in prepare_batch. When True, batch collection stops when (accepted + rejected) >= batch_size, returning only accepted results. This results in variable-sized batches of valid data.

(section-rw)=

RW Configuration

Configuration for Reward Model (RW) training experiments.

Parameter Type Default Description
experiment_name string Required Name of the experiment (no '_' or '/'). Required.
trial_name string Required Name of the trial (no '-' or '/'). Required.
cluster ClusterSpecConfig Required Cluster specification. Mainly used by slurm.
allocation_mode string "" Pattern-based GPU parallel strategy allocation mode.
seed integer 1 Random seed for reproducibility.
enable_offload boolean False Whether to enable training offload using torch_memory_saver. This requires setting up the environment for TMS (e.g., via LD_PRELOAD).
total_train_epochs integer 1 Total number of epochs to train the model.
total_train_steps integer | None None Terminate training after this number of steps. For benchmarking purposes only. None indicates normal training.
total_train_n_seqs integer | None None Terminate training after consuming this number of samples. For benchmarking purposes only. None indicates normal training.
tokenizer_path string "" Path to the tokenizer.
train_dataset TrainDatasetConfig Required -
valid_dataset ValidDatasetConfig | None None -
saver SaverConfig Required -
evaluator EvaluatorConfig Required -
stats_logger StatsLoggerConfig Required -
perf_tracer PerfTracerConfig | None None Performance tracer configuration. None means disabled.
recover RecoverConfig Required -
sglang SGLangConfig Required -
vllm vLLMConfig Required -
scheduler SchedulerConfig Required -
actor TrainEngineConfig Required -

(section-sft)=

SFT Configuration

Configuration for Supervised Fine-Tuning (SFT) experiments.

Parameter Type Default Description
experiment_name string Required Name of the experiment (no '_' or '/'). Required.
trial_name string Required Name of the trial (no '-' or '/'). Required.
cluster ClusterSpecConfig Required Cluster specification. Mainly used by slurm.
allocation_mode string "" Pattern-based GPU parallel strategy allocation mode.
seed integer 1 Random seed for reproducibility.
enable_offload boolean False Whether to enable training offload using torch_memory_saver. This requires setting up the environment for TMS (e.g., via LD_PRELOAD).
total_train_epochs integer 1 Total number of epochs to train the model.
total_train_steps integer | None None Terminate training after this number of steps. For benchmarking purposes only. None indicates normal training.
total_train_n_seqs integer | None None Terminate training after consuming this number of samples. For benchmarking purposes only. None indicates normal training.
tokenizer_path string "" Path to the tokenizer.
train_dataset TrainDatasetConfig Required -
valid_dataset ValidDatasetConfig | None None -
saver SaverConfig Required -
evaluator EvaluatorConfig Required -
stats_logger StatsLoggerConfig Required -
perf_tracer PerfTracerConfig | None None Performance tracer configuration. None means disabled.
recover RecoverConfig Required -
sglang SGLangConfig Required -
vllm vLLMConfig Required -
scheduler SchedulerConfig Required -
actor TrainEngineConfig Required -

(section-fsdp-engine)=

FSDPEngine Configuration

Configuration for Fully Sharded Data Parallel (FSDP) training backend.

Parameter Type Default Description
wrap_policy FSDPWrapPolicy | None None FSDP wrap policy, specifying model layers to wrap.
offload_params boolean False Whether to offload FSDP parameters to CPU.
memory_efficient_load boolean False Enable memory-efficient model loading. When enabled, model weights are initialized on CPU and only rank 0 loads pretrained weights, which are then broadcast to all ranks after FSDP sharding. This reduces peak GPU memory during initialization for large models. Note: For VLMs, rank 0 broadcast is not used; each rank loads weights independently on CPU.
shard_vision_across_sp boolean False Shard vision encoder across SP ranks by image. Only effective when context_parallel_size > 1.

(section-fsdp-wrap-policy)=

FSDPWrapPolicy

Policy configuration for FSDP model layer wrapping. None defaults to wrapping transformer decoder layers defined by transformers.

Parameter Type Default Description
transformer_layer_cls_to_wrap list of string | None None A list of transformer layer names for FSDP to wrap.

(section-micro-batch)=

MicroBatch Specification

Specification for splitting micro-batches during training.

Parameter Type Default Description
n_mbs integer | None 1 Number of micro-batches (or minimum number if max_tokens_per_mb is set). Used when max_tokens_per_mb is None or as minimum count
granularity integer 1 Granularity of each micro-batch. Adjacent sequences are grouped by this size when dividing microbatches.
max_tokens_per_mb integer | None None Maximum tokens per micro-batch for each forward pass. When set, n_mbs becomes the minimum number of micro-batches.
n_mbs_divisor integer 1 Divisor for the number of micro-batches. The final number of micro-batches will be adjusted to be divisible by this value.

(section-norm)=

Norm Configuration

Configuration for reward/advantage normalization.

Parameter Type Default Description
mean_level string | None "batch" Mean level for normalization. None for no mean normalization. Choices: batch, group, None
mean_leave1out boolean False Whether to use leave-one-out average.
std_level string | None "batch" Standard deviation level for normalization. None for no std normalization. Choices: batch, group, None
std_unbiased boolean True Whether to use unbiased standard deviation computation. Defaults to True (changed from False in v0.3.4).
eps float 1e-05 The eps when dividing by standard deviation to avoid numerical issues.
group_size integer 1 Group size for group-level normalization

(section-optimizer)=

Optimizer Configuration

Configuration for model optimization during training.

Parameter Type Default Description
type string "adam" Optimizer type. For FSDP Engine, adam_bf16 enables memory-efficient BF16 optimizer states. For Megatron Engine, adam_bf16 requires dtype=bfloat16 and is automatically converted to adam with precision-aware optimizer enabled. Choices: adam, sgd, adam_bf16
lr float 0.001 Learning rate
weight_decay float 0.01 Weight decay
beta1 float 0.9 Adam beta1 parameter. Only effective when optimizer_type is adam/adam_bf16
beta2 float 0.999 Adam beta2 parameter. Only effective when optimizer_type is adam/adam_bf16
eps float 1e-08 Adam epsilon parameter. Only effective when optimizer_type is adam/adam_bf16
min_lr_ratio float 0.0 Minimum learning rate ratio after annealing
lr_scheduler_type string "constant" Learning rate scheduler type Choices: linear, cosine, constant
warmup_steps_proportion float 0.001 Proportion of training steps for warmup
offload boolean False Enable optimizer state offloading
initial_loss_scale float 4294967296 Initial loss scaling factor
min_loss_scale float 1.0 Minimum loss scaling factor
loss_scale_window float 5 Window size for loss scaling adjustment
hysteresis integer 2 Hysteresis (scaling factor) for loss scaling
gradient_clipping float 1.0 Gradient clipping threshold

(section-ppo-actor)=

PPOActor Configuration

Configuration for PPO actor model, a subclass of a TrainEngine.

Parameter Type Default Description
experiment_name string Required -
trial_name string Required -
path string "" Path to HuggingFace checkpoint
attn_impl string "flash_attention_2" Attention implementation for huggingface transformers model. Choices: flash_attention_2
init_from_scratch boolean False Initialize model weights randomly
is_critic boolean False Whether to use a critic/reward model
temperature float 1.0 Temperature during generation.
mb_spec MicroBatchSpec Required -
pad_to_maximum boolean False Whether to pad each microbatch to the length upper bound specified by mb_spec. Can reduce memory fragmentation but slows down training.
disable_dropout boolean False Disable dropout layers during training
gradient_checkpointing boolean False Enable gradient checkpointing
dtype string "bfloat16" Parameter data type.
grad_reduce_dtype string "float32" Gradient reduction data type.
optimizer OptimizerConfig | None None Optimizer configuration. None means no training.
weight_update_mode string "xccl" Weight update backend type. Choices: disk, xccl
fsdp FSDPEngineConfig Required -
archon ArchonEngineConfig Required -
megatron MegatronEngineConfig Required -
use_lora boolean False Whether to use LoRA. Only support FSDP. Note that should be enabled together with vLLM/SGLang.
lora_rank integer 32 lora rank
lora_alpha integer 16 lora alpha
target_modules list of string Required lora target_modules.
peft_type string "lora" peft method type. Only LoRA is supported for now.
enable_tree_training boolean False Enable tree training with flex attention module.
scheduling_spec tuple Required Train engine schedule specs. Can accept 1 or 2 SchedulingSpec: if 1 spec provided, it's used for both worker and engine, engine is embedded in the worker; if 2 specs provided, first one is for worker, second one is for engine. Currently only used by the TrainController.
scheduling_strategy SchedulingStrategy Required The scheduling strategy of this TrainEngine, either separation or colocation. Currently only used by the TrainController.
ppo_n_minibatches integer 4 Number of minibatches for each PPO update
eps_clip float 0.2 Clipping factor for policy ratio
eps_clip_higher float | None None Clipping factor (higher value) for policy ratio. Default is None. When eps_clip_higher is set (decoupled), eps_clip will be used as the lower value.
c_clip float | None None Dual clipping factor for policy ratio, must be > 1.0. None disables dual clipping.
m2_threshold float | None None The second momentum threshold for M2PO.
reward_norm NormConfig | None None Normalization configuration for rewards
reward_scaling float 1.0 Reward scaling factor
reward_bias float 0.0 Reward bias
reward_clip float 20.0 Maximum absolute value for reward clipping
overlong_reward_penalty boolean False Penalty for overlong sequences. Used within DAPO.
overlong_tokens integer | None None Number of tokens in the tail that will receive a penalty
overlong_penalty_factor float | None None Penalty factor for tokens in the tail
mask_no_eos_with_zero boolean False Mask truncated generations (no EOS token) and exclude from training
discount float 1.0 Discount factor for future rewards
gae_lambda float 1.0 Lambda parameter for GAE
adv_norm NormConfig | None None Normalization configuration for advantages.
kl_ctl float 0.1 KL divergence coefficient
kl_estimator string "k1" KL divergence estimator Choices: k1, k2, k3
use_sapo_loss boolean False Use SAPO loss (mutually exclusive with PPO clipping)
sapo_tau_pos float 1.0 SAPO temperature for positive advantages
sapo_tau_neg float 1.05 SAPO temperature for negative advantages
recompute_logprob boolean False Recompute log probability and replace the log probability returned by inference.
use_decoupled_loss boolean False Use the decoupled loss. Implicitly enables recompute_logprob.
behave_imp_weight_cap float | None 5.0 Filter out tokens/sequences where behave_imp_weight exceeds this cap when computing loss. Only effective when use_decoupled_loss=True (decoupled/async training). Must be > 1.0 when mode is not 'disabled'. Mode controlled by behave_imp_weight_mode (mask/truncate/disabled).
behave_imp_weight_mode string "token_mask" Mode for importance weight filtering. Only effective when use_decoupled_loss=True (decoupled/async training). 'token_truncate': clamp token ratio to [0, cap]. 'token_mask': set token ratio to 0 where ratio > cap. 'sequence_truncate': clamp sequence ratio to [0, cap]. 'sequence_mask': set sequence ratio to 0 where ratio > cap. 'disabled': disable importance weight correction. Choices: token_truncate, token_mask, sequence_truncate, sequence_mask, disabled
importance_sampling_level string "token" Level at which to compute importance sampling ratios. 'token': per-token ratios (standard PPO). 'sequence': sequence-level geometric mean of per-token ratios (GSPO). Choices: token, sequence
prox_logp_method string "recompute" Method for computing proximal policy log-probabilities in decoupled PPO. Only effective when use_decoupled_loss=True. Options: 'recompute' (default): Standard decoupled PPO, recompute proximal policy via forward pass. 'loglinear': Use log-linear interpolation to approximate proximal policy (skip forward pass). 'metrics': Like 'recompute', but also compute approximation metrics for evaluation. Choices: recompute, loglinear, metrics
log_agent_stats boolean False Log statistics for agent trajectories
log_agent_stats_keys list of string Required Keys for logging agent trajectory statistics
max_new_tokens integer 1024 Maximum number of new tokens to generate

(section-ppo-critic)=

PPOCritic Configuration

Configuration for PPO critic model, a subclass of a TrainEngine.

Parameter Type Default Description
experiment_name string Required -
trial_name string Required -
path string "" Path to HuggingFace checkpoint
attn_impl string "flash_attention_2" Attention implementation for huggingface transformers model. Choices: flash_attention_2
init_from_scratch boolean False Initialize model weights randomly
is_critic boolean False Whether to use a critic/reward model
temperature float 1.0 Temperature during generation.
mb_spec MicroBatchSpec Required -
pad_to_maximum boolean False Whether to pad each microbatch to the length upper bound specified by mb_spec. Can reduce memory fragmentation but slows down training.
disable_dropout boolean False Disable dropout layers during training
gradient_checkpointing boolean False Enable gradient checkpointing
dtype string "bfloat16" Parameter data type.
grad_reduce_dtype string "float32" Gradient reduction data type.
optimizer OptimizerConfig | None None Optimizer configuration. None means no training.
weight_update_mode string "xccl" Weight update backend type. Choices: disk, xccl
fsdp FSDPEngineConfig Required -
archon ArchonEngineConfig Required -
megatron MegatronEngineConfig Required -
use_lora boolean False Whether to use LoRA. Only support FSDP. Note that should be enabled together with vLLM/SGLang.
lora_rank integer 32 lora rank
lora_alpha integer 16 lora alpha
target_modules list of string Required lora target_modules.
peft_type string "lora" peft method type. Only LoRA is supported for now.
enable_tree_training boolean False Enable tree training with flex attention module.
scheduling_spec tuple Required Train engine schedule specs. Can accept 1 or 2 SchedulingSpec: if 1 spec provided, it's used for both worker and engine, engine is embedded in the worker; if 2 specs provided, first one is for worker, second one is for engine. Currently only used by the TrainController.
scheduling_strategy SchedulingStrategy Required The scheduling strategy of this TrainEngine, either separation or colocation. Currently only used by the TrainController.
ppo_n_minibatches integer 4 Number of minibatches for each PPO update
eps_clip float 0.5 Clipping factor for value loss
mask_no_eos_with_zero boolean False Mask truncated generations (no EOS token) and exclude from training

(section-train-engine)=

TrainEngine Configuration

Core configuration for model training, including optimization and backend settings.

Parameter Type Default Description
experiment_name string Required -
trial_name string Required -
path string "" Path to HuggingFace checkpoint
attn_impl string "flash_attention_2" Attention implementation for huggingface transformers model. Choices: flash_attention_2
init_from_scratch boolean False Initialize model weights randomly
is_critic boolean False Whether to use a critic/reward model
temperature float 1.0 Temperature during generation.
mb_spec MicroBatchSpec Required -
pad_to_maximum boolean False Whether to pad each microbatch to the length upper bound specified by mb_spec. Can reduce memory fragmentation but slows down training.
disable_dropout boolean False Disable dropout layers during training
gradient_checkpointing boolean False Enable gradient checkpointing
dtype string "bfloat16" Parameter data type.
grad_reduce_dtype string "float32" Gradient reduction data type.
optimizer OptimizerConfig | None None Optimizer configuration. None means no training.
weight_update_mode string "xccl" Weight update backend type. Choices: disk, xccl
fsdp FSDPEngineConfig Required -
archon ArchonEngineConfig Required -
megatron MegatronEngineConfig Required -
use_lora boolean False Whether to use LoRA. Only support FSDP. Note that should be enabled together with vLLM/SGLang.
lora_rank integer 32 lora rank
lora_alpha integer 16 lora alpha
target_modules list of string Required lora target_modules.
peft_type string "lora" peft method type. Only LoRA is supported for now.
enable_tree_training boolean False Enable tree training with flex attention module.
scheduling_spec tuple Required Train engine schedule specs. Can accept 1 or 2 SchedulingSpec: if 1 spec provided, it's used for both worker and engine, engine is embedded in the worker; if 2 specs provided, first one is for worker, second one is for engine. Currently only used by the TrainController.
scheduling_strategy SchedulingStrategy Required The scheduling strategy of this TrainEngine, either separation or colocation. Currently only used by the TrainController.

(section-generation-hyperparameters)=

GenerationHyperparameters

Controls text generation behavior for rollout.

Parameter Type Default Description
n_samples integer 1 Number of sequences to generate per prompt.
max_new_tokens integer 16384 Maximum number of tokens to generate.
min_new_tokens integer 0 Minimum number of tokens to generate.
max_tokens integer 32768 Maximum number of tokens including prompt and generated tokens.
greedy boolean False Whether to use greedy decoding (max probability).
top_p float 1.0 Nucleus sampling probability threshold (0.0, 1.0].
top_k integer 100000000 Number of highest probability tokens to consider.
temperature float 1.0 Sampling temperature. Higher values increase diversity.
stop_token_ids list of integer Required Stop generation when encountering these token IDs.
ignore_eos boolean False Do not stop generation when EOS is encountered.
skip_special_tokens boolean True Skip special tokens when decoding/displaying outputs.
stop list of string | None None One or multiple stop words. Generation will stop if one of these words is sampled.
frequency_penalty float 0.0 Penalizes tokens based on their frequency in generation so far. Must be between -2 and 2 where negative numbers encourage repetition.
lora_name string "default_lora" Lora name to be used for this generation.
use_beam_search boolean False Enable beam search in the vLLM engine. When enabled, sampling parameters like temperature, top-p, and top-k are auto ignored.

(section-inference-engine)=

InferenceEngine Configuration

Configuration for inference servers, including offpolicyness control.

Parameter Type Default Description
experiment_name string | None None -
trial_name string | None None -
fileroot string | None None Root directory for logs and trajectory dumps.
max_concurrent_rollouts integer | None None Maximum number of concurrent rollouts to the inference engine. Defaults to consumer_batch_size.
queue_size integer | None None Input/Output queue size for async rollout.
consumer_batch_size integer 1 Batch size for consuming rollouts from the queue.
max_head_offpolicyness integer 0 Maximum off-policyness for the head. If the current version is more than this many versions behind, the request will not be accepted.
enable_rollout_tracing boolean False Whether to output verbose tracing messages for each generation request.
check_trajectory_format boolean False Whether to check the format of produced trajectories of a customized workflow. Useful when debugging the workflow in isolation. Should be False during RL training.
schedule_policy string "round_robin" Request scheduling policy Choices: round_robin
tokenizer_path string "" Path to tokenizer for trajectory text decoding.
dump_to_file boolean False Whether to dump the trajectories to files under fileroot.
setup_timeout float 300.0 Timeout in seconds of connecting to remote servers or launching local servers.
request_timeout float 3600 Timeout for HTTP requests.
request_retries integer 3 Number of retries for failed requests.
pause_grace_period float 0.0 The grace period after calling /pause_generation. Wait until all requests have been dropped.
scheduling_spec tuple Required inference engine schedule specs. Can accept 1 or 2 SchedulingSpec: if 1 spec provided, it's used for both worker and engine, engine is embedded in the worker; if 2 specs provided, first one is for worker, second one is for engine. Currently only used by the RolloutController.
scheduling_strategy SchedulingStrategy Required The scheduling strategy of this TrainEngine, either separation or colocation. Currently only used by the RolloutController.
use_lora boolean False Whether to use LoRA. Should be same as actors LORA option.
openai OpenAIProxyConfig | None None OpenAI proxy configuration (used when workflow is an agent workflow).
return_routed_experts boolean False Return routed expert indices for MoE models. Effective only when using SGLang engine with MoE models.

(section-sg-lang)=

SGLang Configuration

Configuration for SGLang runtime. Refer to:

https://github.com/sgl-project/sglang for detailed documentation.

Parameter Type Default Description
model_path string "" -
random_seed integer 1 -
skip_tokenizer_init boolean False -
disable_cuda_graph boolean False -
disable_radix_cache boolean True -
disable_cuda_graph_padding boolean False -
enable_nccl_nvls boolean False -
disable_outlines_disk_cache boolean False -
disable_custom_all_reduce boolean False -
disable_overlap_schedule boolean False -
enable_mixed_chunk boolean False -
enable_dp_attention boolean False -
enable_ep_moe boolean False -
enable_torch_compile boolean False -
torch_compile_max_bs integer 32 -
cuda_graph_max_bs integer | None None -
cuda_graph_bs list of integer | None None -
torchao_config string "" -
enable_nan_detection boolean False -
enable_p2p_check boolean False -
triton_attention_reduce_in_fp32 boolean False -
triton_attention_num_kv_splits integer 8 -
num_continuous_decode_steps integer 1 -
enable_memory_saver boolean False -
allow_auto_truncate boolean False -
attention_backend string | None "fa3" -
enable_multimodal boolean False -
sampling_backend string | None None -
context_length integer | None 32768 -
mem_fraction_static float | None 0.9 -
max_running_requests integer | None None -
chunked_prefill_size integer | None -1 -
max_prefill_tokens integer 32768 -
schedule_policy string "lpm" -
schedule_conservativeness float 1.0 -
cpu_offload_gb integer 0 -
dtype string "bfloat16" -
kv_cache_dtype string "auto" -
dp_size integer 1 -
ep_size integer 1 -
enable_lora boolean | None None -
max_lora_rank integer | None None -
max_loaded_loras integer 8 -
lora_paths list of string | None None -
lora_backend string "triton" -
log_level string "warning" -
log_level_http string | None "warning" -
log_requests boolean False -
log_requests_level integer 0 -
show_time_cost boolean False -
enable_metrics boolean True -
decode_log_interval integer 1 -
enable_multithread_load boolean False -
enable_return_routed_experts boolean False -

(section-v-llm)=

vLLM Configuration

Configuration for vLLM runtime. Refer to:

https://docs.vllm.ai/en/stable/api/index.html for detailed documentation.

Parameter Type Default Description
model string "" -
seed integer 1 -
skip_tokenizer_init boolean False -
enforce_eager boolean False -
dtype string "bfloat16" -
distributed_executor_backend string "mp" -
max_num_seqs integer 256 -
block_size integer 16 -
swap_space integer 4 -
cpu_offload_gb float 0 -
disable_sliding_window boolean True -
max_model_len integer | None 32768 -
no_enable_chunked_prefill boolean False -
no_enable_prefix_caching boolean True -
gpu_memory_utilization float 0.9 -
worker_extension_cls string "areal.engine.vllm_ext.vllm_worker_extension.VLLMWorkerExtension" -
enable_sleep_mode boolean False -
uvicorn_log_level string "warning" -
enable_lora boolean False -
max_lora_rank integer 16 -
max_loras integer 8 -
lora_modules list of string | None None -

(section-train-dataset)=

TrainDataset Configuration

Configuration for training dataset loading and preprocessing.

Parameter Type Default Description
path string Required Path to the dataset. Can be a local path or a HuggingFace dataset name.
type string Required Type of training method, e.g., 'sft', 'rl', etc.
batch_size integer 1 Batch size for the dataloader
shuffle boolean True Whether to shuffle the dataset
pin_memory boolean False Pin memory for faster data loading (set True for GPU training)
num_workers integer 0 Number of worker processes for data loading
drop_last boolean True Drop the last incomplete batch
max_length integer | None None Maximum token length of sequences in dataset. Longer sequences are filtered out.

(section-valid-dataset)=

ValidDataset Configuration

Configuration for validation dataset loading and preprocessing.

It has different default values with TrainDatasetConfig. shuffle and drop_last default to False.

Parameter Type Default Description
path string Required Path to the dataset. Can be a local path or a HuggingFace dataset name.
type string Required Type of training method, e.g., 'sft', 'rl', etc.
batch_size integer 1 Batch size for the dataloader
shuffle boolean False Whether to shuffle the dataset
pin_memory boolean False Pin memory for faster data loading (set True for GPU training)
num_workers integer 0 Number of worker processes for data loading
drop_last boolean False Drop the last incomplete batch
max_length integer | None None Maximum token length of sequences in dataset. Longer sequences are filtered out.

(section-cluster)=

Cluster Specification Configuration

Configuration for cluster specification and distributed computing setup.

Parameter Type Default Description
name_resolve NameResolveConfig Required Name resolving configuration.
cluster_name string "local" Name of the cluster. Used to set specific environs.
fileroot string "/tmp/areal/" Root for logs and checkpoints. Should be available on all nodes.
n_nodes integer 32 The size of the cluster. Used to decide slurm hostname suffix.
n_gpus_per_node integer 8 Number of GPUs per node (physical).

(section-name-resolve)=

NameResolve Configuration

Configuration for distributed name resolution and service discovery.

Parameter Type Default Description
type string "nfs" Type of the distributed KV store for name resolving. Choices: nfs, etcd3, ray
nfs_record_root string "/tmp/areal/name_resolve" Record root for NFS name resolving. Should be available on all nodes.
etcd3_addr string "localhost:2379" Address of the ETCD3 server.
ray_actor_name string "ray_kv_store" Name of the distributed Ray KV store.

(section-evaluator)=

Evaluator Configuration

Configuration for model evaluation scheduling and timing.

Parameter Type Default Description
experiment_name string Required -
trial_name string Required -
fileroot string Required -
freq_epochs integer | None None Trigger frequency in epochs. None disables epoch-based saving.
freq_steps integer | None None Trigger frequency in steps. None disables step-based saving.
freq_secs integer | None None Trigger frequency in seconds. None disables time-based saving.

(section-recover)=

Recover Configuration

Configuration for experiment recovery and fault tolerance.

Parameter Type Default Description
experiment_name string Required -
trial_name string Required -
fileroot string Required -
freq_epochs integer | None None Trigger frequency in epochs. None disables epoch-based saving.
freq_steps integer | None None Trigger frequency in steps. None disables step-based saving.
freq_secs integer | None None Trigger frequency in seconds. None disables time-based saving.
mode string "disabled" Recovery mode for the launcher. Options: 'on' or 'auto': Automatically recover from previous runs if recover info and checkpoints are available. 'off' or 'disabled': Never recover from previous runs.
retries integer 3 Number of recovery retries when recovery is enabled.

(section-saver)=

Saver Configuration

Configuration for model checkpoint saving scheduling and timing.

Parameter Type Default Description
experiment_name string Required -
trial_name string Required -
fileroot string Required -
freq_epochs integer | None None Trigger frequency in epochs. None disables epoch-based saving.
freq_steps integer | None None Trigger frequency in steps. None disables step-based saving.
freq_secs integer | None None Trigger frequency in seconds. None disables time-based saving.
mode string "auto" Checkpoint save mode for HF saves. 'auto': use async for Archon engine, sync for others (default). 'sync': always synchronous. 'async': always process-based async with pinned memory staging, extra CPU pinned memory proportional to per-rank model shard size (e.g., ~17.5GB/rank for 70B model on 8 GPUs). Non-Archon engines fall back to sync with a warning. Choices: auto, sync, async

(section-stats-logger)=

StatsLogger Configuration

Configuration for experiment statistics logging and tracking services.

Parameter Type Default Description
experiment_name string Required -
trial_name string Required -
fileroot string Required -
wandb WandBConfig Required Weights & Biases configuration.
swanlab SwanlabConfig Required SwanLab configuration.
tensorboard TensorBoardConfig Required TensorBoard configuration. Only 'path' field required.

(section-swanlab)=

Swanlab Configuration

Configuration for SwanLab experiment tracking and monitoring.

Parameter Type Default Description
project string | None None -
name string | None None -
config dict | None None -
logdir string | None None -
mode string | None "disabled" -
api_key string | None None -

(section-tensor-board)=

TensorBoard Configuration

Configuration for TensorBoard logging and visualization.

Parameter Type Default Description
path string | None None -

(section-wand-b)=

WandB Configuration

Configuration for Weights & Biases experiment tracking.

Parameter Type Default Description
mode string "disabled" -
wandb_base_url string "" -
wandb_api_key string "" -
entity string | None None -
project string | None None -
name string | None None -
job_type string | None None -
group string | None None -
notes string | None None -
tags list of string | None None -
config dict | None None -
id_suffix string | None "train" -

(section-archon-engine)=

ArchonEngine Configuration

Configuration for Archon Engine training backend.

Parameter Type Default Description
attn_type string "varlen" Attention backend type. Use 'tree' for tree training. Choices: varlen, sdpa, tree
offload_params boolean False Whether to offload FSDP parameters to CPU.
enable_compile boolean True Enable torch.compile for TransformerBlocks.
ac_mode string "selective" Activation checkpointing mode. 'memory_budget' requires enable_compile=True. Choices: none, full, selective, memory_budget
selective_ac_option string "op" Selective AC option: 'op' for op-level, or integer string (e.g., '2') for every Nth layer.
ac_memory_budget float 0.5 Memory budget for 'memory_budget' AC mode. 0.0 = minimum memory (max recompute), 1.0 = default behavior (no recompute).
ac_preserve_rng_state boolean False Preserve RNG state during checkpointing for deterministic output. Enabling this may slow down training.
ac_debug boolean False (Testing only) Capture AC debug information. Will be slower.
pp_schedule string "Interleaved1F1B" Pipeline parallel schedule type. Choices: 1F1B, Interleaved1F1B, InterleavedZeroBubble, ZBVZeroBubble
pp_layers_per_stage integer | None None Number of transformer layers per (virtual) pipeline stage. If set, num_virtual_stages is calculated from num_layers. If None, stages are inferred from schedule type (1 stage/rank for 1F1B, 2 stages/rank for Interleaved1F1B/InterleavedZeroBubble/ZBVZeroBubble).
pp_first_stage_less_layers integer 1 Number of layers to reduce in the first pipeline stage. Accounts for embedding layer overhead.
pp_last_stage_less_layers integer 1 Number of layers to reduce in the last pipeline stage. Accounts for output layer overhead.
reshard_after_forward_policy string "default" FSDP reshard policy after forward pass. 'default': reshard when pipeline parallelism is off; keep unsharded when on to avoid repeated all-gather per microbatch. 'always': always reshard after forward (saves memory). 'never': never reshard after forward. Choices: default, always, never
use_deterministic_algorithms boolean False Enable deterministic algorithms for training reproducibility. Sets torch.use_deterministic_algorithms(True, warn_only=True), CUBLAS_WORKSPACE_CONFIG, NCCL_ALGO, and TORCH_COMPILE_DETERMINISTIC. May reduce performance.

(section-distributed-data-parallel)=

DistributedDataParallel Configuration

Configuration for Megatron's DistributedDataParallel.

Refer to Megatron-LM documentation for details.

Parameter Type Default Description
grad_reduce_in_fp32 boolean True -
overlap_grad_reduce boolean False -
overlap_param_gather boolean False -
align_param_gather boolean False -
use_distributed_optimizer boolean True -
check_for_nan_in_grad boolean False -
bucket_size integer | None None -
average_in_collective boolean False -
fp8_param_gather boolean False -

(section-fp8-engine)=

FP8Engine Configuration

Configuration for FP8 (8-bit floating point) training.

This configuration encapsulates all FP8-related parameters and can be reused across different engines (e.g., Megatron, FSDP). When None in the parent config, FP8 training is disabled.

Parameter Type Default Description
mode string "e4m3" FP8 precision mode. Options: 'e4m3' (uniform e4m3), 'hybrid' (e4m3 for activations/weights, e5m2 for output activation gradients).
recipe string "delayed" FP8 scaling recipe. Options: 'tensorwise', 'delayed', 'mxfp8' (Blackwell only), 'blockwise'.
param boolean False Keep parameters in FP8 precision to save memory. Not all parameters will be converted to fp8; for example, biases will remain unchanged.
margin integer 0 Margin for FP8 scaling factor computation.
amax_history_len integer 1 Length of amax history window for scaling factor computation.
amax_compute_algo string "most_recent" Algorithm for choosing amax value. Options: 'max' (largest in history window), 'most_recent'.
wgrad boolean True When False, override FP8 config and compute weight gradients in higher precision.
dot_product_attention boolean False Use FP8 implementation of Dot Product Attention.
multi_head_attention boolean False Use FP8 implementation of Multi Head Attention.
tp_only_amax_red boolean False Reduce FP8 AMAX only in TP or TP-CP domain.
first_last_layers_bf16 boolean False Retain first and last N TransformerBlocks in BF16 instead of FP8.
num_layers_at_start_in_bf16 integer 1 Number of layers at start to keep in BF16 when first_last_layers_bf16 is True.
num_layers_at_end_in_bf16 integer 1 Number of layers at end to keep in BF16 when first_last_layers_bf16 is True.
direct_convert boolean True Whether to use direct FP8 conversion during weight updates and save/load. When True, FP8 parameters are directly converted between TE FP8 and PyTorch FP8 without intermediate dequantization/quantization.

(section-megatron-engine)=

MegatronEngine Configuration

Configuration for Megatron-LM training framework.

Refer to Megatron-LM documentation for implementation details.

Parameter Type Default Description
wrap_with_ddp boolean True -
use_torch_fsdp2 boolean False -
use_custom_fsdp boolean False -
ddp DistributedDataParallelConfig Required -
virtual_pipeline_parallel_size integer 1 Virtual pipeline parallel size for Megatron interleaved schedule. Set to >1 to enable VPP. Default is 1 (disabled).
overlap_param_gather_with_optimizer_step boolean False -
use_precision_aware_optimizer boolean False Enable precision-aware optimizer for Megatron. When using adam_bf16 optimizer type with Megatron Engine, this is automatically enabled with exp_avg_dtype=bfloat16 and exp_avg_sq_dtype=bfloat16.
main_grads_dtype string "float32" -
main_params_dtype string "float32" -
exp_avg_dtype string "float32" -
exp_avg_sq_dtype string "float32" -
async_save boolean False -
use_checkpoint_opt_param_scheduler boolean True -
use_deterministic_algorithms boolean False -
recompute_granularity string | None "full" -
recompute_method string | None "uniform" -
recompute_num_layers integer | None 1 -
distribute_saved_activations boolean | None None -
recompute_modules list of string | None None -
moe_router_dtype string | None "fp32" -
moe_shared_expert_overlap boolean False Enable overlapping between shared expert computations and dispatcher communications. Without this, the shared experts execute after the routed experts.
moe_enable_deepep boolean False -
moe_token_dispatcher_type string "alltoall" Type of token dispatcher. Options: 'allgather','alltoall' and 'flex'.
moe_permute_fusion boolean False Fuse token rearrangement ops during token dispatching.
fp8_config FP8EngineConfig | None None -

(section-open-ai-proxy)=

OpenAIProxy Configuration

Configuration for OpenAI proxy when using agent workflows.

Parameter Type Default Description
mode string "inline" OpenAI proxy mode: 'inline' (in-process), 'subproc' (subprocess), or 'online' (external user sessions for online RL training). inline mode runs the provided agent workflow directly in the same process. subproc mode launches a separate process to run the agent. online mode waits for external users to complete sessions via the proxy gateway URL, enabling online RL training. Choices: inline, subproc, online
tool_call_parser string "qwen" Parser for tool calls in model output.
reasoning_parser string "qwen3" Parser for reasoning content ( tags).
chat_template_type string "hf" Chat template type: 'hf' (standard) or 'concat' (multi-turn concatenation). Choices: hf, concat
engine_max_tokens integer | None None Maximum total tokens for the engine (prompt + completion).
turn_discount float 1.0 Discount factor for multi-turn reward propagation.
export_style string "individual" Export style: 'individual' (all interactions) or 'concat' (leaf nodes only). The 'individual' style exports each interaction (input-output-reward) step separately, and treats them as independent samples to train the model. The 'concat' style exports only the final concatenated trajectory from the root. It is only suitable for linear conversation histories without token mismatching (whether valid depends on the tokenizer). Choices: individual, concat
subproc_max_workers integer 4 Maximum number of worker processes for subprocess mode execution pool.
session_timeout_seconds integer 3600 Session timeout in seconds. Sessions inactive longer than this will be garbage collected.
admin_api_key string "areal-admin-key" Admin API key for the proxy server. Used to authenticate management operations (grant_capacity, start_session). Cannot be used for chat completions. Each session gets a unique API key allocated via start_session. WARNING: Change this from the default for non-local deployments.

(section-perf-tracer)=

PerfTracer Configuration

Configuration for perf tracer emission.

Parameter Type Default Description
experiment_name string Required -
trial_name string Required -
fileroot string Required -
enabled boolean False Explicitly enable or disable perf tracing. Set to true to capture perf traces.
save_interval integer 1 Flush trace events to disk every N calls to save(step=...). A value of 1 writes on every step; values <= 0 fall back to 1.
profile_steps list of integer | None None List of step numbers at which to capture detailed profiling traces. If None, no detailed profiling traces are captured.
session_tracer SessionTracerConfig | None None Session tracing configuration.

(section-scheduler)=

Scheduler Configuration

Configuration for worker scheduling. Used in the single-controller mode. Experimental.

Parameter Type Default Description
type string | None None -
endpoint string "http://localhost:8081" -
deploy_mode string "separation" -
functioncall_service_domain string "http://localhost:8080" -
reward_functioncall_config dict Required -
reward_model_path string "" -
reward_model_service_url string "http://localhost:30000/classify" -

(section-scheduling)=

Scheduling Specification

Configuration class: SchedulingSpec

Parameter Type Default Description
cpu integer 8 Number of CPU cores required per GPU
gpu integer 0 Number of GPU units required. Used only when allocating pods.
mem integer 32 Amount of memory (GB) required per GPU
port_count integer 2 Number of ports to expose
image string "/storage/openpsi/images/areal-latest.sif" Docker/Singularity container image to use. Currently only used by Slurm. Will be potentially used by Kubernetes in the future.
task_type string "worker" Task type (e.g., worker, engine) Choices: worker, engine
env_vars dict Required Environment variables for the container
cmd string | None None Command to execute inside the container. Defaults to AReaL's RPC server.
srun_additional_args string "--unbuffered --mpi=pmi2 -K --chdir $PWD" Additional arguments to pass to the srun command. Only used by slurm.
additional_bash_cmds list of string | None None Additional bash commands to setup the container before running the torchrun command. Only used by slurm.
container_type string "apptainer" Type of containers used in slurm Choices: apptainer, none
mount string "/storage:/storage" Mount path for slurm.
nodelist string | None None sbatch/srun's --nodelist option for slurm.
exclude string | None None sbatch/srun's --exclude option for slurm.
ray_placement_strategy string "shared" Which placement strategy to use for Ray scheduling. Shared will produce 1 placement group for all workers in the role (training). Separate will 1 placement group per worker (rollout). Deferred will do the same as separate but defers accelerator scheduling (multinode rollout). Choices: shared, separate, deferred

(section-scheduling-strategy)=

SchedulingStrategy

Configuration class: SchedulingStrategy

Parameter Type Default Description
type string "separation" - Choices: separation, colocation
target string | None None The target role to be colocated with
fork boolean True When True with colocation, the target worker spawns a new process on the same node/GPUs instead of sharing its process. Provides process isolation while sharing GPU resources.

(section-session-tracer)=

SessionTracer Configuration

Configuration for per-session lifecycle tracing.

Parameter Type Default Description
enabled boolean False Enable per-session lifecycle tracing alongside perf events. When true, session metadata is captured to sessions.jsonl.
flush_threshold integer 256 Flush session trace records once this many entries are ready. Values <= 0 fall back to 1.