Skip to content

Commit 524d1d9

Browse files
HwVanICIHwVanICgemini-code-assist[bot]HwVanICI
authored
VLM Training on NPU (#746)
* VLM training example on npu * Update examples/vlm_npu/geometry3k_grpo.sh Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update examples/vlm_npu/README.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: HwVanIC <root@devserver-hps-e81c1f83-00002.novalocal> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: HwVanICI <m00926705@china.huawei.com>
1 parent e157d2f commit 524d1d9

4 files changed

Lines changed: 283 additions & 0 deletions

File tree

examples/vlm_npu/README.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# Training VLMs with GRPO on NPU:
2+
In this instruction, we will introduce how to train VLMs with GRPO on NPU.
3+
4+
### Hardware
5+
6+
The following hardware configuration has been extensively tested:
7+
8+
- **NPU**: 16x NPU per node
9+
- **CPU**: 64 cores per node
10+
- **Memory**: 1TB per node
11+
- **Network**: RoCE 3.2 Tbps
12+
- **Storage**:
13+
- 1TB local storage for single-node experiments
14+
- 10TB shared storage (NAS) for distributed experiments
15+
16+
### Key Contributions
17+
- Trained Qwen2.5VL-3B-instruct model upto 70 epochs with (4 cards+ 4 cards) train-infer configuration. Took around 19hr to finish full training.
18+
- Trained model is tested with more than one benchmark using VLMEvalKit.
19+
20+
21+
### System configuration:
22+
- Vllm==0.11.0 ; vllm-ascend==0.11.0rc0 ;
23+
- Torch==0.7.1+cpu ; torch_npu==2.7.1.dev20250724
24+
- Areal==0.4.1
25+
- CANN==8.1RC1 ; 910c npus (65 gigs X 16)
26+
27+
### Results:
28+
We trained Qwen2.5-VL-3B for 70 epochs on Geometry3K and evaluated the checkpoints using VLMEvalKit on out of distribution tasks such as MathVision, MathVista, and LogicVista. The training was performed on both NPU and GPU and results are as follows:
29+
30+
| Method | LogicVista | MathVision_mini | MathVista_mini | Avg. |
31+
|------------|------------|------------------|----------------|-------|
32+
| Base Model | 31.0 | 18.3 | 52.3 | 33.8 |
33+
| GRPO-GPU | 35.4 | 20.9 | 55.9 | **37.4** |
34+
| GRPO-NPU | 35.3 | 20.5 | 54.7 | **36.8** |
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
import os
2+
import re
3+
import sys
4+
5+
from mathruler.grader import extract_boxed_content, grade_answer
6+
7+
from areal.api.cli_args import GRPOConfig, load_expr_config
8+
from areal.dataset import get_custom_dataset
9+
from areal.experimental.trainer import PPOTrainer
10+
from areal.utils.hf_utils import load_hf_processor_and_tokenizer
11+
from areal.utils.stats_logger import StatsLogger
12+
from areal.workflow.vision_rlvr import VisionRLVRWorkflow
13+
14+
15+
def format_reward(predict_str: str) -> float:
16+
pattern = re.compile(r"<think>.*</think>.*\\boxed\{.*\}.*", re.DOTALL)
17+
match_result = re.fullmatch(pattern, predict_str)
18+
return 1.0 if match_result else 0.0
19+
20+
21+
def acc_reward(predict_str: str, ground_truth: str) -> float:
22+
answer = extract_boxed_content(predict_str)
23+
return 1.0 if grade_answer(answer, ground_truth) else 0.0
24+
25+
26+
def geometry3k_reward_fn(
27+
prompt, completions, prompt_ids, completion_ids, answer, **kwargs
28+
):
29+
format_reward_val = format_reward(completions)
30+
acc_reward_val = acc_reward(completions, answer)
31+
format_score = 0.1
32+
score = (1.0 - format_score) * (acc_reward_val) + format_score * format_reward_val
33+
return score
34+
35+
36+
def main(args):
37+
config, _ = load_expr_config(args, GRPOConfig)
38+
processor, tokenizer = load_hf_processor_and_tokenizer(config.tokenizer_path)
39+
40+
train_dataset = get_custom_dataset(
41+
split="train",
42+
dataset_config=config.train_dataset,
43+
tokenizer=tokenizer,
44+
processor=processor,
45+
)
46+
47+
valid_dataset = get_custom_dataset(
48+
split="test",
49+
dataset_config=config.valid_dataset,
50+
tokenizer=tokenizer,
51+
processor=processor,
52+
)
53+
54+
with PPOTrainer(
55+
config,
56+
train_dataset=train_dataset,
57+
valid_dataset=valid_dataset,
58+
) as trainer:
59+
workflow = VisionRLVRWorkflow(
60+
reward_fn=geometry3k_reward_fn,
61+
gconfig=config.gconfig,
62+
tokenizer=trainer.tokenizer,
63+
processor=trainer.processor,
64+
enable_thinking=False,
65+
dump_dir=os.path.join(
66+
StatsLogger.get_log_path(config.stats_logger), "generated"
67+
),
68+
)
69+
eval_workflow = VisionRLVRWorkflow(
70+
reward_fn=geometry3k_reward_fn,
71+
gconfig=config.gconfig.new(temperature=0.6),
72+
tokenizer=trainer.tokenizer,
73+
processor=trainer.processor,
74+
enable_thinking=False,
75+
rollout_stat_scope="eval-rollout",
76+
dump_dir=os.path.join(
77+
StatsLogger.get_log_path(config.stats_logger), "generated-eval"
78+
),
79+
)
80+
trainer.train(workflow, eval_workflow)
81+
82+
83+
if __name__ == "__main__":
84+
main(sys.argv[1:])
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
export USE_OPTIMIZED_MODEL=0
2+
# Some models are optimized by vllm ascend. While in some case, e.g. rlhf training,
3+
# the optimized model may not be suitable. In this case, set this value to 0 to disable the optimized model.
4+
5+
python -m areal.launcher.local \
6+
examples/vlm_npu/geometry3k_grpo.py --config examples/vlm_npu/geometry3k_grpo.yaml
Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
experiment_name: geometry3k-grpo
2+
trial_name: trial1
3+
4+
seed: 1
5+
enable_offload: false
6+
total_train_epochs: 3
7+
tokenizer_path: ${actor.path}
8+
9+
cluster:
10+
n_nodes: 1
11+
n_gpus_per_node: 8
12+
fileroot: /tmp/areal/experiments
13+
name_resolve:
14+
type: nfs
15+
nfs_record_root: /tmp/areal/name_resolve
16+
17+
allocation_mode: vllm:d4p1t1+d4p1t1
18+
19+
rollout:
20+
experiment_name: ${experiment_name}
21+
trial_name: ${trial_name}
22+
max_concurrent_rollouts: 256
23+
queue_size: null
24+
consumer_batch_size: ${train_dataset.batch_size}
25+
max_head_offpolicyness: 4
26+
enable_rollout_tracing: false
27+
28+
gconfig:
29+
n_samples: 4
30+
min_new_tokens: 0
31+
max_new_tokens: 512
32+
greedy: false
33+
temperature: 1.0
34+
35+
actor:
36+
experiment_name: ${experiment_name}
37+
trial_name: ${trial_name}
38+
path: Qwen/Qwen2.5-VL-3B-Instruct
39+
init_from_scratch: false
40+
disable_dropout: true
41+
gradient_checkpointing: true
42+
dtype: bfloat16
43+
mb_spec:
44+
max_tokens_per_mb: 4096
45+
optimizer:
46+
type: adam
47+
lr: 2e-6
48+
weight_decay: 0.01
49+
beta1: 0.9
50+
beta2: 0.999
51+
eps: 1e-8
52+
lr_scheduler_type: constant
53+
gradient_clipping: 1.0
54+
warmup_steps_proportion: 0.001
55+
group_size: ${gconfig.n_samples}
56+
eps_clip: 0.4
57+
temperature: ${gconfig.temperature}
58+
reward_scaling: 10.0
59+
reward_bias: -0.5
60+
kl_ctl: 0.0
61+
ppo_n_minibatches: 1
62+
recompute_logprob: true
63+
use_decoupled_loss: true
64+
behav_imp_weight_cap: 5.0
65+
dynamic_sampling: false
66+
reward_norm:
67+
mean_level: group
68+
std_level: group
69+
group_size: ${gconfig.n_samples}
70+
adv_norm:
71+
mean_level: batch
72+
std_level: batch
73+
max_new_tokens: ${gconfig.max_new_tokens}
74+
75+
ref:
76+
experiment_name: ${experiment_name}
77+
trial_name: ${trial_name}
78+
path: ${actor.path}
79+
init_from_scratch: false
80+
disable_dropout: true
81+
dtype: ${actor.dtype}
82+
mb_spec:
83+
max_tokens_per_mb: 4096
84+
optimizer: null
85+
86+
# SGLang
87+
sglang:
88+
model_path: ${actor.path}
89+
random_seed: ${seed}
90+
skip_tokenizer_init: true
91+
dtype: ${actor.dtype}
92+
max_running_requests: 64
93+
context_length: 32768
94+
mem_fraction_static: 0.8
95+
enable_multimodal: true
96+
97+
vllm:
98+
model: ${actor.path}
99+
seed: ${seed}
100+
skip_tokenizer_init: false
101+
dtype: ${actor.dtype}
102+
max_model_len: 32768
103+
gpu_memory_utilization: 0.9
104+
disable_sliding_window: false
105+
106+
107+
# datasets
108+
train_dataset:
109+
batch_size: 32
110+
pin_memory: true
111+
num_workers: 4
112+
path: hiyouga/geometry3k
113+
type: rl
114+
115+
valid_dataset:
116+
batch_size: 32
117+
pin_memory: true
118+
num_workers: 4
119+
path: hiyouga/geometry3k
120+
type: rl
121+
122+
# Utilities
123+
saver:
124+
experiment_name: ${experiment_name}
125+
trial_name: ${trial_name}
126+
fileroot: ${cluster.fileroot}
127+
freq_epochs: 1
128+
freq_steps: null
129+
freq_secs: null
130+
131+
recover:
132+
mode: disabled
133+
experiment_name: ${experiment_name}
134+
trial_name: ${trial_name}
135+
fileroot: ${cluster.fileroot}
136+
freq_epochs: 1
137+
freq_steps: null
138+
freq_secs: 3600
139+
140+
evaluator:
141+
experiment_name: ${experiment_name}
142+
trial_name: ${trial_name}
143+
fileroot: ${cluster.fileroot}
144+
freq_epochs: 1
145+
freq_steps: null
146+
freq_secs: null
147+
148+
stats_logger:
149+
experiment_name: ${experiment_name}
150+
trial_name: ${trial_name}
151+
fileroot: ${cluster.fileroot}
152+
wandb:
153+
mode: disabled
154+
155+
launcher:
156+
inference_server_cpus_per_gpu: 4
157+
inference_server_mem_per_gpu: 32768
158+
trainer_cpus_per_gpu: 4
159+
trainer_mem_per_gpu: 32768

0 commit comments

Comments
 (0)