modelscope · hjh0119 · Nov 14, 2025 · Aug 29, 2025 · Sep 1, 2025 · Sep 1, 2025
diff --git a/docs/source/Instruction/Command-line-parameters.md b/docs/source/Instruction/Command-line-parameters.md
@@ -606,6 +606,8 @@ reward模型参数将在PPO、GRPO中使用。
 - top_entropy_quantile: 仅对熵值处于前指定分位的 token 参与损失计算，默认为1.0，即不过滤低熵 token，具体参考[文档](./GRPO/AdvancedResearch/entropy_mask.md)
 - log_entropy: 记录训练中的熵值变化动态，默认为False，具体参考[文档](./GRPO/GetStarted/GRPO.md#logged-metrics)
 
+##### 奖励函数参数
+内置的奖励函数参考[文档](./GRPO/DeveloperGuide/reward_function.md)
 cosine 奖励参数
 - cosine_min_len_value_wrong：cosine 奖励函数参数，生成错误答案时，最小长度对应的奖励值。默认值为-0.5。
 - cosine_max_len_value_wrong：生成错误答案时，最大长度对应的奖励值。默认值为0.0。

diff --git a/docs/source/Instruction/GRPO/AdvancedResearch/GSPO.md b/docs/source/Instruction/GRPO/AdvancedResearch/GSPO.md
@@ -54,7 +54,7 @@ importance_weights = torch.exp(log_importance_weights)
 - `importance_sampling_level sequence` （GSPO）
 - `importance_sampling_level sequence_token` （GSPO-token）
 
-其中 sequence_token 要求 ms-swift > 3.7 （源码安装）
+其中 sequence_token 要求 ms-swift >= 3.8
 
 论文其他超参
 ```bash

diff --git a/docs/source/Instruction/Use-tuners.md b/docs/source/Instruction/Use-tuners.md
@@ -15,7 +15,7 @@ tuner是指附加在模型上的额外结构部分，用于减少训练参数量
 - Adapter: [Parameter-Efficient Transfer Learning for NLP](http://arxiv.org/abs/1902.00751)
 - Vision Prompt Tuning: [Visual Prompt Tuning](https://arxiv.org/abs/2203.12119)
 - Side: [Side-Tuning: A Baseline for Network Adaptation via Additive Side Networks](https://arxiv.org/abs/1912.13503)
-- Res-Tuning: [Res-Tuning: A Flexible and Efficient Tuning Paradigm via Unbinding Tuner from Backbone](https://arxiv.org/abs/2310.19859)  < [arXiv](https://arxiv.org/abs/2310.19859)  |  [Project Page](https://res-tuning.github.io/)  |  [Usage](ResTuning.md) >
+- Res-Tuning: [Res-Tuning: A Flexible and Efficient Tuning Paradigm via Unbinding Tuner from Backbone](https://arxiv.org/abs/2310.19859)  < [arXiv](https://arxiv.org/abs/2310.19859)  |  [Project Page](https://res-tuning.github.io/) >
 - [PEFT](https://github.com/huggingface/peft)提供的tuners, 如AdaLoRA、DoRA、Fourierft等
 
 ## 接口列表

diff --git a/docs/source/Megatron-SWIFT/Command-line-parameters.md b/docs/source/Megatron-SWIFT/Command-line-parameters.md
@@ -246,29 +246,6 @@ lora训练：
 - lora_bias: 默认为`'none'`，可以选择的值: 'none'、'all'。如果你要将bias全都设置为可训练，你可以设置为`'all'`。
 - use_rslora: 默认为`False`，是否使用`RS-LoRA`。
 
-
-**DPO参数**:
-- ref_load: ref_model的加载路径。采用DPO/KTO算法且使用全参数训练时需要传入。默认为None，即设置为`load`。
-- ref_adapter_load: 加载ref_adapter的权重路径，默认为None。若你要使用SFT产生的LoRA权重进行DPO，请使用"ms-swift>=3.8"，并在训练时设置`--adapter_load sft_ckpt --ref_adapter_load sft_ckpt --finetune true`。若是此场景的断点续训，则设置`--adapter_load rlhf_ckpt --ref_adapter_load sft_ckpt --finetune false`。
-- beta: 含义与[TRL](https://huggingface.co/docs/trl/main/en/dpo_trainer#trl.DPOConfig)相同。控制与参考模型偏差程度的参数。beta值越高，表示与参考模型的偏差越小。对于 IPO 损失函数 (loss_type="ipo")，beta是[论文](https://huggingface.co/papers/2310.12036)中所指的正则化参数。默认为0.1。
-- 🔥rpo_alpha: 来自[RPO 论文](https://huggingface.co/papers/2404.19733)中的参数，用于控制损失函数中NLL项的权重（即SFT损失），`loss = dpo_loss + rpo_alpha * sft_loss`，论文中推荐设置为`1.`。默认为`None`，即默认不引入sft_loss。
-  - **注意**：在"ms-swift<3.8"，其默认值为`1.`。在"ms-swift>=3.8"该默认值修改为`None`。
-- reference_free: 是否忽略提供的参考模型，并隐式地使用一个对所有响应赋予相等概率的参考模型。默认为False。
-- label_smoothing: 默认为0.。
-- f_divergence_type: 默认为`reverse_kl`。可选值参考[TRL文档](https://huggingface.co/docs/trl/main/en/dpo_trainer)。
-- loss_type: 默认为'sigmoid'。可选值参考[TRL文档](https://huggingface.co/docs/trl/main/en/dpo_trainer#loss-functions)。
-
-**KTO参数**:
-- ref_load: 含义同DPO。
-- ref_adapter_load: 含义同DPO。
-- beta: 控制与 ref_model 偏离程度的参数。较高的 beta 表示与 ref_model 偏离更小。默认为`0.1`。
-- loss_type: 默认为'kto'。可选值参考[TRL文档](https://huggingface.co/docs/trl/main/en/kto_trainer#trl.KTOConfig.loss_type)。
-- desirable_weight: 抵消 desirable 和 undesirable 数量不均衡的影响，对 desirable 损失按该系数进行加权，默认为`1.`。
-- undesirable_weight: 抵消 desirable 和 undesirable 数量不均衡的影响，对 undesirable 损失按该系数进行加权，默认为`1.`。
-
-**RM参数**:
-- center_rewards_coefficient: 用于激励奖励模型输出均值为零的奖励的系数，具体查看这篇[论文](https://huggingface.co/papers/2312.09244)。推荐值：0.01。
-
 **Mcore-Bridge参数**
 - 🔥load_safetensors: 默认为False，是否直接从safetensors加载权重。
 - 🔥save_safetensors: 默认为False，是否直接保存成safetensors权重。注意，若该参数设置为True，则不会存储优化器权重、随机数状态等断点续训内容。
@@ -323,6 +300,74 @@ Megatron训练参数继承自Megatron参数和基本参数（**与ms-swift共用
 - calculate_per_token_loss: 覆盖Megatron参数，默认为False。
 
 
+### DPO参数
+- ref_load: ref_model的加载路径。采用DPO/GRPO/KTO算法且使用全参数训练时需要传入。默认为None，即设置为`load`。
+- ref_adapter_load: 加载ref_adapter的权重路径，默认为None。若你要使用SFT产生的LoRA权重进行DPO，请使用"ms-swift>=3.8"，并在训练时设置`--adapter_load sft_ckpt --ref_adapter_load sft_ckpt --finetune true`。若是此场景的断点续训，则设置`--adapter_load rlhf_ckpt --ref_adapter_load sft_ckpt --finetune false`。
+- beta: 含义与[TRL](https://huggingface.co/docs/trl/main/en/dpo_trainer#trl.DPOConfig)相同。控制与参考模型偏差程度的参数。beta值越高，表示与参考模型的偏差越小。对于 IPO 损失函数 (loss_type="ipo")，beta是[论文](https://huggingface.co/papers/2310.12036)中所指的正则化参数。默认为0.1。
+- 🔥rpo_alpha: 来自[RPO 论文](https://huggingface.co/papers/2404.19733)中的参数，用于控制损失函数中NLL项的权重（即SFT损失），`loss = dpo_loss + rpo_alpha * sft_loss`，论文中推荐设置为`1.`。默认为`None`，即默认不引入sft_loss。
+  - **注意**：在"ms-swift<3.8"，其默认值为`1.`。在"ms-swift>=3.8"该默认值修改为`None`。
+- reference_free: 是否忽略提供的参考模型，并隐式地使用一个对所有响应赋予相等概率的参考模型。默认为False。
+- label_smoothing: 默认为0.。
+- f_divergence_type: 默认为`reverse_kl`。可选值参考[TRL文档](https://huggingface.co/docs/trl/main/en/dpo_trainer)。
+- loss_type: 默认为'sigmoid'。可选值参考[TRL文档](https://huggingface.co/docs/trl/main/en/dpo_trainer#loss-functions)。
+
+### KTO参数
+- ref_load: 含义同DPO。
+- ref_adapter_load: 含义同DPO。
+- beta: 控制与 ref_model 偏离程度的参数。较高的 beta 表示与 ref_model 偏离更小。默认为`0.1`。
+- loss_type: 默认为'kto'。可选值参考[TRL文档](https://huggingface.co/docs/trl/main/en/kto_trainer#trl.KTOConfig.loss_type)。
+- desirable_weight: 抵消 desirable 和 undesirable 数量不均衡的影响，对 desirable 损失按该系数进行加权，默认为`1.`。
+- undesirable_weight: 抵消 desirable 和 undesirable 数量不均衡的影响，对 undesirable 损失按该系数进行加权，默认为`1.`。
+
+### RM参数
+- center_rewards_coefficient: 用于激励奖励模型输出均值为零的奖励的系数，具体查看这篇[论文](https://huggingface.co/papers/2312.09244)。推荐值：0.01。
+
+### GRPO参数
+- ref_load: 含义同DPO。
+- ref_adapter_load: 含义同DPO。
+- beta: KL正则系数，默认为0.04，设置为0时不加载ref model。
+- micro_batch_size: 每个device的批次大小，默认为1。
+- global_batch_size: 总批次大小，等价于`micro_batch_size*数据并行大小*梯度累加步数`。默认为16。
+- steps_per_generation：每轮生成的优化步数，即采样批量大小相对global_batch_size的倍数，默认为1。
+- generation_batch_size: 采样批量大小，需要是global_batch_size的倍数，默认等于global_batch_size*steps_per_generation。
+- num_generations: 每个prompt采样的数量，论文中的G值，默认为8。
+- reward_funcs: GRPO算法奖励函数，可选项为`accuracy`、`format`、`cosine`、`repetition`和`soft_overlong`，见swift/plugin/orm.py。你也可以在plugin中自定义自己的奖励函数。默认为`[]`。
+- reward_weights: 每个奖励函数的权重。必须与奖励函数和奖励模型的总数量匹配。默认为 None，即所有奖励的权重都相等，为`1.0`。
+  - 提示：如果GRPO训练中包含`--reward_model`，则其加在奖励函数的最后位置。
+- loss_type: loss 归一化的类型，可选项为['grpo', 'bnpo', 'dr_grpo'], 默认为'grpo', 具体查看该[pr](https://github.com/huggingface/trl/pull/3256#discussion_r2033213348)。
+- log_completions: 是否记录训练中的模型生成内容，默认为False。
+- vllm_mode: vLLM 集成模式，可选项为 `server` 和 `colocate`。server 模式使用 `swift rollout` 拉起的 vLLM 服务器进行采样，colocate 模式在程序内部署 vLLM。使用server端时，
+- vllm_mode server 参数
+  - vllm_server_base_url: vLLM server的Base URL(比如 http://local_host:8000), 默认为None。设置后，忽略host和port设置。
+  - vllm_server_host：vLLM server host地址，默认为None。
+  - vllm_server_port vLLM server 服务端口，默认为8000。
+  - vllm_server_timeout 连接vLLM server的超时时间，默认为 240s。
+  - vllm_server_pass_dataset: 透传额外的数据集信息到vLLM server，用于多轮训练。
+  - async_generate: 异步rollout以提高训练速度，注意开启时采样会使用上一轮更新的模型进行采样，不支持多轮场景。默认`false`.
+  - SWIFT_UPDATE_WEIGHTS_BUCKET_SIZE：环境变量，用于控制权重同步时的传输桶大小（bucket size），适用于 Server Mode 下的全参数训练，单位为 MB，默认值为 512 MB。
+- vllm_mode colocate 参数（更多参数支持参考[vLLM参数](#vLLM参数)。）
+  - vllm_gpu_memory_utilization: vllm透传参数，默认为0.9。
+  - vllm_max_model_len: vllm透传参数，默认为None。
+  - vllm_enforce_eager: vllm透传参数，默认为False。
+  - vllm_limit_mm_per_prompt: vllm透传参数，默认为None。
+  - vllm_enable_prefix_caching: vllm透传参数，默认为True。
+  - vllm_tensor_parallel_size: tp并行数，默认为`1`。
+  - vllm_enable_lora: 支持vLLM Engine 加载 LoRA adapter，默认为False。用于加速LoRA训练的权重同步，具体参考[文档](../Instruction/GRPO/GetStarted/GRPO.md#权重同步加速)。
+  - sleep_level: 训练时释放 vLLM 显存，可选项为[0, 1, 2], 默认为0，不释放。
+  - offload_optimizer: 是否在vLLM推理时offload optimizer参数，默认为False。
+  - offload_model: 是否在vLLM推理时 offload 模型，默认为False。
+- num_iterations: 每条数据的更新次数，[GRPO论文](https://arxiv.org/abs/2402.03300)中的 $\mu$ 值，默认为1。
+- epsilon: clip 系数，默认为0.2。
+- epsilon_high: upper clip 系数，默认为None，设置后与epsilon共同构成[epsilon, epsilon_high]裁剪范围。
+- dynamic_sample：筛除group内奖励标准差为0的数据，额外采样新数据，默认为False。
+- max_resample_times：dynamic_sample设置下限制重采样次数，默认3次。
+- overlong_filter：跳过超长截断的样本，不参与loss计算，默认为False。
+- delta: [INTELLECT-2 tech report](https://huggingface.co/papers/2505.07291)中双侧 GRPO 上界裁剪值。若设置，建议大于 1 + epsilon。默认为None。
+- importance_sampling_level: 控制重要性采样比计算，可选项为 `token` 和 `sequence`，`token` 模式下保留原始的每个 token 的对数概率比，`sequence` 模式下则会对序列中所有有效 token 的对数概率比进行平均。[GSPO论文](https://www.arxiv.org/abs/2507.18071)中使用sequence级别计算来稳定训练，默认为`token`。
+- scale_rewards：指定奖励的缩放策略。可选值包括 `group`（按组内标准差缩放）、`batch`（按整个批次的标准差缩放）、`none`（不进行缩放）。在 ms-swift < 3.10 版本中，该参数为布尔类型，`true` 对应 `group`，`false` 对应 `none`。默认值与 `advantage_estimator` 绑定：`grpo` 对应 `group`，`rloo` 对应 `none`，`reinforce_plus_plus` 对应 `batch`。
+
+内置奖励函数参数参考[文档](../Instruction/Command-line-parameters.md#奖励函数参数)
+
 ## 导出参数
 这里介绍`megatron export`的参数（需"ms-swift>=3.10"），若要使用`swift export`导出命令，请参考[ms-swift命令行参数文档](../Instruction/Command-line-parameters.md#导出参数)。`megatron export`相比`swift export`，支持分布式和多机导出。Megatron导出参数继承自Megatron参数和基本参数。
 - 🔥to_mcore: HF格式权重转成Megatron格式。默认为False。

diff --git a/docs/source/Megatron-SWIFT/GRPO.md b/docs/source/Megatron-SWIFT/GRPO.md
@@ -0,0 +1,61 @@
+# GRPO
+
+**版本依赖**：ms-swift >= 3.11
+
+如果你是首次使用 GRPO，请先参考 [GRPO文档](../Instruction/GRPO/GetStarted/GRPO.md)。
+
+Megatron GRPO 当前已支持以下功能：
+
+- **训练模式**：全参数训练与 LoRA 微调
+- **并行策略**：支持上下文并行（CP）、流水线并行（PP）、张量并行（TP）和专家并行（EP）
+- **推理加速**：支持 vLLM 的 colocate 模式和 server 模式
+- **模型支持**：兼容 Megatron Swift 中的 LLM 及 MLLM（多模态大模型）
+- **算法支持**：涵盖 swift GRPO 的大部分功能
+
+以下参数或功能将在后续版本中逐步支持：
+
+- **Entropy 相关配置**：如 `top_entropy_quantile`、`log_entropy`
+- **Reward Model / Reward Model Plugin**
+- **多轮 Rollout 调度机制**（`multi_turn_scheduler`）：实现多轮对话策略优化
+- **优势估计器**（`advantage_estimator`）：支持更复杂的策略梯度估计方法
+- **KL 散度计入奖励**（`kl_in_reward`）
+- **虚拟流水线并行**（VPP）
+- **参考模型同步更新**（`sync_ref_model`）
+- **Async Generate** (`async_generate`)
+- **num_iterations**
+- **日志同步 SwanLab**
+
+⚠️ 注意：以下参数在 Megatron GRPO 中不生效：
+
+- **`use_vllm`**：Megatron GRPO 暂不支持使用 PTEngine 进行 Rollout 推理。
+- **`move_model_batches`**：该参数专用于 DeepSpeed ZeRO-3 优化，在 Megatron 架构下无效。
+
+与 ms-swift GRPO 相同，Megatron GRPO batch size 相关的参数均以 **completion-level** 为单位，即表示模型生成的 completion 数量，而非 prompt 数量。
+
+#### 参数对比
+
+下表对比了 ms-swift 和 Megatron-SWIFT 中批量相关参数的对应关系：
+
+| ms-swift 参数 | Megatron-SWIFT 参数 | 说明 |
+|---------------|---------------------|------|
+| `per_device_train_batch_size` | `micro_batch_size` | 每张 GPU 的训练批次大小（completion-level） |
+| `gradient_accumulation_steps` | - | 梯度累积步数，在 Megatron-SWIFT 中已包含在 `global_batch_size` 的计算中 |
+| - | `global_batch_size` | 全局批次大小（completion-level）<br/>**Megatron-SWIFT**: `micro_batch_size × dp_size × gradient_accumulation_steps`<br/>**ms-swift**: `per_device_train_batch_size × world_size × gradient_accumulation_steps` |
+| `num_generations` | `num_generations` | 每个 prompt 生成的 completion 数量 |
+| `steps_per_generation` | `steps_per_generation` | Rollout 批次大小相对于训练批次大小的倍数<br/>**注意**：在 ms-swift 中需为 `gradient_accumulation_steps` 的整数倍 |
+| `generation_batch_size` | `generation_batch_size` | Rollout 阶段的批次大小（completion-level），需为 `global_batch_size` 的整数倍 |
+
+以下公式用于计算 Megatron GRPO 中的批量：
+
+- **数据并行大小**：`dp_size = world_size / (TP × PP × CP)`
+- **全局批次大小**：`global_batch_size = micro_batch_size × dp_size × gradient_accumulation_steps`
+- **生成批次大小**：`generation_batch_size = global_batch_size × steps_per_generation`
+- **Rollout Prompt 数量**：`num_rollout_prompts = generation_batch_size / num_generations`
+- **训练 Prompt 数量**：`num_train_prompts = global_batch_size / num_generations`
+- **每个 DP group 的训练 Prompt 数量**：`num_prompts_per_dp_group = global_batch_size / num_generations / dp_size`
+
+注意：在 Megatron GRPO 中，每个 DP group 的训练 Prompt 数量须满足 `num_prompts_per_dp_group` 是 `micro_batch_size`的整数倍，以确保训练批次能够正确分配。
+
+更多参数请参考[命令行文档](./Command-line-parameters.md#grpo参数)
+
+训练脚本请参考[Megatron GRPO 脚本](https://github.com/modelscope/ms-swift/blob/main/examples/megatron/grpo)
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -42,6 +42,7 @@ Swift DOCUMENTATION
    Megatron-SWIFT/LoRA-Training.md
    Megatron-SWIFT/Multimodal-Model.md
    Megatron-SWIFT/Mcore-Bridge.md
+   Megatron-SWIFT/GRPO.md
 
 
 .. toctree::

diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
@@ -621,6 +621,8 @@ The hyperparameters for the reward function can be found in the [Built-in Reward
 - log_entropy: Logs the entropy values during training. The default is False. For more information, refer to the [documentation](./GRPO/GetStarted/GRPO.md#logged-metrics).
 
 
+##### Reward function parameters
+Refer to the [documentation](./GRPO/DeveloperGuide/reward_function.md) for built-in reward functions.
 
 cosine reward function arguments
 - cosine_min_len_value_wrong (default: -0.5): Reward value corresponding to the minimum length when the answer is incorrect.

diff --git a/docs/source_en/Instruction/GRPO/AdvancedResearch/GSPO.md b/docs/source_en/Instruction/GRPO/AdvancedResearch/GSPO.md
@@ -1,6 +1,6 @@
 # Group Sequence Policy Optimization
 
-**Version Requirement**: ms-swift>=3.7
+**Version Requirement**: ms-swift>=3.8
 
 In [Group Sequence Policy Optimization](https://www.arxiv.org/abs/2507.18071), it is pointed out that GRPO computes importance sampling weights at the token level. However, this approach is problematic: since each token is only sampled once, it cannot realize effective distribution correction, and instead introduces high-variance noise during training, which can easily lead to unstable gradient estimates and even training collapse. Therefore, the paper argues that the unit of the objective function should be consistent with that of the reward. Since the reward is typically given at the sequence level (i.e., for the entire generated response), it is more reasonable to perform off-policy correction and optimization at the sequence level rather than the token level.
 

diff --git a/docs/source_en/Instruction/Use-tuners.md b/docs/source_en/Instruction/Use-tuners.md
@@ -15,7 +15,7 @@ Tuners refer to additional structural components attached to a model, aimed at r
 - Adapter: [Parameter-Efficient Transfer Learning for NLP](http://arxiv.org/abs/1902.00751)
 - Vision Prompt Tuning: [Visual Prompt Tuning](https://arxiv.org/abs/2203.12119)
 - Side: [Side-Tuning: A Baseline for Network Adaptation via Additive Side Networks](https://arxiv.org/abs/1912.13503)
-- Res-Tuning: [Res-Tuning: A Flexible and Efficient Tuning Paradigm via Unbinding Tuner from Backbone](https://arxiv.org/abs/2310.19859)  < [arXiv](https://arxiv.org/abs/2310.19859)  |  [Project Page](https://res-tuning.github.io/)  |  [Usage](ResTuning.md) >
+- Res-Tuning: [Res-Tuning: A Flexible and Efficient Tuning Paradigm via Unbinding Tuner from Backbone](https://arxiv.org/abs/2310.19859)  < [arXiv](https://arxiv.org/abs/2310.19859)  |  [Project Page](https://res-tuning.github.io/) >
 - Tuners provided by [PEFT](https://github.com/huggingface/peft), such as AdaLoRA, DoRA, Fourierft, etc.
 
 ## Interface List