\pdfcolInitStack

tcb@breakable

GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning

Xiangxiang Chu, Hailang Huang, Xiao Zhang, Fei Wei, Yong Wang
AMAP, Alibaba Group
{chuxiangxiang.cxx, huanghailang.hhl, xavier.zx,
xixia.wf, wangyong.lz}@alibaba-inc.com
Abstract

Reinforcement Learning (RL) can directly enhance the reasoning capabilities of large language models without extensive reliance on Supervised Fine-Tuning (SFT). In this work, we revisit the traditional Policy Gradient (PG) mechanism and propose a minimalist RL approach termed Group Policy Gradient (GPG). Unlike conventional methods, GPG directly optimize the original RL objective, thus obviating the need for surrogate loss functions. As illustrated in Fig. 1, by eliminating both the critic and reference models, and avoiding KL divergence constraints, our approach significantly simplifies the training process when compared to Group Relative Policy Optimization (GRPO). Our approach achieves superior performance without relying on auxiliary techniques or adjustments. Extensive experiments demonstrate that our method not only reduces computational costs but also consistently outperforms GRPO across various unimodal and multimodal tasks. Our code is available at https://github.com/AMAP-ML/GPG.

Refer to caption
Figure 1: Performance comparison of different RL methods and the base model across diverse tasks. The left panel presents results on four visual reasoning tasks, while the right one evaluates a unimodal task. GPG achieves superior performance on visual reasoning benchmarks, outperforming existing methods by a significant margin. Notably, GPG also maintains competitive results on the unimodal task, demonstrating its robustness and adaptability across multimodal and unimodal scenarios.

1 Introduction

Large Language Models (LLMs) have achieved substantial advancements, progressively narrowing the gap towards achieving Artificial General Intelligence (AGI) [36; 16; 2; 50; 53; 7]. Recently, LLMs exemplified by OpenAI o1 [36] and DeepSeek R1 [16], have adopted a strategy of generating intermediate reasoning steps before producing final answers. This approach has markedly improved their efficacy in domain-specific tasks [22; 13; 20; 25; 28; 21], such as mathematical reasoning. The remarkable success of this technology is mainly attributed to the Reinforcement Fine-Tuning (RFT) method [42; 43; 55; 27; 19]. Through the application of RFT, the models allocate additional time to “deliberate” prior to generating answers, thereby constructing intricate reasoning chains and subsequently enhancing overall model performance.

In contrast to Supervised Fine-Tuning (SFT), which involves training models on fixed input-output pairs to mimic correct responses, RFT introduces an iterative process that incentivizes models to generate coherent and logically structured reasoning paths. RFT leverages RL techniques, such as Proximal Policy Optimization (PPO) [42] and GRPO [43] to optimize decision-making during the generation of intermediate steps. Specifically, PPO ensures stability by constraining policy updates, preventing new strategies that deviate significantly from established behaviours. In contrast, GRPO enhances this process by evaluating performance across groups of actions, encouraging consistent improvements in reasoning quality. This dynamic and feedback-driven approach enables models to deeper and more adaptive thinking, resulting in nuanced answers that better handle complex reasoning tasks compared to the more rigid and label-dependent training of SFT.

Despite the significant success of PPO in enhancing reasoning quality, it still suffers severely from the enormous resource consumption required during training. PPO necessitates the development and integration of both a critic model and a reference model, which not only complicates the training process but also substantially increases computational demands. Consequently, there is a growing trend toward simplifying the PPO method. For instance, ReMax [27] removes the critic model by introducing a baseline value, which reduces the training GPU memory usage and accelerates the training process. Besides, GRPO eliminates the need for a critic model and utilizes normalized rewards within a sample group. Furthermore, DAPO [55] removes the KL Divergence from GRPO and enhances the algorithm by introducing multiple technologies such as “clip-higher” and “dynamic sampling”. Additionally, REINFORCE++ [19] incorporates key optimization strategies from PPO while discarding the critic network, thus simplifying the training process and improving stability. A very recent and concurrent work [30] studies the details of reward and loss normalization and states GRPO tends to generate more tokens.

A thorough examination of the evolution within the Reinforcement Learning (RL) community, reveals that the widely adopted PPO algorithm essentially functions as a conservative surrogate for the original RL problem [42]. All existing enhancements to PPO continue to focus on optimizing the surrogate loss function. Consequently, two fundamental questions remain unresolved:

  • Is it feasible to transcend this intermediate strategy and directly optimize the original problem?

  • If this is achievable, to what extent can the learning strategy be streamlined?

This paper endeavors to provide a comprehensive exploration of these critical questions. In summary, our key contributions are as follows:

  • We revisit the design of policy gradient algorithm [44] and propose a simple RL method that keeps minimal RL components and directly optimizes the objective instead of surrogate loss.

  • Our approach eschews the necessity for both a critic model and a reference model. Moreover, it imposes no distributional constraints. These characteristics confer substantial advantages for potential scalability.

  • Extensive experiments demonstrated that GPG achieves superior performance than GRPO across various unimodal and multimodal visual tasks, while significantly reducing computational costs.

Our code and implementation details are open-sourced, contributing to the development of the community.

2 Related Work

Large Model Reasoning. Recent advancements in both LLM and Multimodal Large Language Model (MLLM) have increasingly focused on enabling models to simulate human-like, stepwise reasoning processes. In the field of LLMs, researchers have pioneered methods such as Chain-of-Thought (CoT) prompting [36; 48; 23; 39; 56; 33; 54], Tree-of-Thought [52], Monte Carlo Tree Search [12; 51; 46], and the construction of complex SFT datasets [33], to enhance performance in reasoning tasks. Notably, approaches such as DeepSeek-R1 [16] have employed large-scale RL with format-specific and result-oriented reward functions, guiding LLMs toward self-emerging, human-like, complex CoT reasoning with significant performance improvements in challenging reasoning tasks. Meanwhile, MLLMs, which convert inputs from various modalities into a unified LLM vocabulary representation space for processing, have exhibited superior performance in vision understanding tasks [2; 50; 53; 7; 29; 5; 14; 35; 47]. Building on the advancements in LLM reasoning, there has been a collective effort within the research community to apply the DeepSeek-R1 methodology to MLLMs to enhance their visual reasoning capabilities, yielding remarkable progress [57; 31; 4; 59].

Reinforcement Learning. RL has driven significant progress in sequential decision-making, with policy gradient methods being fundamental to optimizing stochastic policies. The REINFORCE algorithm [49] established early principles for gradient-based policy updates in trajectory-driven tasks. However, its high variance has posed challenges for scalability. To address this, subsequent research has focused on stabilizing policy optimization processes. Trust Region Policy Optimization (TRPO) [40] introduced constrained updates via quadratic approximations to ensure monotonic improvement. This was further refined by PPO [42], which employed clipped objective functions to simplify the optimization process. Subsequent studies have sought to enhance the PPO alogrithm [6; 58] or elaborate on its implementation [10]. PPO has achieved widespread use in language model alignment and robotic control. However, the algorithm’s dependence on conservative policy updates or heuristic clipping thresholds can undermine its exploration potential in favor of stability, which poses a significant challenge in complex domains requiring dynamic strategy adaptation.

3 Method

3.1 Preliminary and Task Formulation

RL is a computational approach to learning through interaction, where an agent seeks to maximize cumulative rewards by selecting optimal actions within an environment. The RL problem is typically defined by a policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, which maps states to actions, and aims to optimize the expected return. The core idea behind policy gradient methods is to use gradient ascent to iteratively adjust the policy parameters. The learning objective is maximizing the return 𝒥(θ)𝒥𝜃\mathcal{J}(\theta)caligraphic_J ( italic_θ ),

𝒥(θ)=maxθ𝔼πθ[t=0Trt].𝒥𝜃subscript𝜃subscript𝔼subscript𝜋𝜃delimited-[]superscriptsubscript𝑡0𝑇subscript𝑟𝑡\mathcal{J}(\theta)=\max_{\theta}\mathbb{E}_{\pi_{\theta}}\left[\sum_{t=0}^{T}% r_{t}\right].caligraphic_J ( italic_θ ) = roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] . (1)

The policy gradient theorem [44] proves that the above problem can be converted into estimating the gradient,

θ𝒥(θ)=𝔼πθ[θlogπθ(atst)Qπθ(st,at)],subscript𝜃𝒥𝜃subscript𝔼subscript𝜋𝜃delimited-[]subscript𝜃subscript𝜋𝜃conditionalsubscript𝑎𝑡subscript𝑠𝑡superscript𝑄subscript𝜋𝜃subscript𝑠𝑡subscript𝑎𝑡\nabla_{\theta}\mathcal{J}(\theta)=\mathbb{E}_{\pi_{\theta}}\left[\nabla_{% \theta}\log\pi_{\theta}(a_{t}\mid s_{t})Q^{\pi_{\theta}}(s_{t},a_{t})\right],∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_J ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] , (2)

where Qπθ(st,at)superscript𝑄subscript𝜋𝜃subscript𝑠𝑡subscript𝑎𝑡Q^{\pi_{\theta}}(s_{t},a_{t})italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the action-value function, representing the expected return when taking action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and following policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT thereafter.

To reduce the variance, the advantage function Aπθ(st,at)superscript𝐴subscript𝜋𝜃subscript𝑠𝑡subscript𝑎𝑡A^{\pi_{\theta}}(s_{t},a_{t})italic_A start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is often used leading to the policy gradient update rule:

θ𝒥(θ)=𝔼πθ[θlogπθ(atst)Aπθ(st,at)].subscript𝜃𝒥𝜃subscript𝔼subscript𝜋𝜃delimited-[]subscript𝜃subscript𝜋𝜃conditionalsubscript𝑎𝑡subscript𝑠𝑡superscript𝐴subscript𝜋𝜃subscript𝑠𝑡subscript𝑎𝑡\nabla_{\theta}\mathcal{J}(\theta)=\mathbb{E}_{\pi_{\theta}}\left[\nabla_{% \theta}\log\pi_{\theta}(a_{t}\mid s_{t})A^{\pi_{\theta}}(s_{t},a_{t})\right].∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_J ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_A start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] . (3)

One-step advantage estimation can be mathematically formulated as:

Aπθ(st,at)=Qπθ(st,at)Vπθ(st),superscript𝐴subscript𝜋𝜃subscript𝑠𝑡subscript𝑎𝑡superscript𝑄subscript𝜋𝜃subscript𝑠𝑡subscript𝑎𝑡superscript𝑉subscript𝜋𝜃subscript𝑠𝑡A^{\pi_{\theta}}(s_{t},a_{t})=Q^{\pi_{\theta}}(s_{t},a_{t})-V^{\pi_{\theta}}(s% _{t}),italic_A start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (4)

where Vπθ(st)superscript𝑉subscript𝜋𝜃subscript𝑠𝑡V^{\pi_{\theta}}(s_{t})italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) denotes the value function of the critic model, which represents the expected return when starting from state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and following policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. While GAE [41] offers a more sophisticated approach to balance bias and variance in advantage estimation, we find that in the context of model reasoning, one-step estimation is sufficiently effective for achieving good performance. This simplicity is particularly advantageous in scenarios where computational efficiency is paramount.

Given a sequence of questions and instructions, the model is tasked with generating corresponding answers. Subsequently, rewards are returned based on predefined reward models or hand-crafted rules. Our objective is to leverage these reward signals to optimize our policy, thereby enhancing the model’s ability to generate accurate and contextually appropriate responses.

However, designing or obtaining accurate rewards for intermediate steps is nontrivial [16]. To address this challenge, we simplify our problem as follows. Given a question and prompt s𝑠sitalic_s, we sample an action a𝑎aitalic_a from policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and obtain a final reward signal r𝑟ritalic_r. Note that the policy distribution πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is modeled in an autoregressive manner. In this setting, we can leverage policy gradient methods to optimize the policy.

3.2 Group Policy Gradient

Our proposed method, Group Policy Gradient (GPG), is designed to address the issue of high variance in policy gradient estimation in the absence of a value model. By leveraging group-level rewards, GPG stabilizes learning and enhances the robustness of reinforcement learning training. Specifically, GPG utilizes the mean reward within each group to normalize the rewards, thereby effectively reducing variance. This approach eliminates the need for a traditional value model, thereby simplifying the training process and enhancing computational efficiency. The name "Group Policy Gradient" reflects our method’s core mechanism of utilizing group-level mean rewards to stabilize and optimize learning.

The core objective of GPG is defined as:

𝒥GPG(θ)=subscript𝒥GPG𝜃absent\displaystyle\mathcal{J}_{\text{GPG}}(\theta)=caligraphic_J start_POSTSUBSCRIPT GPG end_POSTSUBSCRIPT ( italic_θ ) = 𝔼(q,a)𝒟,{oi}i=1Gsubscript𝔼similar-to𝑞𝑎𝒟superscriptsubscriptsubscript𝑜𝑖𝑖1𝐺\displaystyle\,\mathbb{E}_{(q,a)\sim\mathcal{D},\{o_{i}\}_{i=1}^{G}}blackboard_E start_POSTSUBSCRIPT ( italic_q , italic_a ) ∼ caligraphic_D , { italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT end_POSTSUBSCRIPT (5)
[1i=1G|oi|i=1Gt=1|oi|(logπθ(oi,tq,oi,<t)A^i,t)],\displaystyle\left[\frac{1}{\sum_{i=1}^{G}{|o_{i}|}}\sum_{i=1}^{G}\sum_{t=1}^{% |o_{i}|}\left(-\log\pi_{\theta}(o_{i,t}\mid q,o_{i},<t)\hat{A}_{i,t}\right)% \right],[ divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT | italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT ( - roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ∣ italic_q , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , < italic_t ) over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ) ] ,

where oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the individual responses in the group G𝐺Gitalic_G, and the advantage of the i𝑖iitalic_i-th response is calculated by normalizing the group-level rewards {Ri}i=1Gsuperscriptsubscriptsubscript𝑅𝑖𝑖1𝐺\{R_{i}\}_{i=1}^{G}{ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT:

A^i,t=rimean({Ri}i=1G)std({Ri}i=1G).subscript^𝐴𝑖𝑡subscript𝑟𝑖meansuperscriptsubscriptsubscript𝑅𝑖𝑖1𝐺stdsuperscriptsubscriptsubscript𝑅𝑖𝑖1𝐺\hat{A}_{i,t}=\frac{r_{i}-\text{mean}(\{R_{i}\}_{i=1}^{G})}{\text{std}(\{R_{i}% \}_{i=1}^{G})}.over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT = divide start_ARG italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - mean ( { italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) end_ARG start_ARG std ( { italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) end_ARG . (6)

This normalization technique plays a pivotal role in reducing variance by ensuring that individual rewards are considered in the context of group dynamics, thereby fostering more stable and efficient learning even without a value model. The GPG approach demonstrates that leveraging structured reward systems can yield significant improvements in reinforcement learning performance, highlighting the potential for future developments that minimize dependency on traditional models while maintaining robust training efficacy.

Components
Value Models Reference Models Surrogate Loss Policy Constraint
PPO
GRPO
TRPO
GPG
Table 1: Comparison of reinforcement learning algorithms with various components.

RL algorithms vary significantly in their approaches to tackling variance and optimizing policies. Two key components in many RL algorithms are surrogate loss and policy constraints. Surrogate loss functions help stabilize training by providing a proxy objective that approximates the original goal but is easier to optimize. For example, in PPO [42], the surrogate loss is designed to prevent large updates to the policy, making training more stable by maintaining a balance between exploration and exploitation. (See Appendix B for detailed derivations.) However, surrogate loss functions can sometimes limit the flexibility of the policy updates, potentially leading to sub-optimal solutions. Policy constraints, such as KL divergence limits, ensure that the policy does not change too drastically between updates. These constraints are critical for maintaining stability in training, as they prevent erratic behavior and large deviations from known good policies. However, overly restrictive policy constraints can hinder the agent’s ability to explore new, potentially better strategies.

GRPO [43] introduces the use of reference models and policy constraints to stabilize training. The reference model serves as a foundation to provide additional guidance and regularization, ensuring more consistent and robust learning by minimizing variance in policy updates. Policy constraints further aid in preventing drastic policy changes, contributing to the stability of the learning process. Full derivations can be found in the Appendix B. Conversely, the introduction of GPG aims to offer greater flexibility by removing both the reference model and policy constraints. This approach fosters a more open exploration of the solution space, allowing the RL agent to develop novel strategies to achieve optimal results without the constraints imposed by prior models or divergence measures.

Overall, GPG represents an innovative step forward, integrating the benefits of reward normalization to enhance learning efficiency and adaptability, setting the stage for future advances in reinforcement learning strategies that emphasize flexible exploration and learning dynamics.

4 Experiments

All experimental settings were meticulously controlled to ensure fair comparisons. We adhered closely to the hyperparameters employed by GRPO, despite their suboptimality for our approach. Notably, our method consistently outperformed GRPO across all tasks, achieving superior performance with clear margins. These results underscore the robustness and efficacy of our proposed method.

4.1 Experimental Setup

4.1.1 Dataset and Benchmarks

For the unimodal scenario, we utilize open-s1, open-rs [9] and MATH-lighteval [17] datasets as our training data. These datasets encompass a wide range of problem types and difficulty levels. To assess the reasoning capabilities of the models, we employ five distinct mathematics-focused benchmark datasets: AIME24, MATH-500 [28; 17], AMC23, Minerva [26], and OlympiadBench [21].

In the multimodal case, we handle a variety of tasks. Specifically, for the visual reasoning task, we utilize approximately 12,0001200012,00012 , 000 samples from the SAT dataset [38] for training and perform evaluations on the CV-Bench dataset [45]. In addressing the geometry reasoning task, by following R1-V [4], we train on around 8,00080008,0008 , 000 samples from the GEOQA training set [4] and subsequently evaluating performance on the GEOQA test set [3]. For both the classification and reasoning grounding tasks, we follow Visual-RFT to conduct few-shot classification training on Flower102 [34], Pets37 [37], FGVCAircraft [32], Car196 [24], respectively. Additionally, training is conducted on 239239239239 samples from the LISA training set [25]. All evaluations are carried out using the corresponding test sets associated with these training sets.

4.1.2 Implementation Details

Our approach is broadly applicable across a wide range of reinforcement learning tasks. To demonstrate its versatility and efficacy, we have conducted experiments encompassing both unimodal and multimodal scenarios. These experiments are performed on NVIDIA H20 96G GPUs. For each experiment, we adhered strictly to the implementation of original code base, ensuring consistent training and evaluation procedures. The implemented GPG method can refer to Algorithm.1 and more detailed settings can refer to Appendix A.

Algorithm 1 Group Policy Gradient (GPG)
0:  o[shape: (B,G,C,dim)]𝑜delimited-[]shape: 𝐵𝐺𝐶𝑑𝑖𝑚absento~{}~{}[\text{shape: }(B,G,C,dim)]\leftarrowitalic_o [ shape: ( italic_B , italic_G , italic_C , italic_d italic_i italic_m ) ] ← Model Output, r𝑟absentr\leftarrowitalic_r ← Reward
1:  Calculate logπθ(o)[per_token_logps]subscript𝜋𝜃𝑜delimited-[]𝑝𝑒𝑟_𝑡𝑜𝑘𝑒𝑛_𝑙𝑜𝑔𝑝𝑠\log\pi_{\theta}(o)[per\_token\_logps]roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_o ) [ italic_p italic_e italic_r _ italic_t italic_o italic_k italic_e italic_n _ italic_l italic_o italic_g italic_p italic_s ] based on o𝑜oitalic_o and model πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT
2:  Calculate A^[advantages]^𝐴delimited-[]𝑎𝑑𝑣𝑎𝑛𝑡𝑎𝑔𝑒𝑠\hat{A}[advantages]over^ start_ARG italic_A end_ARG [ italic_a italic_d italic_v italic_a italic_n italic_t italic_a italic_g italic_e italic_s ] based on Eq.6
3:  losslogπθ(o)A^𝑙𝑜𝑠𝑠subscript𝜋𝜃𝑜^𝐴loss\leftarrow-\log\pi_{\theta}(o)\cdot\hat{A}italic_l italic_o italic_s italic_s ← - roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_o ) ⋅ over^ start_ARG italic_A end_ARG
3:  loss𝑙𝑜𝑠𝑠lossitalic_l italic_o italic_s italic_s

4.2 Prompt and Reward Function

Prompt for Reasoning. In the process of reinforcement fine-tuning, specific instructions are incorporated into the system prompt. These instructions encourage the model to generate intermediate reasoning steps, thereby facilitating the reasoning capabilities of the model. An example of this approach is provided below [31]:

System Prompt for Reasoning A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>

Reward Function. For most tasks, we use the accuracy and formatting reward functions. For the grounding task, the Intersection over Union (IoU) reward function is utilized.

  • Accuracy: If the model’s output is consistent with the ground truth, a reward of 1.0 is awarded.

  • Formatting: If the format of the model output is “<think></think> <answer></answer>”, a reward of 1.0 is granted.

  • IoU: Consistent with Visual-RFT [31], the reward value is derived from the calculated scores of the bounding boxes generated by the model.

Models Average AIME24 MATH-500 AMC23 Minerva OlympiadBench
Base Model 48.9 28.8 82.8 62.9 26.5 43.3
+ GRPO 53.1 33.3 83.8 67.5 29.8 50.9
+ GPG 55.7 33.3 87.6 77.5 29.4 50.5
Table 2: Results on the DeepSeek-R1-Distill-Qwen-1.5B model. GPG method achieved an additional average performance boost of 2.6% compared to GRPO.
Models Average AIME24 MATH-500 AMC23 Minerva OlympiadBench
Qwen2.5-Math-7B 30.9 13.3 57.6 45.0 14.7 23.7
+ GRPO 43.7 16.7 73.4 62.5 30.2 35.7
+ Dr. GRPO [30] 43.7 26.7 74.6 50.0 30.1 37.3
+ GPG 45.3 23.3 73.6 60.0 30.5 39.3
Table 3: Results on Qwen2.5-Math-7B model. In comparison with GRPO, the GPG method demonstrated superior performance across four distinct datasets.
Model Average AIME24 MATH-500 AMC23 Minerva OlympiadBench
General Models
Llama-3.1-70B-Instruct 35.7 16.7 64.6 30.1 35.3 31.9
rStar-Math-7B 26.7 78.4 47.5 - 47.1 -
Eurus-2-7B-PRIME 26.7 79.2 57.8 38.6 42.1 48.9
1.5B Models
DeepSeek-R1-Distill-Qwen-1.5B 48.9 28.8 82.8 62.9 26.5 43.3
Still-3-1.5B-Preview 51.6 32.5 84.4 66.7 29.0 45.4
Open-RS1 * 53.1 33.3 83.8 67.5 29.8 50.9
Open-RS3 * 52.0 26.7 85.4 70.0 27.9 50.2
GPG-RS1 55.7 33.3 87.6 77.5 29.4 50.5
GPG-RS3 55.5 33.3 85.0 80.0 26.8 52.4
Table 4: Zero-shot pass@1 performance across benchmarks. Dashes (–) denote unavailable official scores. Asterisk (*) indicates reproduced results.

4.3 Main Results

4.3.1 Unimodal Task

Mathematical Reasoning. We systematically evluate the performance of GPG across five benchmark datasets, as presented in Table 2. Initially, we evaluate the DeepSeek-R1-Distill-Qwen-1.5B model, establishing it as our baseline with an average accuracy of 48.9%. Compared to GRPO [43], our GPG approach exhibits a substantial performance enhancement, achieving an average improvement of 2.6%. Overall, the GPG model attains an average accuracy of 55.7%. Significant improvements over GRPO are evidenced in the MATH-500 and AMC23 benchmarks, showing accuracy improvements of 3.8% and 10.0%, respectively. These results highlight the effectiveness of GPG in enhancing mathematical reasoning capabilities.

Additionally, the performance of GPG was evaluated in comparison to GRPO [43] using the larger Qwen2.5-Math-7B model, as detailed in Table 3. GPG achieves a higher average accuracy of 45.3%, representing an improvement of 1.6% than GRPO. This enhancement illustrates the improved reasoning capabilities provided by GPG, particularly notable in specific benchmarks such as AIME24 (+6.6%) and OlympiadBench [21] (+3.6%). Moreover, GPG achieves an average improvement of 2.6% over Dr. GRPO, demonstrating significant enhancements with a 10.0% increase on AMC23 and a 2.0% gain on OlympiadBench [21]. These advancements highlight the utility of GPG in RL training for critical mathematical reasoning tasks.

As illustrated in Table 4, our models exhibit superior performance compared to most baselines [15; 8; 9], achieving average scores of 53.1% for Open-RS1 and 52.0% for Open-RS3. Additionally, GPG-RS1 and GPG-RS3 shows strong results in AMC23 with a score of 77.5% and 80.0%, obviously surpassing Open-RS 67.5% and 70.0%. Both GPG-RS1 and GPG-RS3 demonstrate competitive performance across various benchmarks, particularly excelling in MATH-500 [28; 17] with scores of 87.6% and 85.0%, and OlympiadBench [21] with scores of 50.5% and 52.4%.

Models Total Count Relation Depth Distance
Qwen2-VL-2B 31.38 54.69 22.46 0.16 31.66
+++ SFT 57.84 60.02 68.92 55.00 45.83
+++ GRPO 59.47 59.64 66.76 54.16 56.66
+++ GPG 69.11 65.36 82.62 67.83 60.67
Table 5: Visual reasoning results on CV-Bench [45], which shows GPG training on base model has overall better performance over GRPO training and the base model.
Models GEOQATestTest{}_{\text{Test}}start_FLOATSUBSCRIPT Test end_FLOATSUBSCRIPT
Qwen2.5-VL-3B-Instruct 35.41
+++ GRPO 47.48
+++ GPG 50.80
Table 6: Geometry reasoning results on GEOQA [3]. GPG is better than GRPO.
Models mIoUtesttest{}_{\text{test}}start_FLOATSUBSCRIPT test end_FLOATSUBSCRIPT mIoUvalval{}_{\text{val}}start_FLOATSUBSCRIPT val end_FLOATSUBSCRIPT gIoUtesttest{}_{\text{test}}start_FLOATSUBSCRIPT test end_FLOATSUBSCRIPT
Qwen2-VL-2B 26.9 30.1 25.3
+++ SFT 28.3 29.7 25.3
+++ GRPO 37.6 34.4 34.4
+++ GPG 51.5 53.4 49.5
Table 7: Reasoning grounding results on LISA [25]. GPG surpasses GRPO in reasoning grounding with 239 training images.
Models Average Flower102 [34] Pets37 [37] FGVC [32] Cars196 [24]
Qwen2-VL-2B 56.0 54.8 66.4 45.9 56.8
+++ SFT 55.6 58.5 55.5 67.9 40.5
+++ GRPO 81.9 71.4 86.1 74.8 95.3
+++ GPG 86.0 73.0 87.1 86.8 97.1
Table 8: 4-shot Results on Four Fine-grained Classification Datasets. GPG shows consistently better results than GRPO on 4444 classification datasets.

4.3.2 Multimodal Task

Visual Reasoning. We initially evaluate the GPG method using the CV-Bench [45] visual reasoning dataset, strictly adhering to the parameter settings of VisualThinker-R1-Zero. As illustrated in Table 5, the GPG method demonstrates a significant improvement in performance. Specifically, it attains a score of 69.11% on CV-Bench, representing an increase of 9.64% points compared to the 59.47% score achieved by GRPO.

Geometry Reasoning. In addition to visual reasoning, MLLMs exhibit notable proficiency in geometry reasoning. To evaluate the efficacy of the GPG method in this domain, we employ an experimental setup similar to that used in R1-V [4] using the GEOQA [3] dataset. The results, presented in Table 7, indicate that the GPG method achieved a score of 50.80%, surpassing the GRPO’s score of 47.48% by 3.32% points. This demonstrates the superior performance of the GPG method in addressing complex geometric reasoning tasks.

Classification. Beyond the evaluation of reasoning tasks, we also assess the enhancement of the GPG method over GRPO in image perception tasks. As shown in Table 8, the GPG method achieves an average score of 86.0% across four classification datasets, surpassing GRPO by 4.1% points. Additionally, our method consistently produces improvements across all four classification datasets, underscoring its superiority in image perception tasks.

Reasoning Grounding. The final critical aspect of evaluating MLLMs involves precisely identifying objects according to user requirements. To this end, we employ the Qwen2-VL-2B model for grounding tasks using the LISA dataset [25], with the results presented in Table 7. In comparison to the GRPO method, the GPG approach demonstrates a substantial enhancement, improving all metrics by over 10% points. This significant improvement underscores the superiority of the GPG method in object localization, leading to considerable advancements in reasoning and perception capabilities.

Refer to caption
Figure 2: Comparison of GPG and GRPO in mathematical reasoning task based on DeepSeek-R1-Distill-Qwen-1.5B model trained on Open-r1 dataset: a test case from AIME24 dataset.
Refer to caption
Figure 3: Comparison of GPG(blue curves) and GRPO(gray curves) in terms of training loss, rewards and completion length. Experiments are based on DeepSeek-R1-Distill-Qwen-1.5B, same as Table. 2.

4.4 Ablation Study and Discussion

Case Study and Training Analysis. We present the reasoning processes of GPG and GRPO, as illustrated in Fig. 2. Compared to GRPO, the GPG approach demonstrates a more comprehensive and accurate reasoning capability, whereas GRPO exhibits errors in formula analysis. Consequently, GPG arrived at the correct solution, while GRPO produced an incorrect result. In Fig. 3, we present a range of real-time training metrics to illustrate the effectiveness of GPG as a straightforward yet strong RL algorithm.

Sensitivity on Group Size. We study the effect of the number of generations within a group. As shown in Table 9, increasing the group size from 2 to 16 leads to progressive improvements across most metrics. Specifically, the Average performance improves steadily with larger group sizes. We choose 8 to achieve a good tradeoff between training cost and performance.

Group Number Average AIME24 MATH-500 AMC23 Minerva OlympiadBench
2 41.9 16.7 71.6 60.0 25.0 36.0
4 43.3 20.0 73.2 55.0 29.8 38.5
8 45.3 23.3 73.6 60.0 30.5 39.3
16 47.3 26.7 74.6 65.0 32.4 37.8
Table 9: Ablation on different group size using Qwen2.5 Math 7B.

Reward Normalization. We study the role of reward normalization and show the result in Table 10. Normalization within a batch is common practice in the RL training process [1]. The experiment results show that reward normalization within a group is better than the batch.

Strategy Average AIME24 MATH-500 AMC23 Minerva OlympiadBench RN Group 45.3 23.3 73.6 60.0 30.5 39.3 RN Batch 44.9 23.3 72.2 55.0 35.3 38.5 + KL (β𝛽\betaitalic_β=0.04) 43.7 20.0 73.8 57.5 29.4 37.6

Table 10: Ablation on reward normalization (RN) using Qwen2.5 Math 7B.

KL constraint. In principle, our method is designed to optimize the original reinforcement learning (RL) problem directly. And it’s a bit strange without imposing any distribution constraints. Despite this, we conducted an ablation study to evaluate the impact of adding a distribution constraint. The results are presented in Table 10. Our findings indicate that incorporating such a constraint negatively impacts performance.

Comparison with Various RL Methods. We attempt to explain the differences between GPG and other RL methods in the simplest way. As shown in Table 11, it can be seen that the loss of GPG does not include the “CLIP term” and the “KL divergence”. Its form and calculation are the simplest, and as discussed in Section 4.3, its performance is better than other methods.

RL Method Loss Function Advantage Function
PPO [42] PPO=min[πθ(o)πθold(o)A,clip(πθ(o)πθold(o),1ϵ,1+ϵ)CLIPA]subscriptPPOabsentsubscript𝜋𝜃𝑜subscript𝜋subscript𝜃𝑜𝑙𝑑𝑜𝐴subscriptclipsubscript𝜋𝜃𝑜subscript𝜋subscript𝜃𝑜𝑙𝑑𝑜1italic-ϵ1italic-ϵCLIP𝐴\begin{aligned} \mathcal{L}_{\text{PPO}}=&-\min\left[\frac{\pi_{\theta}(o)}{% \pi_{\theta_{old}}(o)}\cdot A,\underbrace{\operatorname{clip}\left(\frac{\pi_{% \theta}(o)}{\pi_{\theta_{old}}(o)},1-\epsilon,1+\epsilon\right)}_{\text{CLIP}}% \cdot A\right]\end{aligned}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT PPO end_POSTSUBSCRIPT = end_CELL start_CELL - roman_min [ divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_o ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_o ) end_ARG ⋅ italic_A , under⏟ start_ARG roman_clip ( divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_o ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_o ) end_ARG , 1 - italic_ϵ , 1 + italic_ϵ ) end_ARG start_POSTSUBSCRIPT CLIP end_POSTSUBSCRIPT ⋅ italic_A ] end_CELL end_ROW where A𝐴Aitalic_A computed by applying GAE [41] based on rewards and the critic model.
GRPO [43] GRPO=(min[πθ(o)πθold(o)A,CLIPA]β𝔻KL[πθπref])subscriptGRPOabsentsubscript𝜋𝜃𝑜subscript𝜋subscript𝜃𝑜𝑙𝑑𝑜𝐴CLIP𝐴𝛽subscript𝔻𝐾𝐿delimited-[]conditionalsubscript𝜋𝜃subscript𝜋𝑟𝑒𝑓\begin{aligned} \mathcal{L}_{\text{GRPO}}=&-\left(\min\left[\frac{\pi_{\theta}% (o)}{\pi_{\theta_{old}}(o)}\cdot A,\text{CLIP}\cdot A\right]-\beta\mathbb{D}_{% KL}\left[\pi_{\theta}\|\pi_{ref}\right]\right)\end{aligned}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT GRPO end_POSTSUBSCRIPT = end_CELL start_CELL - ( roman_min [ divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_o ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_o ) end_ARG ⋅ italic_A , CLIP ⋅ italic_A ] - italic_β blackboard_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∥ italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ] ) end_CELL end_ROW A=R(o)mean{R(o)}std{R(o)}𝐴absent𝑅𝑜mean𝑅𝑜std𝑅𝑜\begin{aligned} A=&\frac{R(o)-\operatorname{mean}\{R(o)\}}{\operatorname{std}% \{R(o)\}}\end{aligned}start_ROW start_CELL italic_A = end_CELL start_CELL divide start_ARG italic_R ( italic_o ) - roman_mean { italic_R ( italic_o ) } end_ARG start_ARG roman_std { italic_R ( italic_o ) } end_ARG end_CELL end_ROW
Dr. GRPO [30] Dr.GRPO=PPOsubscriptDr.GRPOsubscriptPPO\begin{aligned} \mathcal{L}_{\text{Dr.GRPO}}=\mathcal{L}_{\text{PPO}}\end{aligned}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT Dr.GRPO end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT PPO end_POSTSUBSCRIPT end_CELL end_ROW A=R(o)mean{R(o)}𝐴absent𝑅𝑜mean𝑅𝑜\begin{aligned} A=&R(o)-\operatorname{mean}\{R(o)\}\end{aligned}start_ROW start_CELL italic_A = end_CELL start_CELL italic_R ( italic_o ) - roman_mean { italic_R ( italic_o ) } end_CELL end_ROW
DAPO [55] DAPO=min[πθ(o)πθold(o)A,clip(πθ(o)πθold(o),1ϵlow,1+ϵhigh)A]subscriptDAPOabsentsubscript𝜋𝜃𝑜subscript𝜋subscript𝜃𝑜𝑙𝑑𝑜𝐴clipsubscript𝜋𝜃𝑜subscript𝜋subscript𝜃𝑜𝑙𝑑𝑜1subscriptitalic-ϵlow1subscriptitalic-ϵhigh𝐴\begin{aligned} \mathcal{L}_{\text{DAPO}}=&-\min\left[\frac{\pi_{\theta}(o)}{% \pi_{\theta_{old}}(o)}\cdot A,\operatorname{clip}\left(\frac{\pi_{\theta}(o)}{% \pi_{\theta_{old}}(o)},1-\epsilon_{\text{low}},1+\epsilon_{\text{high}}\right)% \cdot A\right]\end{aligned}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT DAPO end_POSTSUBSCRIPT = end_CELL start_CELL - roman_min [ divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_o ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_o ) end_ARG ⋅ italic_A , roman_clip ( divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_o ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_o ) end_ARG , 1 - italic_ϵ start_POSTSUBSCRIPT low end_POSTSUBSCRIPT , 1 + italic_ϵ start_POSTSUBSCRIPT high end_POSTSUBSCRIPT ) ⋅ italic_A ] end_CELL end_ROW A=R(o)mean{R(o)}std{R(o)}𝐴absent𝑅𝑜mean𝑅𝑜std𝑅𝑜\begin{aligned} A=&\frac{R(o)-\operatorname{mean}\{R(o)\}}{\operatorname{std}% \{R(o)\}}\end{aligned}start_ROW start_CELL italic_A = end_CELL start_CELL divide start_ARG italic_R ( italic_o ) - roman_mean { italic_R ( italic_o ) } end_ARG start_ARG roman_std { italic_R ( italic_o ) } end_ARG end_CELL end_ROW
GPG GPG=logπθ(o)AsubscriptGPGabsentsubscript𝜋𝜃𝑜𝐴\begin{aligned} \mathcal{L}_{\text{GPG}}=&-\log\pi_{\theta}(o)\cdot A\end{aligned}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT GPG end_POSTSUBSCRIPT = end_CELL start_CELL - roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_o ) ⋅ italic_A end_CELL end_ROW A=R(o)mean{R(o)}std{R(o)}𝐴absent𝑅𝑜mean𝑅𝑜std𝑅𝑜\begin{aligned} A=&\frac{R(o)-\operatorname{mean}\{R(o)\}}{\operatorname{std}% \{R(o)\}}\end{aligned}start_ROW start_CELL italic_A = end_CELL start_CELL divide start_ARG italic_R ( italic_o ) - roman_mean { italic_R ( italic_o ) } end_ARG start_ARG roman_std { italic_R ( italic_o ) } end_ARG end_CELL end_ROW
Table 11: Comparison of various RL methods, we explain in the simplest form.

4.5 Broader Impact

Achieving advanced general intelligence critically depends on augmenting the reasoning capabilities of models, with effcient and scalable reinforcement learning methods serving as a cornerstone. Our proposed approach investigates a minimalist strategy that aims to enhance reasoning capacity through simplicity and efficiency, thereby potentially facilitating the development of scalable systems. However, given the constraints of our computational budget, we have not evaluated our method on extremely large models.

5 Conclusion

In this paper, we introduce GPG, which effectively addresses the critical challenges inherent in reinforcement fine-tuning approaches such as PPO and GRPO. By directly incorporating group-based decision dynamics into the standard PG method, GPG simplifies the training process and significantly reduces computational overhead without sacrificing reasoning quality. This breakthrough provides a more efficient framework for training advanced LLMs capable of complex reasoning, thereby contributing to more resource-effective and scalable artificial intelligence systems.

References

  • [1] Marcin Andrychowicz, Anton Raichuk, Piotr Stańczyk, Manu Orsini, Sertan Girgin, Raphaël Marinier, Leonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, et al. What matters for on-policy deep actor-critic methods? a large-scale study. In International conference on learning representations, 2021.
  • [2] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025.
  • [3] Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P. Xing, and Liang Lin. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning, 2022.
  • [4] Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-v: Reinforcing super generalization ability in vision-language models with less than $3. https://github.com/Deep-Agent/R1-V, 2025. Accessed: 2025-02-02.
  • [5] Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024.
  • [6] Xiangxiang Chu. Policy optimization with penalized point probability distance: An alternative to proximal policy optimization. arXiv preprint arXiv:1807.00442, 2018.
  • [7] Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv preprint, 2024.
  • [8] Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025.
  • [9] Quy-Anh Dang and Chris Ngo. Reinforcement learning for reasoning in small llms: What works and what doesn’t, 2025.
  • [10] Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. Implementation matters in deep rl: A case study on ppo and trpo. In International conference on learning representations, 2019.
  • [11] Hugging Face. Open r1: A fully open reproduction of deepseek-r1. https://github.com/huggingface/open-r1, 2025. Accessed: January 2025.
  • [12] Xidong Feng, Ziyu Wan, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training, 2024.
  • [13] Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models. arXiv preprint arXiv:2410.07985, 2024.
  • [14] Google. Gemini: A family of highly capable multimodal models, 2023.
  • [15] Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. rstar-math: Small llms can master math reasoning with self-evolved deep thinking. arXiv preprint arXiv:2501.04519, 2025.
  • [16] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
  • [17] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
  • [18] Chloe Ching-Yun Hsu, Celestine Mendler-Dünner, and Moritz Hardt. Revisiting design choices in proximal policy optimization. arXiv preprint arXiv:2009.10897, 2020.
  • [19] Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models, 2025.
  • [20] Hailang Huang, Yong Wang, Zixuan Huang, Huaqiu Li, Tongwen Huang, Xiangxiang Chu, and Richong Zhang. Mmgenbench: Fully automatically evaluating lmms from the text-to-image generation perspective, 2025.
  • [21] Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, et al. Olympicarena: Benchmarking multi-discipline cognitive reasoning for superintelligent ai. Advances in Neural Information Processing Systems, 37:19209–19253, 2024.
  • [22] LI Jia, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, et al. Numinamath, 2024.
  • [23] Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 22199–22213. Curran Associates, Inc., 2022.
  • [24] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013.
  • [25] Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024.
  • [26] Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022.
  • [27] Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, and Zhi-Quan Luo. Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models, 2024.
  • [28] Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations, 2023.
  • [29] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023.
  • [30] Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783, 2025.
  • [31] Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. arXiv preprint arXiv:2503.01785, 2025.
  • [32] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
  • [33] Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025.
  • [34] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008.
  • [35] OpenAI. Gpt-4v(ision) system card, 2023.
  • [36] OpenAI. Learning to reason with llms, 2024.
  • [37] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012.
  • [38] Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko. Sat: Spatial aptitude training for multimodal language models, 2024.
  • [39] Laria Reynolds and Kyle McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, CHI EA ’21, New York, NY, USA, 2021. Association for Computing Machinery.
  • [40] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR, 2015.
  • [41] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation, 2018.
  • [42] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • [43] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
  • [44] Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
  • [45] Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Ziteng Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024.
  • [46] Trieu Trinh, Yuhuai Wu, Quoc Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations. Nature, 2024.
  • [47] Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint, 2023.
  • [48] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  • [49] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992.
  • [50] Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302, 2024.
  • [51] Huajian Xin, Z. Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Qihao Zhu, Dejian Yang, Zhibin Gou, Z. F. Wu, Fuli Luo, and Chong Ruan. Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search, 2024.
  • [52] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809–11822, 2023.
  • [53] Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024.
  • [54] Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning, 2025.
  • [55] Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, Wei-Ying Ma, Ya-Qin Zhang, Lin Yan, Mu Qiao, Yonghui Wu, and Mingxuan Wang. Dapo: An open-source llm reinforcement learning system at scale, 2025.
  • [56] Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. STar: Bootstrapping reasoning with reasoning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  • [57] Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. arXiv preprint arXiv:2503.12937, 2025.
  • [58] Rui Zheng, Shihan Dou, Songyang Gao, Wei Shen, Binghai Wang, Yan Liu, Senjie Jin, Qin Liu, Limao Xiong, Lu Chen, Zhiheng Xi, Yuhao Zhou, Nuo Xu, Wenbin Lai, Minghao Zhu, Rongxiang Weng, Wensen Cheng, Cheng Chang, Zhangyue Yin, Yuan Hua, Haoran Huang, Tianxiang Sun, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, and Xuanjing Huang. Secrets of rlhf in large language models part i: Ppo, 2023.
  • [59] Hengguang Zhou, Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. R1-zero’s "aha moment" in visual reasoning on a 2b non-sft model, 2025.

Appendix A More Implementation Details

To evaluate the unimodal reasoning capabilities of our proposed method, we utilize two publicly available code repositories: Open-r1 [11] and Open-rs [9]. These repositories are selected due to their extensive coverage of various reasoning scenarios and their ability to present substantial challenges that effectively assess the reasoning capabilities of advanced models. The DeepSeek-R1-Distill-Qwen-1.5B model is trained for 100 and 50 global steps using the open-s1 and open-rs datasets, as reported in the repository [9], resulting in the GPG-RS1 and GPG-RS3 models, respectively. Subsequently, GPG-RS1 and GPG-RS3 are evaluated across five established benchmarks. Additionally, the Qwen2.5-Math-7B model is trained on 7,500 samples from the MATH-lighteval dataset, based on Open-r1 [11] code base.

For multimodal tasks, we have selected three renowned frameworks as our code base: VisualThinker-R1-Zero [59], R1-V [4], and Visual-RFT [31]. These frameworks cover a variety of tasks, including visual reasoning, geometric reasoning, and image perception. The use of distinct code bases enables a comprehensive assessment of the performance enhancements achieved by our method across different tasks. Specifically, for the VisualThinker-R1-Zero framework, we evaluate the results of the GPG approach on the CV-Bench [45]. Additionally, we evaluate the GPGresults on the GEOQA dataset [3] based on R1-V. Finally, for tasks related to image perception, such as classification [34, 37, 32, 24] and reasoning grounding [25], we examine the performance of PGP using the Visual-RFT framework.

Appendix B More Related Work

Proximal Policy Optimization. PPO [42] addresses the inherent optimization instability of Trust Region Policy Optimization (TRPO) [40] through a clipped surrogate objective. Formally, let the probability ratio between the updated policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and the previous policy πθoldsubscript𝜋subscript𝜃old\pi_{\theta_{\text{old}}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT old end_POSTSUBSCRIPT end_POSTSUBSCRIPT be defined as

rt(θ)=πθ(at|st)πθold(at|st),subscript𝑟𝑡𝜃subscript𝜋𝜃conditionalsubscript𝑎𝑡subscript𝑠𝑡subscript𝜋subscript𝜃oldconditionalsubscript𝑎𝑡subscript𝑠𝑡r_{t}(\theta)=\frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{\text{old}}}(a_{t}% |s_{t})},italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) = divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT old end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG , (7)

where atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denote the action and state at timestep t𝑡titalic_t, respectively. While TRPO maximizes the surrogate objective

𝒥TRPO(θ)=𝔼t[rt(θ)A^t]superscript𝒥TRPO𝜃subscript𝔼𝑡delimited-[]subscript𝑟𝑡𝜃subscript^𝐴𝑡\mathcal{J}^{\text{TRPO}}(\theta)=\mathbb{E}_{t}\left[r_{t}(\theta)\hat{A}_{t}\right]caligraphic_J start_POSTSUPERSCRIPT TRPO end_POSTSUPERSCRIPT ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] (8)

under a Kullback-Leibler (KL) divergence constraint, PPO reformulates this via a clipped mechanism. Here, A^tsubscript^𝐴𝑡\hat{A}_{t}over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the estimated advantage function quantifying the relative value of action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The PPO objective is defined as:

𝒥CLIP(θ)=𝔼t[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)],superscript𝒥CLIP𝜃subscript𝔼𝑡delimited-[]subscript𝑟𝑡𝜃subscript^𝐴𝑡clipsubscript𝑟𝑡𝜃1italic-ϵ1italic-ϵsubscript^𝐴𝑡\mathcal{J}^{\text{CLIP}}(\theta)=\mathbb{E}_{t}\left[\min\left(r_{t}(\theta)% \hat{A}_{t},\,\text{clip}\big{(}r_{t}(\theta),1-\epsilon,1+\epsilon\big{)}\hat% {A}_{t}\right)\right],caligraphic_J start_POSTSUPERSCRIPT CLIP end_POSTSUPERSCRIPT ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ roman_min ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , clip ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) , 1 - italic_ϵ , 1 + italic_ϵ ) over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] , (9)

where the clip operator restricts rt(θ)subscript𝑟𝑡𝜃r_{t}(\theta)italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) to the interval [1ϵ,1+ϵ]1italic-ϵ1italic-ϵ[1-\epsilon,1+\epsilon][ 1 - italic_ϵ , 1 + italic_ϵ ], with ϵitalic-ϵ\epsilonitalic_ϵ being a hyperparameter controlling the policy update magnitude. This constraint prevents excessive policy deviations that could degrade performance.

To further stabilize training and promote exploration, the composite objective incorporates three components: 1) Clipped policy gradient term 𝒥CLIP(θ)superscript𝒥CLIP𝜃\mathcal{J}^{\text{CLIP}}(\theta)caligraphic_J start_POSTSUPERSCRIPT CLIP end_POSTSUPERSCRIPT ( italic_θ ), 2) Value function loss:

VF=𝔼t[(Vθ(st)Vtarget(st))2],superscriptVFsubscript𝔼𝑡delimited-[]superscriptsubscript𝑉𝜃subscript𝑠𝑡subscript𝑉targetsubscript𝑠𝑡2\mathcal{L}^{\text{VF}}=\mathbb{E}_{t}\left[\left(V_{\theta}(s_{t})-V_{\text{% target}}(s_{t})\right)^{2}\right],caligraphic_L start_POSTSUPERSCRIPT VF end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ( italic_V start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_V start_POSTSUBSCRIPT target end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (10)

where Vθ(st)subscript𝑉𝜃subscript𝑠𝑡V_{\theta}(s_{t})italic_V start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the state-value function estimator and Vtarget(st)subscript𝑉targetsubscript𝑠𝑡V_{\text{target}}(s_{t})italic_V start_POSTSUBSCRIPT target end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) denotes the target value computed via temporal-difference methods, 3) Entropy regularization:

(st,πθ)=a𝒜πθ(a|st)logπθ(a|st),subscript𝑠𝑡subscript𝜋𝜃subscript𝑎𝒜subscript𝜋𝜃conditional𝑎subscript𝑠𝑡subscript𝜋𝜃conditional𝑎subscript𝑠𝑡\mathcal{H}(s_{t},\pi_{\theta})=-\sum_{a\in\mathcal{A}}\pi_{\theta}(a|s_{t})% \log\pi_{\theta}(a|s_{t}),caligraphic_H ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) = - ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (11)

with 𝒜𝒜\mathcal{A}caligraphic_A being the action space, which prevents premature policy convergence by encouraging stochasticity.

The complete objective integrates these terms as:

𝒥PPO(θ)=𝔼t[𝒥CLIP(θ)c1VF+c2(st,πθ)],superscript𝒥PPO𝜃subscript𝔼𝑡delimited-[]superscript𝒥CLIP𝜃subscript𝑐1superscriptVFsubscript𝑐2subscript𝑠𝑡subscript𝜋𝜃\mathcal{J}^{\text{PPO}}(\theta)=\mathbb{E}_{t}\left[\mathcal{J}^{\text{CLIP}}% (\theta)-c_{1}\mathcal{L}^{\text{VF}}+c_{2}\mathcal{H}(s_{t},\pi_{\theta})% \right],caligraphic_J start_POSTSUPERSCRIPT PPO end_POSTSUPERSCRIPT ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ caligraphic_J start_POSTSUPERSCRIPT CLIP end_POSTSUPERSCRIPT ( italic_θ ) - italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT VF end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_H ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) ] , (12)

where c1>0subscript𝑐10c_{1}>0italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 and c2>0subscript𝑐20c_{2}>0italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 are coefficients balancing policy optimization, value estimation accuracy, and exploration. Crucially, PPO replaces TRPO’s computationally intensive second-order KL constraints with first-order gradient clipping, enabling efficient large-scale implementations while preserving monotonic policy improvement guarantees, as rigorously established through surrogate objective monotonicity analysis [18].

Group Relative Policy Optimization. GRPO [43] establishes a policy gradient framework that eliminates dependency on explicit value function approximation through comparative advantage estimation within response groups. The method operates by sampling multiple candidate outputs for each input question and constructing advantage signals based on relative rewards within these groups. For a given question qP(Q)similar-to𝑞𝑃𝑄q\sim P(Q)italic_q ∼ italic_P ( italic_Q ), the algorithm generates G𝐺Gitalic_G responses {o1,,oG}subscript𝑜1subscript𝑜𝐺\{o_{1},\ldots,o_{G}\}{ italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_o start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT } from the current policy πθoldsubscript𝜋subscript𝜃old\pi_{\theta_{\text{old}}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT old end_POSTSUBSCRIPT end_POSTSUBSCRIPT, then computes token-level advantages using intra-group reward comparisons.

The advantage term A^i,tsubscript^𝐴𝑖𝑡\hat{A}_{i,t}over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT for the t𝑡titalic_t-th token in the i𝑖iitalic_i-th response is defined as the deviation from the group average reward:

A^i,t=R(oi)1Gj=1GR(oj),subscript^𝐴𝑖𝑡𝑅subscript𝑜𝑖1𝐺superscriptsubscript𝑗1𝐺𝑅subscript𝑜𝑗\hat{A}_{i,t}=R(o_{i})-\frac{1}{G}\sum_{j=1}^{G}R(o_{j}),over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT = italic_R ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_G end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT italic_R ( italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (13)

where R()𝑅R(\cdot)italic_R ( ⋅ ) denotes the reward model’s evaluation. This design inherently aligns with the comparative training paradigm of reward models, which typically learn from pairwise response rankings.

The optimization objective integrates clipped probability ratios with explicit KL regularization. Defining the token-level probability ratio as:

ri,t(θ)=πθ(oi,t|q,oi,<t)πθold(oi,t|q,oi,<t),subscript𝑟𝑖𝑡𝜃subscript𝜋𝜃conditionalsubscript𝑜𝑖𝑡𝑞subscript𝑜𝑖absent𝑡subscript𝜋subscript𝜃oldconditionalsubscript𝑜𝑖𝑡𝑞subscript𝑜𝑖absent𝑡r_{i,t}(\theta)=\frac{\pi_{\theta}(o_{i,t}|q,o_{i,<t})}{\pi_{\theta_{\text{old% }}}(o_{i,t}|q,o_{i,<t})},italic_r start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ( italic_θ ) = divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT | italic_q , italic_o start_POSTSUBSCRIPT italic_i , < italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT old end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT | italic_q , italic_o start_POSTSUBSCRIPT italic_i , < italic_t end_POSTSUBSCRIPT ) end_ARG , (14)

the clipped surrogate objective constrains policy updates through:

𝒥i,tclip(θ)=min(ri,t(θ)A^i,t,clip(ri,t(θ),1ϵ,1+ϵ)A^i,t).subscriptsuperscript𝒥clip𝑖𝑡𝜃subscript𝑟𝑖𝑡𝜃subscript^𝐴𝑖𝑡clipsubscript𝑟𝑖𝑡𝜃1italic-ϵ1italic-ϵsubscript^𝐴𝑖𝑡\mathcal{J}^{\text{clip}}_{i,t}(\theta)=\min\left(r_{i,t}(\theta)\hat{A}_{i,t}% ,\,\text{clip}(r_{i,t}(\theta),1-\epsilon,1+\epsilon)\hat{A}_{i,t}\right).caligraphic_J start_POSTSUPERSCRIPT clip end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ( italic_θ ) = roman_min ( italic_r start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ( italic_θ ) over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , clip ( italic_r start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ( italic_θ ) , 1 - italic_ϵ , 1 + italic_ϵ ) over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ) . (15)

Diverging from PPO’s implicit KL control via reward shaping, GRPO directly regularizes policy divergence using an unbiased KL estimator:

𝔻KL[πθπref]=πref(oi,t|q,oi,<t)πθ(oi,t|q,oi,<t)logπref(oi,t|q,oi,<t)πθ(oi,t|q,oi,<t)1,subscript𝔻KLdelimited-[]conditionalsubscript𝜋𝜃subscript𝜋refsubscript𝜋refconditionalsubscript𝑜𝑖𝑡𝑞subscript𝑜𝑖absent𝑡subscript𝜋𝜃conditionalsubscript𝑜𝑖𝑡𝑞subscript𝑜𝑖absent𝑡subscript𝜋refconditionalsubscript𝑜𝑖𝑡𝑞subscript𝑜𝑖absent𝑡subscript𝜋𝜃conditionalsubscript𝑜𝑖𝑡𝑞subscript𝑜𝑖absent𝑡1\mathbb{D}_{\text{KL}}\left[\pi_{\theta}\|\pi_{\text{ref}}\right]=\frac{\pi_{% \text{ref}}(o_{i,t}|q,o_{i,<t})}{\pi_{\theta}(o_{i,t}|q,o_{i,<t})}-\log\frac{% \pi_{\text{ref}}(o_{i,t}|q,o_{i,<t})}{\pi_{\theta}(o_{i,t}|q,o_{i,<t})}-1,blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT [ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∥ italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ] = divide start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT | italic_q , italic_o start_POSTSUBSCRIPT italic_i , < italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT | italic_q , italic_o start_POSTSUBSCRIPT italic_i , < italic_t end_POSTSUBSCRIPT ) end_ARG - roman_log divide start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT | italic_q , italic_o start_POSTSUBSCRIPT italic_i , < italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT | italic_q , italic_o start_POSTSUBSCRIPT italic_i , < italic_t end_POSTSUBSCRIPT ) end_ARG - 1 , (16)

The complete objective combines these components with a regularization coefficient β𝛽\betaitalic_β:

𝒥GRPO(θ)=𝔼q,{oi}[1G|oi|i,t(𝒥i,tclip(θ)β𝔻KL[πθπref])].superscript𝒥GRPO𝜃subscript𝔼𝑞subscript𝑜𝑖delimited-[]1𝐺subscript𝑜𝑖subscript𝑖𝑡subscriptsuperscript𝒥clip𝑖𝑡𝜃𝛽subscript𝔻KLdelimited-[]conditionalsubscript𝜋𝜃subscript𝜋ref\mathcal{J}^{\text{GRPO}}(\theta)=\mathbb{E}_{q,\{o_{i}\}}\left[\frac{1}{G|o_{% i}|}\sum_{i,t}\left(\mathcal{J}^{\text{clip}}_{i,t}(\theta)-\beta\mathbb{D}_{% \text{KL}}\left[\pi_{\theta}\|\pi_{\text{ref}}\right]\right)\right].caligraphic_J start_POSTSUPERSCRIPT GRPO end_POSTSUPERSCRIPT ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_q , { italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_G | italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ( caligraphic_J start_POSTSUPERSCRIPT clip end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ( italic_θ ) - italic_β blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT [ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∥ italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ] ) ] . (17)