tcb@breakable
GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning
Abstract
Reinforcement Learning (RL) can directly enhance the reasoning capabilities of large language models without extensive reliance on Supervised Fine-Tuning (SFT). In this work, we revisit the traditional Policy Gradient (PG) mechanism and propose a minimalist RL approach termed Group Policy Gradient (GPG). Unlike conventional methods, GPG directly optimize the original RL objective, thus obviating the need for surrogate loss functions. As illustrated in Fig. 1, by eliminating both the critic and reference models, and avoiding KL divergence constraints, our approach significantly simplifies the training process when compared to Group Relative Policy Optimization (GRPO). Our approach achieves superior performance without relying on auxiliary techniques or adjustments. Extensive experiments demonstrate that our method not only reduces computational costs but also consistently outperforms GRPO across various unimodal and multimodal tasks. Our code is available at https://github.com/AMAP-ML/GPG.

1 Introduction
Large Language Models (LLMs) have achieved substantial advancements, progressively narrowing the gap towards achieving Artificial General Intelligence (AGI) [36; 16; 2; 50; 53; 7]. Recently, LLMs exemplified by OpenAI o1 [36] and DeepSeek R1 [16], have adopted a strategy of generating intermediate reasoning steps before producing final answers. This approach has markedly improved their efficacy in domain-specific tasks [22; 13; 20; 25; 28; 21], such as mathematical reasoning. The remarkable success of this technology is mainly attributed to the Reinforcement Fine-Tuning (RFT) method [42; 43; 55; 27; 19]. Through the application of RFT, the models allocate additional time to “deliberate” prior to generating answers, thereby constructing intricate reasoning chains and subsequently enhancing overall model performance.
In contrast to Supervised Fine-Tuning (SFT), which involves training models on fixed input-output pairs to mimic correct responses, RFT introduces an iterative process that incentivizes models to generate coherent and logically structured reasoning paths. RFT leverages RL techniques, such as Proximal Policy Optimization (PPO) [42] and GRPO [43] to optimize decision-making during the generation of intermediate steps. Specifically, PPO ensures stability by constraining policy updates, preventing new strategies that deviate significantly from established behaviours. In contrast, GRPO enhances this process by evaluating performance across groups of actions, encouraging consistent improvements in reasoning quality. This dynamic and feedback-driven approach enables models to deeper and more adaptive thinking, resulting in nuanced answers that better handle complex reasoning tasks compared to the more rigid and label-dependent training of SFT.
Despite the significant success of PPO in enhancing reasoning quality, it still suffers severely from the enormous resource consumption required during training. PPO necessitates the development and integration of both a critic model and a reference model, which not only complicates the training process but also substantially increases computational demands. Consequently, there is a growing trend toward simplifying the PPO method. For instance, ReMax [27] removes the critic model by introducing a baseline value, which reduces the training GPU memory usage and accelerates the training process. Besides, GRPO eliminates the need for a critic model and utilizes normalized rewards within a sample group. Furthermore, DAPO [55] removes the KL Divergence from GRPO and enhances the algorithm by introducing multiple technologies such as “clip-higher” and “dynamic sampling”. Additionally, REINFORCE++ [19] incorporates key optimization strategies from PPO while discarding the critic network, thus simplifying the training process and improving stability. A very recent and concurrent work [30] studies the details of reward and loss normalization and states GRPO tends to generate more tokens.
A thorough examination of the evolution within the Reinforcement Learning (RL) community, reveals that the widely adopted PPO algorithm essentially functions as a conservative surrogate for the original RL problem [42]. All existing enhancements to PPO continue to focus on optimizing the surrogate loss function. Consequently, two fundamental questions remain unresolved:
-
•
Is it feasible to transcend this intermediate strategy and directly optimize the original problem?
-
•
If this is achievable, to what extent can the learning strategy be streamlined?
This paper endeavors to provide a comprehensive exploration of these critical questions. In summary, our key contributions are as follows:
-
•
We revisit the design of policy gradient algorithm [44] and propose a simple RL method that keeps minimal RL components and directly optimizes the objective instead of surrogate loss.
-
•
Our approach eschews the necessity for both a critic model and a reference model. Moreover, it imposes no distributional constraints. These characteristics confer substantial advantages for potential scalability.
-
•
Extensive experiments demonstrated that GPG achieves superior performance than GRPO across various unimodal and multimodal visual tasks, while significantly reducing computational costs.
Our code and implementation details are open-sourced, contributing to the development of the community.
2 Related Work
Large Model Reasoning. Recent advancements in both LLM and Multimodal Large Language Model (MLLM) have increasingly focused on enabling models to simulate human-like, stepwise reasoning processes. In the field of LLMs, researchers have pioneered methods such as Chain-of-Thought (CoT) prompting [36; 48; 23; 39; 56; 33; 54], Tree-of-Thought [52], Monte Carlo Tree Search [12; 51; 46], and the construction of complex SFT datasets [33], to enhance performance in reasoning tasks. Notably, approaches such as DeepSeek-R1 [16] have employed large-scale RL with format-specific and result-oriented reward functions, guiding LLMs toward self-emerging, human-like, complex CoT reasoning with significant performance improvements in challenging reasoning tasks. Meanwhile, MLLMs, which convert inputs from various modalities into a unified LLM vocabulary representation space for processing, have exhibited superior performance in vision understanding tasks [2; 50; 53; 7; 29; 5; 14; 35; 47]. Building on the advancements in LLM reasoning, there has been a collective effort within the research community to apply the DeepSeek-R1 methodology to MLLMs to enhance their visual reasoning capabilities, yielding remarkable progress [57; 31; 4; 59].
Reinforcement Learning. RL has driven significant progress in sequential decision-making, with policy gradient methods being fundamental to optimizing stochastic policies. The REINFORCE algorithm [49] established early principles for gradient-based policy updates in trajectory-driven tasks. However, its high variance has posed challenges for scalability. To address this, subsequent research has focused on stabilizing policy optimization processes. Trust Region Policy Optimization (TRPO) [40] introduced constrained updates via quadratic approximations to ensure monotonic improvement. This was further refined by PPO [42], which employed clipped objective functions to simplify the optimization process. Subsequent studies have sought to enhance the PPO alogrithm [6; 58] or elaborate on its implementation [10]. PPO has achieved widespread use in language model alignment and robotic control. However, the algorithm’s dependence on conservative policy updates or heuristic clipping thresholds can undermine its exploration potential in favor of stability, which poses a significant challenge in complex domains requiring dynamic strategy adaptation.
3 Method
3.1 Preliminary and Task Formulation
RL is a computational approach to learning through interaction, where an agent seeks to maximize cumulative rewards by selecting optimal actions within an environment. The RL problem is typically defined by a policy , which maps states to actions, and aims to optimize the expected return. The core idea behind policy gradient methods is to use gradient ascent to iteratively adjust the policy parameters. The learning objective is maximizing the return ,
(1) |
The policy gradient theorem [44] proves that the above problem can be converted into estimating the gradient,
(2) |
where is the action-value function, representing the expected return when taking action in state and following policy thereafter.
To reduce the variance, the advantage function is often used leading to the policy gradient update rule:
(3) |
One-step advantage estimation can be mathematically formulated as:
(4) |
where denotes the value function of the critic model, which represents the expected return when starting from state and following policy . While GAE [41] offers a more sophisticated approach to balance bias and variance in advantage estimation, we find that in the context of model reasoning, one-step estimation is sufficiently effective for achieving good performance. This simplicity is particularly advantageous in scenarios where computational efficiency is paramount.
Given a sequence of questions and instructions, the model is tasked with generating corresponding answers. Subsequently, rewards are returned based on predefined reward models or hand-crafted rules. Our objective is to leverage these reward signals to optimize our policy, thereby enhancing the model’s ability to generate accurate and contextually appropriate responses.
However, designing or obtaining accurate rewards for intermediate steps is nontrivial [16]. To address this challenge, we simplify our problem as follows. Given a question and prompt , we sample an action from policy and obtain a final reward signal . Note that the policy distribution is modeled in an autoregressive manner. In this setting, we can leverage policy gradient methods to optimize the policy.
3.2 Group Policy Gradient
Our proposed method, Group Policy Gradient (GPG), is designed to address the issue of high variance in policy gradient estimation in the absence of a value model. By leveraging group-level rewards, GPG stabilizes learning and enhances the robustness of reinforcement learning training. Specifically, GPG utilizes the mean reward within each group to normalize the rewards, thereby effectively reducing variance. This approach eliminates the need for a traditional value model, thereby simplifying the training process and enhancing computational efficiency. The name "Group Policy Gradient" reflects our method’s core mechanism of utilizing group-level mean rewards to stabilize and optimize learning.
The core objective of GPG is defined as:
(5) | ||||
where represents the individual responses in the group , and the advantage of the -th response is calculated by normalizing the group-level rewards :
(6) |
This normalization technique plays a pivotal role in reducing variance by ensuring that individual rewards are considered in the context of group dynamics, thereby fostering more stable and efficient learning even without a value model. The GPG approach demonstrates that leveraging structured reward systems can yield significant improvements in reinforcement learning performance, highlighting the potential for future developments that minimize dependency on traditional models while maintaining robust training efficacy.
Components | ||||
Value Models | Reference Models | Surrogate Loss | Policy Constraint | |
PPO | ✓ | ✓ | ✓ | ✓ |
GRPO | ✗ | ✓ | ✓ | ✓ |
TRPO | ✓ | ✗ | ✓ | ✓ |
GPG | ✗ | ✗ | ✗ | ✗ |
RL algorithms vary significantly in their approaches to tackling variance and optimizing policies. Two key components in many RL algorithms are surrogate loss and policy constraints. Surrogate loss functions help stabilize training by providing a proxy objective that approximates the original goal but is easier to optimize. For example, in PPO [42], the surrogate loss is designed to prevent large updates to the policy, making training more stable by maintaining a balance between exploration and exploitation. (See Appendix B for detailed derivations.) However, surrogate loss functions can sometimes limit the flexibility of the policy updates, potentially leading to sub-optimal solutions. Policy constraints, such as KL divergence limits, ensure that the policy does not change too drastically between updates. These constraints are critical for maintaining stability in training, as they prevent erratic behavior and large deviations from known good policies. However, overly restrictive policy constraints can hinder the agent’s ability to explore new, potentially better strategies.
GRPO [43] introduces the use of reference models and policy constraints to stabilize training. The reference model serves as a foundation to provide additional guidance and regularization, ensuring more consistent and robust learning by minimizing variance in policy updates. Policy constraints further aid in preventing drastic policy changes, contributing to the stability of the learning process. Full derivations can be found in the Appendix B. Conversely, the introduction of GPG aims to offer greater flexibility by removing both the reference model and policy constraints. This approach fosters a more open exploration of the solution space, allowing the RL agent to develop novel strategies to achieve optimal results without the constraints imposed by prior models or divergence measures.
Overall, GPG represents an innovative step forward, integrating the benefits of reward normalization to enhance learning efficiency and adaptability, setting the stage for future advances in reinforcement learning strategies that emphasize flexible exploration and learning dynamics.
4 Experiments
All experimental settings were meticulously controlled to ensure fair comparisons. We adhered closely to the hyperparameters employed by GRPO, despite their suboptimality for our approach. Notably, our method consistently outperformed GRPO across all tasks, achieving superior performance with clear margins. These results underscore the robustness and efficacy of our proposed method.
4.1 Experimental Setup
4.1.1 Dataset and Benchmarks
For the unimodal scenario, we utilize open-s1, open-rs [9] and MATH-lighteval [17] datasets as our training data. These datasets encompass a wide range of problem types and difficulty levels. To assess the reasoning capabilities of the models, we employ five distinct mathematics-focused benchmark datasets: AIME24, MATH-500 [28; 17], AMC23, Minerva [26], and OlympiadBench [21].
In the multimodal case, we handle a variety of tasks. Specifically, for the visual reasoning task, we utilize approximately samples from the SAT dataset [38] for training and perform evaluations on the CV-Bench dataset [45]. In addressing the geometry reasoning task, by following R1-V [4], we train on around samples from the GEOQA training set [4] and subsequently evaluating performance on the GEOQA test set [3]. For both the classification and reasoning grounding tasks, we follow Visual-RFT to conduct few-shot classification training on Flower102 [34], Pets37 [37], FGVCAircraft [32], Car196 [24], respectively. Additionally, training is conducted on samples from the LISA training set [25]. All evaluations are carried out using the corresponding test sets associated with these training sets.
4.1.2 Implementation Details
Our approach is broadly applicable across a wide range of reinforcement learning tasks. To demonstrate its versatility and efficacy, we have conducted experiments encompassing both unimodal and multimodal scenarios. These experiments are performed on NVIDIA H20 96G GPUs. For each experiment, we adhered strictly to the implementation of original code base, ensuring consistent training and evaluation procedures. The implemented GPG method can refer to Algorithm.1 and more detailed settings can refer to Appendix A.
4.2 Prompt and Reward Function
Prompt for Reasoning. In the process of reinforcement fine-tuning, specific instructions are incorporated into the system prompt. These instructions encourage the model to generate intermediate reasoning steps, thereby facilitating the reasoning capabilities of the model. An example of this approach is provided below [31]:
Reward Function. For most tasks, we use the accuracy and formatting reward functions. For the grounding task, the Intersection over Union (IoU) reward function is utilized.
-
•
Accuracy: If the model’s output is consistent with the ground truth, a reward of 1.0 is awarded.
-
•
Formatting: If the format of the model output is “<think></think> <answer></answer>”, a reward of 1.0 is granted.
-
•
IoU: Consistent with Visual-RFT [31], the reward value is derived from the calculated scores of the bounding boxes generated by the model.
Models | Average | AIME24 | MATH-500 | AMC23 | Minerva | OlympiadBench |
Base Model | 48.9 | 28.8 | 82.8 | 62.9 | 26.5 | 43.3 |
+ GRPO | 53.1 | 33.3 | 83.8 | 67.5 | 29.8 | 50.9 |
+ GPG | 55.7 | 33.3 | 87.6 | 77.5 | 29.4 | 50.5 |
Models | Average | AIME24 | MATH-500 | AMC23 | Minerva | OlympiadBench |
Qwen2.5-Math-7B | 30.9 | 13.3 | 57.6 | 45.0 | 14.7 | 23.7 |
+ GRPO | 43.7 | 16.7 | 73.4 | 62.5 | 30.2 | 35.7 |
+ Dr. GRPO [30] | 43.7 | 26.7 | 74.6 | 50.0 | 30.1 | 37.3 |
+ GPG | 45.3 | 23.3 | 73.6 | 60.0 | 30.5 | 39.3 |
Model | Average | AIME24 | MATH-500 | AMC23 | Minerva | OlympiadBench |
General Models | ||||||
Llama-3.1-70B-Instruct | 35.7 | 16.7 | 64.6 | 30.1 | 35.3 | 31.9 |
rStar-Math-7B | 26.7 | 78.4 | 47.5 | - | 47.1 | - |
Eurus-2-7B-PRIME | 26.7 | 79.2 | 57.8 | 38.6 | 42.1 | 48.9 |
1.5B Models | ||||||
DeepSeek-R1-Distill-Qwen-1.5B | 48.9 | 28.8 | 82.8 | 62.9 | 26.5 | 43.3 |
Still-3-1.5B-Preview | 51.6 | 32.5 | 84.4 | 66.7 | 29.0 | 45.4 |
Open-RS1 * | 53.1 | 33.3 | 83.8 | 67.5 | 29.8 | 50.9 |
Open-RS3 * | 52.0 | 26.7 | 85.4 | 70.0 | 27.9 | 50.2 |
GPG-RS1 | 55.7 | 33.3 | 87.6 | 77.5 | 29.4 | 50.5 |
GPG-RS3 | 55.5 | 33.3 | 85.0 | 80.0 | 26.8 | 52.4 |
4.3 Main Results
4.3.1 Unimodal Task
Mathematical Reasoning. We systematically evluate the performance of GPG across five benchmark datasets, as presented in Table 2. Initially, we evaluate the DeepSeek-R1-Distill-Qwen-1.5B model, establishing it as our baseline with an average accuracy of 48.9%. Compared to GRPO [43], our GPG approach exhibits a substantial performance enhancement, achieving an average improvement of 2.6%. Overall, the GPG model attains an average accuracy of 55.7%. Significant improvements over GRPO are evidenced in the MATH-500 and AMC23 benchmarks, showing accuracy improvements of 3.8% and 10.0%, respectively. These results highlight the effectiveness of GPG in enhancing mathematical reasoning capabilities.
Additionally, the performance of GPG was evaluated in comparison to GRPO [43] using the larger Qwen2.5-Math-7B model, as detailed in Table 3. GPG achieves a higher average accuracy of 45.3%, representing an improvement of 1.6% than GRPO. This enhancement illustrates the improved reasoning capabilities provided by GPG, particularly notable in specific benchmarks such as AIME24 (+6.6%) and OlympiadBench [21] (+3.6%). Moreover, GPG achieves an average improvement of 2.6% over Dr. GRPO, demonstrating significant enhancements with a 10.0% increase on AMC23 and a 2.0% gain on OlympiadBench [21]. These advancements highlight the utility of GPG in RL training for critical mathematical reasoning tasks.
As illustrated in Table 4, our models exhibit superior performance compared to most baselines [15; 8; 9], achieving average scores of 53.1% for Open-RS1 and 52.0% for Open-RS3. Additionally, GPG-RS1 and GPG-RS3 shows strong results in AMC23 with a score of 77.5% and 80.0%, obviously surpassing Open-RS 67.5% and 70.0%. Both GPG-RS1 and GPG-RS3 demonstrate competitive performance across various benchmarks, particularly excelling in MATH-500 [28; 17] with scores of 87.6% and 85.0%, and OlympiadBench [21] with scores of 50.5% and 52.4%.
Models | Total | Count | Relation | Depth | Distance |
Qwen2-VL-2B | 31.38 | 54.69 | 22.46 | 0.16 | 31.66 |
SFT | 57.84 | 60.02 | 68.92 | 55.00 | 45.83 |
GRPO | 59.47 | 59.64 | 66.76 | 54.16 | 56.66 |
GPG | 69.11 | 65.36 | 82.62 | 67.83 | 60.67 |
Models | GEOQA |
Qwen2.5-VL-3B-Instruct | 35.41 |
GRPO | 47.48 |
GPG | 50.80 |
Models | mIoU | mIoU | gIoU |
Qwen2-VL-2B | 26.9 | 30.1 | 25.3 |
SFT | 28.3 | 29.7 | 25.3 |
GRPO | 37.6 | 34.4 | 34.4 |
GPG | 51.5 | 53.4 | 49.5 |
Models | Average | Flower102 [34] | Pets37 [37] | FGVC [32] | Cars196 [24] |
Qwen2-VL-2B | 56.0 | 54.8 | 66.4 | 45.9 | 56.8 |
SFT | 55.6 | 58.5 | 55.5 | 67.9 | 40.5 |
GRPO | 81.9 | 71.4 | 86.1 | 74.8 | 95.3 |
GPG | 86.0 | 73.0 | 87.1 | 86.8 | 97.1 |
4.3.2 Multimodal Task
Visual Reasoning. We initially evaluate the GPG method using the CV-Bench [45] visual reasoning dataset, strictly adhering to the parameter settings of VisualThinker-R1-Zero. As illustrated in Table 5, the GPG method demonstrates a significant improvement in performance. Specifically, it attains a score of 69.11% on CV-Bench, representing an increase of 9.64% points compared to the 59.47% score achieved by GRPO.
Geometry Reasoning. In addition to visual reasoning, MLLMs exhibit notable proficiency in geometry reasoning. To evaluate the efficacy of the GPG method in this domain, we employ an experimental setup similar to that used in R1-V [4] using the GEOQA [3] dataset. The results, presented in Table 7, indicate that the GPG method achieved a score of 50.80%, surpassing the GRPO’s score of 47.48% by 3.32% points. This demonstrates the superior performance of the GPG method in addressing complex geometric reasoning tasks.
Classification. Beyond the evaluation of reasoning tasks, we also assess the enhancement of the GPG method over GRPO in image perception tasks. As shown in Table 8, the GPG method achieves an average score of 86.0% across four classification datasets, surpassing GRPO by 4.1% points. Additionally, our method consistently produces improvements across all four classification datasets, underscoring its superiority in image perception tasks.
Reasoning Grounding. The final critical aspect of evaluating MLLMs involves precisely identifying objects according to user requirements. To this end, we employ the Qwen2-VL-2B model for grounding tasks using the LISA dataset [25], with the results presented in Table 7. In comparison to the GRPO method, the GPG approach demonstrates a substantial enhancement, improving all metrics by over 10% points. This significant improvement underscores the superiority of the GPG method in object localization, leading to considerable advancements in reasoning and perception capabilities.


4.4 Ablation Study and Discussion
Case Study and Training Analysis. We present the reasoning processes of GPG and GRPO, as illustrated in Fig. 2. Compared to GRPO, the GPG approach demonstrates a more comprehensive and accurate reasoning capability, whereas GRPO exhibits errors in formula analysis. Consequently, GPG arrived at the correct solution, while GRPO produced an incorrect result. In Fig. 3, we present a range of real-time training metrics to illustrate the effectiveness of GPG as a straightforward yet strong RL algorithm.
Sensitivity on Group Size. We study the effect of the number of generations within a group. As shown in Table 9, increasing the group size from 2 to 16 leads to progressive improvements across most metrics. Specifically, the Average performance improves steadily with larger group sizes. We choose 8 to achieve a good tradeoff between training cost and performance.
Group Number | Average | AIME24 | MATH-500 | AMC23 | Minerva | OlympiadBench |
2 | 41.9 | 16.7 | 71.6 | 60.0 | 25.0 | 36.0 |
4 | 43.3 | 20.0 | 73.2 | 55.0 | 29.8 | 38.5 |
8 | 45.3 | 23.3 | 73.6 | 60.0 | 30.5 | 39.3 |
16 | 47.3 | 26.7 | 74.6 | 65.0 | 32.4 | 37.8 |
Reward Normalization. We study the role of reward normalization and show the result in Table 10. Normalization within a batch is common practice in the RL training process [1]. The experiment results show that reward normalization within a group is better than the batch.
Strategy Average AIME24 MATH-500 AMC23 Minerva OlympiadBench RN Group 45.3 23.3 73.6 60.0 30.5 39.3 RN Batch 44.9 23.3 72.2 55.0 35.3 38.5 + KL (=0.04) 43.7 20.0 73.8 57.5 29.4 37.6
KL constraint. In principle, our method is designed to optimize the original reinforcement learning (RL) problem directly. And it’s a bit strange without imposing any distribution constraints. Despite this, we conducted an ablation study to evaluate the impact of adding a distribution constraint. The results are presented in Table 10. Our findings indicate that incorporating such a constraint negatively impacts performance.
Comparison with Various RL Methods. We attempt to explain the differences between GPG and other RL methods in the simplest way. As shown in Table 11, it can be seen that the loss of GPG does not include the “CLIP term” and the “KL divergence”. Its form and calculation are the simplest, and as discussed in Section 4.3, its performance is better than other methods.
4.5 Broader Impact
Achieving advanced general intelligence critically depends on augmenting the reasoning capabilities of models, with effcient and scalable reinforcement learning methods serving as a cornerstone. Our proposed approach investigates a minimalist strategy that aims to enhance reasoning capacity through simplicity and efficiency, thereby potentially facilitating the development of scalable systems. However, given the constraints of our computational budget, we have not evaluated our method on extremely large models.
5 Conclusion
In this paper, we introduce GPG, which effectively addresses the critical challenges inherent in reinforcement fine-tuning approaches such as PPO and GRPO. By directly incorporating group-based decision dynamics into the standard PG method, GPG simplifies the training process and significantly reduces computational overhead without sacrificing reasoning quality. This breakthrough provides a more efficient framework for training advanced LLMs capable of complex reasoning, thereby contributing to more resource-effective and scalable artificial intelligence systems.
References
- [1] Marcin Andrychowicz, Anton Raichuk, Piotr Stańczyk, Manu Orsini, Sertan Girgin, Raphaël Marinier, Leonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, et al. What matters for on-policy deep actor-critic methods? a large-scale study. In International conference on learning representations, 2021.
- [2] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025.
- [3] Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P. Xing, and Liang Lin. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning, 2022.
- [4] Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-v: Reinforcing super generalization ability in vision-language models with less than $3. https://github.com/Deep-Agent/R1-V, 2025. Accessed: 2025-02-02.
- [5] Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024.
- [6] Xiangxiang Chu. Policy optimization with penalized point probability distance: An alternative to proximal policy optimization. arXiv preprint arXiv:1807.00442, 2018.
- [7] Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv preprint, 2024.
- [8] Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025.
- [9] Quy-Anh Dang and Chris Ngo. Reinforcement learning for reasoning in small llms: What works and what doesn’t, 2025.
- [10] Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. Implementation matters in deep rl: A case study on ppo and trpo. In International conference on learning representations, 2019.
- [11] Hugging Face. Open r1: A fully open reproduction of deepseek-r1. https://github.com/huggingface/open-r1, 2025. Accessed: January 2025.
- [12] Xidong Feng, Ziyu Wan, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training, 2024.
- [13] Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models. arXiv preprint arXiv:2410.07985, 2024.
- [14] Google. Gemini: A family of highly capable multimodal models, 2023.
- [15] Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. rstar-math: Small llms can master math reasoning with self-evolved deep thinking. arXiv preprint arXiv:2501.04519, 2025.
- [16] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
- [17] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
- [18] Chloe Ching-Yun Hsu, Celestine Mendler-Dünner, and Moritz Hardt. Revisiting design choices in proximal policy optimization. arXiv preprint arXiv:2009.10897, 2020.
- [19] Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models, 2025.
- [20] Hailang Huang, Yong Wang, Zixuan Huang, Huaqiu Li, Tongwen Huang, Xiangxiang Chu, and Richong Zhang. Mmgenbench: Fully automatically evaluating lmms from the text-to-image generation perspective, 2025.
- [21] Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, et al. Olympicarena: Benchmarking multi-discipline cognitive reasoning for superintelligent ai. Advances in Neural Information Processing Systems, 37:19209–19253, 2024.
- [22] LI Jia, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, et al. Numinamath, 2024.
- [23] Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 22199–22213. Curran Associates, Inc., 2022.
- [24] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013.
- [25] Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024.
- [26] Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022.
- [27] Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, and Zhi-Quan Luo. Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models, 2024.
- [28] Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations, 2023.
- [29] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023.
- [30] Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783, 2025.
- [31] Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. arXiv preprint arXiv:2503.01785, 2025.
- [32] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
- [33] Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025.
- [34] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008.
- [35] OpenAI. Gpt-4v(ision) system card, 2023.
- [36] OpenAI. Learning to reason with llms, 2024.
- [37] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012.
- [38] Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko. Sat: Spatial aptitude training for multimodal language models, 2024.
- [39] Laria Reynolds and Kyle McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, CHI EA ’21, New York, NY, USA, 2021. Association for Computing Machinery.
- [40] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR, 2015.
- [41] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation, 2018.
- [42] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- [43] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
- [44] Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
- [45] Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Ziteng Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024.
- [46] Trieu Trinh, Yuhuai Wu, Quoc Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations. Nature, 2024.
- [47] Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint, 2023.
- [48] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
- [49] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992.
- [50] Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302, 2024.
- [51] Huajian Xin, Z. Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Qihao Zhu, Dejian Yang, Zhibin Gou, Z. F. Wu, Fuli Luo, and Chong Ruan. Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search, 2024.
- [52] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809–11822, 2023.
- [53] Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024.
- [54] Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning, 2025.
- [55] Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, Wei-Ying Ma, Ya-Qin Zhang, Lin Yan, Mu Qiao, Yonghui Wu, and Mingxuan Wang. Dapo: An open-source llm reinforcement learning system at scale, 2025.
- [56] Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. STar: Bootstrapping reasoning with reasoning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
- [57] Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. arXiv preprint arXiv:2503.12937, 2025.
- [58] Rui Zheng, Shihan Dou, Songyang Gao, Wei Shen, Binghai Wang, Yan Liu, Senjie Jin, Qin Liu, Limao Xiong, Lu Chen, Zhiheng Xi, Yuhao Zhou, Nuo Xu, Wenbin Lai, Minghao Zhu, Rongxiang Weng, Wensen Cheng, Cheng Chang, Zhangyue Yin, Yuan Hua, Haoran Huang, Tianxiang Sun, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, and Xuanjing Huang. Secrets of rlhf in large language models part i: Ppo, 2023.
- [59] Hengguang Zhou, Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. R1-zero’s "aha moment" in visual reasoning on a 2b non-sft model, 2025.
Appendix A More Implementation Details
To evaluate the unimodal reasoning capabilities of our proposed method, we utilize two publicly available code repositories: Open-r1 [11] and Open-rs [9]. These repositories are selected due to their extensive coverage of various reasoning scenarios and their ability to present substantial challenges that effectively assess the reasoning capabilities of advanced models. The DeepSeek-R1-Distill-Qwen-1.5B model is trained for 100 and 50 global steps using the open-s1 and open-rs datasets, as reported in the repository [9], resulting in the GPG-RS1 and GPG-RS3 models, respectively. Subsequently, GPG-RS1 and GPG-RS3 are evaluated across five established benchmarks. Additionally, the Qwen2.5-Math-7B model is trained on 7,500 samples from the MATH-lighteval dataset, based on Open-r1 [11] code base.
For multimodal tasks, we have selected three renowned frameworks as our code base: VisualThinker-R1-Zero [59], R1-V [4], and Visual-RFT [31]. These frameworks cover a variety of tasks, including visual reasoning, geometric reasoning, and image perception. The use of distinct code bases enables a comprehensive assessment of the performance enhancements achieved by our method across different tasks. Specifically, for the VisualThinker-R1-Zero framework, we evaluate the results of the GPG approach on the CV-Bench [45]. Additionally, we evaluate the GPGresults on the GEOQA dataset [3] based on R1-V. Finally, for tasks related to image perception, such as classification [34, 37, 32, 24] and reasoning grounding [25], we examine the performance of PGP using the Visual-RFT framework.
Appendix B More Related Work
Proximal Policy Optimization. PPO [42] addresses the inherent optimization instability of Trust Region Policy Optimization (TRPO) [40] through a clipped surrogate objective. Formally, let the probability ratio between the updated policy and the previous policy be defined as
(7) |
where and denote the action and state at timestep , respectively. While TRPO maximizes the surrogate objective
(8) |
under a Kullback-Leibler (KL) divergence constraint, PPO reformulates this via a clipped mechanism. Here, represents the estimated advantage function quantifying the relative value of action in state . The PPO objective is defined as:
(9) |
where the clip operator restricts to the interval , with being a hyperparameter controlling the policy update magnitude. This constraint prevents excessive policy deviations that could degrade performance.
To further stabilize training and promote exploration, the composite objective incorporates three components: 1) Clipped policy gradient term , 2) Value function loss:
(10) |
where is the state-value function estimator and denotes the target value computed via temporal-difference methods, 3) Entropy regularization:
(11) |
with being the action space, which prevents premature policy convergence by encouraging stochasticity.
The complete objective integrates these terms as:
(12) |
where and are coefficients balancing policy optimization, value estimation accuracy, and exploration. Crucially, PPO replaces TRPO’s computationally intensive second-order KL constraints with first-order gradient clipping, enabling efficient large-scale implementations while preserving monotonic policy improvement guarantees, as rigorously established through surrogate objective monotonicity analysis [18].
Group Relative Policy Optimization. GRPO [43] establishes a policy gradient framework that eliminates dependency on explicit value function approximation through comparative advantage estimation within response groups. The method operates by sampling multiple candidate outputs for each input question and constructing advantage signals based on relative rewards within these groups. For a given question , the algorithm generates responses from the current policy , then computes token-level advantages using intra-group reward comparisons.
The advantage term for the -th token in the -th response is defined as the deviation from the group average reward:
(13) |
where denotes the reward model’s evaluation. This design inherently aligns with the comparative training paradigm of reward models, which typically learn from pairwise response rankings.
The optimization objective integrates clipped probability ratios with explicit KL regularization. Defining the token-level probability ratio as:
(14) |
the clipped surrogate objective constrains policy updates through:
(15) |
Diverging from PPO’s implicit KL control via reward shaping, GRPO directly regularizes policy divergence using an unbiased KL estimator:
(16) |
The complete objective combines these components with a regularization coefficient :
(17) |