Reasoning Under 1 Billion: Memory-Augmented Reinforcement Learning for Large Language Models

Hung Le, Dai Do, Dung Nguyen, and Svetha Venkatesh
Applied AI Institute, Deakin University, Australia
{thai.le,v.do,dung.nguyen,svetha.venkatesh}@deakin.edu.au
Abstract

Recent advances in fine-tuning large language models (LLMs) with reinforcement learning (RL) have shown promising improvements in complex reasoning tasks, particularly when paired with chain-of-thought (CoT) prompting. However, these successes have been largely demonstrated on large-scale models with billions of parameters, where a strong pretraining foundation ensures effective initial exploration. In contrast, RL remains challenging for tiny LLMs with 1 billion parameters or fewer because they lack the necessary pretraining strength to explore effectively, often leading to suboptimal reasoning patterns. This work introduces a novel intrinsic motivation approach that leverages episodic memory to address this challenge, improving tiny LLMs in CoT reasoning tasks. Inspired by human memory-driven learning, our method leverages successful reasoning patterns stored in memory while allowing for controlled exploration to generate novel responses. Intrinsic rewards are computed efficiently using a kNN-based episodic memory, allowing the model to discover new reasoning strategies while quickly adapting to effective past solutions. Experiments on fine-tuning GSM8K and AI-MO datasets demonstrate that our approach significantly enhances smaller LLMs’ sample efficiency and generalization capability, making RL-based reasoning improvements more accessible in low-resource settings.

1 Introduction

Large Language Models (LLMs) have demonstrated remarkable advancements in reasoning and problem-solving, driven by innovations in scaling strategies and training techniques Google (2024); OpenAI (2024a). Despite its foundational role in defining LLM capability, scaling pre-training is prohibitively expensive and tends to plateau Xia et al. (2023); Hong et al. (2023). As a result, post-training has become increasingly important, offering improvements in alignment, reasoning depth, and downstream task efficiency Kumar et al. (2025). Among post-training approaches, reinforcement learning (RL) fine-tuning is a promising alternative to expensive LLM’s test-time search methods, such as MCTS or Beam Search Yao et al. (2023); Feng et al. (2023); Snell et al. (2024). RL directly instills chain-of-thought (CoT) reasoning strategies into the model, enabling efficient deployment. Recent works like DeepSeek-R1 Guo et al. (2025) show that RL with simple outcome rewards can enhance reasoning without relying on heavy inference-time compute methods OpenAI (2024a, b) and complicated process-based rewards Lightman et al. (2023); Zhang et al. .

However, these benefits have been observed mainly in large models (8B–670B) Guo et al. (2025); Arora and Zanette (2025); Yeo et al. (2025). In contrast, RL remains challenging for tiny LLMs, which we consider as having \leq 1B parameters. These weak models frequently produce incorrect outputs during training, failing to receive any outcome reward. For example, a 0.5B model may repeatedly generate improperly formatted answers to math questions, failing to produce any valid outputs that qualify for a reward. As a result, reward signals are extremely sparse. A common mitigation approach is using heuristic format-based rewards Guo et al. (2025). However, we argue that relying on format-based rewards can cause training collapse in tiny LLMs, as they may overfit to simple format patterns while neglecting the main task. Worse, exploration is ineffective—not only because small models choose poor actions but also because they lack an explicit exploration mechanism. Unlike RL agents, LLMs do not actively explore or exploit; they passively sample from learned distributions. As noted in Krishnamurthy et al. (2024), even large models struggle with exploration and exploitation; this issue becomes acute for small models. Finally, the lack of quality data in downstream tasks poses an additional challenge for training tiny LLMs with RL.

Drawing inspiration from the human brain’s episodic memory, which stores and retrieves experiences to guide learning McClelland et al. (1995), we introduce Memory-R+, a memory-augmented reinforcement learning framework designed to enhance CoT reasoning in tiny LLMs. To address the challenges of reward sparsity and insufficient exploration, we implement an intrinsic motivation mechanism that emulates the brain’s drive to seek successful outcomes (exploit) and avoid repeated errors (explore). This mechanism guides reasoning trajectories by leveraging two distinct episodic memory modules: one dedicated to storing successful reasoning traces and the other to capturing failed attempts. By employing nearest-neighbor estimation within a shared representation space, Memory-R+ derives performance-driven intrinsic rewards from the memory. This process mirrors how humans learn from near-correct attempts, allowing LLMs to refine their reasoning by aligning with successful patterns while avoiding detrimental exploration paths. This intrinsic motivation effectively addresses the limitations of sparse external rewards, providing a continuous learning signal based on past experiences.

Unlike traditional episodic control methods that rely on state-action-return associations for discrete action spaces Pritzel et al. (2017); Le et al. (2021, 2022); Do et al. (2024), Memory-R+ simplifies memory storage to input-output pairs, making it more suitable for LLM’s textual reasoning. Upon receiving a new query, the framework retrieves outputs from similar past instances by first encoding the query and searching for the top-k most similar queries in memory using cosine similarity. The corresponding response sets from the success and failure memories are then retrieved. The exploitation reward is computed by measuring the Euclidean distance between the generated response and the centroid of the successful response set, encouraging the model to align with generalizable successful patterns rather than memorizing specific past responses. In contrast, the exploration reward is derived from the maximum cosine similarity between the generated response and the stored failure responses, ensuring that the model discovers novel outputs differing from incorrect reasoning. To maintain stability in training, both rewards are normalized within a sliding window, adapting to the model’s recent performance trends.

To evaluate our approach, we conduct extensive experiments across several tiny LLMs on mathematical problem-solving. Our results demonstrate that Memory-R+ significantly improves reasoning accuracy and robustness compared to baseline RL and other handcraft rewards. Moreover, analytic studies provide insights into training collapses and the impact of different memory configurations, highlighting the role of episodic memory in enhancing reasoning performance. In summary, our key contributions are as follows: (1) We pioneer an RL fine-tuning approach for tiny LLMs. (2) We introduce a memory-based intrinsic reward mechanism to teach LLMs to explore and exploit. (3) We empirically identify and analyze training collapse issues when fine-tuning tiny LLMs with RL. (4) Extensive experiments on CoT reasoning tasks show that Memory-R+ outperforms other RL methods, enhancing reasoning in small models significantly.

The significance of our approach is that our method enables effective RL fine-tuning for models as small as 500M parameters—orders of magnitude smaller than current state-of-the-art LLMs used in RL-based reasoning research Guo et al. (2025). This dramatically lowers the barrier to entry for small research labs, academic groups, and companies with limited computing resources, making advanced reasoning capabilities more accessible.

2 Method

2.1 Intrinsic Reward Formulation for CoT Reasoning

When the LLM generates a response to a given query, it receives two forms of feedback: an outcome reward R𝑅Ritalic_R from an Answer Verifier that judges the correctness of the final answer extracted from the response, and an intrinsic reward Rmemsubscript𝑅𝑚𝑒𝑚R_{mem}italic_R start_POSTSUBSCRIPT italic_m italic_e italic_m end_POSTSUBSCRIPT from memory that reflects how the response aligns with past successes and failures. We note that the Answer Verifier can only assess the final answer and cannot evaluate the quality of the reasoning chains in the response. Therefore, the intrinsic reward is expected to complement the Answer Verifier in providing useful training signals.

Our intrinsic reward balances exploration and exploitation by rewarding responses that resemble past successful reasoning trajectories while penalizing those similar to previously failed responses. This is achieved through a kNN-based memory system that quantifies the novelty and similarity of generated responses. Fig. 1 illustrates the overall design of Memory-R+.

Refer to caption
Figure 1: Memory-R+ Architecture. Left: The LLM receives a query q𝑞qitalic_q from training dataset D𝐷Ditalic_D, and generates multiple responses. For each response a𝑎aitalic_a, in addition to outcome reward R𝑅Ritalic_R from an Answer Verifier, Memory-R+ introduces intrinsic reward Rmemsubscript𝑅memR_{\text{mem}}italic_R start_POSTSUBSCRIPT mem end_POSTSUBSCRIPT based on episodic memory. Right: The query q𝑞qitalic_q is used to query the failure memory fsubscript𝑓\mathcal{M}_{f}caligraphic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and success memory ssubscript𝑠\mathcal{M}_{s}caligraphic_M start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT using kNN (red arrows), resulting in corresponding retrieved responses. The intrinsic reward Rmemsubscript𝑅memR_{\text{mem}}italic_R start_POSTSUBSCRIPT mem end_POSTSUBSCRIPT is computed by comparing the current response a𝑎aitalic_a to retrieved ones—encouraging novelty against failed responses (e.g., a1,1subscript𝑎11a_{1,1}italic_a start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT, a3,1subscript𝑎31a_{3,1}italic_a start_POSTSUBSCRIPT 3 , 1 end_POSTSUBSCRIPT, a3,2subscript𝑎32a_{3,2}italic_a start_POSTSUBSCRIPT 3 , 2 end_POSTSUBSCRIPT) and rewarding similarity to successful ones (e.g., a5,1subscript𝑎51a_{5,1}italic_a start_POSTSUBSCRIPT 5 , 1 end_POSTSUBSCRIPT, a5,2subscript𝑎52a_{5,2}italic_a start_POSTSUBSCRIPT 5 , 2 end_POSTSUBSCRIPT, a6,1subscript𝑎61a_{6,1}italic_a start_POSTSUBSCRIPT 6 , 1 end_POSTSUBSCRIPT, a6,2subscript𝑎62a_{6,2}italic_a start_POSTSUBSCRIPT 6 , 2 end_POSTSUBSCRIPT).

2.2 Episodic Memory

Memory Formulation

We construct an episodic memory module \mathcal{M}caligraphic_M to store past reasoning trajectories, facilitating efficient retrieval of relevant experiences. To ensure efficient reasoning retrieval, both queries and responses are encoded into a shared high-dimensional vector space using Enc, implemented as a pre-trained Sentence Transformer Reimers and Gurevych (2019):

𝐪i=Enc(qi)d,𝐚i,j=Enc(ai,j)dformulae-sequencesubscript𝐪𝑖Encsubscript𝑞𝑖superscript𝑑subscript𝐚𝑖𝑗Encsubscript𝑎𝑖𝑗superscript𝑑\mathbf{q}_{i}=\text{Enc}(q_{i})\in\mathbb{R}^{d},\quad\mathbf{a}_{i,j}=\text{% Enc}(a_{i,j})\in\mathbb{R}^{d}bold_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = Enc ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , bold_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = Enc ( italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT (1)

Here, each entry in the memory consists of embeddings of a query qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and a set of associated responses {ai,j}subscript𝑎𝑖𝑗\{a_{i,j}\}{ italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT }: ={(𝐪i,{𝐚i,j}j=1L)}i=1Nsuperscriptsubscriptsubscript𝐪𝑖superscriptsubscriptsubscript𝐚𝑖𝑗𝑗1𝐿𝑖1𝑁\mathcal{M}=\{(\mathbf{q}_{i},\{\mathbf{a}_{i,j}\}_{j=1}^{L})\}_{i=1}^{N}caligraphic_M = { ( bold_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , { bold_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT where N𝑁Nitalic_N is the maximum number of stored queries, and each query qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT maintains at most L𝐿Litalic_L associated responses. Memory retrieval is denoted as: [qi]={𝐚i,j}j=1Ldelimited-[]subscript𝑞𝑖superscriptsubscriptsubscript𝐚𝑖𝑗𝑗1𝐿\mathcal{M}[q_{i}]=\{\mathbf{a}_{i,j}\}_{j=1}^{L}caligraphic_M [ italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] = { bold_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT.

Memory Writing

During training, we sample G𝐺Gitalic_G responses from the LLM for a given query. New query-response pairs are incorporated into memory following an update rule: (1) If q𝑞qitalic_q is a novel query (i.e., not present in \mathcal{M}caligraphic_M) and ||N𝑁|\mathcal{M}|\geq N| caligraphic_M | ≥ italic_N, the oldest stored query-response pair is evicted to maintain a fixed memory capacity. The new query and its corresponding responses are then inserted. (2) If q𝑞qitalic_q already exists in \mathcal{M}caligraphic_M, the new response set {aj}subscript𝑎𝑗\{a_{j}\}{ italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } is merged with the existing responses. If the total number of responses exceeds L𝐿Litalic_L, the oldest responses are discarded to preserve memory constraints.

For guiding reinforcement learning, we maintain two episodic memory modules: one for storing successful responses, ssuperscript𝑠\mathcal{M}^{s}caligraphic_M start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, and another for failed responses, fsuperscript𝑓\mathcal{M}^{f}caligraphic_M start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT. Given a query q𝑞qitalic_q and a set of generated responses {aj}j=1msuperscriptsubscriptsubscript𝑎𝑗𝑗1𝑚\{a_{j}\}_{j=1}^{m}{ italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, we update the memories as follows:

s[q]s[q]{𝐚jR(q,aj)>τs},f[q]f[q]{𝐚jR(q,aj)τf}formulae-sequencesuperscript𝑠delimited-[]𝑞superscript𝑠delimited-[]𝑞conditional-setsubscript𝐚𝑗𝑅𝑞subscript𝑎𝑗superscript𝜏𝑠superscript𝑓delimited-[]𝑞superscript𝑓delimited-[]𝑞conditional-setsubscript𝐚𝑗𝑅𝑞subscript𝑎𝑗superscript𝜏𝑓\mathcal{M}^{s}[q]\leftarrow\mathcal{M}^{s}[q]\cup\{\mathbf{a}_{j}\mid R(q,a_{% j})>\tau^{s}\},\ \mathcal{M}^{f}[q]\leftarrow\mathcal{M}^{f}[q]\cup\{\mathbf{a% }_{j}\mid R(q,a_{j})\leq\tau^{f}\}caligraphic_M start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT [ italic_q ] ← caligraphic_M start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT [ italic_q ] ∪ { bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ italic_R ( italic_q , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) > italic_τ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT } , caligraphic_M start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT [ italic_q ] ← caligraphic_M start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT [ italic_q ] ∪ { bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ italic_R ( italic_q , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≤ italic_τ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT } (2)

where τssuperscript𝜏𝑠\tau^{s}italic_τ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and τfsuperscript𝜏𝑓\tau^{f}italic_τ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT are reward thresholds for classifying successful and failed responses, respectively. For instance, in mathematical problem-solving, where the outcome reward is defined as R(q,a)=1𝑅𝑞𝑎1R(q,a)=1italic_R ( italic_q , italic_a ) = 1 for a correct final answer and 00 otherwise, we can set the thresholds as τs=τf=0.5superscript𝜏𝑠superscript𝜏𝑓0.5\tau^{s}=\tau^{f}=0.5italic_τ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = italic_τ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT = 0.5.

2.3 Memory-based Intrinsic Reward

Memory Read

Given a new query q𝑞qitalic_q and a response a𝑎aitalic_a, we compute their embeddings 𝐪=Enc(q)𝐪Enc𝑞\mathbf{q}=\text{Enc}(q)bold_q = Enc ( italic_q ), 𝐚=Enc(a)𝐚Enc𝑎\mathbf{a}=\text{Enc}(a)bold_a = Enc ( italic_a ) and retrieve the top-k𝑘kitalic_k nearest queries from a memory \mathcal{M}caligraphic_M based on cosine similarity (CS): {qk}k=1K=top-k(argmaxqCS(𝐪,𝐪))superscriptsubscriptsuperscriptsubscript𝑞𝑘𝑘1𝐾top-ksubscriptsuperscriptqCS𝐪superscript𝐪\{{q}_{k}^{\prime}\}_{k=1}^{K}=\text{top-k}\left(\arg\max_{\textbf{q}^{\prime}% \in\mathcal{M}}\text{CS}(\mathbf{q},\mathbf{q^{\prime}})\right){ italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT = top-k ( roman_arg roman_max start_POSTSUBSCRIPT q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_M end_POSTSUBSCRIPT CS ( bold_q , bold_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ). Then, the set of relevant responses from the memory is computed as:

B(,q)=k=1K[qk]𝐵𝑞superscriptsubscript𝑘1𝐾delimited-[]superscriptsubscript𝑞𝑘B\left(\mathcal{M},q\right)=\bigcup_{k=1}^{K}\mathcal{M}[q_{k}^{\prime}]italic_B ( caligraphic_M , italic_q ) = ⋃ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT caligraphic_M [ italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] (3)

where K𝐾Kitalic_K is the number of nearest neighbors considered in the memory retrieval. For simplicity, the same K𝐾Kitalic_K is used for both ssuperscript𝑠\mathcal{M}^{s}caligraphic_M start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and fsuperscript𝑓\mathcal{M}^{f}caligraphic_M start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT.

Exploitation Reward Rexploitsubscript𝑅exploitR_{\text{exploit}}italic_R start_POSTSUBSCRIPT exploit end_POSTSUBSCRIPT

To reinforce successful reasoning patterns, we compute the exploitation reward using responses stored in the success memory ssuperscript𝑠\mathcal{M}^{s}caligraphic_M start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT. The model is rewarded for generating responses similar to those that previously led to correct final answers. We first compute the centroid of retrieved response embeddings from the success memory: 𝐜(s,q)=1|B(s,q)|𝐚jB(s,q)𝐚j𝐜superscript𝑠𝑞1𝐵superscript𝑠𝑞subscriptsubscript𝐚𝑗𝐵superscript𝑠𝑞subscript𝐚𝑗\mathbf{c}(\mathcal{M}^{s},q)=\frac{1}{|B\left(\mathcal{M}^{s},q\right)|}\sum_% {\mathbf{a}_{j}\in B\left(\mathcal{M}^{s},q\right)}\mathbf{a}_{j}bold_c ( caligraphic_M start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_q ) = divide start_ARG 1 end_ARG start_ARG | italic_B ( caligraphic_M start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_q ) | end_ARG ∑ start_POSTSUBSCRIPT bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_B ( caligraphic_M start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_q ) end_POSTSUBSCRIPT bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. The Euclidean distance between the response a𝑎aitalic_a and this centroid derives the exploit reward as:

Rexploit(q,a)=𝐚𝐜(s,q)subscript𝑅exploit𝑞𝑎norm𝐚𝐜superscript𝑠𝑞R_{\text{exploit}}(q,a)=-\|\mathbf{a}-\mathbf{c}(\mathcal{M}^{s},q)\|italic_R start_POSTSUBSCRIPT exploit end_POSTSUBSCRIPT ( italic_q , italic_a ) = - ∥ bold_a - bold_c ( caligraphic_M start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_q ) ∥ (4)

By measuring the distance to the centroid, we encourage the model to align with the general distribution of successful reasoning patterns rather than overfitting specific past answers. This provides a smoother optimization signal, overcoming the reward sparsity problem and capturing structural commonalities in effective reasoning paths.

Exploration Reward Rexploresubscript𝑅exploreR_{\text{explore}}italic_R start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT

To encourage novel reasoning paths, we compute the exploration reward using responses stored in the failure memory fsuperscript𝑓\mathcal{M}^{f}caligraphic_M start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT, ensuring that the model avoids repeating past mistakes. Specifically, novelty is measured as the minus of the cosine similarity between the generated response embedding 𝐚𝐚\mathbf{a}bold_a and its closest retrieved embedding from the failure memory:

Rexplore(q,a)=1max𝐚jB(f,q)CS(𝐚,𝐚j)subscript𝑅explore𝑞𝑎1subscriptsubscript𝐚𝑗𝐵superscript𝑓𝑞CS𝐚subscript𝐚𝑗R_{\text{explore}}(q,a)=1-\max_{\mathbf{a}_{j}\in B\left(\mathcal{M}^{f},q% \right)}\text{CS}(\mathbf{a},\mathbf{a}_{j})italic_R start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( italic_q , italic_a ) = 1 - roman_max start_POSTSUBSCRIPT bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_B ( caligraphic_M start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT , italic_q ) end_POSTSUBSCRIPT CS ( bold_a , bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (5)

This formulation penalizes responses that closely resemble previously failed attempts while rewarding novel outputs that deviate from incorrect reasoning. Importantly, the design of this intrinsic reward creates a natural curriculum: early in training, when most outputs are wrong, this encourages broad exploration—generating anything unlike previous attempts. As correct responses accumulate in the success memory ssuperscript𝑠\mathcal{M}^{s}caligraphic_M start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, the failure memory becomes more selective, guiding the model to avoid bad patterns without discouraging similarity to correct ones. As the model improves, this creates a natural progression from broad to focused exploration. In practice, we can set a warm-up period (e.g., 50 training steps) to collect the initial data before applying the exploration reward.

Reward Normalization

The intrinsic rewards are normalized using a running min-max scaling to ensure they are evaluated relative to recent performance trends. For example, given a sliding window of past intrinsic rewards {Rexplore,i}twtsuperscriptsubscriptsubscript𝑅explore𝑖𝑡𝑤𝑡\{R_{\text{explore},i}\}_{t-w}^{t}{ italic_R start_POSTSUBSCRIPT explore , italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t - italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, where w𝑤witalic_w is the window size, the normalized intrinsic reward is computed as:

R^exploit/explore(q,a)=Rexploit/explore(q,a)mintwitRexploit/explore,imaxtwitRexploit/explore,imintwitRexploit/explore,i+ϵsubscript^𝑅exploit/explore𝑞𝑎subscript𝑅exploit/explore𝑞𝑎subscript𝑡𝑤𝑖𝑡subscript𝑅exploit/explore𝑖subscript𝑡𝑤𝑖𝑡subscript𝑅exploit/explore𝑖subscript𝑡𝑤𝑖𝑡subscript𝑅exploit/explore𝑖italic-ϵ\hat{R}_{\text{exploit/explore}}(q,a)=\frac{R_{\text{exploit/explore}}(q,a)-% \min_{t-w\leq i\leq t}R_{\text{exploit/explore},i}}{\max_{t-w\leq i\leq t}R_{% \text{exploit/explore},i}-\min_{t-w\leq i\leq t}R_{\text{exploit/explore},i}+\epsilon}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT exploit/explore end_POSTSUBSCRIPT ( italic_q , italic_a ) = divide start_ARG italic_R start_POSTSUBSCRIPT exploit/explore end_POSTSUBSCRIPT ( italic_q , italic_a ) - roman_min start_POSTSUBSCRIPT italic_t - italic_w ≤ italic_i ≤ italic_t end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT exploit/explore , italic_i end_POSTSUBSCRIPT end_ARG start_ARG roman_max start_POSTSUBSCRIPT italic_t - italic_w ≤ italic_i ≤ italic_t end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT exploit/explore , italic_i end_POSTSUBSCRIPT - roman_min start_POSTSUBSCRIPT italic_t - italic_w ≤ italic_i ≤ italic_t end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT exploit/explore , italic_i end_POSTSUBSCRIPT + italic_ϵ end_ARG (6)

where ϵitalic-ϵ\epsilonitalic_ϵ is a small constant to prevent division by zero. Here, the rewards are interpreted relative to recent performance, allowing the model to adapt dynamically. We argue that since this is an intrinsic reward, its value should be assessed relative to the model’s past performance rather than an absolute scale. A response is considered more rewarding if it demonstrates improvement over its recent historical performance, ensuring that the model continuously refines its reasoning rather than converging prematurely.

Final Reward Signal

The final memory-based intrinsic reward Rmemsubscript𝑅𝑚𝑒𝑚R_{mem}italic_R start_POSTSUBSCRIPT italic_m italic_e italic_m end_POSTSUBSCRIPT is computed as a weighted sum of the normalized components:

Rmem=βsR^exploit+βeR^explore,subscript𝑅𝑚𝑒𝑚subscript𝛽𝑠subscript^𝑅exploitsubscript𝛽𝑒subscript^𝑅exploreR_{mem}=\beta_{s}\hat{R}_{\text{exploit}}+\beta_{e}\hat{R}_{\text{explore}},italic_R start_POSTSUBSCRIPT italic_m italic_e italic_m end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT exploit end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT , (7)

where R^exploitsubscript^𝑅exploit\hat{R}_{\text{exploit}}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT exploit end_POSTSUBSCRIPT and R^exploresubscript^𝑅explore\hat{R}_{\text{explore}}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT are the normalized rewards, ensuring that the model evaluates improvements relative to its recent history. The weighting factors βssubscript𝛽𝑠\beta_{s}italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and βesubscript𝛽𝑒\beta_{e}italic_β start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT determine the balance between reinforcing past successes and encouraging novel reasoning, providing explicit control over exploitation and exploration trade-offs. For simplicity in this paper, we do not tune them and set βs=βe=1subscript𝛽𝑠subscript𝛽𝑒1\beta_{s}=\beta_{e}=1italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = 1 throughout experiments. With Rmemsubscript𝑅𝑚𝑒𝑚R_{mem}italic_R start_POSTSUBSCRIPT italic_m italic_e italic_m end_POSTSUBSCRIPT, dense performance-driven signals are incorporated into the training rewards. We hypothesize that this facilitates learning performance-based rewards, narrows the difficulty gap between performance-based and format-based learning, and mitigates training collapse issues (see more in Sec. 4.3).

Training with RL

We train the model using a reinforcement learning objective, where the task outcome reward R𝑅Ritalic_R and the intrinsic reward Rmemsubscript𝑅𝑚𝑒𝑚R_{mem}italic_R start_POSTSUBSCRIPT italic_m italic_e italic_m end_POSTSUBSCRIPT are used to update the policy πθ(aq)subscript𝜋𝜃conditional𝑎𝑞\pi_{\theta}(a\mid q)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a ∣ italic_q ). Specifically, we adopt the Group Relative Policy Optimization (GRPO Shao et al. (2024)), a variant of policy gradient methods designed for improved stability and efficiency in language model RL fine-tuning. The training objective maximizes the expected total reward:

maxθ𝔼qD,aπθ[R+Rmem]subscript𝜃subscript𝔼formulae-sequencesimilar-to𝑞𝐷similar-to𝑎subscript𝜋𝜃delimited-[]𝑅subscript𝑅𝑚𝑒𝑚\max_{\theta}\,\mathbb{E}_{q\sim D,a\sim\pi_{\theta}}[R+R_{mem}]roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_q ∼ italic_D , italic_a ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_R + italic_R start_POSTSUBSCRIPT italic_m italic_e italic_m end_POSTSUBSCRIPT ] (8)

where D𝐷Ditalic_D is the training dataset, and πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is the LLM with its trainable parameters θ𝜃\thetaitalic_θ.

3 Experimental Setup

Training

We use three tiny LLMs with at most 1 billion parameters: Qwen2.5-0.5B-Instruct, Falcon3-1B-Instruct, and Llama3.2-1B-Instruct and fine-tune them using a single NVIDIA H100 GPU. Training is conducted on two datasets: (1) the ”easy-math” GSM8K dataset Cobbe et al. (2021), using the training set of 7,473 samples, and (2) the ”hard-math” AI-MO dataset Jia LI and Polu (2024), where we randomly select only 2,000 samples to reflect real-world high-quality data scarcity. We run the training with three seeds to account for the inherent randomness in RL training, ensuring that our results are stable and not dependent on a specific initialization. We implement and execute the training using the Open-R1 codebase Face (2025) (see more details in Appendix A).

Evaluation

We evaluate our approach on three representative mathematical reasoning benchmarks—GSM8K, MATH-500, and AIME24 Cobbe et al. (2021); Lightman et al. (2023)—which increase in difficulty in that order. If not stated otherwise, we follow the zero-shot setting for all evaluations. Our evaluation framework is based on Lighteval Fourrier et al. (2023), and we employ its extractive match metric, which rigorously applies regex-based conditions to precisely extract and parse generated answers. We note that answers must adhere to a strict, predefined format to be successfully extracted for evaluation; if the model fails to generate an answer that follows the specified format, the answer will not be extracted and will be counted as incorrect.

Baselines

We define R1 as the RL baseline trained using the standard GRPO algorithm, following the DeepSeekR1 paper Guo et al. (2025), with correctness outcome and format-based rewards. Cosine incorporates the response length’s property as a reward signal Yeo et al. (2025). Our proposed method, Memory-R+, introduce 2 performance-driven reward strategies: R^exploitsubscript^𝑅exploit\hat{R}_{\text{exploit}}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT exploit end_POSTSUBSCRIPT and R^exploresubscript^𝑅explore\hat{R}_{\text{explore}}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT.

4 Experimental Results

LLM Baseline GSM8K MATH 500 AIME24
Last Best Last Best Last Best
Qwen2.5-0.5B Base 27.8 20.0 0.0
R1 27.5±plus-or-minus\pm±6.3 28.8±plus-or-minus\pm±7.5 18.7±plus-or-minus\pm±3.6 18.9±plus-or-minus\pm±3.8 0.0±plus-or-minus\pm±0.0 1.1±plus-or-minus\pm±1.9
Cosine 29.4±plus-or-minus\pm±1.4 31.2±plus-or-minus\pm±0.7 22.7±plus-or-minus\pm±1.0 22.7±plus-or-minus\pm±1.0 0.0±plus-or-minus\pm±0.0 0.0±plus-or-minus\pm±0.0
Memory-R 33.0±plus-or-minus\pm±1.1 36.0±plus-or-minus\pm±2.6 21.4±plus-or-minus\pm±1.9 23.7±plus-or-minus\pm±1.3 0.0±plus-or-minus\pm±0.0 0.0±plus-or-minus\pm±0.0
Memory-R+ 33.7±plus-or-minus\pm±2.5 34.0±plus-or-minus\pm±2.3 22.3±plus-or-minus\pm±0.6 24.4±plus-or-minus\pm±0.6 0.0±plus-or-minus\pm±0.0 1.1±plus-or-minus\pm±1.9
Falcon3-1B Base 32.9 12.2 0.0
R1 10.9±plus-or-minus\pm±4.6 16.3±plus-or-minus\pm±1.7 6.5±plus-or-minus\pm±1.7 10.8±plus-or-minus\pm±0.4 0.0±plus-or-minus\pm±0.0 0.0±plus-or-minus\pm±0.0
Cosine 35.3±plus-or-minus\pm±0.2 37.4±plus-or-minus\pm±1.3 16.2±plus-or-minus\pm±1.4 17.0±plus-or-minus\pm±0.9 0.0±plus-or-minus\pm±0.0 0.0±plus-or-minus\pm±0.0
Memory-R 34.6±plus-or-minus\pm±1.7 36.3±plus-or-minus\pm±0.6 12.9±plus-or-minus\pm±0.3 14.1±plus-or-minus\pm±0.9 0.0±plus-or-minus\pm±0.0 2.2±plus-or-minus\pm±1.9
Memory-R+ 34.0±plus-or-minus\pm±0.6 34.8±plus-or-minus\pm±0.5 14.0±plus-or-minus\pm±2.8 16.9±plus-or-minus\pm±1.3 0.0±plus-or-minus\pm±0.0 2.2±plus-or-minus\pm±1.9
Llama3.2-1B Base 26.3 17.4 0.0
R1 36.2±plus-or-minus\pm±0.3 37.2±plus-or-minus\pm±1.3 15.2±plus-or-minus\pm±1.0 19.0±plus-or-minus\pm±0.5 3.3±plus-or-minus\pm±0.0 2.2±plus-or-minus\pm±1.9
Cosine 37.6±plus-or-minus\pm±1.3 38.1±plus-or-minus\pm±1.8 18.5±plus-or-minus\pm±0.3 21.2±plus-or-minus\pm±0.3 0.0±plus-or-minus\pm±0.0 3.3±plus-or-minus\pm±3.3
Memory-R 38.7±plus-or-minus\pm±0.8 39.9±plus-or-minus\pm±1.2 20.3±plus-or-minus\pm±1.4 21.1±plus-or-minus\pm±0.7 3.3±plus-or-minus\pm±0.0 4.4±plus-or-minus\pm±1.9
Memory-R+ 40.5±plus-or-minus\pm±1.1 40.7±plus-or-minus\pm±1.1 20.0±plus-or-minus\pm±1.2 20.6±plus-or-minus\pm±1.4 0.0±plus-or-minus\pm±0.0 2.2±plus-or-minus\pm±1.9
Table 1: Results over different LLMs and datasets. Extractive match (mean ±plus-or-minus\pm± std.) at the last and best training checkpoints, averaged over 3 seeds (for Base, only one seed is enough). The average best results are highlighted in bold, and the second-best results are underlined; if two or more statistically identical best results occur (Cohen effect size <0.5absent0.5<0.5< 0.5), all are bold without underlining, and settings with full zero performance are left unformatted.

4.1 Results with GSM8K Training

For GSM8K training, we use a zero-shot setting for all base LLMs except Llama3.2-1B-Instruct, which requires a single in-context example per training sample. Without this, it fails to produce any valid correctness rewards, preventing learning across all models. All baselines share the outcome correctness rewards and format-based rewards, including the integer reward (rewarding responses that contain integers, tailored for GSM8K task) and the XML reward (ensuring responses match a specific XML structure, e.g., <<<answer>>> <<</answer>>>).

We tune GRPO’s hyperparameters using baseline R1 and find that the optimal response length is consistently below 200. Thus, we set the maximum length to 200 for all baselines. Typically, a higher number of generations per step helps stabilize the training, yet demands more memory resources. To balance resource constraints, we set this value to G=16𝐺16G=16italic_G = 16. Other hyperparameter values are provided in the Appendix A. For our method, we set the episodic memory capacity equal to the dataset size, i.e., N=|D|𝑁𝐷N=|D|italic_N = | italic_D |, ensuring no memory overflow. This design choice is suitable for our setting, where the training dataset is relatively small, aligning with real-world conditions. The maximum number of stored responses per query is fixed at L=100𝐿100L=100italic_L = 100, and the reward normalization window size is set to w=100𝑤100w=100italic_w = 100 across all experiments. A key hyperparameter that may require tuning is K𝐾Kitalic_K, the number of neighbors used for memory retrieval. For simplicity, we set K=1𝐾1K=1italic_K = 1 in this section as a proof of concept, without further hyperparameter optimization.

After one training epoch, we evaluate the baselines on mathematical reasoning benchmarks of varying difficulty. To analyze the contributions of exploitation and exploration rewards in our method, we also report results for Memory-R, which utilizes only R^exploitsubscript^𝑅exploit\hat{R}_{\text{exploit}}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT exploit end_POSTSUBSCRIPT as the intrinsic reward for RL training. In contrast, Memory-R+ incorporates both R^exploitsubscript^𝑅exploit\hat{R}_{\text{exploit}}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT exploit end_POSTSUBSCRIPT and R^exploresubscript^𝑅explore\hat{R}_{\text{explore}}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT, providing a more comprehensive reward structure.

Main Results: Table 1 presents test accuracy across multiple training checkpoints, reporting the best and last checkpoint’s results. Memory-R+ emerges as the strongest performer, ranking highest in 10 cases, followed by Memory-R with 8. Cosine and R1 achieve top rankings 7 and 2 times, respectively. Compared to the Base model, our Memory-R variants yield performance improvements ranging from 2% to 14%, depending on the setting. Notably, some RL methods, such as R1, occasionally underperform the Base model due to training collapse. Across all runs, seeds, and settings, we observe that training collapse does not occur with Memory-R+, whereas it does affect the other methods. We provide a detailed analysis of this phenomenon in Sec. 4.3.

Analysis on Intrinsic Reward: We also visualize the learning curves of R^exploitsubscript^𝑅exploit\hat{R}_{\text{exploit}}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT exploit end_POSTSUBSCRIPT and R^exploresubscript^𝑅explore\hat{R}_{\text{explore}}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT over training steps in Appendix Fig. 6. Both rewards generally show an upward trend, suggesting that the RL algorithm effectively optimizes them. It is important to note that these rewards are normalized to reflect relative improvements. For exploration reward, there is a warm-up period during which R^exploresubscript^𝑅explore\hat{R}_{\text{explore}}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT remains zero while initial data is collected to estimate novelty. After this phase, a spike in exploration occurs when R^exploresubscript^𝑅explore\hat{R}_{\text{explore}}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT is first applied, followed by stabilization and a gradual increase. The rise in R^exploresubscript^𝑅explore\hat{R}_{\text{explore}}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT indicates that the LLM’s outputs become more diverse as training progresses.

4.2 Results with AI-MO Training

In this task, we focus on Qwen2.5-0.5B-Instruct, the smallest LLM in our study. The format-based rewards include the XML reward and a heuristic reward that assesses the clarity of reasoning, dubbed reasoning step reward Face (2025). We exclude the integer reward because the answer in this task is not limited to integers. We keep the training hyperparameters similar to Sec. 4.1 except for K𝐾Kitalic_K, which we vary to study the impact of memory retrieval on the performance of our method. Also, as the data is limited (only 2,000 samples), we fine-tune the LLMs for 4 epochs to ensure convergence.

Refer to caption
Figure 2: Performance of fine-tuning Qwen2.5-0.5B-Instruct on AI-MO data. The test accuracy is evaluated at multiple checkpoints during training (mean±plus-or-minus\pm±std. over 3 runs).

Main Results: Fig. 2 reports the test accuracy of 3 RL methods: R1, Cosine, and Memory-R+ (K=20𝐾20K=20italic_K = 20) on 3 test datasets. The results consistently show that Memory-R+ surpasses all baselines by notable margins of approximately 5% on GSM8K, 4% on MATH-500, and 1% on AIME24. In this task, Cosine performs poorly due to response length collapse (see Sec. 4.3.2), whereas Memory-R+ and R1 remain unaffected. However, R1 exhibits significantly slower learning (see Appendix’s Fig. 5b) and achieves lower test accuracy compared to Memory-R+.

Hyperparameter Selection: Table 2 presents Memory-R+’s performance with different values of K𝐾Kitalic_K, showing how varying K𝐾Kitalic_K influences the model’s test accuracy across multiple datasets. As observed, increasing K𝐾Kitalic_K generally leads to an improvement in performance. This trend holds for all three datasets, where the highest accuracy is achieved with K>1𝐾1K>1italic_K > 1. However, the improvement varies depending on the dataset, indicating that the model’s behavior may differ based on task complexity or data characteristics. These findings suggest that tuning K𝐾Kitalic_K can be crucial and further improve Memory-R+’s performance in downstream tasks. Out of all settings tested, K=20𝐾20K=20italic_K = 20 demonstrates the highest performance for this task, consistently ranking at the top.

Dataset Model K=1𝐾1K=1italic_K = 1 K=10𝐾10K=10italic_K = 10 K=20𝐾20K=20italic_K = 20 K=30𝐾30K=30italic_K = 30
GSM8K Best 37.6±plus-or-minus\pm±0.8 38.7±plus-or-minus\pm±1.2 38.2±plus-or-minus\pm±0.9 38.4±plus-or-minus\pm±0.6
Last 36.5±plus-or-minus\pm±0.2 37.8±plus-or-minus\pm±1.4 37.5±plus-or-minus\pm±0.3 36.8±plus-or-minus\pm±0.4
MATH-500 Best 25.5±plus-or-minus\pm±0.1 25.9±plus-or-minus\pm±0.3 25.8±plus-or-minus\pm±0.5 25.8±plus-or-minus\pm±0.3
Last 22.7±plus-or-minus\pm±0.1 23.7±plus-or-minus\pm±0.5 25.3±plus-or-minus\pm±0.6 25.0±plus-or-minus\pm±0.4
AIME24 Best 0.0±plus-or-minus\pm±0.0 0.7±plus-or-minus\pm±1.0 2.2±plus-or-minus\pm±1.9 0.0±plus-or-minus\pm±0.0
Last 0.0±plus-or-minus\pm±0.0 0.3±plus-or-minus\pm±0.8 2.2±plus-or-minus\pm±1.9 0.0±plus-or-minus\pm±0.0
Table 2: Memory-R+ test accuracy with different k=1,10,20and 30𝑘11020and 30k=1,10,20\text{and }30italic_k = 1 , 10 , 20 and 30 (mean±plus-or-minus\pm±std. over 3 runs). The best results are highlighted in bold; if two or more statistically identical best results occur (Cohen effect size <0.5absent0.5<0.5< 0.5), all are bold.

Exploration Analysis: Furthermore, we evaluate the response diversity after fine-tuning LLMs with our method and other approaches, selecting a seed that ensures no training collapse occurs in the other methods. We compute diversity scores on 3 randomly sampled responses from the LLMs, given the input from a subset of 100 geometry questions ( Hendrycks et al. (2021)). We utilize the Language Model Evaluation Harness library Gao et al. (2024) to generate model responses using the ”hendrycks_math_geometry” task. The results in Appendix’s Table 6 demonstrate that Memory-R+ enhances the diversity of the base model, significantly surpassing R1 in terms of diversity. Further details and examples are provided in Appendix B.3.

Emergence of Self-reflection: Finally, we investigate the outputs of LLMs trained with our method to analyze the self-verification behavior discussed in Guo et al. (2025). To this end, we read through the responses of the LLMs on the geometry task mentioned above. Interestingly, the tiny LLMs trained with Memory-R+ also exhibit self-verification behaviors, as indicated by phrases like ”let’s re-evaluate” and ”let’s consider an alternative approach”. Among 100 observed cases, 26 instances demonstrated such behavior, compared to only 6 in the Base model. This highlights the emerging capability of smaller models to perform self-verification, a form of reasoning previously thought to be exclusive to larger, more complex models. These instances suggest that, with the right training and mechanisms, small models can not only generate outputs but also evaluate and refine them. More examples are given in Appendix B.3.

4.3 Training Collpase in Tiny LLMs

When training tiny LLMs for reasoning, we observe that incorporating multiple reward signals (e.g., format, accuracy, etc.) can enhance performance, particularly in boosting specific aspects like accuracy with precise format requirements. However, tiny LLMs can easily converge to local optima (training collapses) when exposed to multiple reward signals, resulting in suboptimal performances. In this section, we discuss these collapse cases, highlighting the nuances of reward design and exploring how these challenges can be addressed with our method.

4.3.1 Reward Mode Collapse

In this section, we investigate the reward mode collapse phenomenon, where (1) LLMs prioritize learning a simpler, typically format-based, reward, or (2) LLMs become confused by multiple rewards, struggling to learn any effectively. We observe and report this issue using Falcon3-1B-Instruct, though it is not exclusive to this model.

Refer to caption
Figure 3: Reward Mode Collapse in Falcon3-1B-Instruct.

The values of the main reward (correctness reward) and easier format-based rewards (e.g., integer reward and XML reward) are shown in Fig. 3. Here, Memory-R and Memory-R+ enhance both accuracy and integer rewards while trading off the XML reward. We hypothesize that incorporating intrinsic content and performance-based rewards, such as R^exploitsubscript^𝑅exploit\hat{R}_{\text{exploit}}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT exploit end_POSTSUBSCRIPT and R^exploresubscript^𝑅explore\hat{R}_{\text{explore}}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT, facilitates correctness optimization and prevents the model from overfitting to easy format-based rewards. In contrast, using R1, without intrinsic rewards, makes the model immediately focus on easier format-based rewards without any improvement in the correctness reward. Additionally, Cosine, which uses length-based intrinsic rewards, fails to learn any reward, resulting in mediocre performance across all criteria. This suggests that relying on naive intrinsic rewards may hinder learning. Among Memory-R and Memory-R+, the latter shows better performance, likely due to its more diverse intrinsic reward structure, which supports both format-based and correctness-based rewards.

4.3.2 Response Length Collapse

We observed two distinct types of collapse in response length. In one scenario, depending on the setup, the LLM struggles to generate meaningful tokens, resulting in unusually brief responses—sometimes as short as just 10 tokens. On the other hand, another form of collapse occurs when the LLM fails to halt its generation process (overthinking problem), leading to output sequences that continually expand until they reach the maximum allowed length. We use Qwen2.5-0.5B-Instruct trained on GSM8K to illustrate both collapse cases mentioned above. We also note that response length collapse also occurs in other settings (see more in Appendix 4.3.2).

Refer to caption
Figure 4: Response Length Collapse in Qwen2.5-0.5B-Instruct.

We present an example of response length collapse in Fig. 4. In this case, Memory-R and Memory-R+ successfully avoid this collapse, achieving high correctness rewards while maintaining reasonable completion lengths and balanced metrics. Interestingly, Cosine, despite focusing on optimizing lengths, leads the model to generate the maximum number of tokens early in training, yet none of its corresponding rewards increase. This suggests that the model fails to optimize for any meaningful objective despite the excessive token generation (in some other cases, Memory-R can also suffer similar issues). On the other hand, R1 drastically shortens responses. This results in the integer reward spiking as the response length drops significantly, indicating that the model is being guided to generate only a minimal number of tokens containing only digits. While this satisfies the integer reward, it is detrimental to correctness, which should be the primary optimization objective. Additionally, the XML reward for R1 remains unchanged, indicating a complicated relationship between the response length collapse and the reward mode collapse mentioned above.

5 Related Works

Enhancing LLM Reasoning

Recent advancements in LLM reasoning have focused on scaling test-time computation to improve accuracy in complex tasks. Test-time search strategies, such as beam search Gao et al. (2023) and majority vote Wang et al. (2022), aggregate predictions from multiple inference traces to refine accuracy. While these methods are effective, they come with the drawback of significantly increasing computational costs. More sophisticated techniques, using Monte Carlo Tree Search Feng et al. (2023) and Tree-of-Thoughts Yao et al. (2023), adopt structured search approaches to explore possible reasoning paths more systematically. However, these methods often require bespoke implementations tailored to specific tasks, and they still lead to high inference costs, making them unsuitable for low-resource devices. In addition, alternative approaches, such as process reward models (PRM) Lightman et al. (2023), aim to address particular aspects of reasoning by modeling rewards during inference. While these methods can improve performance in specific domains, they face several limitations. For instance, Guo et al. (2025) highlights that process-reward models are costly and not universally applicable. These issues underscore the trade-offs between reasoning accuracy and computational efficiency. Automated annotation often fails to provide satisfactory results, and manual annotation is not scalable. Additionally, introducing a model-based PRM leads to reward hacking Gao et al. (2023) and requires extra resources for retraining, complicating the training pipeline.

Reinforcement Learning for Reasoning Enhancement

Recent research, starting with DeepSeek-R1 Guo et al. (2025), has shown the effectiveness of pure RL training with outcome-based rewards in significantly improving reasoning performance, eliminating the need for costly inference-time searches. However, these methods often depend on verifiable ground truth or domain-specific heuristics, such as using response length as rewards Yeo et al. (2025). For instance, Kimi k1.5 Team et al. (2025) introduces a method to shorten chain-of-thought using a length penalty in the reward function during online RL, while Luo et al. (2025); Arora and Zanette (2025) propose an RL objective aimed at minimizing tokens while maintaining accuracy. Other works, such as Chen et al. (2024), explore the overthinking phenomenon and suggest generating data for offline policy optimization using first-correct solutions and diversity criteria. We argue that relying on heuristic reward functions restricts the generalization of LLM reasoning across a wide range of datasets. Additionally, small LLMs face challenges in generating long sentences, so emphasizing sentence length may not be beneficial for these models. In contrast, our approach leverages episodic memory to derive intrinsic rewards, making it more adaptable and widely applicable. Furthermore, while most existing methods target large LLMs, our work is the first to improve this approach for smaller models (\leq1B parameters).

Episodic Memory For LLMs

Several works have explored episodic memory to enhance LLMs’ outputs, but they primarily focus on improving retrieval-based prompting rather than upgrading the model’s inherent reasoning capabilities. Experiential learning methods like REMEMBERER Zhang et al. (2024) store past observation-action pairs and retrieve high-value trajectories to guide LLMs’ action during inference. Similarly, Reflexion Shinn et al. (2024) and ExpeL Zhang et al. (2024) use memory to extract insights from past successes and failures, integrating them into prompts to improve decision-making. However, these methods rely on strong LLMs like GPT-4 OpenAI (2024a) without altering their internal reasoning process, using memory solely for explicit retrieval during inference. They still treat memory as an external knowledge base rather than an intrinsic driver of learning. In contrast, our method embeds memory-driven intrinsic motivation directly into the learning process. Rather than relying on explicit retrieval for in-context learning, our approach finetunes the model by leveraging intrinsic rewards derived from past successes and failures, enabling adaptive and self-improving reasoning.

6 Conclusion

We present Memory-R+, a novel memory-augmented reinforcement learning framework that equips tiny LLMs with intrinsic motivation for effective chain-of-thought reasoning. By leveraging episodic memory to compute exploration and exploitation rewards from past successes and failures, our method mitigates problems such as reward sparsity and poor exploration. Experimental results on math reasoning tasks demonstrate that Memory-R+ significantly boosts reasoning performance in small models, making RL fine-tuning more accessible and effective in low-resource settings.

References

  • Arora and Zanette [2025] Daman Arora and Andrea Zanette. Training language models to reason efficiently. arXiv preprint arXiv:2502.04463, 2025.
  • Chen et al. [2024] X. Chen, J. Xu, T. Liang, Z. He, J. Pang, D. Yu, L. Song, Q. Liu, M. Zhou, Z. Zhang, R. Wang, Z. Tu, H. Mi, and D. Yu. Do not think that much for 2+3=? on the overthinking of o1-like llms. 2024. URL https://arxiv.org/abs/2412.21187.
  • Cobbe et al. [2021] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168.
  • Do et al. [2024] Dai Do, Quan Tran, Svetha Venkatesh, and Hung Le. Large language model prompting with episodic memory. In ECAI, 2024.
  • Face [2025] Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January 2025. URL https://github.com/huggingface/open-r1.
  • Feng et al. [2023] Xidong Feng, Ziyu Wan, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training. arXiv preprint arXiv:2309.17179, 2023.
  • Fourrier et al. [2023] Clémentine Fourrier, Nathan Habib, Hynek Kydlíček, Thomas Wolf, and Lewis Tunstall. Lighteval: A lightweight framework for llm evaluation, 2023. URL https://github.com/huggingface/lighteval.
  • Gao et al. [2023] L. Gao, J. Schulman, and J. Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023.
  • Gao et al. [2024] Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, 07 2024. URL https://zenodo.org/records/12608602.
  • Google [2024] Google. Introducing gemini 2.0: Our new ai model for the agentic era, 2024. URL https://blog.google/technology/google-deepmind/. Accessed: 2025-03-05.
  • Guo et al. [2025] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
  • Hendrycks et al. [2021] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  • Hong et al. [2023] Zhi Hong, Aswathy Ajith, James Pauloski, Eamon Duede, Kyle Chard, and Ian Foster. The diminishing returns of masked language models to science. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1270–1283, 2023.
  • Jia LI and Polu [2024] Lewis Tunstall Ben Lipkin Roman Soletskyi Shengyi Costa Huang Kashif Rasul Longhui Yu Albert Jiang Ziju Shen Zihan Qin Bin Dong Li Zhou Yann Fleureau Guillaume Lample Jia LI, Edward Beeching and Stanislas Polu. Numinamath tir. https://huggingface.co/AI-MO/NuminaMath-TIR, 2024.
  • Krishnamurthy et al. [2024] Akshay Krishnamurthy, Keegan Harris, Dylan J Foster, Cyril Zhang, and Aleksandrs Slivkins. Can large language models explore in-context? arXiv preprint arXiv:2403.15371, 2024.
  • Kumar et al. [2025] Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip HS Torr, Salman Khan, and Fahad Shahbaz Khan. Llm post-training: A deep dive into reasoning large language models. arXiv preprint arXiv:2502.21321, 2025.
  • Le et al. [2021] Hung Le, Thommen Karimpanal George, Majid Abdolshah, Truyen Tran, and Svetha Venkatesh. Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems, 34:30313–30325, 2021.
  • Le et al. [2022] Hung Le, Majid Abdolshah, Thommen K George, Kien Do, Dung Nguyen, and Svetha Venkatesh. Episodic policy gradient training. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 7317–7325, 2022.
  • Lightman et al. [2023] Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations, 2023.
  • Luo et al. [2025] H. Luo, L. Shen, H. He, Y. Wang, S. Liu, W. Li, N. Tan, X. Cao, and D. Tao. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning. 2025. URL https://arxiv.org/abs/2501.12570.
  • Manning et al. [2008] C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.
  • McClelland et al. [1995] J. L. McClelland, B. L. McNaughton, and R. C. O’Reilly. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychological Review, 102(3):419–457, Jul 1995. doi: 10.1037/0033-295X.102.3.419.
  • OpenAI [2024a] OpenAI. Hello gpt-4o, 2024a. URL https://openai.com/index/hello-gpt-4o/.
  • OpenAI [2024b] OpenAI. Learning to reason with llms, 2024b. URL https://openai.com/index/learning-to-reason-with-llms/.
  • Pritzel et al. [2017] Alexander Pritzel, Benigno Uria, Sriram Srinivasan, Adria Puigdomenech Badia, Oriol Vinyals, Demis Hassabis, Daan Wierstra, and Charles Blundell. Neural episodic control. In International conference on machine learning, pages 2827–2836. PMLR, 2017.
  • Reimers and Gurevych [2019] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, 2019.
  • Shao et al. [2024] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
  • Shinn et al. [2024] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Enhancing language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems, volume 36, 2024.
  • Snell et al. [2024] Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024.
  • Team et al. [2025] K. Team, A. Du, B. Gao, B. Xing, C. Jiang, C. Chen, C. Li, C. Xiao, C. Du, C. Liao, C. Tang, C. Wang, D. Zhang, E. Yuan, E. Lu, F. Tang, F. Sung, G. Wei, G. Lai, H. Guo, H. Zhu, H. Ding, H. Hu, H. Yang, H. Zhang, H. Yao, H. Zhao, H. Lu, H. Li, H. Yu, H. Gao, H. Zheng, H. Yuan, J. Chen, J. Guo, J. Su, J. Wang, J. Zhao, J. Zhang, J. Liu, J. Yan, J. Wu, L. Shi, L. Ye, L. Yu, M. Dong, N. Zhang, N. Ma, Q. Pan, Q. Gong, S. Liu, S. Ma, S. Wei, S. Cao, S. Huang, T. Jiang, W. Gao, W. Xiong, W. He, W. Huang, W. Wu, W. He, X. Wei, X. Jia, X. Wu, X. Xu, X. Zu, X. Zhou, X. Pan, Y. Charles, Y. Li, Y. Hu, Y. Liu, Y. Chen, Y. Wang, Y. Liu, Y. Qin, Y. Liu, Y. Yang, Y. Bao, Y. Du, Y. Wu, Y. Wang, Z. Zhou, Z. Wang, Z. Li, Z. Zhu, Z. Zhang, Z. Wang, Z. Yang, Z. Huang, Z. Huang, Z. Xu, and Z. Yang. Kimi k1.5: Scaling reinforcement learning with llms. 2025. URL https://arxiv.org/abs/2501.12599.
  • Wang et al. [2022] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2022.
  • Xia et al. [2023] Mengzhou Xia, Mikel Artetxe, Chunting Zhou, Xi Victoria Lin, Ramakanth Pasunuru, Danqi Chen, Luke Zettlemoyer, and Veselin Stoyanov. Training trajectories of language models across scales. In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023.
  • Yao et al. [2023] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809–11822, 2023.
  • Yeo et al. [2025] Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms. arXiv preprint arXiv:2502.03373, 2025.
  • [35] Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang. Rest-mcts*: Llm self-training via process reward guided tree search. In The Thirty-eighth Annual Conference on Neural Information Processing Systems.
  • Zhang et al. [2024] Danyang Zhang, Lu Chen, Situo Zhang, Hongshen Xu, Zihan Zhao, and Kai Yu. Large language models are semi-parametric reinforcement learning agents. In Advances in Neural Information Processing Systems, volume 36, 2024.

Appendix A Experiment Details

A.1 System Prompt

Following Guo et al. [2025], Face [2025], the system prompt is designed as CoT prompting with a clear requirement for reasoning and answer format, as shown in Table 3.

SYSTEM PROMPT: A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The Assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>
Table 3: System prompt used in our experiments.

The use of <think> and <answer> tags ensures a clear distinction between the internal reasoning process and the final output.

A.2 Training Hyperparameters

The model is trained using the GRPO optimization framework with carefully selected hyperparameters to ensure stable convergence while being suitable for our computing resources. The key hyperparameters for GSM8K and AI-MO are listed in Table 4. There is a slight difference to make the training suitable for each dataset while ensuring efficient training.

Hyperparameter GSM8K AI-MO
Learning Rate 5×1065superscript1065\times 10^{-6}5 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 5×1065superscript1065\times 10^{-6}5 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
Adam β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.9 0.9
Adam β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0.99 0.99
Weight Decay 0.1 0.1
Warmup Ratio 0.1 0.1
Learning Rate Scheduler Cosine Cosine
Batch Size 2 4
Gradient Accumulation Steps 8 16
Number of GRPO Generations 16 16
Maximum Prompt Length 256 512
Maximum Completion Length 200 300
Training Epochs 1 4
Maximum Gradient Norm 0.1 0.1
Mixed Precision BF16 BF16
Table 4: Key training hyperparameters for GSM8K and AI-MO.

We list the links to the LLM models and datasets in Table 5.

Models/Datasets URL
Qwen2.5-0.5B-Instruct https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct
Llama3-1B-Instruct https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct
Falcon3-1B-Instruct https://huggingface.co/tiiuae/Falcon3-1B-Instruct
Sentence Transformer https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
GSM8K https://huggingface.co/datasets/openai/gsm8k
MATH-500 https://huggingface.co/datasets/HuggingFaceH4/MATH-500
AIME24 https://huggingface.co/datasets/math-ai/aime24
AI-MO https://huggingface.co/datasets/AI-MO/NuminaMath-TIR
Table 5: Models and Datasets Details.

Appendix B Training Collapse Examples

B.1 Reward Mode Collapse

We present the reward mode collapse phenomenon in Figure 3. Here, we show the values of the correctness reward, integer reward (ensuring the output is an integer), and XML reward (ensuring the output contains correctly formatted XML tags for parsing) when optimized simultaneously using different reward schemes with 1 same seed. It is evident that while Memory-R and Memory-R+ accuracy rewards increase steadily, those of R1 and cosine do not. Instead, these rewards lead the tiny LLMs to learn reward patterns more easily.

B.2 Collapse in Response Length

In Fig. 4, we present different rewards and corresponding completion lengths when training Qwen2.5-0.5B-Instruct with the same random seed. These figures reveal two distinct types of response length collapse. Memory-R and Memory-R+++ shows robust behaviors with good correctness rewards. Cosine causes the model to generate excessively, reaching the maximum token limit early in training. In contrast, rewards such as integer and XML remain close to zero. R1 shortens responses while optimizing for the integer reward. Both approaches result in low correctness rewards, highlighting their suboptimal behavior.

Fig. 5 presents additional examples of response length collapse observed during the training of Qwen-2.5-0.5B-Instruct on the GSM8K and AI-MO datasets. The Cosine method exhibits severe lengthening collapse on GSM8K while experiencing shortening collapse on AI-MO. R1 suffers from shortening collapse on GSM8K. Although it does not exhibit collapse on AI-MO, it converges more slowly and underperforms Memory-R+++, the only method capable of overcoming training collapse.

Refer to caption
Figure 5: More Training Collapses in Qwen2.5-0.5B-Instruct during fine-tuning GSM8K (a) and AI-MO datasets (b). The results have been smoothed to improve clarity and visual appeal.

B.3 Details on Model Outputs

Memory-based Intrinsic Rewards

In Fig. 6, we report the memory-based intrinsic rewards (Rexploitsubscript𝑅𝑒𝑥𝑝𝑙𝑜𝑖𝑡R_{exploit}italic_R start_POSTSUBSCRIPT italic_e italic_x italic_p italic_l italic_o italic_i italic_t end_POSTSUBSCRIPT and Rexploresubscript𝑅𝑒𝑥𝑝𝑙𝑜𝑟𝑒R_{explore}italic_R start_POSTSUBSCRIPT italic_e italic_x italic_p italic_l italic_o italic_r italic_e end_POSTSUBSCRIPT) over training steps while fine-tuning Qwen2.5-0.5B-Instruct with Memory-R+ on GSM8K.

Refer to caption
Figure 6: Memory-based Intrinsic Reward on GSM8K and Qwen2.5-0.5B-Instruct. The results have been smoothed to improve clarity and visual appeal.
Diversity Evaluation

To assess the diversity of responses generated by Qwen2.5-0.5B-Instruct fine-tuned with our method, we employ a pairwise similarity analysis. Specifically, for each question, we sample three responses from the model using a temperature of 0.10.10.10.1, and compute the pairwise cosine similarity between them. This process is repeated for a set of 100 questions to obtain a comprehensive measure of response diversity.

To capture both lexical and semantic similarities, we utilize two different embedding models: TF-IDF Manning et al. [2008] and Sentence Transformer Reimers and Gurevych [2019]. The TF-IDF model provides a measure of lexical overlap, while the Sentence Transformer captures deeper semantic relationships between responses.

For each question, we compute the average pairwise cosine similarity for the three sampled responses using both embedding models. The final results, reported in Table 6, reflect the overall diversity of the model’s responses across the dataset. We report 1 - similarity as the diversity score, where a higher value indicates greater diversity, suggesting more variation in the generated outputs.

The results confirm that our method effectively encourages the LLM to explore more. Compared to R1, the final model trained with our approach demonstrates a clear improvement in diversity. We present several sampled responses of the Base, R1, and Memory-R+ in Table 7, 8, and 9, respectively. The question for these responses is listed below:

Problem: In right triangle ABC𝐴𝐵𝐶ABCitalic_A italic_B italic_C, shown below, cosB=610𝐵610\cos{B}=\frac{6}{10}roman_cos italic_B = divide start_ARG 6 end_ARG start_ARG 10 end_ARG. What is tanC𝐶\tan{C}roman_tan italic_C? draw((0,0)–(10,0)–(3.6,4.8)–cycle,black+linewidth(1)); draw(rightanglemark((0,0),(3.6,4.8),(10,0),20),black+linewidth(1)); label(”C𝐶Citalic_C”,(10,0),E); label(”A𝐴Aitalic_A”,(3.6,4.8),N); label(”B𝐵Bitalic_B”,(0,0),W); label(”10101010”,(0,0)–(10,0),S);
Self-Verification Behaviors

Examples of our method’s output, showcasing self-verification behaviors (highlighted in red), can be found at the end of the paper.

Method Lexical Diversity Semantic Diversity
Base 27.3 9.0
R1 24.2 8.6
Memory-R+ 27.6 9.3
Table 6: Diversity scores (×100absent100\times 100× 100) for different baselines. Bold denotes best results.
Response 1 Response 2 Response 3
33-3- 3 Given a right triangle ABC with cosB=610𝐵610\cos B=\frac{6}{10}roman_cos italic_B = divide start_ARG 6 end_ARG start_ARG 10 end_ARG, we need to find tanC𝐶\tan Croman_tan italic_C. Using the Pythagorean theorem: BC=AB2AC2𝐵𝐶𝐴superscript𝐵2𝐴superscript𝐶2BC=\sqrt{AB^{2}-AC^{2}}italic_B italic_C = square-root start_ARG italic_A italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_A italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG We express cosine as: cosB=adjacenthypotenuse𝐵adjacenthypotenuse\cos B=\frac{\text{adjacent}}{\text{hypotenuse}}roman_cos italic_B = divide start_ARG adjacent end_ARG start_ARG hypotenuse end_ARG Since cosB=35𝐵35\cos B=\frac{3}{5}roman_cos italic_B = divide start_ARG 3 end_ARG start_ARG 5 end_ARG: cosB=ACc𝐵𝐴𝐶𝑐\cos B=\frac{AC}{c}roman_cos italic_B = divide start_ARG italic_A italic_C end_ARG start_ARG italic_c end_ARG 1919\frac{1}{9}divide start_ARG 1 end_ARG start_ARG 9 end_ARG To solve for tanC𝐶\tan Croman_tan italic_C, we use: cosB=610𝐵610\cos B=\frac{6}{10}roman_cos italic_B = divide start_ARG 6 end_ARG start_ARG 10 end_ARG Since cosB=adjacenthypotenuse=xc𝐵adjacenthypotenuse𝑥𝑐\cos B=\frac{\text{adjacent}}{\text{hypotenuse}}=\frac{x}{c}roman_cos italic_B = divide start_ARG adjacent end_ARG start_ARG hypotenuse end_ARG = divide start_ARG italic_x end_ARG start_ARG italic_c end_ARG we find x=6𝑥6x=6italic_x = 6, hypotenuse c=10𝑐10c=10italic_c = 10. Verifying: x10=610𝑥10610\frac{x}{10}=\frac{6}{10}divide start_ARG italic_x end_ARG start_ARG 10 end_ARG = divide start_ARG 6 end_ARG start_ARG 10 end_ARG By cross-multiplication: x=6𝑥6x=6italic_x = 6 Next, using sine: sinB=𝐵\sin B=...roman_sin italic_B = … To find tanC𝐶\tan Croman_tan italic_C given cosB=610𝐵610\cos B=\frac{6}{10}roman_cos italic_B = divide start_ARG 6 end_ARG start_ARG 10 end_ARG: Step-by-step: cosB=610𝐵610\cos B=\frac{6}{10}roman_cos italic_B = divide start_ARG 6 end_ARG start_ARG 10 end_ARG Adjacent side: a𝑎aitalic_a, hypotenuse: c𝑐citalic_c. Using the ratio: cosB=adjacenthypotenuse𝐵adjacenthypotenuse\cos{B}=\frac{\text{adjacent}}{\text{hypotenuse}}roman_cos italic_B = divide start_ARG adjacent end_ARG start_ARG hypotenuse end_ARG
Table 7: Qwen2.5-0.5B-Instruct’s responses.
Response 1 Response 2 Response 3
To solve for tanC𝐶\tan Croman_tan italic_C, given cosB=610𝐵610\cos B=\frac{6}{10}roman_cos italic_B = divide start_ARG 6 end_ARG start_ARG 10 end_ARG in ABC𝐴𝐵𝐶\triangle ABC△ italic_A italic_B italic_C: Identify sides: - AB𝐴𝐵ABitalic_A italic_B = hypotenuse - BC𝐵𝐶BCitalic_B italic_C = one leg - AC𝐴𝐶ACitalic_A italic_C = other leg Step 1: Find sinB𝐵\sin Broman_sin italic_B: cosB=adjacenthypotenuse𝐵adjacenthypotenuse\cos B=\frac{\text{adjacent}}{\text{hypotenuse}}roman_cos italic_B = divide start_ARG adjacent end_ARG start_ARG hypotenuse end_ARG For cosB=610𝐵610\cos B=\frac{6}{10}roman_cos italic_B = divide start_ARG 6 end_ARG start_ARG 10 end_ARG: BCAB=610𝐵𝐶𝐴𝐵610\frac{BC}{AB}=\frac{6}{10}divide start_ARG italic_B italic_C end_ARG start_ARG italic_A italic_B end_ARG = divide start_ARG 6 end_ARG start_ARG 10 end_ARG Let BC=x𝐵𝐶𝑥BC=xitalic_B italic_C = italic_x: xAB=610𝑥𝐴𝐵610\frac{x}{AB}=\frac{6}{10}divide start_ARG italic_x end_ARG start_ARG italic_A italic_B end_ARG = divide start_ARG 6 end_ARG start_ARG 10 end_ARG Since AB𝐴𝐵ABitalic_A italic_B is the hypotenuse: AB=10x6=5x3𝐴𝐵10𝑥65𝑥3AB=\frac{10x}{6}=\frac{5x}{3}italic_A italic_B = divide start_ARG 10 italic_x end_ARG start_ARG 6 end_ARG = divide start_ARG 5 italic_x end_ARG start_ARG 3 end_ARG Next, we find sinB𝐵\sin Broman_sin italic_B To determine tanC𝐶\tan Croman_tan italic_C given cosB=610𝐵610\cos B=\frac{6}{10}roman_cos italic_B = divide start_ARG 6 end_ARG start_ARG 10 end_ARG in ABC𝐴𝐵𝐶\triangle ABC△ italic_A italic_B italic_C: Trigonometric definitions: - cosB=adjacenthypotenuse𝐵adjacenthypotenuse\cos B=\frac{\text{adjacent}}{\text{hypotenuse}}roman_cos italic_B = divide start_ARG adjacent end_ARG start_ARG hypotenuse end_ARG - tanB=oppositeadjacent𝐵oppositeadjacent\tan B=\frac{\text{opposite}}{\text{adjacent}}roman_tan italic_B = divide start_ARG opposite end_ARG start_ARG adjacent end_ARG Step 1: Find the hypotenuse using cosB𝐵\cos Broman_cos italic_B: cosB=adjacenthypotenuse𝐵adjacenthypotenuse\cos B=\frac{\text{adjacent}}{\text{hypotenuse}}roman_cos italic_B = divide start_ARG adjacent end_ARG start_ARG hypotenuse end_ARG Given: cosB=610𝐵610\cos B=\frac{6}{10}roman_cos italic_B = divide start_ARG 6 end_ARG start_ARG 10 end_ARG Let b𝑏bitalic_b be the adjacent side and hhitalic_h be… To solve for tanC𝐶\tan Croman_tan italic_C, analyze ABC𝐴𝐵𝐶\triangle ABC△ italic_A italic_B italic_C: Given: cosB=610𝐵610\cos B=\frac{6}{10}roman_cos italic_B = divide start_ARG 6 end_ARG start_ARG 10 end_ARG Step 1: Find sinB𝐵\sin Broman_sin italic_B: cosB=adjacenthypotenuse𝐵adjacenthypotenuse\cos B=\frac{\text{adjacent}}{\text{hypotenuse}}roman_cos italic_B = divide start_ARG adjacent end_ARG start_ARG hypotenuse end_ARG Since cosB=610𝐵610\cos B=\frac{6}{10}roman_cos italic_B = divide start_ARG 6 end_ARG start_ARG 10 end_ARG: adjacenthypotenuse=610adjacenthypotenuse610\frac{\text{adjacent}}{\text{hypotenuse}}=\frac{6}{10}divide start_ARG adjacent end_ARG start_ARG hypotenuse end_ARG = divide start_ARG 6 end_ARG start_ARG 10 end_ARG Let hypotenuse = hhitalic_h: adjacent=6kadjacent6𝑘\text{adjacent}=6kadjacent = 6 italic_k where k𝑘kitalic_k is a positive number. Next, find: sinB=oppositehypotenuse𝐵oppositehypotenuse\sin B=\frac{\text{opposite}}{\text{hypotenuse}}roman_sin italic_B = divide start_ARG opposite end_ARG start_ARG hypotenuse end_ARG
Table 8: Qwen2.5-0.5B-Instruct fine-tuned with R1’s responses.
Response 1 Response 2 Response 3
To solve for tanC𝐶\tan Croman_tan italic_C, given cosB=610𝐵610\cos B=\frac{6}{10}roman_cos italic_B = divide start_ARG 6 end_ARG start_ARG 10 end_ARG: Step 1: Recall trigonometric definitions - cosB=adjacenthypotenuse𝐵adjacenthypotenuse\cos B=\frac{\text{adjacent}}{\text{hypotenuse}}roman_cos italic_B = divide start_ARG adjacent end_ARG start_ARG hypotenuse end_ARG - sinB=oppositehypotenuse𝐵oppositehypotenuse\sin B=\frac{\text{opposite}}{\text{hypotenuse}}roman_sin italic_B = divide start_ARG opposite end_ARG start_ARG hypotenuse end_ARG Step 2: Use Pythagorean identity sin2B+cos2B=1superscript2𝐵superscript2𝐵1\sin^{2}B+\cos^{2}B=1roman_sin start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_B + roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_B = 1 Substituting cosB=610𝐵610\cos B=\frac{6}{10}roman_cos italic_B = divide start_ARG 6 end_ARG start_ARG 10 end_ARG: sin2B+(610)2=1superscript2𝐵superscript61021\sin^{2}B+\left(\frac{6}{10}\right)^{2}=1roman_sin start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_B + ( divide start_ARG 6 end_ARG start_ARG 10 end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1 Calculate sinB𝐵\sin Broman_sin italic_B To determine tanC𝐶\tan Croman_tan italic_C, first find sinB𝐵\sin Broman_sin italic_B and cosB𝐵\cos Broman_cos italic_B for ABC𝐴𝐵𝐶\triangle ABC△ italic_A italic_B italic_C. Given: cosB=610𝐵610\cos B=\frac{6}{10}roman_cos italic_B = divide start_ARG 6 end_ARG start_ARG 10 end_ARG Use the Pythagorean theorem to find the hypotenuse: AB2+BC2=AC2𝐴superscript𝐵2𝐵superscript𝐶2𝐴superscript𝐶2AB^{2}+BC^{2}=AC^{2}italic_A italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_B italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_A italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Substituting known values: 102+62=AC2superscript102superscript62𝐴superscript𝐶210^{2}+6^{2}=AC^{2}10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_A italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 100+36=AC210036𝐴superscript𝐶2100+36=AC^{2}100 + 36 = italic_A italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT AC=136=234𝐴𝐶136234AC=\sqrt{136}=2\sqrt{34}italic_A italic_C = square-root start_ARG 136 end_ARG = 2 square-root start_ARG 34 end_ARG Next, calculate sinB𝐵\sin Broman_sin italic_B using: sinB=oppositehypotenuse𝐵oppositehypotenuse\sin B=\frac{\text{opposite}}{\text{hypotenuse}}roman_sin italic_B = divide start_ARG opposite end_ARG start_ARG hypotenuse end_ARG To find tanC𝐶\tan Croman_tan italic_C in ABC𝐴𝐵𝐶\triangle ABC△ italic_A italic_B italic_C: Given: cosB=610=35𝐵61035\cos B=\frac{6}{10}=\frac{3}{5}roman_cos italic_B = divide start_ARG 6 end_ARG start_ARG 10 end_ARG = divide start_ARG 3 end_ARG start_ARG 5 end_ARG Trigonometric definition: cosB=adjacenthypotenuse𝐵adjacenthypotenuse\cos B=\frac{\text{adjacent}}{\text{hypotenuse}}roman_cos italic_B = divide start_ARG adjacent end_ARG start_ARG hypotenuse end_ARG Denote sides: - AB=c𝐴𝐵𝑐AB=citalic_A italic_B = italic_c - BC=b𝐵𝐶𝑏BC=bitalic_B italic_C = italic_b - AC=a𝐴𝐶𝑎AC=aitalic_A italic_C = italic_a Given: ba=35b=35a𝑏𝑎35𝑏35𝑎\frac{b}{a}=\frac{3}{5}\Rightarrow b=\frac{3}{5}adivide start_ARG italic_b end_ARG start_ARG italic_a end_ARG = divide start_ARG 3 end_ARG start_ARG 5 end_ARG ⇒ italic_b = divide start_ARG 3 end_ARG start_ARG 5 end_ARG italic_a Using Pythagorean theorem: c2+b2=a2superscript𝑐2superscript𝑏2superscript𝑎2c^{2}+b^{2}=a^{2}italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Solve for remaining sides…
Table 9: Qwen2.5-0.5B-Instruct fine-tuned with Memory-R+’s responses.

[t] Memory-R+’s Output Example 1 Question: Coplanar squares ABGH and BCDF are adjacent, with CD=10 units and AH=5 units. Point E is on segments AD and GB. What is the area of triangle ABE, in square units? Express your answer as a common fraction. Response: To find the area of triangle ABE𝐴𝐵𝐸\triangle ABE△ italic_A italic_B italic_E where ABGH𝐴𝐵𝐺𝐻ABGHitalic_A italic_B italic_G italic_H and BCDF𝐵𝐶𝐷𝐹BCDFitalic_B italic_C italic_D italic_F are coplanar squares with CD=10𝐶𝐷10CD=10italic_C italic_D = 10 units and AH=5𝐴𝐻5AH=5italic_A italic_H = 5 units, we need to determine the coordinates of points E𝐸Eitalic_E and B𝐵Bitalic_B. First, let’s place the squares on the coordinate plane for easier calculation: Square ABCD𝐴𝐵𝐶𝐷ABCDitalic_A italic_B italic_C italic_D has vertices at A(0,0)𝐴00A(0,0)italic_A ( 0 , 0 ), B(5,0)𝐵50B(5,0)italic_B ( 5 , 0 ), C(5,5)𝐶55C(5,5)italic_C ( 5 , 5 ), and D(0,5)𝐷05D(0,5)italic_D ( 0 , 5 ). Square BCDF𝐵𝐶𝐷𝐹BCDFitalic_B italic_C italic_D italic_F has vertices at B(5,0)𝐵50B(5,0)italic_B ( 5 , 0 ), C(15,0)𝐶150C(15,0)italic_C ( 15 , 0 ), D(15,10)𝐷1510D(15,10)italic_D ( 15 , 10 ), and F(5,10)𝐹510F(5,10)italic_F ( 5 , 10 ). Next, we need to find the coordinates of point E𝐸Eitalic_E. Since E𝐸Eitalic_E lies on segment AD𝐴𝐷ADitalic_A italic_D and GB𝐺𝐵GBitalic_G italic_B, we can use the fact that AD𝐴𝐷ADitalic_A italic_D and GB𝐺𝐵GBitalic_G italic_B are parallel lines (both have a slope of 1). Step 1: Determine the coordinates of E𝐸Eitalic_E Since E𝐸Eitalic_E lies on AD𝐴𝐷ADitalic_A italic_D: E divides AD into two equal parts.𝐸 divides 𝐴𝐷 into two equal partsE\text{ divides }AD\text{ into two equal parts}.italic_E divides italic_A italic_D into two equal parts . The equation of line AD𝐴𝐷ADitalic_A italic_D can be found using the points A(0,0)𝐴00A(0,0)italic_A ( 0 , 0 ) and D(0,5)𝐷05D(0,5)italic_D ( 0 , 5 ): y=x+5.𝑦𝑥5y=-x+5.italic_y = - italic_x + 5 . To find the intersection with AD𝐴𝐷ADitalic_A italic_D, set x=0𝑥0x=0italic_x = 0: y=0+5=5.𝑦055y=-0+5=5.italic_y = - 0 + 5 = 5 . Thus, E𝐸Eitalic_E is at (0,5)05(0,5)( 0 , 5 ). Step 2: Determine the coordinates of B𝐵Bitalic_B Since B𝐵Bitalic_B lies on BC𝐵𝐶BCitalic_B italic_C and DF𝐷𝐹DFitalic_D italic_F: B divides BC into two equal parts.𝐵 divides 𝐵𝐶 into two equal partsB\text{ divides }BC\text{ into two equal parts}.italic_B divides italic_B italic_C into two equal parts . The distance from A𝐴Aitalic_A to BE𝐵𝐸BEitalic_B italic_E is simply the absolute value of the y-coordinate of A𝐴Aitalic_A: Height=|00|=0.Height000\text{Height}=|0-0|=0.Height = | 0 - 0 | = 0 . However, this interpretation seems incorrect because A𝐴Aitalic_A should be above BE𝐵𝐸BEitalic_B italic_E. Let’s re-evaluate: Correctly, the correct height should be the vertical distance from A𝐴Aitalic_A to the line BE𝐵𝐸BEitalic_B italic_E. Now, calculate the area of ABE𝐴𝐵𝐸\triangle ABE△ italic_A italic_B italic_E: Area=12×5×5=252=12.5.Area125525212.5\text{Area}=\frac{1}{2}\times 5\times 5=\frac{25}{2}=12.5.Area = divide start_ARG 1 end_ARG start_ARG 2 end_ARG × 5 × 5 = divide start_ARG 25 end_ARG start_ARG 2 end_ARG = 12.5 . Thus, the area of triangle ABE𝐴𝐵𝐸\triangle ABE△ italic_A italic_B italic_E is: 252252\boxed{\frac{25}{2}}divide start_ARG 25 end_ARG start_ARG 2 end_ARG

[t] Memory-R+’s Output Example 2 Question: In the diagram, points X𝑋Xitalic_X, Y𝑌Yitalic_Y and Z𝑍Zitalic_Z are on the sides of UVW𝑈𝑉𝑊\triangle UVW△ italic_U italic_V italic_W, as shown. Line segments UY𝑈𝑌UYitalic_U italic_Y, VZ𝑉𝑍VZitalic_V italic_Z and WX𝑊𝑋WXitalic_W italic_X intersect at P𝑃Pitalic_P. Point Y𝑌Yitalic_Y is on VW𝑉𝑊VWitalic_V italic_W such that VY:YW=4:3:𝑉𝑌𝑌𝑊4:3VY:YW=4:3italic_V italic_Y : italic_Y italic_W = 4 : 3. If PYW𝑃𝑌𝑊\triangle PYW△ italic_P italic_Y italic_W has an area of 30 and PZW𝑃𝑍𝑊\triangle PZW△ italic_P italic_Z italic_W has an area of 35, determine the area of UXP𝑈𝑋𝑃\triangle UXP△ italic_U italic_X italic_P. Model Response: 88\boxed{8}8 To solve for the area of UXP𝑈𝑋𝑃\triangle UXP△ italic_U italic_X italic_P, we need to use the given information about the areas of other triangles in the diagram. First, let’s find the ratio of the lengths VY𝑉𝑌VYitalic_V italic_Y to YW𝑌𝑊YWitalic_Y italic_W. We know: - The ratio VY:YW=4:3:𝑉𝑌𝑌𝑊4:3VY:YW=4:3italic_V italic_Y : italic_Y italic_W = 4 : 3. Given that PZW𝑃𝑍𝑊\triangle PZW△ italic_P italic_Z italic_W has an area of 35, we can find the area of UXP𝑈𝑋𝑃\triangle UXP△ italic_U italic_X italic_P by subtracting the areas of PYW𝑃𝑌𝑊\triangle PYW△ italic_P italic_Y italic_W and PZW𝑃𝑍𝑊\triangle PZW△ italic_P italic_Z italic_W from the area of UXW𝑈𝑋𝑊\triangle UXW△ italic_U italic_X italic_W. First, we find the area of UXW𝑈𝑋𝑊\triangle UXW△ italic_U italic_X italic_W: Area of UXW=Area of UYW+Area of UXWArea of 𝑈𝑋𝑊Area of 𝑈𝑌𝑊Area of 𝑈𝑋𝑊\text{Area of }\triangle UXW=\text{Area of }\triangle UYW+\text{Area of }% \triangle UXWArea of △ italic_U italic_X italic_W = Area of △ italic_U italic_Y italic_W + Area of △ italic_U italic_X italic_W Finally, we find the area of UXP𝑈𝑋𝑃\triangle UXP△ italic_U italic_X italic_P: Area of UXP=Area of UXW(Area of PYW+Area of PZW)Area of 𝑈𝑋𝑃Area of 𝑈𝑋𝑊Area of 𝑃𝑌𝑊Area of 𝑃𝑍𝑊\text{Area of }\triangle UXP=\text{Area of }\triangle UXW-(\text{Area of }% \triangle PYW+\text{Area of }\triangle PZW)Area of △ italic_U italic_X italic_P = Area of △ italic_U italic_X italic_W - ( Area of △ italic_P italic_Y italic_W + Area of △ italic_P italic_Z italic_W ) Area of UXP=30(15+35)=3050=20Area of 𝑈𝑋𝑃301535305020\text{Area of }\triangle UXP=30-(15+35)=30-50=-20Area of △ italic_U italic_X italic_P = 30 - ( 15 + 35 ) = 30 - 50 = - 20 However, this result seems incorrect because the area cannot be negative. Let’s re-evaluate the problem using the correct approach. Given: - Area of PYW=30𝑃𝑌𝑊30\triangle PYW=30△ italic_P italic_Y italic_W = 30 - Area of PZW=35𝑃𝑍𝑊35\triangle PZW=35△ italic_P italic_Z italic_W = 35 The correct calculation should be: Area of UXP=Area of UXW(Area of PYW+Area of PZW)Area of 𝑈𝑋𝑃Area of 𝑈𝑋𝑊Area of 𝑃𝑌𝑊Area of 𝑃𝑍𝑊\text{Area of }\triangle UXP=\text{Area of }\triangle UXW-(\text{Area of }% \triangle PYW+\text{Area of }\triangle PZW)Area of △ italic_U italic_X italic_P = Area of △ italic_U italic_X italic_W - ( Area of △ italic_P italic_Y italic_W + Area of △ italic_P italic_Z italic_W ) Area of UXP=30(15+35)=3050=20Area of 𝑈𝑋𝑃301535305020\text{Area of }\triangle UXP=30-(15+35)=30-50=-20Area of △ italic_U italic_X italic_P = 30 - ( 15 + 35 ) = 30 - 50 = - 20 This indicates a mistake in the interpretation or calculation. Revisiting the problem, we realize the correct approach involves recognizing the geometric properties and relationships between the triangles. Given the correct geometric properties and solving step-by-step correctly, we find: 88\boxed{8}8