InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning

Yan-Shuo Liang and Wu-Jun Li
National Key Laboratory for Novel Software Technology,
Department of Computer Science and Technology, Nanjing University, P. R. China
[email protected],[email protected]
Wu-Jun Li is the corresponding author.
Abstract

Continual learning requires the model to learn multiple tasks sequentially. In continual learning, the model should possess the ability to maintain its performance on old tasks (stability) and the ability to adapt to new tasks continuously (plasticity). Recently, parameter-efficient fine-tuning (PEFT), which involves freezing a pre-trained model and injecting a small number of learnable parameters to adapt to downstream tasks, has gained increasing popularity in continual learning. Although existing continual learning methods based on PEFT have demonstrated superior performance compared to those not based on PEFT, most of them do not consider how to eliminate the interference of the new task on the old tasks, which inhibits the model from making a good trade-off between stability and plasticity. In this work, we propose a new PEFT method, called interference-free low-rank adaptation (InfLoRA), for continual learning. InfLoRA injects a small number of parameters to reparameterize the pre-trained weights and shows that fine-tuning these injected parameters is equivalent to fine-tuning the pre-trained weights within a subspace. Furthermore, InfLoRA designs this subspace to eliminate the interference of the new task on the old tasks, making a good trade-off between stability and plasticity. Experimental results show that InfLoRA outperforms existing state-of-the-art continual learning methods on multiple datasets. Code is available at https://github.com/liangyanshuo/InfLoRA.

1 Introduction

Continual learning requires the model to learn multiple tasks sequentially [33]. To achieve continual learning, the model should possess two essential abilities, including the ability to keep its performance on the old tasks (stability) and the ability to adapt to the new tasks continuously (plasticity) [33]. Furthermore, two different scenarios are often considered in continual learning, including task-incremental scenario [32] and class-incremental scenario [41]. Task-incremental scenario allows the model to get task identities during inference. On the contrary, class-incremental scenario does not allow the model to get task identities during inference, making the model learn to distinguish all the classes across all the tasks.

Recently, parameter-efficient fine-tuning (PEFT) [16, 15, 18], which involves freezing a pre-trained model and injecting a small number of learnable parameters to adapt to downstream tasks, has gained increasing popularity in continual learning [44, 38, 12], especially in the class-incremental scenario. More specifically, existing continual learning methods based on PEFT [21, 43] inject the learnable parameters into a pre-trained model using some popular PEFT methods such as prompt-tuning [25] or low-rank adaptation (LoRA) [16]. Subsequently, these methods freeze the pre-trained weights and sequentially fine-tune the injected parameters on multiple tasks throughout the continual learning process.

Although continual learning methods based on PEFT have demonstrated superior performance compared to those not based on PEFT [44], most of them do not consider how to eliminate the interference of the new task on the old tasks, which inhibits the model from making a good trade-off between stability and plasticity. Specifically, when learning a new task, existing continual learning methods based on PEFT either reuse the previously learned parameters to adapt to the new task [44, 12] or randomly expand some parameters first and then adapt to the new task [38, 43, 42]. During this process, the interference of the new task on the old tasks exists due to the shared parameters between new and old tasks, which means fine-tuning a pre-trained model on a new task may interfere with the model’s performance on the old tasks. As a result, it is hard for the model to make a good trade-off between stability and plasticity.

In this work, we propose a new PEFT method, called interference-free low-rank adaptation (InfLoRA), for continual learning. The contributions of this work are listed as follows:

  • InfLoRA injects a small number of parameters to reparameterize the pre-trained weights and shows that fine-tuning these injected parameters is equivalent to fine-tuning the pre-trained weights within a subspace.

  • InfLoRA designs this subspace to eliminate the interference of the new task on the old tasks, making a good trade-off between stability and plasticity.

  • Experimental results show that InfLoRA outperforms existing state-of-the-art continual learning methods on multiple datasets.

2 Related Work and Preliminaries

2.1 Related Work

Parameter-Efficient Fine-Tuning  Parameter-efficient fine-tuning (PEFT) methods freeze a pre-trained model and inject a small number of learnable parameters to adapt to downstream tasks. In this way, PEFT methods reduce the inefficiency of full fine-tuning methods which fine-tune all the parameters of a pre-trained model to learn downstream tasks. For example, Adapter [15] adds small modules in different layers of Transformers and only tunes these added modules to adapt to downstream tasks. Prompt-tuning [25] and Prefix-tuning [27] insert a set of learnable tokens into the input of the Transformer layers and only tune these tokens to adapt to downstream tasks. Low-rank adaptation (LoRA) [16] reparameterizes the pre-trained weights with low-rank branches and only tunes these branches to adapt to the downstream tasks. Although these methods tune much fewer learnable parameters than full fine-tuning, they always show comparable or even superior performance compared with full fine-tuning [45, 11, 16, 31]. Early PEFT methods focus on natural language processing (NLP). Recently, PEFT methods have also been proposed for computer vision (CV). For example, visual prompt tuning (VPT) [18] and AdapterFormer [6] apply prompt-tuning and Adapter techniques to CV tasks, respectively. Both of them exhibit comparable performance to full fine-tuning.

Continual Learning  Early continual learning was usually considered in the context of learning from scratch. Three types of continual learning methods are proposed, including regularization-based methods [46, 20, 1, 23], memory-based methods [2, 7, 3, 39, 28], and expansion-based methods [35, 17, 26]. Regularization-based methods employ a penalty loss (regularization) to prevent important parameters of old tasks from changing too much. Memory-based methods maintain a memory buffer to store information about old tasks. Expansion-based methods dynamically expand the model’s architecture for each new task.

Recently, with the advancements of pre-trained models [13, 10, 9], using pre-trained models for continual learning has gained increasing popularity. Some continual learning methods fully fine-tune the pre-trained models [4, 49], which has been shown to be inefficient. Other methods explore PEFT methods in continual learning. For instance, some existing continual learning methods [38, 44, 21, 43] introduce prompt-tuning in continual learning, achieving much higher performance than previous methods that learn from scratch, especially in the class-incremental scenario. The method in [12] introduces a framework in continual learning that can be combined with many existing PEFT methods, such as prompt-tuning, LoRA and Adapter. However, all these methods do not consider how to eliminate the interference of the new task on the old tasks, which inhibits the model from making a good trade-off between stability and plasticity.

2.2 Preliminaries

We first introduce low-rank adaptation (LoRA) [16], a popular PEFT method related to our method. Then, we give the problem definition for continual learning.

Low-Rank Adaptation  LoRA [16] is one of the most popular PEFT methods. It assumes that the changes of parameters lie in a low-rank space when the model is fully fine-tuned on a downstream task. Specifically, for a linear layer with the input dimension dIsubscript𝑑𝐼d_{I}italic_d start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and the output dimension dOsubscript𝑑𝑂d_{O}italic_d start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT, we represent its weight with 𝑾dO×dIsuperscript𝑾subscript𝑑𝑂subscript𝑑𝐼\bm{W}^{d_{O}\times d_{I}}bold_italic_W start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Then, LoRA reparametrizes the pre-trained weight 𝑾𝑾\bm{W}bold_italic_W by expanding a branch with two matrices, 𝑨dO×r𝑨superscriptsubscript𝑑𝑂𝑟\bm{A}\in\mathbb{R}^{d_{O}\times r}bold_italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT × italic_r end_POSTSUPERSCRIPT and 𝑩r×dI𝑩superscript𝑟subscript𝑑𝐼\bm{B}\in\mathbb{R}^{r\times d_{I}}bold_italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_d start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Typically, r𝑟ritalic_r is much smaller than the input dimension dIsubscript𝑑𝐼d_{I}italic_d start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and output dimension dOsubscript𝑑𝑂d_{O}italic_d start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT, making 𝑨𝑨\bm{A}bold_italic_A a dimensionality increasing matrix and 𝑩𝑩\bm{B}bold_italic_B a dimensionality reduction matrix. Finally, LoRA modifies the forward propagation in this linear layer as 𝒆=𝑾𝒉+𝑨𝑩𝒉𝒆𝑾𝒉𝑨𝑩𝒉\bm{e}=\bm{W}\bm{h}+\bm{A}\bm{B}\bm{h}bold_italic_e = bold_italic_W bold_italic_h + bold_italic_A bold_italic_B bold_italic_h. Here, 𝒉𝒉\bm{h}bold_italic_h and 𝒆𝒆\bm{e}bold_italic_e denote the input and output of this layer, respectively. LoRA initializes 𝑨𝑨\bm{A}bold_italic_A as 𝟎0\bm{0}bold_0 and initializes 𝑩𝑩\bm{B}bold_italic_B using Gaussian distribution. During the learning of the downstream tasks, LoRA freezes the pre-trained weight 𝑾𝑾\bm{W}bold_italic_W and only fine-tunes the parameters 𝑨𝑨\bm{A}bold_italic_A and 𝑩𝑩\bm{B}bold_italic_B.

Refer to caption
Figure 1: (a) The architecture of our InfLoRA in a certain linear layer of a Transformer. During the learning of the t𝑡titalic_t-th task, the pre-trained weight and all the old branches are frozen, and only 𝑨tsubscript𝑨𝑡\bm{A}_{t}bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is fine-tuned. (b) The pipeline of designing dimensionality reduction matrix 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Problem Definition  In continual learning, there is a sequence of tasks with different distributions. We define the task sequence as 𝒟={𝒟1,,𝒟T}𝒟subscript𝒟1subscript𝒟𝑇\mathcal{D}=\{\mathcal{D}_{1},...,\mathcal{D}_{T}\}caligraphic_D = { caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }, where the t𝑡titalic_t-th task 𝒟t={(𝒙i,t,yi,t)}i=1ntsubscript𝒟𝑡superscriptsubscriptsubscript𝒙𝑖𝑡subscript𝑦𝑖𝑡𝑖1subscript𝑛𝑡\mathcal{D}_{t}=\{(\bm{x}_{i,t},y_{i,t})\}_{i=1}^{n_{t}}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { ( bold_italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Here, 𝒙i,tsubscript𝒙𝑖𝑡\bm{x}_{i,t}bold_italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT denotes an input sample and yi,tsubscript𝑦𝑖𝑡y_{i,t}italic_y start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT denotes its label. The objective of continual learning is to train a model sequentially on these tasks and ensure that the model performs well on all of them.

We follow existing continual learning methods [43, 44] based on PEFT and assume the model is a pre-trained Vision Transformer (ViT) [10]. Specifically, assume the model is h𝚽(f𝚯())subscript𝚽subscript𝑓𝚯h_{\bm{\Phi}}(f_{\bm{\Theta}}(\cdot))italic_h start_POSTSUBSCRIPT bold_Φ end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT bold_Θ end_POSTSUBSCRIPT ( ⋅ ) ) where h𝚽()subscript𝚽h_{\bm{\Phi}}(\cdot)italic_h start_POSTSUBSCRIPT bold_Φ end_POSTSUBSCRIPT ( ⋅ ) is the classifier with parameters 𝚽𝚽\bm{\Phi}bold_Φ and f𝚯()subscript𝑓𝚯f_{\bm{\Theta}}(\cdot)italic_f start_POSTSUBSCRIPT bold_Θ end_POSTSUBSCRIPT ( ⋅ ) is the pre-trained ViT backbone with pre-trained parameters 𝚯𝚯\bm{\Theta}bold_Θ. Similar to existing work [43], our focus is primarily on the class-incremental scenario, where task identities are unknown during inference. Furthermore, we concentrate on the exemplar-free setting [43, 51], where no historical data can be fetched for rehearsal.

3 Methodology

Figure 1 (a) illustrates the architecture of our InfLoRA within a linear layer. Before learning the t𝑡titalic_t-th new task, our InfLoRA expands a LoRA-like branch, which includes a dimensionality reduction matrix 𝑩tr×dIsubscript𝑩𝑡superscript𝑟subscript𝑑𝐼\bm{B}_{t}\in\mathbb{R}^{r\times d_{I}}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_d start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and a dimensionality increasing matrix 𝑨tdO×rsubscript𝑨𝑡superscriptsubscript𝑑𝑂𝑟\bm{A}_{t}\in\mathbb{R}^{d_{O}\times r}bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT × italic_r end_POSTSUPERSCRIPT. Then, the forward propagation of this linear layer is modified as

𝒆=𝒆absent\displaystyle\bm{e}=bold_italic_e = 𝑾𝒉+j=1t𝑨j𝑩j𝒉=𝑾t1𝒉+𝑨t𝑩t𝒉=𝑾t𝒉.𝑾𝒉superscriptsubscript𝑗1𝑡subscript𝑨𝑗subscript𝑩𝑗𝒉subscript𝑾𝑡1𝒉subscript𝑨𝑡subscript𝑩𝑡𝒉subscript𝑾𝑡𝒉\displaystyle\bm{W}\bm{h}+\sum_{j=1}^{t}\bm{A}_{j}\bm{B}_{j}\bm{h}=\bm{W}_{t-1% }\bm{h}+\bm{A}_{t}\bm{B}_{t}\bm{h}=\bm{W}_{t}\bm{h}.bold_italic_W bold_italic_h + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_italic_h = bold_italic_W start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT bold_italic_h + bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_h = bold_italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_h . (1)

Here, 𝑾t=𝑾t1+𝑨t𝑩t=𝑾+i=1t𝑨i𝑩isubscript𝑾𝑡subscript𝑾𝑡1subscript𝑨𝑡subscript𝑩𝑡𝑾superscriptsubscript𝑖1𝑡subscript𝑨𝑖subscript𝑩𝑖\bm{W}_{t}=\bm{W}_{t-1}+\bm{A}_{t}\bm{B}_{t}=\bm{W}+\sum_{i=1}^{t}\bm{A}_{i}% \bm{B}_{i}bold_italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_W start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_W + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Similar to LoRA, our InfLoRA also initializes dimensionality increasing matrix 𝑨tsubscript𝑨𝑡\bm{A}_{t}bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as 𝟎0\bm{0}bold_0. However, different from LoRA, which employs Gaussian distribution to initialize the dimensionality reduction matrix 𝑩𝑩\bm{B}bold_italic_B, our InfLoRA designs the dimensionality reduction matrix 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT before learning the t𝑡titalic_t-th task. During the learning of the t𝑡titalic_t-th task, InfLoRA fine-tunes 𝑨tsubscript𝑨𝑡\bm{A}_{t}bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to learn the new task while keeping the pre-trained weight 𝑾𝑾\bm{W}bold_italic_W, all the old branches and the matrix 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT frozen. After learning the t𝑡titalic_t-th tasks, for any given test sample belonging to the learned tasks, the model uses 𝑾tsubscript𝑾𝑡\bm{W}_{t}bold_italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and (1) to infer its label. This design ensures that our method is compatible with the class-incremental scenario where task identities are unknown during inference.

In the following subsections, we first build the relationship between our InfLoRA and the method that fine-tunes the pre-trained weight. Specifically, we show that fine-tuning parameters 𝑨tsubscript𝑨𝑡\bm{A}_{t}bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is equivalent to fine-tuning the pre-trained weights 𝑾𝑾\bm{W}bold_italic_W within a subspace spanned by the rows of 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Note that 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is designed before learning the t𝑡titalic_t-th task, making this subspace pre-designed. Then, building upon this relationship, we introduce how our InfLoRA designs this subspace to eliminate the interference of the new task on the old tasks and make a good trade-off between stability and plasticity.

3.1 Relationship between InfLoRA and Fine-Tuning the Pre-Trained Weight

When the t𝑡titalic_t-th task arrives and our method has expanded a new branch, the forward propagation in this layer can be represented by (1). At this time, we can prove the following proposition:

Proposition 1.

When learning the t𝑡titalic_t-th task with forward propagation represented by (1), fine-tuning 𝐀tsubscript𝐀𝑡\bm{A}_{t}bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is equivalent to fine-tuning the pre-trained weight 𝐖𝐖\bm{W}bold_italic_W within the subspace span{𝐛1t,,𝐛rt}spansuperscriptsubscript𝐛1𝑡superscriptsubscript𝐛𝑟𝑡{\rm span}\{\bm{b}_{1}^{t},...,\bm{b}_{r}^{t}\}roman_span { bold_italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , … , bold_italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT }. Here, 𝐛itsuperscriptsubscript𝐛𝑖𝑡\bm{b}_{i}^{t}bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (1ir1𝑖𝑟1\leq i\leq r1 ≤ italic_i ≤ italic_r) denotes the i𝑖iitalic_i-th row vector of 𝐁tsubscript𝐁𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Proof.

When tuning the pre-trained weight 𝑾𝑾\bm{W}bold_italic_W to learn the t𝑡titalic_t-th task, we can compute the gradient of 𝑾𝑾\bm{W}bold_italic_W based on the chain rule:

𝑾=𝒆𝒆𝑾=𝒆𝒉T.𝑾𝒆𝒆𝑾𝒆superscript𝒉𝑇\displaystyle\frac{\partial\mathcal{L}}{\partial\bm{W}}=\frac{\partial\mathcal% {L}}{\partial\bm{e}}\frac{\partial\bm{e}}{\partial\bm{W}}=\frac{\partial% \mathcal{L}}{\partial\bm{e}}\bm{h}^{T}.divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ bold_italic_W end_ARG = divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ bold_italic_e end_ARG divide start_ARG ∂ bold_italic_e end_ARG start_ARG ∂ bold_italic_W end_ARG = divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ bold_italic_e end_ARG bold_italic_h start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT . (2)

Here, \mathcal{L}caligraphic_L denotes the loss function. At this time, the change of 𝑾𝑾\bm{W}bold_italic_W can be denoted as Δ𝑾=α𝑾Δ𝑾𝛼𝑾\Delta\bm{W}=-\alpha\frac{\partial\mathcal{L}}{\partial\bm{W}}roman_Δ bold_italic_W = - italic_α divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ bold_italic_W end_ARG, where α𝛼\alphaitalic_α is the learning rate. Then, we can compute the change of the composed matrix 𝑾t=𝑾+j=1t𝑨j𝑩jsubscript𝑾𝑡𝑾superscriptsubscript𝑗1𝑡subscript𝑨𝑗subscript𝑩𝑗\bm{W}_{t}=\bm{W}+\sum_{j=1}^{t}\bm{A}_{j}\bm{B}_{j}bold_italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_W + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT:

Δ𝑾𝑾t=subscriptΔ𝑾subscript𝑾𝑡absent\displaystyle\Delta_{\bm{W}}\bm{W}_{t}=roman_Δ start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [𝑾+Δ𝑾+j=1t𝑨j𝑩j](𝑾+j=1t𝑨j𝑩j)delimited-[]𝑾Δ𝑾superscriptsubscript𝑗1𝑡subscript𝑨𝑗subscript𝑩𝑗𝑾superscriptsubscript𝑗1𝑡subscript𝑨𝑗subscript𝑩𝑗\displaystyle[\bm{W}+\Delta\bm{W}+\sum_{j=1}^{t}\bm{A}_{j}\bm{B}_{j}]-(\bm{W}+% \sum_{j=1}^{t}\bm{A}_{j}\bm{B}_{j})[ bold_italic_W + roman_Δ bold_italic_W + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] - ( bold_italic_W + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )
=\displaystyle== Δ𝑾=α𝑾t=α𝒆𝒉TΔ𝑾𝛼subscript𝑾𝑡𝛼𝒆superscript𝒉𝑇\displaystyle\Delta\bm{W}=-\alpha\frac{\partial\mathcal{L}}{\partial\bm{W}_{t}% }=-\alpha\frac{\partial\mathcal{L}}{\partial\bm{e}}\bm{h}^{T}roman_Δ bold_italic_W = - italic_α divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ bold_italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG = - italic_α divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ bold_italic_e end_ARG bold_italic_h start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (3)

Here, we use Δ𝑾𝑾tsubscriptΔ𝑾subscript𝑾𝑡\Delta_{\bm{W}}\bm{W}_{t}roman_Δ start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to denote the change of the composed matrix 𝑾tsubscript𝑾𝑡\bm{W}_{t}bold_italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT causing by the change of 𝑾𝑾\bm{W}bold_italic_W.

Similarly, when tuning the expanded weight 𝑨tsubscript𝑨𝑡\bm{A}_{t}bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we can get the gradient of 𝑨tsubscript𝑨𝑡\bm{A}_{t}bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT based on the chain rule:

𝑨t=𝒆𝒆𝑨t=𝒆𝒉T𝑩tT.subscript𝑨𝑡𝒆𝒆subscript𝑨𝑡𝒆superscript𝒉𝑇superscriptsubscript𝑩𝑡𝑇\displaystyle\frac{\partial\mathcal{L}}{\partial\bm{A}_{t}}=\frac{\partial% \mathcal{L}}{\partial\bm{e}}\frac{\partial\bm{e}}{\partial\bm{A}_{t}}=\frac{% \partial\mathcal{L}}{\partial\bm{e}}\bm{h}^{T}\bm{B}_{t}^{T}.divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG = divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ bold_italic_e end_ARG divide start_ARG ∂ bold_italic_e end_ARG start_ARG ∂ bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG = divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ bold_italic_e end_ARG bold_italic_h start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT . (4)

At this time, the change of 𝑨tsubscript𝑨𝑡\bm{A}_{t}bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be denoted as Δ𝑨t=α𝑨tΔsubscript𝑨𝑡𝛼subscript𝑨𝑡\Delta\bm{A}_{t}=-\alpha\frac{\partial\mathcal{L}}{\partial\bm{A}_{t}}roman_Δ bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = - italic_α divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG. Then, we can compute the change of the composed matrix 𝑾t=𝑾t1+𝑨t𝑩tsubscript𝑾𝑡subscript𝑾𝑡1subscript𝑨𝑡subscript𝑩𝑡\bm{W}_{t}=\bm{W}_{t-1}+\bm{A}_{t}\bm{B}_{t}bold_italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_W start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

Δ𝑨t𝑾t=subscriptΔsubscript𝑨𝑡subscript𝑾𝑡absent\displaystyle\Delta_{\bm{A}_{t}}\bm{W}_{t}=roman_Δ start_POSTSUBSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [𝑾t1+(𝑨t+Δ𝑨t)𝑩t](𝑾t1+𝑨t𝑩t)delimited-[]subscript𝑾𝑡1subscript𝑨𝑡Δsubscript𝑨𝑡subscript𝑩𝑡subscript𝑾𝑡1subscript𝑨𝑡subscript𝑩𝑡\displaystyle[\bm{W}_{t-1}+(\bm{A}_{t}+\Delta\bm{A}_{t})\bm{B}_{t}]-(\bm{W}_{t% -1}+\bm{A}_{t}\bm{B}_{t})[ bold_italic_W start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + ( bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + roman_Δ bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] - ( bold_italic_W start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
=\displaystyle== Δ𝑨t𝑩t=α𝑨t𝑩t=α𝒆𝒉T𝑩tT𝑩tΔsubscript𝑨𝑡subscript𝑩𝑡𝛼subscript𝑨𝑡subscript𝑩𝑡𝛼𝒆superscript𝒉𝑇superscriptsubscript𝑩𝑡𝑇subscript𝑩𝑡\displaystyle\Delta\bm{A}_{t}\bm{B}_{t}=-\alpha\frac{\partial\mathcal{L}}{% \partial\bm{A}_{t}}\bm{B}_{t}=-\alpha\frac{\partial\mathcal{L}}{\partial\bm{e}% }\bm{h}^{T}\bm{B}_{t}^{T}\bm{B}_{t}roman_Δ bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = - italic_α divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = - italic_α divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ bold_italic_e end_ARG bold_italic_h start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
=\displaystyle== Δ𝑾𝑾t𝑩tT𝑩tsubscriptΔ𝑾subscript𝑾𝑡superscriptsubscript𝑩𝑡𝑇subscript𝑩𝑡\displaystyle\Delta_{\bm{W}}\bm{W}_{t}\bm{B}_{t}^{T}\bm{B}_{t}roman_Δ start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (5)

Here, we use Δ𝑨t𝑾tsubscriptΔsubscript𝑨𝑡subscript𝑾𝑡\Delta_{\bm{A}_{t}}\bm{W}_{t}roman_Δ start_POSTSUBSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to denote the change of the composed matrix 𝑾tsubscript𝑾𝑡\bm{W}_{t}bold_italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT causing by the change of 𝑨tsubscript𝑨𝑡\bm{A}_{t}bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The fourth equation in (3.1) holds because of (4). The fifth equation in (3.1) holds because of (2). (3.1) shows that Δ𝑨t𝑾tsubscriptΔsubscript𝑨𝑡subscript𝑾𝑡\Delta_{\bm{A}_{t}}\bm{W}_{t}roman_Δ start_POSTSUBSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is equal to Δ𝑾𝑾tsubscriptΔ𝑾subscript𝑾𝑡\Delta_{\bm{W}}\bm{W}_{t}roman_Δ start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT multiplying a projection matrix 𝑩tT𝑩tsuperscriptsubscript𝑩𝑡𝑇subscript𝑩𝑡\bm{B}_{t}^{T}\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Since 𝑩tT𝑩tsuperscriptsubscript𝑩𝑡𝑇subscript𝑩𝑡\bm{B}_{t}^{T}\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT projects each row vector of Δ𝑾𝑾tsubscriptΔ𝑾subscript𝑾𝑡\Delta_{\bm{W}}\bm{W}_{t}roman_Δ start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT into the subspace span{𝒃1t,,𝒃rt}spansuperscriptsubscript𝒃1𝑡superscriptsubscript𝒃𝑟𝑡{\rm span}\{\bm{b}_{1}^{t},...,\bm{b}_{r}^{t}\}roman_span { bold_italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , … , bold_italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT }, Proposition 1 holds. ∎

Proposition 1 has demonstrated that using our InfLoRA to train the model is equivalent to directly fine-tuning the pre-trained weight 𝑾𝑾\bm{W}bold_italic_W within the subspace span{𝒃1t,,𝒃rt}spansuperscriptsubscript𝒃1𝑡superscriptsubscript𝒃𝑟𝑡{\rm span}\{\bm{b}_{1}^{t},...,\bm{b}_{r}^{t}\}roman_span { bold_italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , … , bold_italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT }. Therefore, before learning the t𝑡titalic_t-th task, we can design matrix 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT such that learning the t𝑡titalic_t-th task in the subspace span{𝒃1t,,𝒃rt}spansuperscriptsubscript𝒃1𝑡superscriptsubscript𝒃𝑟𝑡{\rm span}\{\bm{b}_{1}^{t},...,\bm{b}_{r}^{t}\}roman_span { bold_italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , … , bold_italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } will not interfere with the performance of the model on the old tasks.

3.2 Eliminating the Interference of the New Task on the Old Tasks

We first introduce the desired characteristics that InfLoRA aims to let the subspace span{𝒃1t,,𝒃rt}spansuperscriptsubscript𝒃1𝑡superscriptsubscript𝒃𝑟𝑡{\rm span}\{\bm{b}_{1}^{t},...,\bm{b}_{r}^{t}\}roman_span { bold_italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , … , bold_italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } have. With these characteristics, InfLoRA can eliminate the interference of the new task on the old tasks and make a good trade-off between stability and plasticity. Then, we introduce how to design dimensionality reduction matrix 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT so that subspace span{𝒃1t,,𝒃rt}spansuperscriptsubscript𝒃1𝑡superscriptsubscript𝒃𝑟𝑡{\rm span}\{\bm{b}_{1}^{t},...,\bm{b}_{r}^{t}\}roman_span { bold_italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , … , bold_italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } has these characteristics.

3.2.1 Desired Characteristics

First, InfLoRA aims to make the subspace span{𝒃1t,,𝒃rt}spansuperscriptsubscript𝒃1𝑡superscriptsubscript𝒃𝑟𝑡{\rm span}\{\bm{b}_{1}^{t},...,\bm{b}_{r}^{t}\}roman_span { bold_italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , … , bold_italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } orthogonal to the gradients of all the old tasks. In this way, according to Proposition 1, the update of InfLoRA, which can be represented as Δ𝑨t𝑾tsubscriptΔsubscript𝑨𝑡subscript𝑾𝑡\Delta_{\bm{A}_{t}}\bm{W}_{t}roman_Δ start_POSTSUBSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, will also be orthogonal to the gradient of the old tasks. Note that the idea of making the update for the new task orthogonal to the gradient of the old tasks to eliminate the interference of the new task on the old tasks has been proposed in many existing continual learning methods [36, 30]. However, all these existing methods are designed for continual learning from scratch, involving updating all parameters of the model, which is incompatible with the setting in PEFT. On the contrary, our method is a PEFT method, which only tunes the parameters in 𝑨tsubscript𝑨𝑡\bm{A}_{t}bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Besides eliminating the interference of new tasks on old tasks, our InfLoRA further makes the subspace span{𝒃1t,,𝒃rt}spansuperscriptsubscript𝒃1𝑡superscriptsubscript𝒃𝑟𝑡{\rm span}\{\bm{b}_{1}^{t},...,\bm{b}_{r}^{t}\}roman_span { bold_italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , … , bold_italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } lie in a subspace that the gradient of the new task lies in to make a good trade-off between stability and plasticity. Specifically, existing work [19] has shown that during fine-tuning, the weight increments of pre-trained ViT exhibit redundancy in terms of weight rank. Therefore, the gradients of the new task lie in a low-dimensional subspace. Our method makes span{𝒃1t,,𝒃rt}spansuperscriptsubscript𝒃1𝑡superscriptsubscript𝒃𝑟𝑡{\rm span}\{\bm{b}_{1}^{t},...,\bm{b}_{r}^{t}\}roman_span { bold_italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , … , bold_italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } not only orthogonal to the gradient of the old tasks but also lie in the subspace in which the gradients of the new task t𝑡titalic_t lie. By doing so, our method makes the model’s focus on the new task when eliminating the interference of the new task on the old tasks, thereby making a good trade-off between stability and plasticity. Section 3 verifies the effectiveness of these two characteristics.

3.2.2 Designing Dimensionality Reduction Matrix

InfLoRA first approximates the gradient space of the new task and old tasks. Here, we use 𝒩tsubscript𝒩𝑡\mathcal{N}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to represent the gradient space of the new task approximated by InfLoRA. Similarly, we use tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to represent the gradient space of previous t1𝑡1t-1italic_t - 1 old tasks approximated by InfLoRA. We also use tsuperscriptsubscript𝑡bottom\mathcal{M}_{t}^{\bot}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT to denote the residual gradient space, which is orthogonal to the space tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Then, in order to meet the characteristics described in Section 3.2.1, InfLoRA ensures that each row of 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT lies in 𝒩ttsubscript𝒩𝑡superscriptsubscript𝑡bottom\mathcal{N}_{t}\cap\mathcal{M}_{t}^{\bot}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∩ caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT. In other words, InfLoRA makes span{𝒃1t,,𝒃rt}𝒩ttspansuperscriptsubscript𝒃1𝑡superscriptsubscript𝒃𝑟𝑡subscript𝒩𝑡superscriptsubscript𝑡bottom{\rm span}\{\bm{b}_{1}^{t},...,\bm{b}_{r}^{t}\}\subseteq\mathcal{N}_{t}\cap% \mathcal{M}_{t}^{\bot}roman_span { bold_italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , … , bold_italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } ⊆ caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∩ caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT.

Existing works [36, 29] have shown that the gradient update of the linear layer lies in the span of the inputs. Please refer to supplementary material for a detailed explanation of this proposition. Therefore, InfLoRA uses the input matrix of the new task t𝑡titalic_t to approximate the gradient space of the new task. Specifically, InfLoRA computes the input matrix 𝑯t=[𝒉1t,,𝒉nt]subscript𝑯𝑡superscriptsubscript𝒉1𝑡superscriptsubscript𝒉𝑛𝑡\bm{H}_{t}=[\bm{h}_{1}^{t},...,\bm{h}_{n}^{t}]bold_italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ bold_italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , … , bold_italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ], with each column of 𝑯tsubscript𝑯𝑡\bm{H}_{t}bold_italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT representing an input vector of the t𝑡titalic_t-th task. Then, InfLoRA considers 𝒩tsubscript𝒩𝑡\mathcal{N}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as the subspace spanned by the columns of matrix 𝑯tsubscript𝑯𝑡\bm{H}_{t}bold_italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

However, InfLoRA cannot use the input matrix of the old tasks to approximate the gradient space of the old tasks since the data from the old tasks is not available when the model learns the new tasks. Instead, existing methods such as gradient projection memory (GPM) [36] and dual gradient projection memory (DualGPM) [29] can learn a matrix to preserve information about the gradients of the old tasks. InfLoRA incorporates DualGPM to preserve gradient information. With the assistance of DualGPM, the model can learn either a matrix 𝑴tdI×ktsubscript𝑴𝑡superscriptsubscript𝑑𝐼subscript𝑘𝑡\bm{M}_{t}\in\mathbb{R}^{d_{I}\times k_{t}}bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT × italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT or a matrix 𝑴tdI×(dIkt)superscriptsubscript𝑴𝑡bottomsuperscriptsubscript𝑑𝐼subscript𝑑𝐼subscript𝑘𝑡\bm{M}_{t}^{\bot}\in\mathbb{R}^{d_{I}\times(d_{I}-k_{t})}bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT × ( italic_d start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT - italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT. Here, the columns of 𝑴tsubscript𝑴𝑡\bm{M}_{t}bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT contribute to the orthonormal bases of tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the columns of 𝑴tsuperscriptsubscript𝑴𝑡bottom\bm{M}_{t}^{\bot}bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT contribute to the orthonormal bases of tsuperscriptsubscript𝑡bottom\mathcal{M}_{t}^{\bot}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT. ktsubscript𝑘𝑡k_{t}italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the dimension of tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. For detailed information of how DualGPM maintains orthonormal bases 𝑴tsubscript𝑴𝑡\bm{M}_{t}bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT or 𝑴tsuperscriptsubscript𝑴𝑡bottom\bm{M}_{t}^{\bot}bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT, please refer to supplementary material or the original paper [29].

After approximating the gradient space of the new task and old tasks, InfLoRA gets the component of 𝒩tsubscript𝒩𝑡\mathcal{N}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT which lies in tsuperscriptsubscript𝑡bottom\mathcal{M}_{t}^{\bot}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT. Specifically, when the model maintains tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, InfLoRA performs the operation

𝑯^t=𝑯t𝑴t𝑴tT𝑯t.subscript^𝑯𝑡subscript𝑯𝑡subscript𝑴𝑡superscriptsubscript𝑴𝑡𝑇subscript𝑯𝑡\displaystyle\hat{\bm{H}}_{t}=\bm{H}_{t}-\bm{M}_{t}\bm{M}_{t}^{T}\bm{H}_{t}.over^ start_ARG bold_italic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (6)

Similarly, when the model maintains tsuperscriptsubscript𝑡bottom\mathcal{M}_{t}^{\bot}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT, InfLoRA performs the operation

𝑯^t=𝑴t(𝑴t)T𝑯t.subscript^𝑯𝑡superscriptsubscript𝑴𝑡bottomsuperscriptsuperscriptsubscript𝑴𝑡bottom𝑇subscript𝑯𝑡\displaystyle\hat{\bm{H}}_{t}=\bm{M}_{t}^{\bot}(\bm{M}_{t}^{\bot})^{T}\bm{H}_{% t}.over^ start_ARG bold_italic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT ( bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (7)

Note that when t=1𝑡1t=1italic_t = 1, tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a null space and 𝑯^t=𝑯tsubscript^𝑯𝑡subscript𝑯𝑡\hat{\bm{H}}_{t}=\bm{H}_{t}over^ start_ARG bold_italic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Obviously, each column of 𝑯^tsubscript^𝑯𝑡\hat{\bm{H}}_{t}over^ start_ARG bold_italic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT lies in 𝒩ttsubscript𝒩𝑡superscriptsubscript𝑡bottom\mathcal{N}_{t}\cap\mathcal{M}_{t}^{\bot}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∩ caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT. However, since (𝑯^t)Tn×dIsuperscriptsubscript^𝑯𝑡𝑇superscript𝑛subscript𝑑𝐼(\hat{\bm{H}}_{t})^{T}\in\mathbb{R}^{n\times d_{I}}( over^ start_ARG bold_italic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝑩tr×dIsubscript𝑩𝑡superscript𝑟subscript𝑑𝐼\bm{B}_{t}\in\mathbb{R}^{r\times d_{I}}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_d start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUPERSCRIPT have different shapes, InfLoRA can not directly define 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as (𝑯^t)Tsuperscriptsubscript^𝑯𝑡𝑇(\hat{\bm{H}}_{t})^{T}( over^ start_ARG bold_italic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. Note that nrmuch-greater-than𝑛𝑟n\gg ritalic_n ≫ italic_r, InfLoRA uses the principal components of (𝑯^t)Tsuperscriptsubscript^𝑯𝑡𝑇(\hat{\bm{H}}_{t})^{T}( over^ start_ARG bold_italic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT to set 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Specifically, singular value decomposotion (SVD) is performed on (𝑯^t)T=𝑽t𝚺t𝑼tsuperscriptsubscript^𝑯𝑡𝑇subscript𝑽𝑡subscript𝚺𝑡subscript𝑼𝑡(\hat{\bm{H}}_{t})^{T}=\bm{V}_{t}\bm{\Sigma}_{t}\bm{U}_{t}( over^ start_ARG bold_italic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = bold_italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Then, InfLoRA designs 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by

𝑩t=(𝑼t)r.subscript𝑩𝑡subscriptsubscript𝑼𝑡𝑟\displaystyle\bm{B}_{t}=(\bm{U}_{t})_{r}.bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( bold_italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT . (8)

Here, (𝑼t)rsubscriptsubscript𝑼𝑡𝑟(\bm{U}_{t})_{r}( bold_italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT denotes the rows of 𝑼tsubscript𝑼𝑡\bm{U}_{t}bold_italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT corresponding to the top-r𝑟ritalic_r singular values. Figure 1 (b) illustrates the pipeline of designing matrix 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Note that DualGPM expands subspace tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and reduces subspace tsuperscriptsubscript𝑡bottom\mathcal{M}_{t}^{\bot}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT when the number of tasks increases. Since InfLoRA constrains the update of the model within the subspace 𝒩tttsubscript𝒩𝑡superscriptsubscript𝑡bottomsuperscriptsubscript𝑡bottom\mathcal{N}_{t}\cap\mathcal{M}_{t}^{\bot}\subseteq\mathcal{M}_{t}^{\bot}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∩ caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT ⊆ caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT, the space for learning the new task reduces when the number of tasks increases. However, by adjusting the approximation error of the gradient for the old tasks, DualGPM can expand tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT slowly and reduce tsuperscriptsubscript𝑡bottom\mathcal{M}_{t}^{\bot}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT slowly. Therefore, the constraints imposed by InfLoRA do not excessively affect the model’s learning of new tasks. Please refer to supplementary material for a detailed explanation.

Algorithm 1 InfLoRA for Continual Learning
1:  Input: The data of different tasks {𝒟t}t=1Tsuperscriptsubscriptsubscript𝒟𝑡𝑡1𝑇\{\mathcal{D}_{t}\}_{t=1}^{T}{ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, a pre-trained ViT model f𝚯()subscript𝑓𝚯f_{\bm{\Theta}}(\cdot)italic_f start_POSTSUBSCRIPT bold_Θ end_POSTSUBSCRIPT ( ⋅ ).
2:  Output: Network f𝚯()subscript𝑓𝚯f_{\bm{\Theta}}(\cdot)italic_f start_POSTSUBSCRIPT bold_Θ end_POSTSUBSCRIPT ( ⋅ ) with learned parameters 𝑾tsubscript𝑾𝑡\bm{W}_{t}bold_italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.
3:  for t𝑡titalic_t in 1:T:1𝑇1:T1 : italic_T do
4:     Design 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT through (8);
5:     Expand a new branch for the t𝑡titalic_t-th task;
6:     for tsubscript𝑡\mathcal{B}_{t}caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT sampled from 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT do
7:        Compute the loss (f𝚯(t))subscript𝑓𝚯subscript𝑡\mathcal{L}(f_{\bm{\Theta}}(\mathcal{B}_{t}))caligraphic_L ( italic_f start_POSTSUBSCRIPT bold_Θ end_POSTSUBSCRIPT ( caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) through (9) and update the parameters;
8:     end for
9:     Preserve the information about the gradient of the t𝑡titalic_t-th task through DualGPM;
10:  end for

3.3 Whole Process of InfLoRA

Algorithm 1 outlines the whole process of InfLoRA in continual learning. When the t𝑡titalic_t-th new task arrives, InfLoRA first designs 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT through (8) and expands a new branch. Then, InfLoRA learns the t𝑡titalic_t-th task by fine-tuning the newly expanded branch. Please note that, based on empirical findings from existing methods [38, 12], we employ the local cross-entropy (CE) loss as the learning objective, as it usually performs better than the global CE loss in continual learning methods based on PEFT. The local CE is the CE loss constrained to the classes of the current new task, which can be denoted as

(𝒟t)=1|𝒟t|(𝒙,y)𝒟tce(mask(h𝚽(f𝚯(𝒙))),y).subscript𝒟𝑡1subscript𝒟𝑡subscript𝒙𝑦subscript𝒟𝑡subscript𝑐𝑒masksubscript𝚽subscript𝑓𝚯𝒙𝑦\displaystyle\mathcal{L}(\mathcal{D}_{t})=\frac{1}{|\mathcal{D}_{t}|}\sum_{(% \bm{x},y)\in\mathcal{D}_{t}}\mathcal{L}_{ce}({\rm mask}(h_{\bm{\Phi}}(f_{\bm{% \Theta}}(\bm{x}))),y).caligraphic_L ( caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT ( bold_italic_x , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT ( roman_mask ( italic_h start_POSTSUBSCRIPT bold_Φ end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT bold_Θ end_POSTSUBSCRIPT ( bold_italic_x ) ) ) , italic_y ) . (9)

Here, mask()mask{\rm mask}(\cdot)roman_mask ( ⋅ ) is a function that filters out the logits of the old classes and cesubscript𝑐𝑒\mathcal{L}_{ce}caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT denotes the standard CE loss. After learning the t𝑡titalic_t-th new task, InfLoRA follows the DualGPM to preserve the information about the gradient of the t𝑡titalic_t-th task.

Note that the branch corresponding to the t𝑡titalic_t-th task will be frozen once the model has learned the t𝑡titalic_t-th task. Since the expanded branches are linear transformations, we can integrate the old branches into the pre-trained weight to reduce the expanded parameters. Specifically, after learning the first task, InfLoRA integrates the first branch into the pre-trained weight and obtains the weight 𝑾1=𝑾+𝑨1𝑩1subscript𝑾1𝑾subscript𝑨1subscript𝑩1\bm{W}_{1}=\bm{W}+\bm{A}_{1}\bm{B}_{1}bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_italic_W + bold_italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Before learning the t𝑡titalic_t-th new task (t>1𝑡1t>1italic_t > 1), InfLoRA maintains the weight 𝑾t1subscript𝑾𝑡1\bm{W}_{t-1}bold_italic_W start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. After learning the t𝑡titalic_t-th task, InfLoRA integrates the t𝑡titalic_t-th branch into 𝑾t1subscript𝑾𝑡1\bm{W}_{t-1}bold_italic_W start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and obtains 𝑾t=𝑾t1+𝑨t𝑩tsubscript𝑾𝑡subscript𝑾𝑡1subscript𝑨𝑡subscript𝑩𝑡\bm{W}_{t}=\bm{W}_{t-1}+\bm{A}_{t}\bm{B}_{t}bold_italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_W start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. In this way, the parameters in 𝑨tsubscript𝑨𝑡\bm{A}_{t}bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT do not need to be maintained in the learning of subsequent tasks. Therefore, during the whole learning process, the number of parameters expanded by InfLoRA equals the number of parameters in a single branch. Since a single branch contains (dI+dO)rsubscript𝑑𝐼subscript𝑑𝑂𝑟(d_{I}+d_{O})r( italic_d start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT ) italic_r parameters, the number of parameters expanded by InfLoRA is (dI+dO)rsubscript𝑑𝐼subscript𝑑𝑂𝑟(d_{I}+d_{O})r( italic_d start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT ) italic_r all the time.

Table 1: Results (%) on ImageNet-R. Results are included for 5 tasks, 10 tasks, and 20 tasks. We report results averaged over 5 trials.
Tasks 5 10 20
Method ACC5𝐴𝐶subscript𝐶5ACC_{5}italic_A italic_C italic_C start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT (\uparrow) ACC¯5subscript¯𝐴𝐶𝐶5\overline{ACC}_{5}over¯ start_ARG italic_A italic_C italic_C end_ARG start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT (\uparrow) ACC10𝐴𝐶subscript𝐶10ACC_{10}italic_A italic_C italic_C start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT (\uparrow) ACC¯10subscript¯𝐴𝐶𝐶10\overline{ACC}_{10}over¯ start_ARG italic_A italic_C italic_C end_ARG start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT (\uparrow) ACC20𝐴𝐶subscript𝐶20ACC_{20}italic_A italic_C italic_C start_POSTSUBSCRIPT 20 end_POSTSUBSCRIPT (\uparrow) ACC¯20subscript¯𝐴𝐶𝐶20\overline{ACC}_{20}over¯ start_ARG italic_A italic_C italic_C end_ARG start_POSTSUBSCRIPT 20 end_POSTSUBSCRIPT (\uparrow)
joint 81.14±0.34plus-or-minus81.140.3481.14\pm 0.3481.14 ± 0.34 - 81.14±0.34plus-or-minus81.140.3481.14\pm 0.3481.14 ± 0.34 - 81.14±0.34plus-or-minus81.140.3481.14\pm 0.3481.14 ± 0.34 -
sequential 58.74±1.28plus-or-minus58.741.2858.74\pm 1.2858.74 ± 1.28 72.91±0.28plus-or-minus72.910.2872.91\pm 0.2872.91 ± 0.28 46.07±1.15plus-or-minus46.071.1546.07\pm 1.1546.07 ± 1.15 62.91±0.68plus-or-minus62.910.6862.91\pm 0.6862.91 ± 0.68 34.62±0.85plus-or-minus34.620.8534.62\pm 0.8534.62 ± 0.85 51.15±1.50plus-or-minus51.151.5051.15\pm 1.5051.15 ± 1.50
L2P [44] 64.13±0.78plus-or-minus64.130.7864.13\pm 0.7864.13 ± 0.78 68.66±0.41plus-or-minus68.660.4168.66\pm 0.4168.66 ± 0.41 62.54±0.24plus-or-minus62.540.2462.54\pm 0.2462.54 ± 0.24 67.98±0.27plus-or-minus67.980.2767.98\pm 0.2767.98 ± 0.27 57.92±0.28plus-or-minus57.920.2857.92\pm 0.2857.92 ± 0.28 64.57±0.29plus-or-minus64.570.2964.57\pm 0.2964.57 ± 0.29
DualPrompt [43] 67.88±0.17plus-or-minus67.880.1767.88\pm 0.1767.88 ± 0.17 71.16±0.31plus-or-minus71.160.3171.16\pm 0.3171.16 ± 0.31 65.41±0.52plus-or-minus65.410.5265.41\pm 0.5265.41 ± 0.52 69.39±0.43plus-or-minus69.390.4369.39\pm 0.4369.39 ± 0.43 61.00±0.72plus-or-minus61.000.7261.00\pm 0.7261.00 ± 0.72 65.80±0.67plus-or-minus65.800.6765.80\pm 0.6765.80 ± 0.67
CODA-P [38] 73.09±0.21plus-or-minus73.090.2173.09\pm 0.2173.09 ± 0.21 76.91±0.21plus-or-minus76.910.2176.91\pm 0.2176.91 ± 0.21 71.47±0.35plus-or-minus71.470.3571.47\pm 0.3571.47 ± 0.35 75.82±0.29plus-or-minus75.820.2975.82\pm 0.2975.82 ± 0.29 67.28±0.30plus-or-minus67.280.3067.28\pm 0.3067.28 ± 0.30 72.34±0.17plus-or-minus72.340.1772.34\pm 0.1772.34 ± 0.17
C-LoRA [37] 75.85±0.31plus-or-minus75.850.3175.85\pm 0.3175.85 ± 0.31 78.85±0.34plus-or-minus78.850.3478.85\pm 0.3478.85 ± 0.34 71.89±0.45plus-or-minus71.890.4571.89\pm 0.4571.89 ± 0.45 75.33±0.28plus-or-minus75.330.2875.33\pm 0.2875.33 ± 0.28 65.71±0.60plus-or-minus65.710.6065.71\pm 0.6065.71 ± 0.60 70.63±0.85plus-or-minus70.630.8570.63\pm 0.8570.63 ± 0.85
LAE [12] 73.84±0.14plus-or-minus73.840.1473.84\pm 0.1473.84 ± 0.14 77.29±0.45plus-or-minus77.290.4577.29\pm 0.4577.29 ± 0.45 71.70±0.39plus-or-minus71.700.3971.70\pm 0.3971.70 ± 0.39 76.71±0.10plus-or-minus76.710.1076.71\pm 0.1076.71 ± 0.10 66.98±0.35plus-or-minus66.980.3566.98\pm 0.3566.98 ± 0.35 73.72±0.05plus-or-minus73.720.0573.72\pm 0.0573.72 ± 0.05
InfLoRA-b5 75.28±0.01plus-or-minus75.280.0175.28\pm 0.0175.28 ± 0.01 78.95±0.08plus-or-minus78.950.0878.95\pm 0.0878.95 ± 0.08 74.13±0.18plus-or-minus74.130.1874.13\pm 0.1874.13 ± 0.18 78.54±0.14plus-or-minus78.540.1478.54\pm 0.1478.54 ± 0.14 68.41±0.29plus-or-minus68.410.2968.41\pm 0.2968.41 ± 0.29 74.00±0.50plus-or-minus74.000.5074.00\pm 0.5074.00 ± 0.50
InfLoRA 77.52 ±plus-or-minus\pm± 0.37 82.01 ±plus-or-minus\pm± 0.12 75.65 ±plus-or-minus\pm± 0.14 80.82 ±plus-or-minus\pm± 0.24 71.01 ±plus-or-minus\pm± 0.45 77.28 ±plus-or-minus\pm± 0.45

4 Experiments

4.1 Experimental Settings

Datasets and Evaluation Metric  Similar to existing continual learning methods [12, 44] based on PEFT, we use ImageNet-R [14], CIFAR100 [24], and DomainNet [34] to train and evaluate the models. Imagenet-R is generated through artistic processing of 200 classes from ImageNet [8]. This dataset is introduced to continual learning by existing work [43] and has become a standard benchmark for continual learning methods based on PEFT. CIFAR100 is a dataset commonly used in existing continual learning works. DomainNet contains 345 classes and is introduced by some existing works [38, 42] for continual learning. Following existing continual learning work [38], we split ImageNet-R into 5, 10, and 20 tasks, with each task containing 40, 20, and 10 classes. We split CIFAR100 into 10 tasks, and each task constrains 10 classes. We split DomainNet into 5 tasks, and each task contains 69 classes.

Following existing continual learning methods [12, 44], we evaluate the performance of the model through two popular metrics, including the final accuracy ACCT𝐴𝐶subscript𝐶𝑇ACC_{T}italic_A italic_C italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and the averaged accuracy ACC¯T=1Ti=1TACCisubscript¯𝐴𝐶𝐶𝑇1𝑇superscriptsubscript𝑖1𝑇𝐴𝐶subscript𝐶𝑖\overline{ACC}_{T}=\frac{1}{T}\sum_{i=1}^{T}ACC_{i}over¯ start_ARG italic_A italic_C italic_C end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A italic_C italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where T𝑇Titalic_T denotes the total number of tasks and ACCi𝐴𝐶subscript𝐶𝑖ACC_{i}italic_A italic_C italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is defined as

ACCi=1ij=1iai,j.𝐴𝐶subscript𝐶𝑖1𝑖superscriptsubscript𝑗1𝑖subscript𝑎𝑖𝑗\displaystyle ACC_{i}=\frac{1}{i}\sum_{j=1}^{i}a_{i,j}.italic_A italic_C italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_i end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT . (10)

Here, ai,jsubscript𝑎𝑖𝑗a_{i,j}italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT denotes the accuracy of the j𝑗jitalic_j-th task once the model has learned the i𝑖iitalic_i-th task.

Baselines  We compare our InfLoRA with state-of-the-art continual learning methods based on PEFT, including learn to prompt (L2P) [44], DualPrompt [43], continual decomposed attention-based prompt (CODA-P) [38], learning accumulation ensemble (LAE) [12], continual low-rank adaptation (C-LoRA) [37]. For LAE, we implement it with LoRA [16]. Following existing works [38, 12], we also include two methods without continual learning, joint and sequential, in the comparison. Here, joint denotes the method that learns all the tasks jointly, while sequential denotes the method that learns all the tasks sequentially without any operation to overcome the forgetting of the model. The accuracy of joint can be treated as the accuracy upper-bound and the accuracy of sequential can be treated as the accuracy lower-bound.

Architecture and Training Details  We follow existing works [12, 43] to perform experiments. Specifically, we use the ViT-B/16 backbone [10] supervised pre-trained on ImageNet 21K as the pre-trained model.

For all the methods, we follow existing works [38, 44, 12] and use the Adam [22] optimizer with running averages of gradient and its square (β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999). Each task is trained for 50 epochs on ImageNet-R, 20 epochs on CIFAR100 and 5 epochs on DomainNet. The batch size is set to 128 for all the experiments. Since our InfLoRA shares a similar architecture to LoRA, we follow existing work [12] and insert the architecture of our InfLoRA in the key and value of the attention module. Furthermore, existing method DualPrompt [43] treats the inserted blocks as hyperparameters and searches for the best positions for their prompts. On the contrary, we insert the architecture of InfLoRA for all the Transformer blocks to avoid searching. We also implement a variant of our method, which inserts the bottom 5 Transformer blocks like existing methods DualPrompt and CODA-P. We call this variant InfLoRA-b5. As for the hyperparameter r𝑟ritalic_r, we determine its value through a grid search on a validation dataset.

Table 2: Results (%) on CIFAR100 and DomainNet. We report results over 5 trials.
Tasks CIFAR100 DomainNet
Method ACC10𝐴𝐶subscript𝐶10ACC_{10}italic_A italic_C italic_C start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT (\uparrow) ACC¯10subscript¯𝐴𝐶𝐶10\overline{ACC}_{10}over¯ start_ARG italic_A italic_C italic_C end_ARG start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT (\uparrow) ACC5𝐴𝐶subscript𝐶5ACC_{5}italic_A italic_C italic_C start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT (\uparrow) ACC¯5subscript¯𝐴𝐶𝐶5\overline{ACC}_{5}over¯ start_ARG italic_A italic_C italic_C end_ARG start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT (\uparrow)
joint 91.92±0.05plus-or-minus91.920.0591.92\pm 0.0591.92 ± 0.05 - 77.72±0.04plus-or-minus77.720.0477.72\pm 0.0477.72 ± 0.04 -
sequential 62.18±3.59plus-or-minus62.183.5962.18\pm 3.5962.18 ± 3.59 80.42±0.23plus-or-minus80.420.2380.42\pm 0.2380.42 ± 0.23 53.44±1.21plus-or-minus53.441.2153.44\pm 1.2153.44 ± 1.21 69.09±0.33plus-or-minus69.090.3369.09\pm 0.3369.09 ± 0.33
L2P [44] 82.48±0.20plus-or-minus82.480.2082.48\pm 0.2082.48 ± 0.20 87.64±0.25plus-or-minus87.640.2587.64\pm 0.2587.64 ± 0.25 70.16±0.05plus-or-minus70.160.0570.16\pm 0.0570.16 ± 0.05 75.60±0.03plus-or-minus75.600.0375.60\pm 0.0375.60 ± 0.03
DualPrompt [43] 84.42±0.30plus-or-minus84.420.3084.42\pm 0.3084.42 ± 0.30 90.06±0.07plus-or-minus90.060.0790.06\pm 0.0790.06 ± 0.07 72.14±0.05plus-or-minus72.140.0572.14\pm 0.0572.14 ± 0.05 77.71±0.06plus-or-minus77.710.0677.71\pm 0.0677.71 ± 0.06
CODA-P [38] 86.62±0.11plus-or-minus86.620.1186.62\pm 0.1186.62 ± 0.11 91.08±0.28plus-or-minus91.080.2891.08\pm 0.2891.08 ± 0.28 73.23±0.13plus-or-minus73.230.1373.23\pm 0.1373.23 ± 0.13 78.72±0.07plus-or-minus78.720.0778.72\pm 0.0778.72 ± 0.07
C-LoRA [37] 82.97±0.47plus-or-minus82.970.4782.97\pm 0.4782.97 ± 0.47 88.81±0.34plus-or-minus88.810.3488.81\pm 0.3488.81 ± 0.34 69.34±0.13plus-or-minus69.340.1369.34\pm 0.1369.34 ± 0.13 75.25±0.11plus-or-minus75.250.1175.25\pm 0.1175.25 ± 0.11
LAE [12] 84.15±0.10plus-or-minus84.150.1084.15\pm 0.1084.15 ± 0.10 89.84±0.03plus-or-minus89.840.0389.84\pm 0.0389.84 ± 0.03 66.85±0.40plus-or-minus66.850.4066.85\pm 0.4066.85 ± 0.40 75.01±0.17plus-or-minus75.010.1775.01\pm 0.1775.01 ± 0.17
InfLoRA-b5 87.06 ±plus-or-minus\pm± 0.25 91.59±0.13plus-or-minus91.590.1391.59\pm 0.1391.59 ± 0.13 73.26±0.50plus-or-minus73.260.5073.26\pm 0.5073.26 ± 0.50 78.82±0.34plus-or-minus78.820.3478.82\pm 0.3478.82 ± 0.34
InfLoRA 86.51±0.73plus-or-minus86.510.7386.51\pm 0.7386.51 ± 0.73 91.70 ±plus-or-minus\pm± 0.32 74.53 ±plus-or-minus\pm± 0.23 79.57 ±plus-or-minus\pm± 0.57
Refer to caption
Figure 2: Variation of the performance of different methods during the learning of ImageNet-R and CIFAR100.

4.2 Experimental Results

Accuracy  Table 1 shows the results of different methods on ImageNet-R with a different number of tasks. Table 2 shows the results of different methods on CIFAR100 and DomainNet. We can find that our methods InfLoRA and InfLoRA-b5 outperform existing continual learning methods.

Figure 2 shows the variation of the accuracy of different continual learning methods on ImageNet-R and CIFAR100. We can find that our method outperforms existing methods not only at the end of the learning but also throughout the whole learning process. This indicates that our InfLoRA eliminates the interference of the new task on the old tasks and thus the accuracy of our method decreases slower compared to other methods.

Analysis of Expanded Parameters  Figure 3 shows the number of expanded parameters and the accuracy of different methods on ImageNet-R and CIFAR100. For L2P, DualPrompt and CODA-P, their expanded parameters are included in the added prompts and corresponding key. For LAE, its expanded parameters are the inserted LoRA modules and an additional copy. For C-LoRA, its expanded parameters are inserted LoRA modules. For our method, the expanded parameters are 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝑨tsubscript𝑨𝑡\bm{A}_{t}bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The details of computing the number of expanded parameters for different methods are given in supplementary material. We can find that CODA-P and C-LoRA expand much more parameters than other methods. Furthermore, our methods InfLoRA and InfLoRA-b5 expand comparable parameters to L2P, DualPrompt and LAE but perform better than these methods.

Table 3: Results of different variants on ImageNet-R with a different number of tasks.
Tasks 5 10 20
ACC5𝐴𝐶subscript𝐶5ACC_{5}italic_A italic_C italic_C start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT (\uparrow) ACC¯5subscript¯𝐴𝐶𝐶5\overline{ACC}_{5}over¯ start_ARG italic_A italic_C italic_C end_ARG start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT (\uparrow) ACC10𝐴𝐶subscript𝐶10ACC_{10}italic_A italic_C italic_C start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT (\uparrow) ACC¯10subscript¯𝐴𝐶𝐶10\overline{ACC}_{10}over¯ start_ARG italic_A italic_C italic_C end_ARG start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT (\uparrow) ACC20𝐴𝐶subscript𝐶20ACC_{20}italic_A italic_C italic_C start_POSTSUBSCRIPT 20 end_POSTSUBSCRIPT (\uparrow) ACC¯20subscript¯𝐴𝐶𝐶20\overline{ACC}_{20}over¯ start_ARG italic_A italic_C italic_C end_ARG start_POSTSUBSCRIPT 20 end_POSTSUBSCRIPT (\uparrow)
Random 𝑩tabsentsubscript𝑩𝑡\rightarrow\bm{B}_{t}→ bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 72.49±0.38plus-or-minus72.490.3872.49\pm 0.3872.49 ± 0.38 79.40±0.29plus-or-minus79.400.2979.40\pm 0.2979.40 ± 0.29 67.38±0.41plus-or-minus67.380.4167.38\pm 0.4167.38 ± 0.41 76.62±0.06plus-or-minus76.620.0676.62\pm 0.0676.62 ± 0.06 56.17±0.29plus-or-minus56.170.2956.17\pm 0.2956.17 ± 0.29 69.24±0.35plus-or-minus69.240.3569.24\pm 0.3569.24 ± 0.35
𝒩t𝑩tsubscript𝒩𝑡subscript𝑩𝑡\mathcal{N}_{t}\rightarrow\bm{B}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT → bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 67.01±0.11plus-or-minus67.010.1167.01\pm 0.1167.01 ± 0.11 76.09±0.04plus-or-minus76.090.0476.09\pm 0.0476.09 ± 0.04 57.91±0.30plus-or-minus57.910.3057.91\pm 0.3057.91 ± 0.30 70.23±0.59plus-or-minus70.230.5970.23\pm 0.5970.23 ± 0.59 40.73±0.29plus-or-minus40.730.2940.73\pm 0.2940.73 ± 0.29 59.68±0.52plus-or-minus59.680.5259.68\pm 0.5259.68 ± 0.52
t𝑩tsuperscriptsubscript𝑡bottomsubscript𝑩𝑡\mathcal{M}_{t}^{\bot}\rightarrow\bm{B}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT → bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 75.94±0.53plus-or-minus75.940.5375.94\pm 0.5375.94 ± 0.53 80.69±0.27plus-or-minus80.690.2780.69\pm 0.2780.69 ± 0.27 74.61±0.62plus-or-minus74.610.6274.61\pm 0.6274.61 ± 0.62 79.67±0.27plus-or-minus79.670.2779.67\pm 0.2779.67 ± 0.27 68.79±0.42plus-or-minus68.790.4268.79\pm 0.4268.79 ± 0.42 75.74±0.26plus-or-minus75.740.2675.74\pm 0.2675.74 ± 0.26
𝒩tt𝑩tsubscript𝒩𝑡superscriptsubscript𝑡bottomsubscript𝑩𝑡\mathcal{N}_{t}\cap\mathcal{M}_{t}^{\bot}\rightarrow\bm{B}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∩ caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT → bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (InfLoRA) 77.52 ±plus-or-minus\pm± 0.37 82.01 ±plus-or-minus\pm± 0.12 75.65 ±plus-or-minus\pm± 0.14 80.82 ±plus-or-minus\pm± 0.24 71.01 ±plus-or-minus\pm± 0.45 77.28 ±plus-or-minus\pm± 0.45

Ablation Study  We perform experiment to verify the effectiveness of designing dimensionality reduction matrix 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by (8). Specifically, we explore three different variants for designing 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The first variant designs 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT randomly using Gaussian distribution. We call this variant ‘Random 𝑩tabsentsubscript𝑩𝑡\rightarrow\bm{B}_{t}→ bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT’. The second variant discards the operation in (6) or (7) and directly sets 𝑯^t=𝑯tsubscript^𝑯𝑡subscript𝑯𝑡\hat{\bm{H}}_{t}=\bm{H}_{t}over^ start_ARG bold_italic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Through this way, this variant ensures that each row of 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT lies in 𝒩tsubscript𝒩𝑡\mathcal{N}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT while ignoring tsuperscriptsubscript𝑡bottom\mathcal{M}_{t}^{\bot}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT. We call this variant ‘𝒩t𝑩tsubscript𝒩𝑡subscript𝑩𝑡\mathcal{N}_{t}\rightarrow\bm{B}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT → bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT’. The third variant does not compute the input matrix but initializes 𝑯tsubscript𝑯𝑡\bm{H}_{t}bold_italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using a Gaussian distribution before applying the operation in (6) or (7). In this way, this variant ensures that each row of 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT lies in tsuperscriptsubscript𝑡bottom{\mathcal{M}}_{t}^{\bot}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT while ignoring 𝒩tsubscript𝒩𝑡\mathcal{N}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We call this variant ‘t𝑩tsuperscriptsubscript𝑡bottomsubscript𝑩𝑡{\mathcal{M}}_{t}^{\bot}\rightarrow\bm{B}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT → bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT’. Since our method focuses both tsuperscriptsubscript𝑡bottom{\mathcal{M}}_{t}^{\bot}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT and 𝒩tsubscript𝒩𝑡\mathcal{N}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we use 𝒩tt𝑩tsubscript𝒩𝑡superscriptsubscript𝑡bottomsubscript𝑩𝑡\mathcal{N}_{t}\cap{\mathcal{M}}_{t}^{\bot}\rightarrow\bm{B}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∩ caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT → bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to represent our method.

Refer to caption
Figure 3: Variation of the performance of different methods during the learning of ImageNet-R and CIFAR100.

Table 3 shows the results of our method and its variants. We can find that all these variants fail to perform as well as our method. To further demonstrate the performance of different variants, we show the relative accuracy of different tasks after the model learns them all in Figure 4. Here, relative accuracy is the accuracy of different variants minus the accuracy of our InfLoRA. Note that the last task is the new task, and the other tasks are old tasks in Figure 4. As we can see, ‘Random 𝑩tabsentsubscript𝑩𝑡\rightarrow\bm{B}_{t}→ bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT’ and ‘𝒩t𝑩tsubscript𝒩𝑡subscript𝑩𝑡\mathcal{N}_{t}\rightarrow\bm{B}_{t}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT → bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT’ outperform ‘t𝑩tsuperscriptsubscript𝑡bottomsubscript𝑩𝑡{\mathcal{M}}_{t}^{\bot}\rightarrow\bm{B}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT → bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT’ on the new task but shows much lower accuracy than ‘t𝑩tsuperscriptsubscript𝑡bottomsubscript𝑩𝑡{\mathcal{M}}_{t}^{\bot}\rightarrow\bm{B}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT → bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT’ and our InfLoRA on the old tasks. This means these two variants fail to eliminate the inference of the new task on the old tasks, making the model suffer from low stability. On the contrary, ‘t𝑩tsuperscriptsubscript𝑡bottomsubscript𝑩𝑡{\mathcal{M}}_{t}^{\bot}\rightarrow\bm{B}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT → bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT’ shows the lowest performance on the new task. This means ‘t𝑩tsuperscriptsubscript𝑡bottomsubscript𝑩𝑡{\mathcal{M}}_{t}^{\bot}\rightarrow\bm{B}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT → bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT’ ignores the plasticity of the model. Our method outperforms all the variants on most of the tasks. This shows that our method can eliminate the interference of the new task on the old tasks and make a better trade-off between stability and plasticity than these variants.

Varying the Pre-Trained Model  We also follow the existing method [40] and perform experiments using a ViT-B/16 pre-trained with two different self-supervised methods, including DINO [5] and iBOT [50]. All experimental settings, except for the choice of the pre-trained model, are kept consistent with the details outlined in Section 4.1.

Refer to caption
Figure 4: Relative accuracy of different tasks. Relative accuracy is the accuracy of different variants minus the accuracy of InfLoRA.
Table 4: Results (%) of different methods on ImageNet-R (10 tasks) using various self-supervised pre-trained models. Here, DINO-1k and iBOT-1k indicate that the ViT is pre-trained on ImageNet-1k using these respective methods.
Method ACC10𝐴𝐶subscript𝐶10ACC_{10}italic_A italic_C italic_C start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT (\uparrow) ACC¯10subscript¯𝐴𝐶𝐶10\overline{ACC}_{10}over¯ start_ARG italic_A italic_C italic_C end_ARG start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT (\uparrow)
DINO-1k L2P [44] 56.71±0.12plus-or-minus56.710.1256.71\pm 0.1256.71 ± 0.12 63.59±0.21plus-or-minus63.590.2163.59\pm 0.2163.59 ± 0.21
DualPrompt [43] 60.23±0.42plus-or-minus60.230.4260.23\pm 0.4260.23 ± 0.42 66.57±0.25plus-or-minus66.570.2566.57\pm 0.2566.57 ± 0.25
CODA-P [38] 64.02±0.68plus-or-minus64.020.6864.02\pm 0.6864.02 ± 0.68 71.50±0.42plus-or-minus71.500.4271.50\pm 0.4271.50 ± 0.42
C-LoRA [37] 63.07±0.36plus-or-minus63.070.3663.07\pm 0.3663.07 ± 0.36 68.09±0.41plus-or-minus68.090.4168.09\pm 0.4168.09 ± 0.41
LAE [12] 61.03±0.27plus-or-minus61.030.2761.03\pm 0.2761.03 ± 0.27 69.89±0.15plus-or-minus69.890.1569.89\pm 0.1569.89 ± 0.15
InfLoRA-b5 66.16±0.14plus-or-minus66.160.1466.16\pm 0.1466.16 ± 0.14 73.01±0.17plus-or-minus73.010.1773.01\pm 0.1773.01 ± 0.17
InfLoRA 68.31 ±plus-or-minus\pm± 0.28 76.15 ±plus-or-minus\pm± 0.05
iBOT-1k L2P [44] 60.80±0.35plus-or-minus60.800.3560.80\pm 0.3560.80 ± 0.35 66.58±0.28plus-or-minus66.580.2866.58\pm 0.2866.58 ± 0.28
DualPrompt [43] 63.78±0.38plus-or-minus63.780.3863.78\pm 0.3863.78 ± 0.38 68.88±0.16plus-or-minus68.880.1668.88\pm 0.1668.88 ± 0.16
CODA-P [38] 68.02±0.48plus-or-minus68.020.4868.02\pm 0.4868.02 ± 0.48 74.28±0.47plus-or-minus74.280.4774.28\pm 0.4774.28 ± 0.47
C-LoRA [37] 68.60±0.07plus-or-minus68.600.0768.60\pm 0.0768.60 ± 0.07 73.47±0.28plus-or-minus73.470.2873.47\pm 0.2873.47 ± 0.28
LAE [12] 64.14±0.29plus-or-minus64.140.2964.14\pm 0.2964.14 ± 0.29 72.59±0.22plus-or-minus72.590.2272.59\pm 0.2272.59 ± 0.22
InfLoRA-b5 69.72±0.44plus-or-minus69.720.4469.72\pm 0.4469.72 ± 0.44 76.11±0.13plus-or-minus76.110.1376.11\pm 0.1376.11 ± 0.13
InfLoRA 71.84 ±plus-or-minus\pm± 0.09 78.29 ±plus-or-minus\pm± 0.09

Table 4 shows the results of different methods on ImageNet-R when using various pre-trained models. Comparing these results to those in Table 1, we can find that the performance of all methods utilizing self-supervised pre-trained models is lower than the performance of the corresponding methods using supervised pre-trained models. However, our methods still outperform all other methods.

Combining with Classifier Alignment  Slow learner with classifier alignment (SLCA) [48] utilizes feature statistics to align classifiers, demonstrating superior performance compared to methods without aligned classifiers. Our InfLoRA can be combined with classifier alignment (CA) to get better performance. Specifically, after learning the t𝑡titalic_t-th task with parameters 𝑨tsubscript𝑨𝑡\bm{A}_{t}bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and loss (9), we collect features 𝑭t={𝒓i,t}i=1ntsubscript𝑭𝑡superscriptsubscriptsubscript𝒓𝑖𝑡𝑖1subscript𝑛𝑡\bm{F}_{t}=\{\bm{r}_{i,t}\}_{i=1}^{n_{t}}bold_italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { bold_italic_r start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT of the t𝑡titalic_t-th task. Here, 𝒓i,t=f(𝒙i,t)subscript𝒓𝑖𝑡𝑓subscript𝒙𝑖𝑡\bm{r}_{i,t}=f(\bm{x}_{i,t})bold_italic_r start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT = italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ) denotes the features extracted by backbone f𝚯()subscript𝑓𝚯f_{\bm{\Theta}}(\cdot)italic_f start_POSTSUBSCRIPT bold_Θ end_POSTSUBSCRIPT ( ⋅ ). Then, mean and covariance of features for each class are computed and saved. After that, for each class c𝑐citalic_c the model has seen during continual learning, S𝑆Sitalic_S samples are sampled from Gaussian distribution 𝒩(𝝁c,𝚺c)𝒩subscript𝝁𝑐subscript𝚺𝑐\mathcal{N}(\bm{\mu}_{c},\bm{\Sigma}_{c})caligraphic_N ( bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , bold_Σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ). Here, 𝝁csubscript𝝁𝑐\bm{\mu}_{c}bold_italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and covariance 𝚺csubscript𝚺𝑐\bm{\Sigma}_{c}bold_Σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT denote the mean and covariance of the class c𝑐citalic_c. Finally, we align the classifier using standard cross-entropy and these samples. The details of this experiment are given in supplementary material.

Table 5 shows that our method InfLoRA+CA outperforms SLCA. Note that SLCA tunes all the parameters of the model while our method InfLoRA only tunes the parameters in 𝑨tsubscript𝑨𝑡\bm{A}_{t}bold_italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Therefore, our InfLoRA+CA is much more efficient than SLCA.

Table 5: Results (%) of different methods on ImageNet-R (10 tasks) and CIFAR100 using classifier alignment (CA) technique.
Method ACC10𝐴𝐶subscript𝐶10ACC_{10}italic_A italic_C italic_C start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT (\uparrow) ACC¯10subscript¯𝐴𝐶𝐶10\overline{ACC}_{10}over¯ start_ARG italic_A italic_C italic_C end_ARG start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT (\uparrow)
CIFAR100 SLCA [48] 91.06±0.24plus-or-minus91.060.2491.06\pm 0.2491.06 ± 0.24 93.65±0.19plus-or-minus93.650.1993.65\pm 0.1993.65 ± 0.19
InfLoRA+CA 91.59 ±plus-or-minus\pm± 0.08 94.39 ±plus-or-minus\pm± 0.05
ImageNet-R SLCA [48] 77.34±0.25plus-or-minus77.340.2577.34\pm 0.2577.34 ± 0.25 81.35±0.16plus-or-minus81.350.1681.35\pm 0.1681.35 ± 0.16
InfLoRA+CA 79.78 ±plus-or-minus\pm± 0.25 83.38 ±plus-or-minus\pm± 0.19

5 Conclusion

In this work, we propose a new method, called interference-free low-rank adaptation (InfLoRA), for continual learning. InfLoRA injects a small number of parameters to reparameterize the pre-trained weights and shows that fine-tuning these injected parameters is equivalent to fine-tuning the pre-trained weights within a subspace. Furthermore, InfLoRA designs this subspace to eliminate the interference of the new task on the old tasks, making a good trade-off between stability and plasticity. Experimental results show that InfLoRA outperforms existing state-of-the-art continual learning methods on multiple datasets.

Acknowledgment

This work is supported by NSFC (No.62192783), National Key R&D Program of China (No.2020YFA0713901), and Fundamental Research Funds for the Central Universities (No.020214380108).

References

  • Aljundi et al. [2018] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision, pages 139–154, 2018.
  • Aljundi et al. [2019a] Rahaf Aljundi, Eugene Belilovsky, Tinne Tuytelaars, Laurent Charlin, Massimo Caccia, Min Lin, and Lucas Page-Caccia. Online continual learning with maximal interfered retrieval. In Advances in Neural Information Processing Systems, pages 11849–11860, 2019a.
  • Aljundi et al. [2019b] Rahaf Aljundi, Min Lin, Baptiste Goujaud, and Yoshua Bengio. Gradient based sample selection for online continual learning. In Advances in Neural Information Processing Systems, pages 11816–11825, 2019b.
  • Boschini et al. [2022] Matteo Boschini, Lorenzo Bonicelli, Angelo Porrello, Giovanni Bellitto, Matteo Pennisi, Simone Palazzo, Concetto Spampinato, and Simone Calderara. Transfer without forgetting. In Proceedings of the European Conference on Computer Vision, pages 692–709, 2022.
  • Caron et al. [2021] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021.
  • Chen et al. [2022] Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems, pages 16664–16678, 2022.
  • Chrysakis and Moens [2020] Aristotelis Chrysakis and Marie-Francine Moens. Online continual learning from imbalanced data. In Proceedings of the International Conference on Machine Learning, pages 1952–1961, 2020.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  • Devlin et al. [2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, pages 4171–4186, 2019.
  • Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  • Fu et al. [2022] Chin-Lun Fu, Zih-Ching Chen, Yun-Ru Lee, and Hung-Yi Lee. Adapterbias: Parameter-efficient token-dependent representation shift for adapters in nlp tasks. In Findings of the Association for Computational Linguistics, pages 2608–2621, 2022.
  • Gao et al. [2023] Qiankun Gao, Chen Zhao, Yifan Sun, Teng Xi, Gang Zhang, Bernard Ghanem, and Jian Zhang. A unified continual learning framework with general parameter-efficient tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11449–11459, 2023.
  • He et al. [2022] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
  • Hendrycks et al. [2021] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8349, 2021.
  • Houlsby et al. [2019] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In Proceedings of the International Conference on Machine Learning, pages 2790–2799, 2019.
  • Hu et al. [2022] Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  • Hung et al. [2019] Steven C. Y. Hung, Cheng-Hao Tu, Cheng-En Wu, Chien-Hung Chen, Yi-Ming Chan, and Chu-Song Chen. Compacting, picking and growing for unforgetting continual learning. In Advances in Neural Information Processing Systems, pages 13647–13657, 2019.
  • Jia et al. [2022] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge J. Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In Proceedings of the European Conference on Computer Vision, pages 709–727, 2022.
  • Jie and Deng [2023] Shibo Jie and Zhi-Hong Deng. Fact: Factor-tuning for lightweight adaptation on vision transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1060–1068, 2023.
  • Jung et al. [2020] Sangwon Jung, Hongjoon Ahn, Sungmin Cha, and Taesup Moon. Continual learning with node-importance based adaptive group sparse regularization. Advances in Neural Information Processing Systems, pages 3647–3658, 2020.
  • Khan et al. [2023] Muhammad Gul Zain Ali Khan, Muhammad Ferjad Naeem, Luc Van Gool, Didier Stricker, Federico Tombari, and Muhammad Zeshan Afzal. Introducing language guidance in prompt-based continual learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11463–11473, 2023.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kirkpatrick et al. [2017] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, pages 3521–3526, 2017.
  • Krizhevsky [2009] A Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, University of Tront, 2009.
  • Lester et al. [2021] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, 2021.
  • Li et al. [2019] Xilai Li, Yingbo Zhou, Tianfu Wu, Richard Socher, and Caiming Xiong. Learn to grow: A continual structure learning framework for overcoming catastrophic forgetting. In Proceedings of the International Conference on Machine Learning, pages 3925–3934, 2019.
  • Li and Liang [2021] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 4582–4597, 2021.
  • Liang and Li [2023a] Yan-Shuo Liang and Wu-Jun Li. Loss decoupling for task-agnostic continual learning. In Advances in Neural Information Processing Systems, 2023a.
  • Liang and Li [2023b] Yan-Shuo Liang and Wu-Jun Li. Adaptive plasticity improvement for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7816–7825, 2023b.
  • Lin et al. [2022] Sen Lin, Li Yang, Deliang Fan, and Junshan Zhang. Trgp: Trust region gradient projection for continual learning. In International Conference on Learning Representations, 2022.
  • Mahabadi et al. [2021] Rabeeh Karimi Mahabadi, James Henderson, and Sebastian Ruder. Compacter: Efficient low-rank hypercomplex adapter layers. In Advances in Neural Information Processing Systems, pages 1022–1035, 2021.
  • Masana et al. [2021] Marc Masana, Joost Van de Weijer, Bartłomiej Twardowski, et al. On the importance of cross-task features for class-incremental learning. arXiv preprint arXiv:2106.11930, 2021.
  • Parisi et al. [2019] German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, pages 54–71, 2019.
  • Peng et al. [2019] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1406–1415, 2019.
  • Rusu et al. [2016] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
  • Saha et al. [2021] Gobinda Saha, Isha Garg, and Kaushik Roy. Gradient projection memory for continual learning. In International Conference on Learning Representations, 2021.
  • Smith et al. [2023a] James Seale Smith, Yen-Chang Hsu, Lingyu Zhang, Ting Hua, Zsolt Kira, Yilin Shen, and Hongxia Jin. Continual diffusion: Continual customization of text-to-image diffusion with c-lora. CoRR, 2023a.
  • Smith et al. [2023b] James Seale Smith, Leonid Karlinsky, Vyshnavi Gutta, Paola Cascante-Bonilla, Donghyun Kim, Assaf Arbelle, Rameswar Panda, Rogerio Feris, and Zsolt Kira. Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11909–11919, 2023b.
  • Sun et al. [2022] Qing Sun, Fan Lyu, Fanhua Shang, Wei Feng, and Liang Wan. Exploring example influence in continual learning. Advances in Neural Information Processing Systems, pages 27075–27086, 2022.
  • Wang et al. [2023a] Liyuan Wang, Jingyi Xie, Xingxing Zhang, Mingyi Huang, Hang Su, and Jun Zhu. Hierarchical decomposition of prompt-based continual learning: Rethinking obscured sub-optimality. arXiv preprint arXiv:2310.07234, 2023a.
  • Wang et al. [2023b] Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application. arXiv preprint arXiv:2302.00487, 2023b.
  • Wang et al. [2022a] Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. S-prompts learning with pre-trained transformers: An occam’s razor for domain incremental learning. Advances in Neural Information Processing Systems, pages 5682–5695, 2022a.
  • Wang et al. [2022b] Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun, Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer G. Dy, and Tomas Pfister. Dualprompt: Complementary prompting for rehearsal-free continual learning. In Proceedings of the European Conference on Computer Vision, pages 631–648, 2022b.
  • Wang et al. [2022c] Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 139–149, 2022c.
  • Zaken et al. [2022] Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 1–9, 2022.
  • Zenke et al. [2017] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In Proceedings of the International Conference on Machine Learning, pages 3987–3995, 2017.
  • Zhang et al. [2021] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, pages 107–115, 2021.
  • Zhang et al. [2023] Gengwei Zhang, Liyuan Wang, Guoliang Kang, Ling Chen, and Yunchao Wei. SLCA: slow learner with classifier alignment for continual learning on a pre-trained model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19091–19101, 2023.
  • Zheng et al. [2023] Zangwei Zheng, Mingyuan Ma, Kai Wang, Ziheng Qin, Xiangyu Yue, and Yang You. Preventing zero-shot transfer degradation in continual learning of vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19068–19079, 2023.
  • Zhou et al. [2022] Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. Image bert pre-training with online tokenizer. In International Conference on Learning Representations, 2022.
  • Zhu et al. [2021] Fei Zhu, Xu-Yao Zhang, Chuang Wang, Fei Yin, and Cheng-Lin Liu. Prototype augmentation and self-supervision for incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5871–5880, 2021.
\thetitle

Supplementary Material

A Details of GPM and DualGPM

GPM and DualGPM are established on the fact that the gradient updates lie in the span of input data points [47].

For a linear layer, we denote its forward propagation as

𝒆=𝑾𝒉+𝒃,𝒆𝑾𝒉𝒃\displaystyle\bm{e}=\bm{W}\bm{h}+\bm{b},bold_italic_e = bold_italic_W bold_italic_h + bold_italic_b , (11)

𝑾dI×dO𝑾superscriptsubscript𝑑𝐼subscript𝑑𝑂\bm{W}\in\mathbb{R}^{d_{I}\times d_{O}}bold_italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝒉dI𝒉superscriptsubscript𝑑𝐼\bm{h}\in\mathbb{R}^{d_{I}}bold_italic_h ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and 𝒆dO𝒆superscriptsubscript𝑑𝑂\bm{e}\in\mathbb{R}^{d_{O}}bold_italic_e ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. dIsubscript𝑑𝐼d_{I}italic_d start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and dOsubscript𝑑𝑂d_{O}italic_d start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT denote input and output dimension, respectively. We further denote the loss function as \mathcal{L}caligraphic_L. Through the chain rule, we can get the gradient of 𝑾𝑾\bm{W}bold_italic_W:

𝑾𝑾\displaystyle\frac{\partial\mathcal{L}}{\partial\bm{W}}divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ bold_italic_W end_ARG =𝒆𝒆𝑾=𝒆𝒉T=[a1𝒉T,a2𝒉T,,adO𝒉T].absent𝒆𝒆𝑾𝒆superscript𝒉𝑇delimited-[]matrixsubscript𝑎1superscript𝒉𝑇subscript𝑎2superscript𝒉𝑇subscript𝑎subscript𝑑𝑂superscript𝒉𝑇\displaystyle=\frac{\partial\mathcal{L}}{\partial\bm{e}}\frac{\partial\bm{e}}{% \bm{W}}=\frac{\partial\mathcal{L}}{\partial\bm{e}}\bm{h}^{T}=\left[\begin{% matrix}a_{1}\bm{h}^{T},\\ a_{2}\bm{h}^{T},\\ ...,\\ a_{d_{O}}\bm{h}^{T}\end{matrix}\right].= divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ bold_italic_e end_ARG divide start_ARG ∂ bold_italic_e end_ARG start_ARG bold_italic_W end_ARG = divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ bold_italic_e end_ARG bold_italic_h start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = [ start_ARG start_ROW start_CELL italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_h start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_h start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL … , end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_h start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] . (12)

[a1,a2,,adO]Tsuperscriptsubscript𝑎1subscript𝑎2subscript𝑎subscript𝑑𝑂𝑇[a_{1},a_{2},...,a_{d_{O}}]^{T}[ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT denotes the vector 𝒆𝒆\frac{\partial\mathcal{L}}{\partial\bm{e}}divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ bold_italic_e end_ARG. Through (12), we can find that each column of 𝑾𝑾\frac{\partial\mathcal{L}}{\partial\bm{W}}divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ bold_italic_W end_ARG can be represented as input 𝒉𝒉\bm{h}bold_italic_h multiplied by a real value aksubscript𝑎𝑘a_{k}italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (1kdO1𝑘subscript𝑑𝑂1\leq k\leq d_{O}1 ≤ italic_k ≤ italic_d start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT). Therefore, in the linear layer, each column of the gradient 𝑾𝑾\frac{\partial\mathcal{L}}{\partial\bm{W}}divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ bold_italic_W end_ARG lies in the span of input.

A.1 Gradient Projection Memory

GPM learns a subspace tsubscript𝑡{\mathcal{M}}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with orthogonal bases 𝑴tsubscript𝑴𝑡\bm{M}_{t}bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to approximate the gradient space of the old tasks. Here, the columns of 𝑴tsubscript𝑴𝑡\bm{M}_{t}bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT contribute a set of orthogonal bases in tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. GPM expands the bases of tsubscript𝑡{\mathcal{M}}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to the bases of t+1subscript𝑡1{\mathcal{M}}_{t+1}caligraphic_M start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT after learning the t𝑡titalic_t-th new task. Specifically, GPM computes the inputs matrix 𝑯tsubscript𝑯𝑡\bm{H}_{t}bold_italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT such that each column of 𝑯tsubscript𝑯𝑡\bm{H}_{t}bold_italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents an input of this layer. Then, the part of 𝑯tsubscript𝑯𝑡\bm{H}_{t}bold_italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that has already in tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is removed by

𝑯^t=𝑯t𝑴t(𝑴t)T𝑯t=𝑯t𝑯t,proj.subscript^𝑯𝑡subscript𝑯𝑡subscript𝑴𝑡superscriptsubscript𝑴𝑡𝑇subscript𝑯𝑡subscript𝑯𝑡subscript𝑯𝑡𝑝𝑟𝑜𝑗\displaystyle\hat{\bm{H}}_{t}=\bm{H}_{t}-\bm{M}_{t}(\bm{M}_{t})^{T}\bm{H}_{t}=% \bm{H}_{t}-\bm{H}_{t,proj}.over^ start_ARG bold_italic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_H start_POSTSUBSCRIPT italic_t , italic_p italic_r italic_o italic_j end_POSTSUBSCRIPT . (13)

Please note that when t=1𝑡1t=1italic_t = 1, dim(t)=0dimsubscript𝑡0{\rm dim}({\mathcal{M}}_{t})=0roman_dim ( caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 0 and hence 𝑯t,projsubscript𝑯𝑡𝑝𝑟𝑜𝑗\bm{H}_{t,proj}bold_italic_H start_POSTSUBSCRIPT italic_t , italic_p italic_r italic_o italic_j end_POSTSUBSCRIPT is a zero matrix. After that, singular value decomposition (SVD) is performed on 𝑯^t=𝑼^𝚺^𝑽^Tsubscript^𝑯𝑡^𝑼^𝚺superscript^𝑽𝑇\hat{\bm{H}}_{t}=\hat{\bm{U}}\hat{\bm{\Sigma}}\hat{\bm{V}}^{T}over^ start_ARG bold_italic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over^ start_ARG bold_italic_U end_ARG over^ start_ARG bold_Σ end_ARG over^ start_ARG bold_italic_V end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. Then, u𝑢uitalic_u new orthogonal bases are chosen from the columns of 𝑼^^𝑼\hat{\bm{U}}over^ start_ARG bold_italic_U end_ARG for a minimum of u𝑢uitalic_u satisfying the following criteria for given threshold ϵthsubscriptitalic-ϵ𝑡\epsilon_{th}italic_ϵ start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT:

(𝑯^t)uF2+𝑯t,projF2ϵth𝑯tF2.superscriptsubscriptnormsubscriptsubscript^𝑯𝑡𝑢𝐹2superscriptsubscriptnormsubscript𝑯𝑡𝑝𝑟𝑜𝑗𝐹2subscriptitalic-ϵ𝑡superscriptsubscriptnormsubscript𝑯𝑡𝐹2\displaystyle||(\hat{\bm{H}}_{t})_{u}||_{F}^{2}+||\bm{H}_{t,proj}||_{F}^{2}% \geq\epsilon_{th}||\bm{H}_{t}||_{F}^{2}.| | ( over^ start_ARG bold_italic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + | | bold_italic_H start_POSTSUBSCRIPT italic_t , italic_p italic_r italic_o italic_j end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ italic_ϵ start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT | | bold_italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (14)

Here, (𝑯^t)u=[𝒖1,,𝒖u]subscriptsubscript^𝑯𝑡𝑢subscript𝒖1subscript𝒖𝑢(\hat{\bm{H}}_{t})_{u}=[\bm{u}_{1},...,\bm{u}_{u}]( over^ start_ARG bold_italic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = [ bold_italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_u start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ] denotes the components of 𝑯^tsubscript^𝑯𝑡\hat{\bm{H}}_{t}over^ start_ARG bold_italic_H end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that correspond to top-u𝑢uitalic_u singular values. Then, subspace t+1subscript𝑡1{\mathcal{M}}_{t+1}caligraphic_M start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is obtained with the bases 𝑴t+1=[𝑴t,𝒖1,,𝒖u]subscript𝑴𝑡1subscript𝑴𝑡subscript𝒖1subscript𝒖𝑢\bm{M}_{t+1}=[\bm{M}_{t},\bm{u}_{1},...,\bm{u}_{u}]bold_italic_M start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = [ bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_u start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ].

A.2 Dual Gradient Projection Memory

Different from GPM that learns a subspace tsubscript𝑡{\mathcal{M}}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with orthogonal bases 𝑴tsubscript𝑴𝑡\bm{M}_{t}bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to approximate the gradient space of the old tasks, DualGPM either learns a subspace tsubscript𝑡{\mathcal{M}}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with orthogonal bases 𝑴tsubscript𝑴𝑡\bm{M}_{t}bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to approximate the gradient of the old tasks, or learns a subspace tsuperscriptsubscript𝑡bottom{\mathcal{M}}_{t}^{\bot}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT with orthogonal bases 𝑴tsuperscriptsubscript𝑴𝑡bottom\bm{M}_{t}^{\bot}bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT to approximate orthogonal complement of the gradient space of the old tasks.

DualGPM decides whether to keep 𝑴tsubscript𝑴𝑡\bm{M}_{t}bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT or 𝑴tsuperscriptsubscript𝑴𝑡bottom\bm{M}_{t}^{\bot}bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT in memory according to dim(t)dimsubscript𝑡{\rm dim}({\mathcal{M}}_{t})roman_dim ( caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and dim(t)dimsuperscriptsubscript𝑡bottom{\rm dim}({\mathcal{M}}_{t}^{\bot})roman_dim ( caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT ). Specifically, during the learning of the first several tasks, dim(t)dim(t)dimsubscript𝑡dimsuperscriptsubscript𝑡bottom{\rm dim}({\mathcal{M}}_{t})\leq{\rm dim}({\mathcal{M}}_{t}^{\bot})roman_dim ( caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ roman_dim ( caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT ). At this time, DualGPM maintains 𝑴tsubscript𝑴𝑡\bm{M}_{t}bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and expands 𝑴tsubscript𝑴𝑡\bm{M}_{t}bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to 𝑴t+1subscript𝑴𝑡1\bm{M}_{t+1}bold_italic_M start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT after each task. When dim(t)dimsubscript𝑡{\rm dim}({\mathcal{M}}_{t})roman_dim ( caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) increases and exceeds dim(t)dimsuperscriptsubscript𝑡bottom{\rm dim}({\mathcal{M}}_{t}^{\bot})roman_dim ( caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT ), DualGPM obtains 𝑴tsuperscriptsubscript𝑴𝑡bottom\bm{M}_{t}^{\bot}bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT through some transformations on 𝑴tsubscript𝑴𝑡\bm{M}_{t}bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. After that, DualGPM only maintains 𝑴tsuperscriptsubscript𝑴𝑡bottom\bm{M}_{t}^{\bot}bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT in memory, and reduces 𝑴tsuperscriptsubscript𝑴𝑡bottom\bm{M}_{t}^{\bot}bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT to 𝑴t+1superscriptsubscript𝑴𝑡1bottom\bm{M}_{t+1}^{\bot}bold_italic_M start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT after each task. Through this way, the number of bases kept for each layer is min{dim(t),dim(t)}mindimsubscript𝑡dimsuperscriptsubscript𝑡bottom{\rm min}\{{\rm dim}({\mathcal{M}}_{t}),{\rm dim}({\mathcal{M}}_{t}^{\bot})\}roman_min { roman_dim ( caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , roman_dim ( caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT ) }.

There are three key problems in DualGPM: expanding the bases of tsubscript𝑡{\mathcal{M}}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, obtaining the bases of tsuperscriptsubscript𝑡bottom{\mathcal{M}}_{t}^{\bot}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT through the bases of tsubscript𝑡{\mathcal{M}}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and reducing the bases of tsuperscriptsubscript𝑡bottom{\mathcal{M}}_{t}^{\bot}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT.

Expanding the Bases of tsubscript𝑡{\mathcal{M}}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

The expansion of tsubscript𝑡{\mathcal{M}}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the same as that in GPM.

Transforming tsubscript𝑡{\mathcal{M}}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to tsuperscriptsubscript𝑡bottom{\mathcal{M}}_{t}^{\bot}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT

DualGPM transforms tsubscript𝑡{\mathcal{M}}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to tsuperscriptsubscript𝑡bottom{\mathcal{M}}_{t}^{\bot}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT by performing SVD to the matrix 𝑴tsubscript𝑴𝑡\bm{M}_{t}bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Specifically, let 𝑴t=𝑼𝚺𝑽Tsubscript𝑴𝑡𝑼𝚺superscript𝑽𝑇\bm{M}_{t}=\bm{U}\bm{\Sigma}\bm{V}^{T}bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_U bold_Σ bold_italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, the column vectors of 𝑼𝑼\bm{U}bold_italic_U which correspond to the zero singular values form a set of orthogonal bases of tsuperscriptsubscript𝑡bottom{\mathcal{M}}_{t}^{\bot}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT. Please refer to the paper of DualGPM [29] for this explanation.

Reducing the Bases of tsuperscriptsubscript𝑡bottom{\mathcal{M}}_{t}^{\bot}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT

DualGPM reduces space tsuperscriptsubscript𝑡bottom{\mathcal{M}}_{t}^{\bot}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT by removing the part of tsuperscriptsubscript𝑡bottom{\mathcal{M}}_{t}^{\bot}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT which contains the gradient of the t𝑡titalic_t-th task. Specifically, DualGPM first computes the input matrix 𝑹tsubscript𝑹𝑡\bm{R}_{t}bold_italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Then, the part of 𝑹tsubscript𝑹𝑡\bm{R}_{t}bold_italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT which lies in tsuperscriptsubscript𝑡bottom{\mathcal{M}}_{t}^{\bot}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT can be computed through

𝑹^t=𝑴t(𝑴t)T𝑹t=𝑹t,proj.superscriptsubscript^𝑹𝑡bottomsuperscriptsubscript𝑴𝑡bottomsuperscriptsuperscriptsubscript𝑴𝑡bottom𝑇subscript𝑹𝑡superscriptsubscript𝑹𝑡𝑝𝑟𝑜𝑗bottom\displaystyle\hat{\bm{R}}_{t}^{\bot}=\bm{M}_{t}^{\bot}(\bm{M}_{t}^{\bot})^{T}% \bm{R}_{t}=\bm{R}_{t,proj}^{\bot}.over^ start_ARG bold_italic_R end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT = bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT ( bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_R start_POSTSUBSCRIPT italic_t , italic_p italic_r italic_o italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT . (15)

After that, SVD is performed on 𝑹^t=𝑼^𝚺^(𝑽^)Tsuperscriptsubscript^𝑹𝑡bottomsuperscript^𝑼bottomsuperscript^𝚺bottomsuperscriptsuperscript^𝑽bottom𝑇\hat{\bm{R}}_{t}^{\bot}=\hat{\bm{U}}^{\bot}\hat{\bm{\Sigma}}^{\bot}(\hat{\bm{V% }}^{\bot})^{T}over^ start_ARG bold_italic_R end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT = over^ start_ARG bold_italic_U end_ARG start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT over^ start_ARG bold_Σ end_ARG start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_V end_ARG start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. Then, k𝑘kitalic_k new orthogonal bases are chosen from the columns of 𝑼^superscript^𝑼bottom\hat{\bm{U}}^{\bot}over^ start_ARG bold_italic_U end_ARG start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT for a maximum of k𝑘kitalic_k satisfying the following criteria for the given threshold ϵthsubscriptitalic-ϵ𝑡\epsilon_{th}italic_ϵ start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT (the same as ϵthsubscriptitalic-ϵ𝑡\epsilon_{th}italic_ϵ start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT in (14)):

(𝑹^t)kF2(1ϵth)𝑹tF2.superscriptsubscriptnormsubscriptsuperscriptsubscript^𝑹𝑡bottom𝑘𝐹21subscriptitalic-ϵ𝑡superscriptsubscriptnormsubscript𝑹𝑡𝐹2\displaystyle||(\hat{\bm{R}}_{t}^{\bot})_{k}||_{F}^{2}\leq(1-\epsilon_{th})||% \bm{R}_{t}||_{F}^{2}.| | ( over^ start_ARG bold_italic_R end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ( 1 - italic_ϵ start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT ) | | bold_italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (16)

Let 𝒁=(𝑹^t)k=[𝒖1,,𝒖k]𝒁subscriptsuperscriptsubscript^𝑹𝑡bottom𝑘superscriptsubscript𝒖1bottomsuperscriptsubscript𝒖𝑘bottom\bm{Z}=(\hat{\bm{R}}_{t}^{\bot})_{k}=[\bm{u}_{1}^{\bot},...,\bm{u}_{k}^{\bot}]bold_italic_Z = ( over^ start_ARG bold_italic_R end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = [ bold_italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT , … , bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT ], 𝒵=span{𝒖1,,𝒖k}𝒵spansuperscriptsubscript𝒖1bottomsuperscriptsubscript𝒖𝑘bottom\mathcal{Z}={\rm span}\{\bm{u}_{1}^{\bot},...,\bm{u}_{k}^{\bot}\}caligraphic_Z = roman_span { bold_italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT , … , bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT }. Here, 𝒵𝒵\mathcal{Z}caligraphic_Z is the subspace of tsuperscriptsubscript𝑡bottom{\mathcal{M}}_{t}^{\bot}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT that contains the gradient of the t𝑡titalic_t-th task. DualGPM removes 𝒵𝒵\mathcal{Z}caligraphic_Z from tsuperscriptsubscript𝑡bottom{\mathcal{M}}_{t}^{\bot}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT to get t+1superscriptsubscript𝑡1bottom{\mathcal{M}}_{t+1}^{\bot}caligraphic_M start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT. Specifically, let 𝑴^t=𝑴t𝒁(𝒁T)𝑴tsuperscriptsubscript^𝑴𝑡bottomsuperscriptsubscript𝑴𝑡bottom𝒁superscript𝒁𝑇superscriptsubscript𝑴𝑡bottom\hat{\bm{M}}_{t}^{\bot}=\bm{M}_{t}^{\bot}-\bm{Z}(\bm{Z}^{T})\bm{M}_{t}^{\bot}over^ start_ARG bold_italic_M end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT = bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT - bold_italic_Z ( bold_italic_Z start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT. DualGPM performs the second SVD on 𝑴^t=𝑼~𝚺~(𝑽~)Tsuperscriptsubscript^𝑴𝑡bottomsuperscript~𝑼bottomsuperscript~𝚺bottomsuperscriptsuperscript~𝑽bottom𝑇\hat{\bm{M}}_{t}^{\bot}=\widetilde{\bm{U}}^{\bot}\widetilde{\bm{\Sigma}}^{\bot% }(\widetilde{\bm{V}}^{\bot})^{T}over^ start_ARG bold_italic_M end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT = over~ start_ARG bold_italic_U end_ARG start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT over~ start_ARG bold_Σ end_ARG start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_V end_ARG start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. The columns of 𝑼~superscript~𝑼bottom\widetilde{\bm{U}}^{\bot}over~ start_ARG bold_italic_U end_ARG start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT which correspond to the non-zero singular values form the bases 𝑴t+1superscriptsubscript𝑴𝑡1bottom\bm{M}_{t+1}^{\bot}bold_italic_M start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT. Please refer to the paper of DualGPM [29] for this explanation.

Refer to caption
Figure 5: Change of the dimension of subspace tsuperscriptsubscript𝑡bottom\mathcal{M}_{t}^{\bot}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT throughout the whole learning process.

A.3 Approximation Error in DualGPM

DualGPM either learns a subspace tsubscript𝑡{\mathcal{M}}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to approximate the gradient space of the old tasks or learns a subspace tsuperscriptsubscript𝑡bottom{\mathcal{M}}_{t}^{\bot}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT to represent the orthogonal complement of the gradient of the old tasks. From Seciton A.2, we can find that the approximation error is related to the hyperparameter ϵthsubscriptitalic-ϵ𝑡\epsilon_{th}italic_ϵ start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT in (14) and (16). Specifically, when the value of ϵthsubscriptitalic-ϵ𝑡\epsilon_{th}italic_ϵ start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT in (14) and (16) increases, the approximation error decreases. As a result, the dimension of subspace tsubscript𝑡{\mathcal{M}}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT becomes larger, while the dimension of tsuperscriptsubscript𝑡bottom{\mathcal{M}}_{t}^{\bot}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT becomes smaller. Note that our InfLoRA constrains the update of the model to lie within the subspace 𝒩tttsubscript𝒩𝑡superscriptsubscript𝑡bottomsuperscriptsubscript𝑡bottom\mathcal{N}_{t}\cap{\mathcal{M}}_{t}^{\bot}\subseteq{\mathcal{M}}_{t}^{\bot}caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∩ caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT ⊆ caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT. Therefore, we can adjust the value of ϵthsubscriptitalic-ϵ𝑡\epsilon_{th}italic_ϵ start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT to adjust the space for learning the new task. Here, for all the experiments, we set

ϵth=ϵ+(1ϵ)tT,subscriptitalic-ϵ𝑡italic-ϵ1italic-ϵ𝑡𝑇\displaystyle\epsilon_{th}=\epsilon+\frac{(1-\epsilon)*t}{T},italic_ϵ start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT = italic_ϵ + divide start_ARG ( 1 - italic_ϵ ) ∗ italic_t end_ARG start_ARG italic_T end_ARG , (17)

where t𝑡titalic_t denotes the task id and T𝑇Titalic_T denotes the total number of tasks. In other words, we gradually increase the value of ϵthsubscriptitalic-ϵ𝑡\epsilon_{th}italic_ϵ start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT as the number of tasks increases throughout the whole learning process. Table 6 shows the setting of ϵitalic-ϵ\epsilonitalic_ϵ in our InfLoRA.

Figure 5 illustrates the variation of the dimension of the subspace tsuperscriptsubscript𝑡bottom\mathcal{M}_{t}^{\bot}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT in different Transformer layers of ViT-B/16. We can find that the dimension of the subspace tsuperscriptsubscript𝑡bottom\mathcal{M}_{t}^{\bot}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT in different Transformer layers of ViT-B/16 is always much larger than zero, which means the space for learning the new task always exists throughout the whole learning process.

Table 6: List of hyper-parameters for different methods. The meaning of different hyperparameters is given in Section B.2. The hyperparameter ϵitalic-ϵ\epsilonitalic_ϵ in InfLoRA is explained in Section A.3
Methods Hyper-Parameters
L2P lr: 0.001 (ImageNet-R, DomainNet, CIFAR100)
l𝑙litalic_l: 1 (ImageNet-R, DomainNet, CIFAR100)
p𝑝pitalic_p: 30 (ImageNet-R, DomainNet, CIFAR100)
e𝑒eitalic_e: 20 (ImageNet-R, DomainNet, CIFAR100)
DualPrompt lr: 0.001 (ImageNet-R, DomainNet, CIFAR100)
lEsubscript𝑙𝐸l_{E}italic_l start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT: 3 (ImageNet-R, DomainNet, CIFAR100)
lSsubscript𝑙𝑆l_{S}italic_l start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT: 2 (ImageNet-R, DomainNet, CIFAR100)
eEsubscript𝑒𝐸e_{E}italic_e start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT: 20 (ImageNet-R, DomainNet, CIFAR100)
eSsubscript𝑒𝑆e_{S}italic_e start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT: 6 (ImageNet-R, DomainNet, CIFAR100)
CODA-P lr: 0.001 (ImageNet-R, DomainNet, CIFAR100)
l𝑙litalic_l: 5 (ImageNet-R, DomainNet, CIFAR100)
p𝑝pitalic_p: 100 (ImageNet-R, DomainNet, CIFAR100)
e𝑒eitalic_e: 8 (ImageNet-R, DomainNet, CIFAR100)
LAE lr: 0.001 (ImageNet-R, DomainNet, CIFAR100)
r𝑟ritalic_r: 5 (ImageNet-R, DomainNet, CIFAR100)
C-LoRA lr: 0.001 (ImageNet-R, DomainNet, CIFAR100)
r𝑟ritalic_r: 64 (ImageNet-R, DomainNet, CIFAR100)
λ𝜆\lambdaitalic_λ: 0.5 (ImageNet-R, DomainNet, CIFAR100)
InfLoRA-b5 lr: 0.001 (CIFAR100), 0.0005 (ImageNet-R, DomainNet)
r𝑟ritalic_r: 10 (ImageNet-R, CIFAR100), 20 (DomainNet)
ϵitalic-ϵ\epsilonitalic_ϵ: 0.990.990.990.99 (ImageNet-R), 0.950.950.950.95 (CIFAR100, DomainNet)
InfLoRA lr: 0.0005 (ImageNet-R, DomainNet, CIFAR100)
r𝑟ritalic_r: 10 (ImageNet-R, DomainNet, CIFAR100)
ϵitalic-ϵ\epsilonitalic_ϵ: 0.980.980.980.98 (ImageNet-R), 0.950.950.950.95 (CIFAR100, DomainNet)

B More Experimental Details

B.1 Training Details

For all the methods in all the experiments except for the comparison with SLCA, the batch size is set to 128 to follow many existing continual learning methods based on PEFT [38, 40]. Hyperparameters for different methods are selected based on the experimental settings in existing works [38, 44, 12] or through hyperparameter search. For example, Adam is used as the optimizer with running averages of gradient and its square (β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999). The learning rate is searched among [5e-4, 1e-3, 2e-3, 1e-2] for all the methods through the validation sets we split from the training sets. For the hyperparameter r𝑟ritalic_r in our InfLoRA, we search it among [1, 5, 10, 20, 30] through the validation sets we split from the training sets. Table 6 shows the hyperparameters of different methods.

When compared with SLCA, our method is combined with classifier alignment (CA). At this time, we follow SLCA to train the expanded LoRA branches and classifiers using the SGD optimizer. Each task is trained for 50 epochs on ImageNet-R, 20 epochs on CIFAR100 and 5 epochs on DomainNet. The batch size is set to 128.

B.2 Expanded Parameters

For L2P [44], the expanded parameters consist of the inserted prompts and their corresponding keys. Let d𝑑ditalic_d denote the embedding dimension, e𝑒eitalic_e denote the prompt length, p𝑝pitalic_p denote the number of prompts, and l𝑙litalic_l denote the number of layers in which prompts are inserted. To compute the total number of expanded parameters, the formula used is dlp(e+1)𝑑𝑙𝑝𝑒1dlp(e+1)italic_d italic_l italic_p ( italic_e + 1 ).

For DualPrompt [43], the expanded parameters also consist of the inserted prompts and corresponding keys. However, DualPrompt contains expert prompts and shared prompts. Let d𝑑ditalic_d denote the embedding dimension, T𝑇Titalic_T denote the number of tasks, eEsubscript𝑒𝐸e_{E}italic_e start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT denote the expert prompt length, eSsubscript𝑒𝑆e_{S}italic_e start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT denote the shared prompt length, lEsubscript𝑙𝐸l_{E}italic_l start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT denote the number of layers in which expert prompts are inserted and lSsubscript𝑙𝑆l_{S}italic_l start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT denote the number of layers in which shared prompts are inserted. To compute the total number of expanded parameters, the formula used is d[TlE(eE+1)+eSlS]𝑑delimited-[]𝑇subscript𝑙𝐸subscript𝑒𝐸1subscript𝑒𝑆subscript𝑙𝑆d[Tl_{E}(e_{E}+1)+e_{S}l_{S}]italic_d [ italic_T italic_l start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( italic_e start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT + 1 ) + italic_e start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ].

For CODA-Prompt [38], the expanded parameters consist of the inserted prompts, corresponding keys and attention parameters. Let d𝑑ditalic_d denote the embedding dimension, e𝑒eitalic_e denote the prompt length, p𝑝pitalic_p denote the number of prompts, and l𝑙litalic_l denote the number of layers in which prompts are inserted. To compute the total number of expanded parameters, the formula used is dlp(e+2)𝑑𝑙𝑝𝑒2dlp(e+2)italic_d italic_l italic_p ( italic_e + 2 ).

For LAE [12], we implement it with LoRA. Therefore, the expanded parameters in this method consist of the inserted LoRA modules and the corresponding ensemble modules. Let d𝑑ditalic_d denote the embedding dimension, r𝑟ritalic_r denote the rank, and l𝑙litalic_l denote the number of layers in which LoRA modules are inserted. Since LAE inserts LoRA modules into key and value projection in multi-head attention, the number of expanded parameters is 8ldr8𝑙𝑑𝑟8ldr8 italic_l italic_d italic_r.

For C-LoRA [37], the expanded parameters in this method consist of the inserted LoRA modules. Let d𝑑ditalic_d denote the embedding dimension, r𝑟ritalic_r denote the rank, and l𝑙litalic_l denote the number of layers in which LoRA modules are inserted. Since C-LoRA inserts LoRA modules into query, key and value projection in multi-head attention, the number of expanded parameters is 6ldr6𝑙𝑑𝑟6ldr6 italic_l italic_d italic_r.

For our methods, since we integrate the branches of the old tasks when the model learns a new task, the number of expanded parameters equals the number of parameters in a single branch. Let d𝑑ditalic_d denote the embedding dimension, r𝑟ritalic_r denote the rank, and l𝑙litalic_l denote the number of layers in which our InfLoRA modules are inserted. Since we also insert InfLoRA modules into key and value projection in multi-head attention, the number of expanded parameters is 4ldr4𝑙𝑑𝑟4ldr4 italic_l italic_d italic_r.

C More Experimental Results

Table 7: The comparison between our InfLoRA and more methods on ImageNet-R.
Tasks 5 10 20
Method ACC5𝐴𝐶subscript𝐶5ACC_{5}italic_A italic_C italic_C start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT (\uparrow) ACC¯5subscript¯𝐴𝐶𝐶5\overline{ACC}_{5}over¯ start_ARG italic_A italic_C italic_C end_ARG start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT (\uparrow) ACC10𝐴𝐶subscript𝐶10ACC_{10}italic_A italic_C italic_C start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT (\uparrow) ACC¯10subscript¯𝐴𝐶𝐶10\overline{ACC}_{10}over¯ start_ARG italic_A italic_C italic_C end_ARG start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT (\uparrow) ACC20𝐴𝐶subscript𝐶20ACC_{20}italic_A italic_C italic_C start_POSTSUBSCRIPT 20 end_POSTSUBSCRIPT (\uparrow) ACC¯20subscript¯𝐴𝐶𝐶20\overline{ACC}_{20}over¯ start_ARG italic_A italic_C italic_C end_ARG start_POSTSUBSCRIPT 20 end_POSTSUBSCRIPT (\uparrow)
SeqLoRA 70.96±0.25plus-or-minus70.960.2570.96\pm 0.2570.96 ± 0.25 79.14±0.32plus-or-minus79.140.3279.14\pm 0.3279.14 ± 0.32 64.32±0.09plus-or-minus64.320.0964.32\pm 0.0964.32 ± 0.09 74.78±0.29plus-or-minus74.780.2974.78\pm 0.2974.78 ± 0.29 56.98±0.29plus-or-minus56.980.2956.98\pm 0.2956.98 ± 0.29 69.29±0.26plus-or-minus69.290.2669.29\pm 0.2669.29 ± 0.26
HiDe-Prompt [40] 76.82±0.91plus-or-minus76.820.9176.82\pm 0.9176.82 ± 0.91 77.19±0.34plus-or-minus77.190.3477.19\pm 0.3477.19 ± 0.34 75.06±0.12plus-or-minus75.060.1275.06\pm 0.1275.06 ± 0.12 76.60±0.01plus-or-minus76.600.0176.60\pm 0.0176.60 ± 0.01 66.88±1.29plus-or-minus66.881.2966.88\pm 1.2966.88 ± 1.29 76.71±0.23plus-or-minus76.710.2376.71\pm 0.2376.71 ± 0.23
InfLoRA 77.52 ±plus-or-minus\pm± 0.37 82.01 ±plus-or-minus\pm± 0.12 75.65 ±plus-or-minus\pm± 0.14 80.82 ±plus-or-minus\pm± 0.24 71.01 ±plus-or-minus\pm± 0.45 77.28 ±plus-or-minus\pm± 0.45
Table 8: The comparison between our InfLoRA and more methods on DomainNet.
ACC5𝐴𝐶subscript𝐶5ACC_{5}italic_A italic_C italic_C start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT (\uparrow) ACC¯5subscript¯𝐴𝐶𝐶5\overline{ACC}_{5}over¯ start_ARG italic_A italic_C italic_C end_ARG start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT (\uparrow)
SeqLoRA 71.69±0.13plus-or-minus71.690.1371.69\pm 0.1371.69 ± 0.13 78.68±0.12plus-or-minus78.680.1278.68\pm 0.1278.68 ± 0.12
HiDe-Prompt [40] 71.48±0.10plus-or-minus71.480.1071.48\pm 0.1071.48 ± 0.10 76.15±0.05plus-or-minus76.150.0576.15\pm 0.0576.15 ± 0.05
InfLoRA 74.53 ±plus-or-minus\pm± 0.23 79.57 ±plus-or-minus\pm± 0.57
Table 9: Results (%) of different methods on ImageNet-R (10 tasks) using various self-supervised pre-trained models. Here, DINO-1k and iBOT-1k indicate that the ViT is pre-trained on ImageNet-1k using these respective methods.
Method ACC10𝐴𝐶subscript𝐶10ACC_{10}italic_A italic_C italic_C start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT (\uparrow) ACC¯10subscript¯𝐴𝐶𝐶10\overline{ACC}_{10}over¯ start_ARG italic_A italic_C italic_C end_ARG start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT (\uparrow)
DINO-1k SeqLoRA 60.67±0.11plus-or-minus60.670.1160.67\pm 0.1160.67 ± 0.11 66.29±0.21plus-or-minus66.290.2166.29\pm 0.2166.29 ± 0.21
HiDe-Prompt [40] 68.11±0.18plus-or-minus68.110.1868.11\pm 0.1868.11 ± 0.18 71.70±0.01plus-or-minus71.700.0171.70\pm 0.0171.70 ± 0.01
InfLoRA 68.31 ±plus-or-minus\pm± 0.28 76.15 ±plus-or-minus\pm± 0.05
iBOT-1k SeqLoRA 66.87±0.40plus-or-minus66.870.4066.87\pm 0.4066.87 ± 0.40 71.80±0.28plus-or-minus71.800.2871.80\pm 0.2871.80 ± 0.28
HiDe-Prompt [40] 71.33±0.21plus-or-minus71.330.2171.33\pm 0.2171.33 ± 0.21 73.62±0.13plus-or-minus73.620.1373.62\pm 0.1373.62 ± 0.13
InfLoRA 71.84 ±plus-or-minus\pm± 0.09 78.29 ±plus-or-minus\pm± 0.09

C.1 Compare with More Methods

We compare with SeqLoRA, which initials LoRA modules and finetunes these modules on multiple tasks sequentially without any operation to overcome forgetting. The results are given in Table 7, Table 8 and Table 9. We can find that our method outperforms this method.

A recent continual learning PEFT method hierarchical decomposition prompt (HiDe-Prompt) [40] proposes to perform continual learning hierarchically. This method maintains a set of task-specific prompts for each task and contains two stages during training and inference. Specifically, given an input sample, Hide-Prompt infers the prompt index and then uses the corresponding prompt to infer its label. We also compare our method with this method, and the results are also given in Table 7, Table 8 and Table 9. We can find that our method outperforms this method. Furthermore, this method shows comparable performance to our method in terms of final accuracy ACCT𝐴𝐶subscript𝐶𝑇ACC_{T}italic_A italic_C italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT on ImageNet-R. However, there is a notable gap between this method and our method in terms of averaged accuracy ACC¯Tsubscript¯𝐴𝐶𝐶𝑇\overline{ACC}_{T}over¯ start_ARG italic_A italic_C italic_C end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. Note that averaged accuracy ACC¯Tsubscript¯𝐴𝐶𝐶𝑇\overline{ACC}_{T}over¯ start_ARG italic_A italic_C italic_C end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is more important than final accuracy ACCT𝐴𝐶subscript𝐶𝑇ACC_{T}italic_A italic_C italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT since ACC¯Tsubscript¯𝐴𝐶𝐶𝑇\overline{ACC}_{T}over¯ start_ARG italic_A italic_C italic_C end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT represents the performance of the model over the whole learning process.

C.2 Hyperparameter Analysis

We perform the hyperparameter analysis for our method InfLoRA. There are two specific hyperparameters in our method InfLoRA. The first hyperparameter is r𝑟ritalic_r, which controls the expanded parameters in InfLoRA. The second hyperparameter is ϵitalic-ϵ\epsilonitalic_ϵ, which is not the specific hyperparameter of our InfLoRA but the hyperparameter introduced by DualGPM. This hyperparameter controls the component maintained in the matrix 𝑴tsubscript𝑴𝑡\bm{M}_{t}bold_italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Figure 6 shows the results of our method with different values of r𝑟ritalic_r or ϵitalic-ϵ\epsilonitalic_ϵ. We can find that the performance of InfLoRA increases first and then decreases with the increase of r𝑟ritalic_r and ϵitalic-ϵ\epsilonitalic_ϵ.

Refer to caption
Figure 6: (a) Analysis of the hyperparameter r𝑟ritalic_r. (b) Analysis of the hyperparameter ϵitalic-ϵ\epsilonitalic_ϵ.

C.3 Domain Incremental Setting

InfLoRA can be extended to the domain incremental setting. Specifically, DomainNet contains six domains and InfLoRA can learn on these domains sequentially. Table 10 shows that InfLoRA outperforms other baselines.

Table 10: Results of DomainNet for domain incremental setting.
Method ACC6𝐴𝐶subscript𝐶6ACC_{6}italic_A italic_C italic_C start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT (\uparrow) ACC¯6subscript¯𝐴𝐶𝐶6\overline{ACC}_{6}over¯ start_ARG italic_A italic_C italic_C end_ARG start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT (\uparrow)
L2P [44] 34.15±0.10plus-or-minus34.150.1034.15\pm 0.1034.15 ± 0.10 49.84±0.03plus-or-minus49.840.0349.84\pm 0.0349.84 ± 0.03
DualPrompt [43] 35.24±0.12plus-or-minus35.240.1235.24\pm 0.1235.24 ± 0.12 48.44±0.13plus-or-minus48.440.1348.44\pm 0.1348.44 ± 0.13
CODA-P [38] 56.89±0.04plus-or-minus56.890.0456.89\pm 0.0456.89 ± 0.04 57.56±0.03plus-or-minus57.560.0357.56\pm 0.0357.56 ± 0.03
C-LoRA [37] 44.96±0.01plus-or-minus44.960.0144.96\pm 0.0144.96 ± 0.01 52.95±0.08plus-or-minus52.950.0852.95\pm 0.0852.95 ± 0.08
InfLoRA 68.44 ±plus-or-minus\pm± 0.04 67.46 ±plus-or-minus\pm± 0.03

C.4 Inference Efficiency

Existing methods often involve multiple forward propagations through the pre-trained backbone. Specifically, prompt-based continual learning methods, including L2P, DualPrompt, and CODA-P, require an extra forward propagation to generate instance-specific prompts. LAE requires an extra forward propagation for ensembling. In contrast, our InfLoRA only requires a single forward propagation through the pre-trained backbone. Figure 7 provides a comparison of the time consumed by different methods during inference. We can find that our method consistently outperforms existing methods in terms of time efficiency.

Refer to caption
Figure 7: The time of inferring one task for different methods.