InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning

Yan-Shuo Liang and Wu-Jun Li
National Key Laboratory for Novel Software Technology,
Department of Computer Science and Technology, Nanjing University, P. R. China
[email protected],[email protected] Wu-Jun Li is the corresponding author.

Abstract

Continual learning requires the model to learn multiple tasks sequentially. In continual learning, the model should possess the ability to maintain its performance on old tasks (stability) and the ability to adapt to new tasks continuously (plasticity). Recently, parameter-efficient fine-tuning (PEFT), which involves freezing a pre-trained model and injecting a small number of learnable parameters to adapt to downstream tasks, has gained increasing popularity in continual learning. Although existing continual learning methods based on PEFT have demonstrated superior performance compared to those not based on PEFT, most of them do not consider how to eliminate the interference of the new task on the old tasks, which inhibits the model from making a good trade-off between stability and plasticity. In this work, we propose a new PEFT method, called interference-free low-rank adaptation (InfLoRA), for continual learning. InfLoRA injects a small number of parameters to reparameterize the pre-trained weights and shows that fine-tuning these injected parameters is equivalent to fine-tuning the pre-trained weights within a subspace. Furthermore, InfLoRA designs this subspace to eliminate the interference of the new task on the old tasks, making a good trade-off between stability and plasticity. Experimental results show that InfLoRA outperforms existing state-of-the-art continual learning methods on multiple datasets. Code is available at https://github.com/liangyanshuo/InfLoRA.

1 Introduction

Continual learning requires the model to learn multiple tasks sequentially [33]. To achieve continual learning, the model should possess two essential abilities, including the ability to keep its performance on the old tasks (stability) and the ability to adapt to the new tasks continuously (plasticity) [33]. Furthermore, two different scenarios are often considered in continual learning, including task-incremental scenario [32] and class-incremental scenario [41]. Task-incremental scenario allows the model to get task identities during inference. On the contrary, class-incremental scenario does not allow the model to get task identities during inference, making the model learn to distinguish all the classes across all the tasks.

Recently, parameter-efficient fine-tuning (PEFT) [16, 15, 18], which involves freezing a pre-trained model and injecting a small number of learnable parameters to adapt to downstream tasks, has gained increasing popularity in continual learning [44, 38, 12], especially in the class-incremental scenario. More specifically, existing continual learning methods based on PEFT [21, 43] inject the learnable parameters into a pre-trained model using some popular PEFT methods such as prompt-tuning [25] or low-rank adaptation (LoRA) [16]. Subsequently, these methods freeze the pre-trained weights and sequentially fine-tune the injected parameters on multiple tasks throughout the continual learning process.

Although continual learning methods based on PEFT have demonstrated superior performance compared to those not based on PEFT [44], most of them do not consider how to eliminate the interference of the new task on the old tasks, which inhibits the model from making a good trade-off between stability and plasticity. Specifically, when learning a new task, existing continual learning methods based on PEFT either reuse the previously learned parameters to adapt to the new task [44, 12] or randomly expand some parameters first and then adapt to the new task [38, 43, 42]. During this process, the interference of the new task on the old tasks exists due to the shared parameters between new and old tasks, which means fine-tuning a pre-trained model on a new task may interfere with the model’s performance on the old tasks. As a result, it is hard for the model to make a good trade-off between stability and plasticity.

In this work, we propose a new PEFT method, called interference-free low-rank adaptation (InfLoRA), for continual learning. The contributions of this work are listed as follows:

•

InfLoRA injects a small number of parameters to reparameterize the pre-trained weights and shows that fine-tuning these injected parameters is equivalent to fine-tuning the pre-trained weights within a subspace.
•

InfLoRA designs this subspace to eliminate the interference of the new task on the old tasks, making a good trade-off between stability and plasticity.
•

Experimental results show that InfLoRA outperforms existing state-of-the-art continual learning methods on multiple datasets.

2 Related Work and Preliminaries

2.1 Related Work

Parameter-Efficient Fine-Tuning Parameter-efficient fine-tuning (PEFT) methods freeze a pre-trained model and inject a small number of learnable parameters to adapt to downstream tasks. In this way, PEFT methods reduce the inefficiency of full fine-tuning methods which fine-tune all the parameters of a pre-trained model to learn downstream tasks. For example, Adapter [15] adds small modules in different layers of Transformers and only tunes these added modules to adapt to downstream tasks. Prompt-tuning [25] and Prefix-tuning [27] insert a set of learnable tokens into the input of the Transformer layers and only tune these tokens to adapt to downstream tasks. Low-rank adaptation (LoRA) [16] reparameterizes the pre-trained weights with low-rank branches and only tunes these branches to adapt to the downstream tasks. Although these methods tune much fewer learnable parameters than full fine-tuning, they always show comparable or even superior performance compared with full fine-tuning [45, 11, 16, 31]. Early PEFT methods focus on natural language processing (NLP). Recently, PEFT methods have also been proposed for computer vision (CV). For example, visual prompt tuning (VPT) [18] and AdapterFormer [6] apply prompt-tuning and Adapter techniques to CV tasks, respectively. Both of them exhibit comparable performance to full fine-tuning.

Continual Learning Early continual learning was usually considered in the context of learning from scratch. Three types of continual learning methods are proposed, including regularization-based methods [46, 20, 1, 23], memory-based methods [2, 7, 3, 39, 28], and expansion-based methods [35, 17, 26]. Regularization-based methods employ a penalty loss (regularization) to prevent important parameters of old tasks from changing too much. Memory-based methods maintain a memory buffer to store information about old tasks. Expansion-based methods dynamically expand the model’s architecture for each new task.

Recently, with the advancements of pre-trained models [13, 10, 9], using pre-trained models for continual learning has gained increasing popularity. Some continual learning methods fully fine-tune the pre-trained models [4, 49], which has been shown to be inefficient. Other methods explore PEFT methods in continual learning. For instance, some existing continual learning methods [38, 44, 21, 43] introduce prompt-tuning in continual learning, achieving much higher performance than previous methods that learn from scratch, especially in the class-incremental scenario. The method in [12] introduces a framework in continual learning that can be combined with many existing PEFT methods, such as prompt-tuning, LoRA and Adapter. However, all these methods do not consider how to eliminate the interference of the new task on the old tasks, which inhibits the model from making a good trade-off between stability and plasticity.

2.2 Preliminaries

We first introduce low-rank adaptation (LoRA) [16], a popular PEFT method related to our method. Then, we give the problem definition for continual learning.

Low-Rank Adaptation LoRA [16] is one of the most popular PEFT methods. It assumes that the changes of parameters lie in a low-rank space when the model is fully fine-tuned on a downstream task. Specifically, for a linear layer with the input dimension $d_{I}$ and the output dimension $d_{O}$ , we represent its weight with $\bm{W}^{d_{O}\times d_{I}}$ . Then, LoRA reparametrizes the pre-trained weight $\bm{W}$ by expanding a branch with two matrices, $\bm{A}\in\mathbb{R}^{d_{O}\times r}$ and $\bm{B}\in\mathbb{R}^{r\times d_{I}}$ . Typically, $r$ is much smaller than the input dimension $d_{I}$ and output dimension $d_{O}$ , making $\bm{A}$ a dimensionality increasing matrix and $\bm{B}$ a dimensionality reduction matrix. Finally, LoRA modifies the forward propagation in this linear layer as $\bm{e}=\bm{W}\bm{h}+\bm{A}\bm{B}\bm{h}$ . Here, $\bm{h}$ and $\bm{e}$ denote the input and output of this layer, respectively. LoRA initializes $\bm{A}$ as $\bm{0}$ and initializes $\bm{B}$ using Gaussian distribution. During the learning of the downstream tasks, LoRA freezes the pre-trained weight $\bm{W}$ and only fine-tunes the parameters $\bm{A}$ and $\bm{B}$ .

Refer to caption — Figure 1: (a) The architecture of our InfLoRA in a certain linear layer of a Transformer. During the learning of the $t$ -th task, the pre-trained weight and all the old branches are frozen, and only $\bm{A}_{t}$ is fine-tuned. (b) The pipeline of designing dimensionality reduction matrix $\bm{B}_{t}$ .

Problem Definition In continual learning, there is a sequence of tasks with different distributions. We define the task sequence as $\mathcal{D}=\{\mathcal{D}_{1},...,\mathcal{D}_{T}\}$ , where the $t$ -th task $\mathcal{D}_{t}=\{(\bm{x}_{i,t},y_{i,t})\}_{i=1}^{n_{t}}$ . Here, $\bm{x}_{i,t}$ denotes an input sample and $y_{i,t}$ denotes its label. The objective of continual learning is to train a model sequentially on these tasks and ensure that the model performs well on all of them.

We follow existing continual learning methods [43, 44] based on PEFT and assume the model is a pre-trained Vision Transformer (ViT) [10]. Specifically, assume the model is $h_{\bm{\Phi}}(f_{\bm{\Theta}}(\cdot))$ where $h_{\bm{\Phi}}(\cdot)$ is the classifier with parameters $\bm{\Phi}$ and $f_{\bm{\Theta}}(\cdot)$ is the pre-trained ViT backbone with pre-trained parameters $\bm{\Theta}$ . Similar to existing work [43], our focus is primarily on the class-incremental scenario, where task identities are unknown during inference. Furthermore, we concentrate on the exemplar-free setting [43, 51], where no historical data can be fetched for rehearsal.

3 Methodology

Figure 1 (a) illustrates the architecture of our InfLoRA within a linear layer. Before learning the $t$ -th new task, our InfLoRA expands a LoRA-like branch, which includes a dimensionality reduction matrix $\bm{B}_{t}\in\mathbb{R}^{r\times d_{I}}$ and a dimensionality increasing matrix $\bm{A}_{t}\in\mathbb{R}^{d_{O}\times r}$ . Then, the forward propagation of this linear layer is modified as

\displaystyle\bm{e}=

\displaystyle\bm{W}\bm{h}+\sum_{j=1}^{t}\bm{A}_{j}\bm{B}_{j}\bm{h}=\bm{W}_{t-1% }\bm{h}+\bm{A}_{t}\bm{B}_{t}\bm{h}=\bm{W}_{t}\bm{h}.

(1)

Here, $\bm{W}_{t}=\bm{W}_{t-1}+\bm{A}_{t}\bm{B}_{t}=\bm{W}+\sum_{i=1}^{t}\bm{A}_{i}% \bm{B}_{i}$ . Similar to LoRA, our InfLoRA also initializes dimensionality increasing matrix $\bm{A}_{t}$ as $\bm{0}$ . However, different from LoRA, which employs Gaussian distribution to initialize the dimensionality reduction matrix $\bm{B}$ , our InfLoRA designs the dimensionality reduction matrix $\bm{B}_{t}$ before learning the $t$ -th task. During the learning of the $t$ -th task, InfLoRA fine-tunes $\bm{A}_{t}$ to learn the new task while keeping the pre-trained weight $\bm{W}$ , all the old branches and the matrix $\bm{B}_{t}$ frozen. After learning the $t$ -th tasks, for any given test sample belonging to the learned tasks, the model uses $\bm{W}_{t}$ and (1) to infer its label. This design ensures that our method is compatible with the class-incremental scenario where task identities are unknown during inference.

In the following subsections, we first build the relationship between our InfLoRA and the method that fine-tunes the pre-trained weight. Specifically, we show that fine-tuning parameters $\bm{A}_{t}$ is equivalent to fine-tuning the pre-trained weights $\bm{W}$ within a subspace spanned by the rows of $\bm{B}_{t}$ . Note that $\bm{B}_{t}$ is designed before learning the $t$ -th task, making this subspace pre-designed. Then, building upon this relationship, we introduce how our InfLoRA designs this subspace to eliminate the interference of the new task on the old tasks and make a good trade-off between stability and plasticity.

3.1 Relationship between InfLoRA and Fine-Tuning the Pre-Trained Weight

When the $t$ -th task arrives and our method has expanded a new branch, the forward propagation in this layer can be represented by (1). At this time, we can prove the following proposition:

Proposition 1.

When learning the $t$ -th task with forward propagation represented by (1), fine-tuning $\bm{A}_{t}$ is equivalent to fine-tuning the pre-trained weight $\bm{W}$ within the subspace ${\rm span}\{\bm{b}_{1}^{t},...,\bm{b}_{r}^{t}\}$ . Here, $\bm{b}_{i}^{t}$ ( $1\leq i\leq r$ ) denotes the $i$ -th row vector of $\bm{B}_{t}$ .

Proof.

When tuning the pre-trained weight $\bm{W}$ to learn the $t$ -th task, we can compute the gradient of $\bm{W}$ based on the chain rule:

\displaystyle\frac{\partial\mathcal{L}}{\partial\bm{W}}=\frac{\partial\mathcal% {L}}{\partial\bm{e}}\frac{\partial\bm{e}}{\partial\bm{W}}=\frac{\partial% \mathcal{L}}{\partial\bm{e}}\bm{h}^{T}.

(2)

Here, $\mathcal{L}$ denotes the loss function. At this time, the change of $\bm{W}$ can be denoted as $\Delta\bm{W}=-\alpha\frac{\partial\mathcal{L}}{\partial\bm{W}}$ , where $\alpha$ is the learning rate. Then, we can compute the change of the composed matrix $\bm{W}_{t}=\bm{W}+\sum_{j=1}^{t}\bm{A}_{j}\bm{B}_{j}$ :

	$\displaystyle\Delta_{\bm{W}}\bm{W}_{t}=$	$\displaystyle[\bm{W}+\Delta\bm{W}+\sum_{j=1}^{t}\bm{A}_{j}\bm{B}_{j}]-(\bm{W}+% \sum_{j=1}^{t}\bm{A}_{j}\bm{B}_{j})$
	$\displaystyle=$	$\displaystyle\Delta\bm{W}=-\alpha\frac{\partial\mathcal{L}}{\partial\bm{W}_{t}% }=-\alpha\frac{\partial\mathcal{L}}{\partial\bm{e}}\bm{h}^{T}$		(3)

Here, we use $\Delta_{\bm{W}}\bm{W}_{t}$ to denote the change of the composed matrix $\bm{W}_{t}$ causing by the change of $\bm{W}$ .

Similarly, when tuning the expanded weight $\bm{A}_{t}$ , we can get the gradient of $\bm{A}_{t}$ based on the chain rule:

\displaystyle\frac{\partial\mathcal{L}}{\partial\bm{A}_{t}}=\frac{\partial% \mathcal{L}}{\partial\bm{e}}\frac{\partial\bm{e}}{\partial\bm{A}_{t}}=\frac{% \partial\mathcal{L}}{\partial\bm{e}}\bm{h}^{T}\bm{B}_{t}^{T}.

(4)

At this time, the change of $\bm{A}_{t}$ can be denoted as $\Delta\bm{A}_{t}=-\alpha\frac{\partial\mathcal{L}}{\partial\bm{A}_{t}}$ . Then, we can compute the change of the composed matrix $\bm{W}_{t}=\bm{W}_{t-1}+\bm{A}_{t}\bm{B}_{t}$ :

$\displaystyle\Delta_{\bm{A}_{t}}\bm{W}_{t}=$	$\displaystyle[\bm{W}_{t-1}+(\bm{A}_{t}+\Delta\bm{A}_{t})\bm{B}_{t}]-(\bm{W}_{t% -1}+\bm{A}_{t}\bm{B}_{t})$
$\displaystyle=$	$\displaystyle\Delta\bm{A}_{t}\bm{B}_{t}=-\alpha\frac{\partial\mathcal{L}}{% \partial\bm{A}_{t}}\bm{B}_{t}=-\alpha\frac{\partial\mathcal{L}}{\partial\bm{e}% }\bm{h}^{T}\bm{B}_{t}^{T}\bm{B}_{t}$
$\displaystyle=$	$\displaystyle\Delta_{\bm{W}}\bm{W}_{t}\bm{B}_{t}^{T}\bm{B}_{t}$	(5)

Here, we use $\Delta_{\bm{A}_{t}}\bm{W}_{t}$ to denote the change of the composed matrix $\bm{W}_{t}$ causing by the change of $\bm{A}_{t}$ . The fourth equation in (3.1) holds because of (4). The fifth equation in (3.1) holds because of (2). (3.1) shows that $\Delta_{\bm{A}_{t}}\bm{W}_{t}$ is equal to $\Delta_{\bm{W}}\bm{W}_{t}$ multiplying a projection matrix $\bm{B}_{t}^{T}\bm{B}_{t}$ . Since $\bm{B}_{t}^{T}\bm{B}_{t}$ projects each row vector of $\Delta_{\bm{W}}\bm{W}_{t}$ into the subspace ${\rm span}\{\bm{b}_{1}^{t},...,\bm{b}_{r}^{t}\}$ , Proposition 1 holds. ∎

Proposition 1 has demonstrated that using our InfLoRA to train the model is equivalent to directly fine-tuning the pre-trained weight $\bm{W}$ within the subspace ${\rm span}\{\bm{b}_{1}^{t},...,\bm{b}_{r}^{t}\}$ . Therefore, before learning the $t$ -th task, we can design matrix $\bm{B}_{t}$ such that learning the $t$ -th task in the subspace ${\rm span}\{\bm{b}_{1}^{t},...,\bm{b}_{r}^{t}\}$ will not interfere with the performance of the model on the old tasks.

3.2 Eliminating the Interference of the New Task on the Old Tasks

We first introduce the desired characteristics that InfLoRA aims to let the subspace ${\rm span}\{\bm{b}_{1}^{t},...,\bm{b}_{r}^{t}\}$ have. With these characteristics, InfLoRA can eliminate the interference of the new task on the old tasks and make a good trade-off between stability and plasticity. Then, we introduce how to design dimensionality reduction matrix $\bm{B}_{t}$ so that subspace ${\rm span}\{\bm{b}_{1}^{t},...,\bm{b}_{r}^{t}\}$ has these characteristics.

3.2.1 Desired Characteristics

First, InfLoRA aims to make the subspace ${\rm span}\{\bm{b}_{1}^{t},...,\bm{b}_{r}^{t}\}$ orthogonal to the gradients of all the old tasks. In this way, according to Proposition 1, the update of InfLoRA, which can be represented as $\Delta_{\bm{A}_{t}}\bm{W}_{t}$ , will also be orthogonal to the gradient of the old tasks. Note that the idea of making the update for the new task orthogonal to the gradient of the old tasks to eliminate the interference of the new task on the old tasks has been proposed in many existing continual learning methods [36, 30]. However, all these existing methods are designed for continual learning from scratch, involving updating all parameters of the model, which is incompatible with the setting in PEFT. On the contrary, our method is a PEFT method, which only tunes the parameters in $\bm{A}_{t}$ .

Besides eliminating the interference of new tasks on old tasks, our InfLoRA further makes the subspace ${\rm span}\{\bm{b}_{1}^{t},...,\bm{b}_{r}^{t}\}$ lie in a subspace that the gradient of the new task lies in to make a good trade-off between stability and plasticity. Specifically, existing work [19] has shown that during fine-tuning, the weight increments of pre-trained ViT exhibit redundancy in terms of weight rank. Therefore, the gradients of the new task lie in a low-dimensional subspace. Our method makes ${\rm span}\{\bm{b}_{1}^{t},...,\bm{b}_{r}^{t}\}$ not only orthogonal to the gradient of the old tasks but also lie in the subspace in which the gradients of the new task $t$ lie. By doing so, our method makes the model’s focus on the new task when eliminating the interference of the new task on the old tasks, thereby making a good trade-off between stability and plasticity. Section 3 verifies the effectiveness of these two characteristics.

3.2.2 Designing Dimensionality Reduction Matrix

InfLoRA first approximates the gradient space of the new task and old tasks. Here, we use $\mathcal{N}_{t}$ to represent the gradient space of the new task approximated by InfLoRA. Similarly, we use $\mathcal{M}_{t}$ to represent the gradient space of previous $t-1$ old tasks approximated by InfLoRA. We also use $\mathcal{M}_{t}^{\bot}$ to denote the residual gradient space, which is orthogonal to the space $\mathcal{M}_{t}$ . Then, in order to meet the characteristics described in Section 3.2.1, InfLoRA ensures that each row of $\bm{B}_{t}$ lies in $\mathcal{N}_{t}\cap\mathcal{M}_{t}^{\bot}$ . In other words, InfLoRA makes ${\rm span}\{\bm{b}_{1}^{t},...,\bm{b}_{r}^{t}\}\subseteq\mathcal{N}_{t}\cap% \mathcal{M}_{t}^{\bot}$ .

Existing works [36, 29] have shown that the gradient update of the linear layer lies in the span of the inputs. Please refer to supplementary material for a detailed explanation of this proposition. Therefore, InfLoRA uses the input matrix of the new task $t$ to approximate the gradient space of the new task. Specifically, InfLoRA computes the input matrix $\bm{H}_{t}=[\bm{h}_{1}^{t},...,\bm{h}_{n}^{t}]$ , with each column of $\bm{H}_{t}$ representing an input vector of the $t$ -th task. Then, InfLoRA considers $\mathcal{N}_{t}$ as the subspace spanned by the columns of matrix $\bm{H}_{t}$ .

However, InfLoRA cannot use the input matrix of the old tasks to approximate the gradient space of the old tasks since the data from the old tasks is not available when the model learns the new tasks. Instead, existing methods such as gradient projection memory (GPM) [36] and dual gradient projection memory (DualGPM) [29] can learn a matrix to preserve information about the gradients of the old tasks. InfLoRA incorporates DualGPM to preserve gradient information. With the assistance of DualGPM, the model can learn either a matrix $\bm{M}_{t}\in\mathbb{R}^{d_{I}\times k_{t}}$ or a matrix $\bm{M}_{t}^{\bot}\in\mathbb{R}^{d_{I}\times(d_{I}-k_{t})}$ . Here, the columns of $\bm{M}_{t}$ contribute to the orthonormal bases of $\mathcal{M}_{t}$ and the columns of $\bm{M}_{t}^{\bot}$ contribute to the orthonormal bases of $\mathcal{M}_{t}^{\bot}$ . $k_{t}$ denotes the dimension of $\mathcal{M}_{t}$ . For detailed information of how DualGPM maintains orthonormal bases $\bm{M}_{t}$ or $\bm{M}_{t}^{\bot}$ , please refer to supplementary material or the original paper [29].

After approximating the gradient space of the new task and old tasks, InfLoRA gets the component of $\mathcal{N}_{t}$ which lies in $\mathcal{M}_{t}^{\bot}$ . Specifically, when the model maintains $\mathcal{M}_{t}$ , InfLoRA performs the operation

\displaystyle\hat{\bm{H}}_{t}=\bm{H}_{t}-\bm{M}_{t}\bm{M}_{t}^{T}\bm{H}_{t}.

(6)

Similarly, when the model maintains $\mathcal{M}_{t}^{\bot}$ , InfLoRA performs the operation

\displaystyle\hat{\bm{H}}_{t}=\bm{M}_{t}^{\bot}(\bm{M}_{t}^{\bot})^{T}\bm{H}_{% t}.

(7)

Note that when $t=1$ , $\mathcal{M}_{t}$ is a null space and $\hat{\bm{H}}_{t}=\bm{H}_{t}$ . Obviously, each column of $\hat{\bm{H}}_{t}$ lies in $\mathcal{N}_{t}\cap\mathcal{M}_{t}^{\bot}$ . However, since $(\hat{\bm{H}}_{t})^{T}\in\mathbb{R}^{n\times d_{I}}$ and $\bm{B}_{t}\in\mathbb{R}^{r\times d_{I}}$ have different shapes, InfLoRA can not directly define $\bm{B}_{t}$ as $(\hat{\bm{H}}_{t})^{T}$ . Note that $n\gg r$ , InfLoRA uses the principal components of $(\hat{\bm{H}}_{t})^{T}$ to set $\bm{B}_{t}$ . Specifically, singular value decomposotion (SVD) is performed on $(\hat{\bm{H}}_{t})^{T}=\bm{V}_{t}\bm{\Sigma}_{t}\bm{U}_{t}$ . Then, InfLoRA designs $\bm{B}_{t}$ by

\displaystyle\bm{B}_{t}=(\bm{U}_{t})_{r}.

(8)

Here, $(\bm{U}_{t})_{r}$ denotes the rows of $\bm{U}_{t}$ corresponding to the top- $r$ singular values. Figure 1 (b) illustrates the pipeline of designing matrix $\bm{B}_{t}$ .

Note that DualGPM expands subspace $\mathcal{M}_{t}$ and reduces subspace $\mathcal{M}_{t}^{\bot}$ when the number of tasks increases. Since InfLoRA constrains the update of the model within the subspace $\mathcal{N}_{t}\cap\mathcal{M}_{t}^{\bot}\subseteq\mathcal{M}_{t}^{\bot}$ , the space for learning the new task reduces when the number of tasks increases. However, by adjusting the approximation error of the gradient for the old tasks, DualGPM can expand $\mathcal{M}_{t}$ slowly and reduce $\mathcal{M}_{t}^{\bot}$ slowly. Therefore, the constraints imposed by InfLoRA do not excessively affect the model’s learning of new tasks. Please refer to supplementary material for a detailed explanation.

Algorithm 1 InfLoRA for Continual Learning

1: Input: The data of different tasks

\{\mathcal{D}_{t}\}_{t=1}^{T}

, a pre-trained ViT model

f_{\bm{\Theta}}(\cdot)

2: Output: Network

f_{\bm{\Theta}}(\cdot)

with learned parameters

\bm{W}_{t}

3: for

t

1:T

4: Design

\bm{B}_{t}

through (8);

5: Expand a new branch for the

t

-th task;

6: for

\mathcal{B}_{t}

sampled from

\mathcal{D}_{t}

7: Compute the loss

\mathcal{L}(f_{\bm{\Theta}}(\mathcal{B}_{t}))

through (9) and update the parameters;

8: end for

9: Preserve the information about the gradient of the

t

-th task through DualGPM;

10: end for

3.3 Whole Process of InfLoRA

Algorithm 1 outlines the whole process of InfLoRA in continual learning. When the $t$ -th new task arrives, InfLoRA first designs $\bm{B}_{t}$ through (8) and expands a new branch. Then, InfLoRA learns the $t$ -th task by fine-tuning the newly expanded branch. Please note that, based on empirical findings from existing methods [38, 12], we employ the local cross-entropy (CE) loss as the learning objective, as it usually performs better than the global CE loss in continual learning methods based on PEFT. The local CE is the CE loss constrained to the classes of the current new task, which can be denoted as

\displaystyle\mathcal{L}(\mathcal{D}_{t})=\frac{1}{|\mathcal{D}_{t}|}\sum_{(% \bm{x},y)\in\mathcal{D}_{t}}\mathcal{L}_{ce}({\rm mask}(h_{\bm{\Phi}}(f_{\bm{% \Theta}}(\bm{x}))),y).

(9)

Here, ${\rm mask}(\cdot)$ is a function that filters out the logits of the old classes and $\mathcal{L}_{ce}$ denotes the standard CE loss. After learning the $t$ -th new task, InfLoRA follows the DualGPM to preserve the information about the gradient of the $t$ -th task.

Note that the branch corresponding to the $t$ -th task will be frozen once the model has learned the $t$ -th task. Since the expanded branches are linear transformations, we can integrate the old branches into the pre-trained weight to reduce the expanded parameters. Specifically, after learning the first task, InfLoRA integrates the first branch into the pre-trained weight and obtains the weight $\bm{W}_{1}=\bm{W}+\bm{A}_{1}\bm{B}_{1}$ . Before learning the $t$ -th new task ( $t>1$ ), InfLoRA maintains the weight $\bm{W}_{t-1}$ . After learning the $t$ -th task, InfLoRA integrates the $t$ -th branch into $\bm{W}_{t-1}$ and obtains $\bm{W}_{t}=\bm{W}_{t-1}+\bm{A}_{t}\bm{B}_{t}$ . In this way, the parameters in $\bm{A}_{t}$ and $\bm{B}_{t}$ do not need to be maintained in the learning of subsequent tasks. Therefore, during the whole learning process, the number of parameters expanded by InfLoRA equals the number of parameters in a single branch. Since a single branch contains $(d_{I}+d_{O})r$ parameters, the number of parameters expanded by InfLoRA is $(d_{I}+d_{O})r$ all the time.

Table 1: Results (%) on ImageNet-R. Results are included for 5 tasks, 10 tasks, and 20 tasks. We report results averaged over 5 trials.

Tasks	5		10		20
Method	$ACC_{5}$ ( $\uparrow$ )	$\overline{ACC}_{5}$ ( $\uparrow$ )	$ACC_{10}$ ( $\uparrow$ )	$\overline{ACC}_{10}$ ( $\uparrow$ )	$ACC_{20}$ ( $\uparrow$ )	$\overline{ACC}_{20}$ ( $\uparrow$ )
joint	$81.14\pm 0.34$	-	$81.14\pm 0.34$	-	$81.14\pm 0.34$	-
sequential	$58.74\pm 1.28$	$72.91\pm 0.28$	$46.07\pm 1.15$	$62.91\pm 0.68$	$34.62\pm 0.85$	$51.15\pm 1.50$
L2P [44]	$64.13\pm 0.78$	$68.66\pm 0.41$	$62.54\pm 0.24$	$67.98\pm 0.27$	$57.92\pm 0.28$	$64.57\pm 0.29$
DualPrompt [43]	$67.88\pm 0.17$	$71.16\pm 0.31$	$65.41\pm 0.52$	$69.39\pm 0.43$	$61.00\pm 0.72$	$65.80\pm 0.67$
CODA-P [38]	$73.09\pm 0.21$	$76.91\pm 0.21$	$71.47\pm 0.35$	$75.82\pm 0.29$	$67.28\pm 0.30$	$72.34\pm 0.17$
C-LoRA [37]	$75.85\pm 0.31$	$78.85\pm 0.34$	$71.89\pm 0.45$	$75.33\pm 0.28$	$65.71\pm 0.60$	$70.63\pm 0.85$
LAE [12]	$73.84\pm 0.14$	$77.29\pm 0.45$	$71.70\pm 0.39$	$76.71\pm 0.10$	$66.98\pm 0.35$	$73.72\pm 0.05$
InfLoRA-b5	$75.28\pm 0.01$	$78.95\pm 0.08$	$74.13\pm 0.18$	$78.54\pm 0.14$	$68.41\pm 0.29$	$74.00\pm 0.50$
InfLoRA	77.52 $\pm$ 0.37	82.01 $\pm$ 0.12	75.65 $\pm$ 0.14	80.82 $\pm$ 0.24	71.01 $\pm$ 0.45	77.28 $\pm$ 0.45

4 Experiments

4.1 Experimental Settings

Datasets and Evaluation Metric Similar to existing continual learning methods [12, 44] based on PEFT, we use ImageNet-R [14], CIFAR100 [24], and DomainNet [34] to train and evaluate the models. Imagenet-R is generated through artistic processing of 200 classes from ImageNet [8]. This dataset is introduced to continual learning by existing work [43] and has become a standard benchmark for continual learning methods based on PEFT. CIFAR100 is a dataset commonly used in existing continual learning works. DomainNet contains 345 classes and is introduced by some existing works [38, 42] for continual learning. Following existing continual learning work [38], we split ImageNet-R into 5, 10, and 20 tasks, with each task containing 40, 20, and 10 classes. We split CIFAR100 into 10 tasks, and each task constrains 10 classes. We split DomainNet into 5 tasks, and each task contains 69 classes.

Following existing continual learning methods [12, 44], we evaluate the performance of the model through two popular metrics, including the final accuracy $ACC_{T}$ and the averaged accuracy $\overline{ACC}_{T}=\frac{1}{T}\sum_{i=1}^{T}ACC_{i}$ , where $T$ denotes the total number of tasks and $ACC_{i}$ is defined as

\displaystyle ACC_{i}=\frac{1}{i}\sum_{j=1}^{i}a_{i,j}.

(10)

Here, $a_{i,j}$ denotes the accuracy of the $j$ -th task once the model has learned the $i$ -th task.

Baselines We compare our InfLoRA with state-of-the-art continual learning methods based on PEFT, including learn to prompt (L2P) [44], DualPrompt [43], continual decomposed attention-based prompt (CODA-P) [38], learning accumulation ensemble (LAE) [12], continual low-rank adaptation (C-LoRA) [37]. For LAE, we implement it with LoRA [16]. Following existing works [38, 12], we also include two methods without continual learning, joint and sequential, in the comparison. Here, joint denotes the method that learns all the tasks jointly, while sequential denotes the method that learns all the tasks sequentially without any operation to overcome the forgetting of the model. The accuracy of joint can be treated as the accuracy upper-bound and the accuracy of sequential can be treated as the accuracy lower-bound.

Architecture and Training Details We follow existing works [12, 43] to perform experiments. Specifically, we use the ViT-B/16 backbone [10] supervised pre-trained on ImageNet 21K as the pre-trained model.

For all the methods, we follow existing works [38, 44, 12] and use the Adam [22] optimizer with running averages of gradient and its square ( $\beta_{1}=0.9$ , $\beta_{2}=0.999$ ). Each task is trained for 50 epochs on ImageNet-R, 20 epochs on CIFAR100 and 5 epochs on DomainNet. The batch size is set to 128 for all the experiments. Since our InfLoRA shares a similar architecture to LoRA, we follow existing work [12] and insert the architecture of our InfLoRA in the key and value of the attention module. Furthermore, existing method DualPrompt [43] treats the inserted blocks as hyperparameters and searches for the best positions for their prompts. On the contrary, we insert the architecture of InfLoRA for all the Transformer blocks to avoid searching. We also implement a variant of our method, which inserts the bottom 5 Transformer blocks like existing methods DualPrompt and CODA-P. We call this variant InfLoRA-b5. As for the hyperparameter $r$ , we determine its value through a grid search on a validation dataset.

Table 2: Results (%) on CIFAR100 and DomainNet. We report results over 5 trials.

Tasks	CIFAR100		DomainNet
Method	$ACC_{10}$ ( $\uparrow$ )	$\overline{ACC}_{10}$ ( $\uparrow$ )	$ACC_{5}$ ( $\uparrow$ )	$\overline{ACC}_{5}$ ( $\uparrow$ )
joint	$91.92\pm 0.05$	-	$77.72\pm 0.04$	-
sequential	$62.18\pm 3.59$	$80.42\pm 0.23$	$53.44\pm 1.21$	$69.09\pm 0.33$
L2P [44]	$82.48\pm 0.20$	$87.64\pm 0.25$	$70.16\pm 0.05$	$75.60\pm 0.03$
DualPrompt [43]	$84.42\pm 0.30$	$90.06\pm 0.07$	$72.14\pm 0.05$	$77.71\pm 0.06$
CODA-P [38]	$86.62\pm 0.11$	$91.08\pm 0.28$	$73.23\pm 0.13$	$78.72\pm 0.07$
C-LoRA [37]	$82.97\pm 0.47$	$88.81\pm 0.34$	$69.34\pm 0.13$	$75.25\pm 0.11$
LAE [12]	$84.15\pm 0.10$	$89.84\pm 0.03$	$66.85\pm 0.40$	$75.01\pm 0.17$
InfLoRA-b5	87.06 $\pm$ 0.25	$91.59\pm 0.13$	$73.26\pm 0.50$	$78.82\pm 0.34$
InfLoRA	$86.51\pm 0.73$	91.70 $\pm$ 0.32	74.53 $\pm$ 0.23	79.57 $\pm$ 0.57

4.2 Experimental Results

Accuracy Table 1 shows the results of different methods on ImageNet-R with a different number of tasks. Table 2 shows the results of different methods on CIFAR100 and DomainNet. We can find that our methods InfLoRA and InfLoRA-b5 outperform existing continual learning methods.

Figure 2 shows the variation of the accuracy of different continual learning methods on ImageNet-R and CIFAR100. We can find that our method outperforms existing methods not only at the end of the learning but also throughout the whole learning process. This indicates that our InfLoRA eliminates the interference of the new task on the old tasks and thus the accuracy of our method decreases slower compared to other methods.

Analysis of Expanded Parameters Figure 3 shows the number of expanded parameters and the accuracy of different methods on ImageNet-R and CIFAR100. For L2P, DualPrompt and CODA-P, their expanded parameters are included in the added prompts and corresponding key. For LAE, its expanded parameters are the inserted LoRA modules and an additional copy. For C-LoRA, its expanded parameters are inserted LoRA modules. For our method, the expanded parameters are $\bm{B}_{t}$ and $\bm{A}_{t}$ . The details of computing the number of expanded parameters for different methods are given in supplementary material. We can find that CODA-P and C-LoRA expand much more parameters than other methods. Furthermore, our methods InfLoRA and InfLoRA-b5 expand comparable parameters to L2P, DualPrompt and LAE but perform better than these methods.

Table 3: Results of different variants on ImageNet-R with a different number of tasks.

Tasks	5		10		20
	$ACC_{5}$ ( $\uparrow$ )	$\overline{ACC}_{5}$ ( $\uparrow$ )	$ACC_{10}$ ( $\uparrow$ )	$\overline{ACC}_{10}$ ( $\uparrow$ )	$ACC_{20}$ ( $\uparrow$ )	$\overline{ACC}_{20}$ ( $\uparrow$ )
Random $\rightarrow\bm{B}_{t}$	$72.49\pm 0.38$	$79.40\pm 0.29$	$67.38\pm 0.41$	$76.62\pm 0.06$	$56.17\pm 0.29$	$69.24\pm 0.35$
$\mathcal{N}_{t}\rightarrow\bm{B}_{t}$	$67.01\pm 0.11$	$76.09\pm 0.04$	$57.91\pm 0.30$	$70.23\pm 0.59$	$40.73\pm 0.29$	$59.68\pm 0.52$
$\mathcal{M}_{t}^{\bot}\rightarrow\bm{B}_{t}$	$75.94\pm 0.53$	$80.69\pm 0.27$	$74.61\pm 0.62$	$79.67\pm 0.27$	$68.79\pm 0.42$	$75.74\pm 0.26$
$\mathcal{N}_{t}\cap\mathcal{M}_{t}^{\bot}\rightarrow\bm{B}_{t}$ (InfLoRA)	77.52 $\pm$ 0.37	82.01 $\pm$ 0.12	75.65 $\pm$ 0.14	80.82 $\pm$ 0.24	71.01 $\pm$ 0.45	77.28 $\pm$ 0.45

Ablation Study We perform experiment to verify the effectiveness of designing dimensionality reduction matrix $\bm{B}_{t}$ by (8). Specifically, we explore three different variants for designing $\bm{B}_{t}$ . The first variant designs $\bm{B}_{t}$ randomly using Gaussian distribution. We call this variant ‘Random $\rightarrow\bm{B}_{t}$ ’. The second variant discards the operation in (6) or (7) and directly sets $\hat{\bm{H}}_{t}=\bm{H}_{t}$ . Through this way, this variant ensures that each row of $\bm{B}_{t}$ lies in $\mathcal{N}_{t}$ while ignoring $\mathcal{M}_{t}^{\bot}$ . We call this variant ‘ $\mathcal{N}_{t}\rightarrow\bm{B}_{t}$ ’. The third variant does not compute the input matrix but initializes $\bm{H}_{t}$ using a Gaussian distribution before applying the operation in (6) or (7). In this way, this variant ensures that each row of $\bm{B}_{t}$ lies in ${\mathcal{M}}_{t}^{\bot}$ while ignoring $\mathcal{N}_{t}$ . We call this variant ‘ ${\mathcal{M}}_{t}^{\bot}\rightarrow\bm{B}_{t}$ ’. Since our method focuses both ${\mathcal{M}}_{t}^{\bot}$ and $\mathcal{N}_{t}$ , we use $\mathcal{N}_{t}\cap{\mathcal{M}}_{t}^{\bot}\rightarrow\bm{B}_{t}$ to represent our method.

Table 3 shows the results of our method and its variants. We can find that all these variants fail to perform as well as our method. To further demonstrate the performance of different variants, we show the relative accuracy of different tasks after the model learns them all in Figure 4. Here, relative accuracy is the accuracy of different variants minus the accuracy of our InfLoRA. Note that the last task is the new task, and the other tasks are old tasks in Figure 4. As we can see, ‘Random $\rightarrow\bm{B}_{t}$ ’ and ‘ $\mathcal{N}_{t}\rightarrow\bm{B}_{t}$ ’ outperform ‘ ${\mathcal{M}}_{t}^{\bot}\rightarrow\bm{B}_{t}$ ’ on the new task but shows much lower accuracy than ‘ ${\mathcal{M}}_{t}^{\bot}\rightarrow\bm{B}_{t}$ ’ and our InfLoRA on the old tasks. This means these two variants fail to eliminate the inference of the new task on the old tasks, making the model suffer from low stability. On the contrary, ‘ ${\mathcal{M}}_{t}^{\bot}\rightarrow\bm{B}_{t}$ ’ shows the lowest performance on the new task. This means ‘ ${\mathcal{M}}_{t}^{\bot}\rightarrow\bm{B}_{t}$ ’ ignores the plasticity of the model. Our method outperforms all the variants on most of the tasks. This shows that our method can eliminate the interference of the new task on the old tasks and make a better trade-off between stability and plasticity than these variants.

Varying the Pre-Trained Model We also follow the existing method [40] and perform experiments using a ViT-B/16 pre-trained with two different self-supervised methods, including DINO [5] and iBOT [50]. All experimental settings, except for the choice of the pre-trained model, are kept consistent with the details outlined in Section 4.1.

Table 4: Results (%) of different methods on ImageNet-R (10 tasks) using various self-supervised pre-trained models. Here, DINO-1k and iBOT-1k indicate that the ViT is pre-trained on ImageNet-1k using these respective methods.

	Method	$ACC_{10}$ ( $\uparrow$ )	$\overline{ACC}_{10}$ ( $\uparrow$ )
DINO-1k	L2P [44]	$56.71\pm 0.12$	$63.59\pm 0.21$
	DualPrompt [43]	$60.23\pm 0.42$	$66.57\pm 0.25$
	CODA-P [38]	$64.02\pm 0.68$	$71.50\pm 0.42$
	C-LoRA [37]	$63.07\pm 0.36$	$68.09\pm 0.41$
	LAE [12]	$61.03\pm 0.27$	$69.89\pm 0.15$
	InfLoRA-b5	$66.16\pm 0.14$	$73.01\pm 0.17$
	InfLoRA	68.31 $\pm$ 0.28	76.15 $\pm$ 0.05
iBOT-1k	L2P [44]	$60.80\pm 0.35$	$66.58\pm 0.28$
	DualPrompt [43]	$63.78\pm 0.38$	$68.88\pm 0.16$
	CODA-P [38]	$68.02\pm 0.48$	$74.28\pm 0.47$
	C-LoRA [37]	$68.60\pm 0.07$	$73.47\pm 0.28$
	LAE [12]	$64.14\pm 0.29$	$72.59\pm 0.22$
	InfLoRA-b5	$69.72\pm 0.44$	$76.11\pm 0.13$
	InfLoRA	71.84 $\pm$ 0.09	78.29 $\pm$ 0.09

Table 4 shows the results of different methods on ImageNet-R when using various pre-trained models. Comparing these results to those in Table 1, we can find that the performance of all methods utilizing self-supervised pre-trained models is lower than the performance of the corresponding methods using supervised pre-trained models. However, our methods still outperform all other methods.

Combining with Classifier Alignment Slow learner with classifier alignment (SLCA) [48] utilizes feature statistics to align classifiers, demonstrating superior performance compared to methods without aligned classifiers. Our InfLoRA can be combined with classifier alignment (CA) to get better performance. Specifically, after learning the $t$ -th task with parameters $\bm{A}_{t}$ and $\bm{B}_{t}$ and loss (9), we collect features $\bm{F}_{t}=\{\bm{r}_{i,t}\}_{i=1}^{n_{t}}$ of the $t$ -th task. Here, $\bm{r}_{i,t}=f(\bm{x}_{i,t})$ denotes the features extracted by backbone $f_{\bm{\Theta}}(\cdot)$ . Then, mean and covariance of features for each class are computed and saved. After that, for each class $c$ the model has seen during continual learning, $S$ samples are sampled from Gaussian distribution $\mathcal{N}(\bm{\mu}_{c},\bm{\Sigma}_{c})$ . Here, $\bm{\mu}_{c}$ and covariance $\bm{\Sigma}_{c}$ denote the mean and covariance of the class $c$ . Finally, we align the classifier using standard cross-entropy and these samples. The details of this experiment are given in supplementary material.

Table 5 shows that our method InfLoRA+CA outperforms SLCA. Note that SLCA tunes all the parameters of the model while our method InfLoRA only tunes the parameters in $\bm{A}_{t}$ . Therefore, our InfLoRA+CA is much more efficient than SLCA.

Table 5: Results (%) of different methods on ImageNet-R (10 tasks) and CIFAR100 using classifier alignment (CA) technique.

	Method	$ACC_{10}$ ( $\uparrow$ )	$\overline{ACC}_{10}$ ( $\uparrow$ )
CIFAR100	SLCA [48]	$91.06\pm 0.24$	$93.65\pm 0.19$
CIFAR100	InfLoRA+CA	91.59 $\pm$ 0.08	94.39 $\pm$ 0.05
ImageNet-R	SLCA [48]	$77.34\pm 0.25$	$81.35\pm 0.16$
ImageNet-R	InfLoRA+CA	79.78 $\pm$ 0.25	83.38 $\pm$ 0.19

5 Conclusion

In this work, we propose a new method, called interference-free low-rank adaptation (InfLoRA), for continual learning. InfLoRA injects a small number of parameters to reparameterize the pre-trained weights and shows that fine-tuning these injected parameters is equivalent to fine-tuning the pre-trained weights within a subspace. Furthermore, InfLoRA designs this subspace to eliminate the interference of the new task on the old tasks, making a good trade-off between stability and plasticity. Experimental results show that InfLoRA outperforms existing state-of-the-art continual learning methods on multiple datasets.

Acknowledgment

This work is supported by NSFC (No.62192783), National Key R&D Program of China (No.2020YFA0713901), and Fundamental Research Funds for the Central Universities (No.020214380108).

References

Aljundi et al. [2018] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision, pages 139–154, 2018.
Aljundi et al. [2019a] Rahaf Aljundi, Eugene Belilovsky, Tinne Tuytelaars, Laurent Charlin, Massimo Caccia, Min Lin, and Lucas Page-Caccia. Online continual learning with maximal interfered retrieval. In Advances in Neural Information Processing Systems, pages 11849–11860, 2019a.
Aljundi et al. [2019b] Rahaf Aljundi, Min Lin, Baptiste Goujaud, and Yoshua Bengio. Gradient based sample selection for online continual learning. In Advances in Neural Information Processing Systems, pages 11816–11825, 2019b.
Boschini et al. [2022] Matteo Boschini, Lorenzo Bonicelli, Angelo Porrello, Giovanni Bellitto, Matteo Pennisi, Simone Palazzo, Concetto Spampinato, and Simone Calderara. Transfer without forgetting. In Proceedings of the European Conference on Computer Vision, pages 692–709, 2022.
Caron et al. [2021] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021.
Chen et al. [2022] Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems, pages 16664–16678, 2022.
Chrysakis and Moens [2020] Aristotelis Chrysakis and Marie-Francine Moens. Online continual learning from imbalanced data. In Proceedings of the International Conference on Machine Learning, pages 1952–1961, 2020.
Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
Devlin et al. [2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, pages 4171–4186, 2019.
Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
Fu et al. [2022] Chin-Lun Fu, Zih-Ching Chen, Yun-Ru Lee, and Hung-Yi Lee. Adapterbias: Parameter-efficient token-dependent representation shift for adapters in nlp tasks. In Findings of the Association for Computational Linguistics, pages 2608–2621, 2022.
Gao et al. [2023] Qiankun Gao, Chen Zhao, Yifan Sun, Teng Xi, Gang Zhang, Bernard Ghanem, and Jian Zhang. A unified continual learning framework with general parameter-efficient tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11449–11459, 2023.
He et al. [2022] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
Hendrycks et al. [2021] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8349, 2021.
Houlsby et al. [2019] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In Proceedings of the International Conference on Machine Learning, pages 2790–2799, 2019.
Hu et al. [2022] Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
Hung et al. [2019] Steven C. Y. Hung, Cheng-Hao Tu, Cheng-En Wu, Chien-Hung Chen, Yi-Ming Chan, and Chu-Song Chen. Compacting, picking and growing for unforgetting continual learning. In Advances in Neural Information Processing Systems, pages 13647–13657, 2019.
Jia et al. [2022] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge J. Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In Proceedings of the European Conference on Computer Vision, pages 709–727, 2022.
Jie and Deng [2023] Shibo Jie and Zhi-Hong Deng. Fact: Factor-tuning for lightweight adaptation on vision transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1060–1068, 2023.
Jung et al. [2020] Sangwon Jung, Hongjoon Ahn, Sungmin Cha, and Taesup Moon. Continual learning with node-importance based adaptive group sparse regularization. Advances in Neural Information Processing Systems, pages 3647–3658, 2020.
Khan et al. [2023] Muhammad Gul Zain Ali Khan, Muhammad Ferjad Naeem, Luc Van Gool, Didier Stricker, Federico Tombari, and Muhammad Zeshan Afzal. Introducing language guidance in prompt-based continual learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11463–11473, 2023.
Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Kirkpatrick et al. [2017] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, pages 3521–3526, 2017.
Krizhevsky [2009] A Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, University of Tront, 2009.
Lester et al. [2021] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, 2021.
Li et al. [2019] Xilai Li, Yingbo Zhou, Tianfu Wu, Richard Socher, and Caiming Xiong. Learn to grow: A continual structure learning framework for overcoming catastrophic forgetting. In Proceedings of the International Conference on Machine Learning, pages 3925–3934, 2019.
Li and Liang [2021] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 4582–4597, 2021.
Liang and Li [2023a] Yan-Shuo Liang and Wu-Jun Li. Loss decoupling for task-agnostic continual learning. In Advances in Neural Information Processing Systems, 2023a.
Liang and Li [2023b] Yan-Shuo Liang and Wu-Jun Li. Adaptive plasticity improvement for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7816–7825, 2023b.
Lin et al. [2022] Sen Lin, Li Yang, Deliang Fan, and Junshan Zhang. Trgp: Trust region gradient projection for continual learning. In International Conference on Learning Representations, 2022.
Mahabadi et al. [2021] Rabeeh Karimi Mahabadi, James Henderson, and Sebastian Ruder. Compacter: Efficient low-rank hypercomplex adapter layers. In Advances in Neural Information Processing Systems, pages 1022–1035, 2021.
Masana et al. [2021] Marc Masana, Joost Van de Weijer, Bartłomiej Twardowski, et al. On the importance of cross-task features for class-incremental learning. arXiv preprint arXiv:2106.11930, 2021.
Parisi et al. [2019] German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, pages 54–71, 2019.
Peng et al. [2019] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1406–1415, 2019.
Rusu et al. [2016] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
Saha et al. [2021] Gobinda Saha, Isha Garg, and Kaushik Roy. Gradient projection memory for continual learning. In International Conference on Learning Representations, 2021.
Smith et al. [2023a] James Seale Smith, Yen-Chang Hsu, Lingyu Zhang, Ting Hua, Zsolt Kira, Yilin Shen, and Hongxia Jin. Continual diffusion: Continual customization of text-to-image diffusion with c-lora. CoRR, 2023a.
Smith et al. [2023b] James Seale Smith, Leonid Karlinsky, Vyshnavi Gutta, Paola Cascante-Bonilla, Donghyun Kim, Assaf Arbelle, Rameswar Panda, Rogerio Feris, and Zsolt Kira. Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11909–11919, 2023b.
Sun et al. [2022] Qing Sun, Fan Lyu, Fanhua Shang, Wei Feng, and Liang Wan. Exploring example influence in continual learning. Advances in Neural Information Processing Systems, pages 27075–27086, 2022.
Wang et al. [2023a] Liyuan Wang, Jingyi Xie, Xingxing Zhang, Mingyi Huang, Hang Su, and Jun Zhu. Hierarchical decomposition of prompt-based continual learning: Rethinking obscured sub-optimality. arXiv preprint arXiv:2310.07234, 2023a.
Wang et al. [2023b] Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application. arXiv preprint arXiv:2302.00487, 2023b.
Wang et al. [2022a] Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. S-prompts learning with pre-trained transformers: An occam’s razor for domain incremental learning. Advances in Neural Information Processing Systems, pages 5682–5695, 2022a.
Wang et al. [2022b] Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun, Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer G. Dy, and Tomas Pfister. Dualprompt: Complementary prompting for rehearsal-free continual learning. In Proceedings of the European Conference on Computer Vision, pages 631–648, 2022b.
Wang et al. [2022c] Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 139–149, 2022c.
Zaken et al. [2022] Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 1–9, 2022.
Zenke et al. [2017] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In Proceedings of the International Conference on Machine Learning, pages 3987–3995, 2017.
Zhang et al. [2021] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, pages 107–115, 2021.
Zhang et al. [2023] Gengwei Zhang, Liyuan Wang, Guoliang Kang, Ling Chen, and Yunchao Wei. SLCA: slow learner with classifier alignment for continual learning on a pre-trained model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19091–19101, 2023.
Zheng et al. [2023] Zangwei Zheng, Mingyuan Ma, Kai Wang, Ziheng Qin, Xiangyu Yue, and Yang You. Preventing zero-shot transfer degradation in continual learning of vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19068–19079, 2023.
Zhou et al. [2022] Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. Image bert pre-training with online tokenizer. In International Conference on Learning Representations, 2022.
Zhu et al. [2021] Fei Zhu, Xu-Yao Zhang, Chuang Wang, Fei Yin, and Cheng-Lin Liu. Prototype augmentation and self-supervision for incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5871–5880, 2021.

\thetitle

Supplementary Material

A Details of GPM and DualGPM

GPM and DualGPM are established on the fact that the gradient updates lie in the span of input data points [47].

For a linear layer, we denote its forward propagation as

\displaystyle\bm{e}=\bm{W}\bm{h}+\bm{b},

(11)

$\bm{W}\in\mathbb{R}^{d_{I}\times d_{O}}$ , $\bm{h}\in\mathbb{R}^{d_{I}}$ , and $\bm{e}\in\mathbb{R}^{d_{O}}$ . $d_{I}$ and $d_{O}$ denote input and output dimension, respectively. We further denote the loss function as $\mathcal{L}$ . Through the chain rule, we can get the gradient of $\bm{W}$ :

\displaystyle\frac{\partial\mathcal{L}}{\partial\bm{W}}

\displaystyle=\frac{\partial\mathcal{L}}{\partial\bm{e}}\frac{\partial\bm{e}}{% \bm{W}}=\frac{\partial\mathcal{L}}{\partial\bm{e}}\bm{h}^{T}=\left[\begin{% matrix}a_{1}\bm{h}^{T},\\ a_{2}\bm{h}^{T},\\ ...,\\ a_{d_{O}}\bm{h}^{T}\end{matrix}\right].

(12)

$[a_{1},a_{2},...,a_{d_{O}}]^{T}$ denotes the vector $\frac{\partial\mathcal{L}}{\partial\bm{e}}$ . Through (12), we can find that each column of $\frac{\partial\mathcal{L}}{\partial\bm{W}}$ can be represented as input $\bm{h}$ multiplied by a real value $a_{k}$ ( $1\leq k\leq d_{O}$ ). Therefore, in the linear layer, each column of the gradient $\frac{\partial\mathcal{L}}{\partial\bm{W}}$ lies in the span of input.

A.1 Gradient Projection Memory

GPM learns a subspace ${\mathcal{M}}_{t}$ with orthogonal bases $\bm{M}_{t}$ to approximate the gradient space of the old tasks. Here, the columns of $\bm{M}_{t}$ contribute a set of orthogonal bases in $\mathcal{M}_{t}$ . GPM expands the bases of ${\mathcal{M}}_{t}$ to the bases of ${\mathcal{M}}_{t+1}$ after learning the $t$ -th new task. Specifically, GPM computes the inputs matrix $\bm{H}_{t}$ such that each column of $\bm{H}_{t}$ represents an input of this layer. Then, the part of $\bm{H}_{t}$ that has already in $\mathcal{M}_{t}$ is removed by

\displaystyle\hat{\bm{H}}_{t}=\bm{H}_{t}-\bm{M}_{t}(\bm{M}_{t})^{T}\bm{H}_{t}=% \bm{H}_{t}-\bm{H}_{t,proj}.

(13)

Please note that when $t=1$ , ${\rm dim}({\mathcal{M}}_{t})=0$ and hence $\bm{H}_{t,proj}$ is a zero matrix. After that, singular value decomposition (SVD) is performed on $\hat{\bm{H}}_{t}=\hat{\bm{U}}\hat{\bm{\Sigma}}\hat{\bm{V}}^{T}$ . Then, $u$ new orthogonal bases are chosen from the columns of $\hat{\bm{U}}$ for a minimum of $u$ satisfying the following criteria for given threshold $\epsilon_{th}$ :

\displaystyle||(\hat{\bm{H}}_{t})_{u}||_{F}^{2}+||\bm{H}_{t,proj}||_{F}^{2}% \geq\epsilon_{th}||\bm{H}_{t}||_{F}^{2}.

(14)

Here, $(\hat{\bm{H}}_{t})_{u}=[\bm{u}_{1},...,\bm{u}_{u}]$ denotes the components of $\hat{\bm{H}}_{t}$ that correspond to top- $u$ singular values. Then, subspace ${\mathcal{M}}_{t+1}$ is obtained with the bases $\bm{M}_{t+1}=[\bm{M}_{t},\bm{u}_{1},...,\bm{u}_{u}]$ .

A.2 Dual Gradient Projection Memory

Different from GPM that learns a subspace ${\mathcal{M}}_{t}$ with orthogonal bases $\bm{M}_{t}$ to approximate the gradient space of the old tasks, DualGPM either learns a subspace ${\mathcal{M}}_{t}$ with orthogonal bases $\bm{M}_{t}$ to approximate the gradient of the old tasks, or learns a subspace ${\mathcal{M}}_{t}^{\bot}$ with orthogonal bases $\bm{M}_{t}^{\bot}$ to approximate orthogonal complement of the gradient space of the old tasks.

DualGPM decides whether to keep $\bm{M}_{t}$ or $\bm{M}_{t}^{\bot}$ in memory according to ${\rm dim}({\mathcal{M}}_{t})$ and ${\rm dim}({\mathcal{M}}_{t}^{\bot})$ . Specifically, during the learning of the first several tasks, ${\rm dim}({\mathcal{M}}_{t})\leq{\rm dim}({\mathcal{M}}_{t}^{\bot})$ . At this time, DualGPM maintains $\bm{M}_{t}$ , and expands $\bm{M}_{t}$ to $\bm{M}_{t+1}$ after each task. When ${\rm dim}({\mathcal{M}}_{t})$ increases and exceeds ${\rm dim}({\mathcal{M}}_{t}^{\bot})$ , DualGPM obtains $\bm{M}_{t}^{\bot}$ through some transformations on $\bm{M}_{t}$ . After that, DualGPM only maintains $\bm{M}_{t}^{\bot}$ in memory, and reduces $\bm{M}_{t}^{\bot}$ to $\bm{M}_{t+1}^{\bot}$ after each task. Through this way, the number of bases kept for each layer is ${\rm min}\{{\rm dim}({\mathcal{M}}_{t}),{\rm dim}({\mathcal{M}}_{t}^{\bot})\}$ .

There are three key problems in DualGPM: expanding the bases of ${\mathcal{M}}_{t}$ , obtaining the bases of ${\mathcal{M}}_{t}^{\bot}$ through the bases of ${\mathcal{M}}_{t}$ , and reducing the bases of ${\mathcal{M}}_{t}^{\bot}$ .

Expanding the Bases of ${\mathcal{M}}_{t}$

The expansion of ${\mathcal{M}}_{t}$ is the same as that in GPM.

Transforming ${\mathcal{M}}_{t}$ to ${\mathcal{M}}_{t}^{\bot}$

DualGPM transforms ${\mathcal{M}}_{t}$ to ${\mathcal{M}}_{t}^{\bot}$ by performing SVD to the matrix $\bm{M}_{t}$ . Specifically, let $\bm{M}_{t}=\bm{U}\bm{\Sigma}\bm{V}^{T}$ , the column vectors of $\bm{U}$ which correspond to the zero singular values form a set of orthogonal bases of ${\mathcal{M}}_{t}^{\bot}$ . Please refer to the paper of DualGPM [29] for this explanation.

Reducing the Bases of ${\mathcal{M}}_{t}^{\bot}$

DualGPM reduces space ${\mathcal{M}}_{t}^{\bot}$ by removing the part of ${\mathcal{M}}_{t}^{\bot}$ which contains the gradient of the $t$ -th task. Specifically, DualGPM first computes the input matrix $\bm{R}_{t}$ . Then, the part of $\bm{R}_{t}$ which lies in ${\mathcal{M}}_{t}^{\bot}$ can be computed through

\displaystyle\hat{\bm{R}}_{t}^{\bot}=\bm{M}_{t}^{\bot}(\bm{M}_{t}^{\bot})^{T}% \bm{R}_{t}=\bm{R}_{t,proj}^{\bot}.

(15)

After that, SVD is performed on $\hat{\bm{R}}_{t}^{\bot}=\hat{\bm{U}}^{\bot}\hat{\bm{\Sigma}}^{\bot}(\hat{\bm{V% }}^{\bot})^{T}$ . Then, $k$ new orthogonal bases are chosen from the columns of $\hat{\bm{U}}^{\bot}$ for a maximum of $k$ satisfying the following criteria for the given threshold $\epsilon_{th}$ (the same as $\epsilon_{th}$ in (14)):

\displaystyle||(\hat{\bm{R}}_{t}^{\bot})_{k}||_{F}^{2}\leq(1-\epsilon_{th})||% \bm{R}_{t}||_{F}^{2}.

(16)

Let $\bm{Z}=(\hat{\bm{R}}_{t}^{\bot})_{k}=[\bm{u}_{1}^{\bot},...,\bm{u}_{k}^{\bot}]$ , $\mathcal{Z}={\rm span}\{\bm{u}_{1}^{\bot},...,\bm{u}_{k}^{\bot}\}$ . Here, $\mathcal{Z}$ is the subspace of ${\mathcal{M}}_{t}^{\bot}$ that contains the gradient of the $t$ -th task. DualGPM removes $\mathcal{Z}$ from ${\mathcal{M}}_{t}^{\bot}$ to get ${\mathcal{M}}_{t+1}^{\bot}$ . Specifically, let $\hat{\bm{M}}_{t}^{\bot}=\bm{M}_{t}^{\bot}-\bm{Z}(\bm{Z}^{T})\bm{M}_{t}^{\bot}$ . DualGPM performs the second SVD on $\hat{\bm{M}}_{t}^{\bot}=\widetilde{\bm{U}}^{\bot}\widetilde{\bm{\Sigma}}^{\bot% }(\widetilde{\bm{V}}^{\bot})^{T}$ . The columns of $\widetilde{\bm{U}}^{\bot}$ which correspond to the non-zero singular values form the bases $\bm{M}_{t+1}^{\bot}$ . Please refer to the paper of DualGPM [29] for this explanation.

A.3 Approximation Error in DualGPM

DualGPM either learns a subspace ${\mathcal{M}}_{t}$ to approximate the gradient space of the old tasks or learns a subspace ${\mathcal{M}}_{t}^{\bot}$ to represent the orthogonal complement of the gradient of the old tasks. From Seciton A.2, we can find that the approximation error is related to the hyperparameter $\epsilon_{th}$ in (14) and (16). Specifically, when the value of $\epsilon_{th}$ in (14) and (16) increases, the approximation error decreases. As a result, the dimension of subspace ${\mathcal{M}}_{t}$ becomes larger, while the dimension of ${\mathcal{M}}_{t}^{\bot}$ becomes smaller. Note that our InfLoRA constrains the update of the model to lie within the subspace $\mathcal{N}_{t}\cap{\mathcal{M}}_{t}^{\bot}\subseteq{\mathcal{M}}_{t}^{\bot}$ . Therefore, we can adjust the value of $\epsilon_{th}$ to adjust the space for learning the new task. Here, for all the experiments, we set

\displaystyle\epsilon_{th}=\epsilon+\frac{(1-\epsilon)*t}{T},

(17)

where $t$ denotes the task id and $T$ denotes the total number of tasks. In other words, we gradually increase the value of $\epsilon_{th}$ as the number of tasks increases throughout the whole learning process. Table 6 shows the setting of $\epsilon$ in our InfLoRA.

Figure 5 illustrates the variation of the dimension of the subspace $\mathcal{M}_{t}^{\bot}$ in different Transformer layers of ViT-B/16. We can find that the dimension of the subspace $\mathcal{M}_{t}^{\bot}$ in different Transformer layers of ViT-B/16 is always much larger than zero, which means the space for learning the new task always exists throughout the whole learning process.

Table 6: List of hyper-parameters for different methods. The meaning of different hyperparameters is given in Section B.2. The hyperparameter

\epsilon

in InfLoRA is explained in Section A.3

Methods	Hyper-Parameters
L2P	lr: 0.001 (ImageNet-R, DomainNet, CIFAR100)
	$l$ : 1 (ImageNet-R, DomainNet, CIFAR100)
	$p$ : 30 (ImageNet-R, DomainNet, CIFAR100)
	$e$ : 20 (ImageNet-R, DomainNet, CIFAR100)
DualPrompt	lr: 0.001 (ImageNet-R, DomainNet, CIFAR100)
	$l_{E}$ : 3 (ImageNet-R, DomainNet, CIFAR100)
	$l_{S}$ : 2 (ImageNet-R, DomainNet, CIFAR100)
	$e_{E}$ : 20 (ImageNet-R, DomainNet, CIFAR100)
	$e_{S}$ : 6 (ImageNet-R, DomainNet, CIFAR100)
CODA-P	lr: 0.001 (ImageNet-R, DomainNet, CIFAR100)
	$l$ : 5 (ImageNet-R, DomainNet, CIFAR100)
	$p$ : 100 (ImageNet-R, DomainNet, CIFAR100)
	$e$ : 8 (ImageNet-R, DomainNet, CIFAR100)
LAE	lr: 0.001 (ImageNet-R, DomainNet, CIFAR100)
	$r$ : 5 (ImageNet-R, DomainNet, CIFAR100)
C-LoRA	lr: 0.001 (ImageNet-R, DomainNet, CIFAR100)
	$r$ : 64 (ImageNet-R, DomainNet, CIFAR100)
	$\lambda$ : 0.5 (ImageNet-R, DomainNet, CIFAR100)
InfLoRA-b5	lr: 0.001 (CIFAR100), 0.0005 (ImageNet-R, DomainNet)
	$r$ : 10 (ImageNet-R, CIFAR100), 20 (DomainNet)
	$\epsilon$ : $0.99$ (ImageNet-R), $0.95$ (CIFAR100, DomainNet)
InfLoRA	lr: 0.0005 (ImageNet-R, DomainNet, CIFAR100)
	$r$ : 10 (ImageNet-R, DomainNet, CIFAR100)
	$\epsilon$ : $0.98$ (ImageNet-R), $0.95$ (CIFAR100, DomainNet)

B More Experimental Details

B.1 Training Details

For all the methods in all the experiments except for the comparison with SLCA, the batch size is set to 128 to follow many existing continual learning methods based on PEFT [38, 40]. Hyperparameters for different methods are selected based on the experimental settings in existing works [38, 44, 12] or through hyperparameter search. For example, Adam is used as the optimizer with running averages of gradient and its square ( $\beta_{1}=0.9$ , $\beta_{2}=0.999$ ). The learning rate is searched among [5e-4, 1e-3, 2e-3, 1e-2] for all the methods through the validation sets we split from the training sets. For the hyperparameter $r$ in our InfLoRA, we search it among [1, 5, 10, 20, 30] through the validation sets we split from the training sets. Table 6 shows the hyperparameters of different methods.

When compared with SLCA, our method is combined with classifier alignment (CA). At this time, we follow SLCA to train the expanded LoRA branches and classifiers using the SGD optimizer. Each task is trained for 50 epochs on ImageNet-R, 20 epochs on CIFAR100 and 5 epochs on DomainNet. The batch size is set to 128.

B.2 Expanded Parameters

For L2P [44], the expanded parameters consist of the inserted prompts and their corresponding keys. Let $d$ denote the embedding dimension, $e$ denote the prompt length, $p$ denote the number of prompts, and $l$ denote the number of layers in which prompts are inserted. To compute the total number of expanded parameters, the formula used is $dlp(e+1)$ .

For DualPrompt [43], the expanded parameters also consist of the inserted prompts and corresponding keys. However, DualPrompt contains expert prompts and shared prompts. Let $d$ denote the embedding dimension, $T$ denote the number of tasks, $e_{E}$ denote the expert prompt length, $e_{S}$ denote the shared prompt length, $l_{E}$ denote the number of layers in which expert prompts are inserted and $l_{S}$ denote the number of layers in which shared prompts are inserted. To compute the total number of expanded parameters, the formula used is $d[Tl_{E}(e_{E}+1)+e_{S}l_{S}]$ .

For CODA-Prompt [38], the expanded parameters consist of the inserted prompts, corresponding keys and attention parameters. Let $d$ denote the embedding dimension, $e$ denote the prompt length, $p$ denote the number of prompts, and $l$ denote the number of layers in which prompts are inserted. To compute the total number of expanded parameters, the formula used is $dlp(e+2)$ .

For LAE [12], we implement it with LoRA. Therefore, the expanded parameters in this method consist of the inserted LoRA modules and the corresponding ensemble modules. Let $d$ denote the embedding dimension, $r$ denote the rank, and $l$ denote the number of layers in which LoRA modules are inserted. Since LAE inserts LoRA modules into key and value projection in multi-head attention, the number of expanded parameters is $8ldr$ .

For C-LoRA [37], the expanded parameters in this method consist of the inserted LoRA modules. Let $d$ denote the embedding dimension, $r$ denote the rank, and $l$ denote the number of layers in which LoRA modules are inserted. Since C-LoRA inserts LoRA modules into query, key and value projection in multi-head attention, the number of expanded parameters is $6ldr$ .

For our methods, since we integrate the branches of the old tasks when the model learns a new task, the number of expanded parameters equals the number of parameters in a single branch. Let $d$ denote the embedding dimension, $r$ denote the rank, and $l$ denote the number of layers in which our InfLoRA modules are inserted. Since we also insert InfLoRA modules into key and value projection in multi-head attention, the number of expanded parameters is $4ldr$ .

C More Experimental Results

Table 7: The comparison between our InfLoRA and more methods on ImageNet-R.

Tasks	5		10		20
Method	$ACC_{5}$ ( $\uparrow$ )	$\overline{ACC}_{5}$ ( $\uparrow$ )	$ACC_{10}$ ( $\uparrow$ )	$\overline{ACC}_{10}$ ( $\uparrow$ )	$ACC_{20}$ ( $\uparrow$ )	$\overline{ACC}_{20}$ ( $\uparrow$ )
SeqLoRA	$70.96\pm 0.25$	$79.14\pm 0.32$	$64.32\pm 0.09$	$74.78\pm 0.29$	$56.98\pm 0.29$	$69.29\pm 0.26$
HiDe-Prompt [40]	$76.82\pm 0.91$	$77.19\pm 0.34$	$75.06\pm 0.12$	$76.60\pm 0.01$	$66.88\pm 1.29$	$76.71\pm 0.23$
InfLoRA	77.52 $\pm$ 0.37	82.01 $\pm$ 0.12	75.65 $\pm$ 0.14	80.82 $\pm$ 0.24	71.01 $\pm$ 0.45	77.28 $\pm$ 0.45

Table 8: The comparison between our InfLoRA and more methods on DomainNet.

	$ACC_{5}$ ( $\uparrow$ )	$\overline{ACC}_{5}$ ( $\uparrow$ )
SeqLoRA	$71.69\pm 0.13$	$78.68\pm 0.12$
HiDe-Prompt [40]	$71.48\pm 0.10$	$76.15\pm 0.05$
InfLoRA	74.53 $\pm$ 0.23	79.57 $\pm$ 0.57

Table 9: Results (%) of different methods on ImageNet-R (10 tasks) using various self-supervised pre-trained models. Here, DINO-1k and iBOT-1k indicate that the ViT is pre-trained on ImageNet-1k using these respective methods.

	Method	$ACC_{10}$ ( $\uparrow$ )	$\overline{ACC}_{10}$ ( $\uparrow$ )
DINO-1k	SeqLoRA	$60.67\pm 0.11$	$66.29\pm 0.21$
	HiDe-Prompt [40]	$68.11\pm 0.18$	$71.70\pm 0.01$
	InfLoRA	68.31 $\pm$ 0.28	76.15 $\pm$ 0.05
iBOT-1k	SeqLoRA	$66.87\pm 0.40$	$71.80\pm 0.28$
	HiDe-Prompt [40]	$71.33\pm 0.21$	$73.62\pm 0.13$
	InfLoRA	71.84 $\pm$ 0.09	78.29 $\pm$ 0.09

C.1 Compare with More Methods

We compare with SeqLoRA, which initials LoRA modules and finetunes these modules on multiple tasks sequentially without any operation to overcome forgetting. The results are given in Table 7, Table 8 and Table 9. We can find that our method outperforms this method.

A recent continual learning PEFT method hierarchical decomposition prompt (HiDe-Prompt) [40] proposes to perform continual learning hierarchically. This method maintains a set of task-specific prompts for each task and contains two stages during training and inference. Specifically, given an input sample, Hide-Prompt infers the prompt index and then uses the corresponding prompt to infer its label. We also compare our method with this method, and the results are also given in Table 7, Table 8 and Table 9. We can find that our method outperforms this method. Furthermore, this method shows comparable performance to our method in terms of final accuracy $ACC_{T}$ on ImageNet-R. However, there is a notable gap between this method and our method in terms of averaged accuracy $\overline{ACC}_{T}$ . Note that averaged accuracy $\overline{ACC}_{T}$ is more important than final accuracy $ACC_{T}$ since $\overline{ACC}_{T}$ represents the performance of the model over the whole learning process.

C.2 Hyperparameter Analysis

We perform the hyperparameter analysis for our method InfLoRA. There are two specific hyperparameters in our method InfLoRA. The first hyperparameter is $r$ , which controls the expanded parameters in InfLoRA. The second hyperparameter is $\epsilon$ , which is not the specific hyperparameter of our InfLoRA but the hyperparameter introduced by DualGPM. This hyperparameter controls the component maintained in the matrix $\bm{M}_{t}$ .

Figure 6 shows the results of our method with different values of $r$ or $\epsilon$ . We can find that the performance of InfLoRA increases first and then decreases with the increase of $r$ and $\epsilon$ .

C.3 Domain Incremental Setting

InfLoRA can be extended to the domain incremental setting. Specifically, DomainNet contains six domains and InfLoRA can learn on these domains sequentially. Table 10 shows that InfLoRA outperforms other baselines.

Table 10: Results of DomainNet for domain incremental setting.

Method	$ACC_{6}$ ( $\uparrow$ )	$\overline{ACC}_{6}$ ( $\uparrow$ )
L2P [44]	$34.15\pm 0.10$	$49.84\pm 0.03$
DualPrompt [43]	$35.24\pm 0.12$	$48.44\pm 0.13$
CODA-P [38]	$56.89\pm 0.04$	$57.56\pm 0.03$
C-LoRA [37]	$44.96\pm 0.01$	$52.95\pm 0.08$
InfLoRA	68.44 $\pm$ 0.04	67.46 $\pm$ 0.03

C.4 Inference Efficiency

Existing methods often involve multiple forward propagations through the pre-trained backbone. Specifically, prompt-based continual learning methods, including L2P, DualPrompt, and CODA-P, require an extra forward propagation to generate instance-specific prompts. LAE requires an extra forward propagation for ensembling. In contrast, our InfLoRA only requires a single forward propagation through the pre-trained backbone. Figure 7 provides a comparison of the time consumed by different methods during inference. We can find that our method consistently outperforms existing methods in terms of time efficiency.