License: CC BY-NC-ND 4.0
arXiv:2404.00505v2 [cs.LG] 12 Apr 2024

Transfer Learning with Reconstruction Loss

Wei Cui, , and Wei Yu Manuscript submitted to IEEE Transactions on Machine Learning in Communications and Networking on April 6, 2023, revised on October 18, 2023, and accepted on March 24, 2024. The materials in this paper have been presented in part at the IEEE Global Communications Conference (Globecom), Rio de Janeiro, Brazil, December 2022 [1]. This work is supported by Natural Sciences and Engineering Research Council (NSERC) of Canada via the Canada Research Chairs Program. The authors are with The Edward S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto, Toronto, ON M5S 3G4, Canada (e-mails: {cuiwei2, weiyu}@ece.utoronto.ca).
Abstract

In most applications of utilizing neural networks for mathematical optimization, a dedicated model is trained for each specific optimization objective. However, in many scenarios, several distinct yet correlated objectives or tasks often need to be optimized on the same set of problem inputs. Instead of independently training a different neural network for each problem separately, it would be more efficient to exploit the correlations between these objectives and to train multiple neural network models with shared model parameters and feature representations. To achieve this, this paper first establishes the concept of common information: the shared knowledge required for solving the correlated tasks, then proposes a novel approach for model training by adding into the model an additional reconstruction stage associated with a new reconstruction loss. This loss is for reconstructing the common information starting from a selected hidden layer in the model. The proposed approach encourages the learned features to be general and transferable, and therefore can be readily used for efficient transfer learning. For numerical simulations, three applications are studied: transfer learning on classifying MNIST handwritten digits, the device-to-device wireless network power allocation, and the multiple-input-single-output network downlink beamforming and localization. Simulation results suggest that the proposed approach is highly efficient in data and model complexity, is resilient to over-fitting, and has competitive performances.

Index Terms:
Transfer learning, feature learning, mathematical optimization, wireless communications, information flow

I Introduction

Deep learning has gained increasing popularity as a flexible and computationally efficient approach for solving a great variety of mathematical optimization problems, such as resource allocations [2, 3, 4, 5], detection and sensing [6, 7, 8, 9], and so on. In most literature on applying deep learning for solving optimization problems, a specialized neural network is trained from scratch for each individual optimization task. Such an approach requires a large number of training data for obtaining satisfactory performances on each task, and lacks scalability when multiple objectives need to be optimized. However, in many scenarios, there are often multiple optimization problems that are based on the same set of inputs and differ from each other only in terms of their objective functions. In this paper, we exploit the similarities between these optimization tasks and purpose a novel deep learning approach to train neural networks for different tasks in a highly efficient way, in terms of both data and model complexity.

In the machine learning literature, researchers have explored the transfer of knowledge between machine learning models to tackle similar tasks, known as transfer learning [10]. In these works, a model is first fully trained from scratch with abundant training data and computation resources for one task (i.e. the source task). When a new task correlated to the source task is presented (i.e. the target task) with only a small amount of data available, the trained model is further fine tuned based on the limited available data to solve this new task. Transfer learning is popular in computer vision (CV) [11, 12, 13, 14, 15], natural language processing (NLP) [16, 17, 18, 19], and so on.

To better understand transfer learning with neural networks, we interpret the neural network input-to-output computation flow as a two-stage process, i.e. a feature learning stage followed by an optimization stage:

  1. 1.

    Feature learning stage: the stage where the high-level feature representations are learned.

  2. 2.

    Optimization stage: the stage where the final task-specific outputs are computed based on the learned features.

Many transfer learning approaches can be viewed as transferring the feature learning stage across neural network models, while each model learns its own task-specific optimization stage. In the mainstream transfer learning research in application fields such as CV and NLP, the inputs in the problems are typically highly structured, while the outputs (or targets) are in much lower dimensions. Consequently, it is clear where the feature learning stage and the optimization stage are within the neural network computation flow. However, conducting transfer learning on general mathematical optimization problems is different. Here, the inputs, the outputs, and their mappings often lack discernible structures, resulting in no clear distinction between the feature learning stage and the optimization stage in the neural network. Consequently, it is difficult to determine where within the trained model the transferable knowledge (in the form of features) is computed, or even if such transferable knowledge exists at all.

I-A Main Contributions

In this paper, we propose a novel transfer learning approach to explicitly enforce the learning of transferable features within a specific location in the neural network computation flow111The code for this paper is available at:
https://github.com/willtop/Transfer_Learning_with_Reconstruction_Loss
. Firstly, we establish the following concept for transfer learning:

Common Information222The term “common information” has also been used in information theory as a similarity measurement between two correlated random variables [20], which is not to be confused with the definition in this paper.: The information required for specifying and for solving both the source task and the target task.

Although common information may be difficult to identify in general, in this paper, we make a key observation that when the source task and the target task share the same input, the problem input itself always forms a (possibly non-strict) super-set of the common information. Therefore, in this paper, we consider transfer learning for problems that share the same input distribution and propose to use the problem input as a general choice for common information. Moreover, this paper also goes beyond using the input as common information and further considers applications where we can extract specific common information with lower dimensional representations. In this case, we can further adopt the proposed approach for these specific representations.

With the concept of common information established, the proposed transfer learning approach can be described as follows. When training the neural network model on the source task, besides the task-specific loss, we introduce an additional reconstruction loss to be minimized jointly: i.e., we let the neural network reconstruct the common information using features from a specific hidden layer (referred to as the feature layer) and compute the common information reconstruction loss. Through minimizing this reconstruction loss, we encourage the features learned at the feature layer to be informative about all the correlated tasks that take the same input distribution. To perform transfer learning on a target task, we fix the trained model parameters up to this feature layer as the already-optimized feature learning stage for the model and further train the remaining model parameters on the target task.

When the proposed approach utilizes a choice of common information that is generic (e.g. the problem inputs), the features learned in the model can be used for all target tasks that have the same input. Essentially, the proposed transfer learning approach is target-task agnostic. This is in contrast to some of the prior transfer learning works [21, 22, 17, 23] where the training approach is dedicated to a given source-task and target-task pair. We note that several works also explore the similar idea of encouraging input reconstruction from the model’s internal features, e.g., in the field of semi-supervised learning [24, 25], multi-task learning [26], or domain adaptation (a sub-category of transfer learning where the input distribution changes from the source to the target domain) [22, 27, 28, 29, 30, 31]. Nonetheless, these works deal with different problem setups: in semi-supervised learning, the focus is on obtaining quality features and latent representations of the inputs with only limited ground-truth labels; in multi-task learning, the model is trained on multiple task simultaneously using ample training data; in domain adaptation, the task stays the same while the input distribution changes. While the input reconstruction technique in these works all aims to encourage the neural network models to extract salient features, the present paper differs in that we focus on transfer learning and specifically on quick adaptation to new unseen tasks, after the initial training stage, with only limited further adjustments of model parameters. Moreover, this paper goes one step further in expanding the possibilities for the reconstruction target beyond the problem inputs, through introducing the concept of common information.

For numerical simulations, we first demonstrate the proposed approach through a classical machine learning application: the MNIST handwritten digits classification [32]. We formulate the source task and the target task as correlated classification tasks and adopt the fully-connected neural networks. Furthermore, we treat several important and challenging classes of mathematical optimization problems in wireless communications. Although a few works have explored transfer learning on certain wireless communication problems [33, 34, 23], the techniques used are specific to the application settings or objective characteristics. On the other hand, our approach is more general and readily adapts to different problems or objectives. We illustrate this by experimenting over two application scenarios: the power control utility optimization for device-to-device (D2D) wireless networks, as well as the downlink beamforming and localization problems for multiple-input-single-output (MISO) wireless networks. Specifically, for each of the application scenario, we explore transfer learning between a pair of distinct yet correlated objectives: the min rate and sum rate objectives in the D2D networks; and the beamforming gain and the localization accuracy in the MISO networks. Optimization results suggest that the proposed approach achieves better knowledge transfer and mitigates over-fitting on limited target task data more effectively than the conventional transfer learning method.

I-B Paper Organization and Notations

The remaining of the paper is organized as follows. Section II formulates the general transfer learning problem and establishes the concept of common information. Section III introduces the three applications to be studied in details: the MNIST classification problem, the D2D network power control problems, and the downlink MISO network beamforming and localization problems. Section IV proposes the novel transfer learning approach for general neural networks, along with the proper selections of the common information for the three applications. The performances of the proposed method over three applications are presented and analyzed in Section V. Finally, conclusions are drawn in Section VI.

For mathematical symbols, we use lower-case letters for scalar variables, lower-case bold-face letters for vector variables, and upper-case bold-face letters for matrix variables. By default, we regard any vector as a single-column matrix. We use the superscript ()HsuperscriptH(\cdot)^{\text{H}}( ⋅ ) start_POSTSUPERSCRIPT H end_POSTSUPERSCRIPT to denote the Hermitian transpose of a matrix (or a vector regarded as a single-column matrix). We use 𝒞𝒩(0,Σ)𝒞𝒩0Σ\mathcal{CN}(0,\Sigma)caligraphic_C caligraphic_N ( 0 , roman_Σ ) to denote the zero-mean circularly-symmetric complex normal distribution, with ΣΣ\Sigmaroman_Σ being the covariance matrix. We use []ksubscriptdelimited-[]𝑘[\cdot]_{k}[ ⋅ ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to denote the k𝑘kitalic_k-th element of a vector. We use |||\cdot|| ⋅ | to denote the absolute value of a complex number, and ||||2||\cdot||_{2}| | ⋅ | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to denote the L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT norm of a vector. Lastly, we use the operator \leftarrow to denote the assignment to a specified variable.

II Transfer Learning Formulation

We first present the general transfer learning formulation studied in this paper. This transfer learning formulation is not restricted to any specific application and is applicable to various application domains and scenarios. Specifically, we introduce the novel concept and the precise definition of common information. At the end of this section, we provide a discussion on the fundamental learning objectives and requirements for efficient transfer learning.

II-A General Setup: Source Task and Target Task Optimization

Among many variants of transfer learning formulations, we focus on the transfer learning setting where the source task and the target task share the same input distribution but differ in their respective objectives. Let 𝒮𝒮\mathcal{S}caligraphic_S denote the source task and 𝒯𝒯\mathcal{T}caligraphic_T denote the target task. Consider the optimization problems summarized by the following components:

  • Input parameters 𝐩𝐩\mathbf{p}bold_p summarizing all environment information essential for optimization, which follow the same distribution in both 𝒮𝒮\mathcal{S}caligraphic_S and 𝒯𝒯\mathcal{T}caligraphic_T;

  • Optimization variables for 𝒮𝒮\mathcal{S}caligraphic_S: 𝐱ssubscript𝐱𝑠\mathbf{x}_{s}bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT;

  • Objective (or utility) for 𝒮𝒮\mathcal{S}caligraphic_S: us(𝐱s)subscript𝑢𝑠subscript𝐱𝑠u_{s}(\mathbf{x}_{s})italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT );

  • Optimization variables for 𝒯𝒯\mathcal{T}caligraphic_T: 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT;

  • Objective (or utility) for 𝒯𝒯\mathcal{T}caligraphic_T: ut(𝐱t)subscript𝑢𝑡subscript𝐱𝑡u_{t}(\mathbf{x}_{t})italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

If 𝒯𝒯\mathcal{T}caligraphic_T and 𝒮𝒮\mathcal{S}caligraphic_S are supervised learning tasks, us(𝐱s)subscript𝑢𝑠subscript𝐱𝑠u_{s}(\mathbf{x}_{s})italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) and ut(𝐱t)subscript𝑢𝑡subscript𝐱𝑡u_{t}(\mathbf{x}_{t})italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) are also dependent on the ground-truth labels, which we omit in our notations as they are not variables to be optimized.

To optimize for 𝒮𝒮\mathcal{S}caligraphic_S and 𝒯𝒯\mathcal{T}caligraphic_T, we utilize neural networks to compute the optimal values of the optimization variables. Specifically:

  • Let ΘssubscriptsubscriptΘ𝑠\mathcal{F}_{\Theta_{s}}caligraphic_F start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT with trainable parameters ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT denote the neural network mapping for optimizing 𝒮𝒮\mathcal{S}caligraphic_S:

    Θs(𝐩)=𝐱s.subscriptsubscriptΘ𝑠𝐩subscript𝐱𝑠\displaystyle\mathcal{F}_{\Theta_{s}}(\mathbf{p})=\mathbf{x}_{s}\>.caligraphic_F start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_p ) = bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT . (1)
  • Let ΘtsubscriptsubscriptΘ𝑡\mathcal{F}_{\Theta_{t}}caligraphic_F start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT with trainable parameters ΘtsubscriptΘ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denote the neural network mapping for optimizing 𝒯𝒯\mathcal{T}caligraphic_T:

    Θt(𝐩)=𝐱t.subscriptsubscriptΘ𝑡𝐩subscript𝐱𝑡\displaystyle\mathcal{F}_{\Theta_{t}}(\mathbf{p})=\mathbf{x}_{t}\>.caligraphic_F start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_p ) = bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (2)

II-B Common Information

Under this transfer learning setting, 𝒮𝒮\mathcal{S}caligraphic_S and 𝒯𝒯\mathcal{T}caligraphic_T are correlated in the sense that for optimizing us(𝐱𝐬)subscript𝑢𝑠subscript𝐱𝐬u_{s}(\mathbf{x_{s}})italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT ) and ut(𝐱𝐭)subscript𝑢𝑡subscript𝐱𝐭u_{t}(\mathbf{x_{t}})italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT ), the information extracted from 𝐩𝐩\mathbf{p}bold_p should be similar. To mathematically formalize this concept, we introduce the concept of common information, which we denote by (𝐩)𝐩\mathcal{I}(\mathbf{p})caligraphic_I ( bold_p ), with the following definition:

Definition (Common Information).

Let the input parameters 𝐩𝐩\mathbf{p}bold_p follow the same distribution for the source task 𝒮𝒮\mathcal{S}caligraphic_S and the target task 𝒯𝒯\mathcal{T}caligraphic_T. Let 𝐱s*(𝐩)superscriptsubscript𝐱𝑠𝐩\mathbf{x}_{s}^{*}(\mathbf{p})bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( bold_p ) and 𝐱t*(𝐩)superscriptsubscript𝐱𝑡𝐩\mathbf{x}_{t}^{*}(\mathbf{p})bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( bold_p ) denote the optimal solution for 𝒮𝒮\mathcal{S}caligraphic_S and 𝒯𝒯\mathcal{T}caligraphic_T respectively as follows333Note that by (1) and (2), both 𝐱ssubscript𝐱𝑠\mathbf{x}_{s}bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are functions of the input 𝐩𝐩\mathbf{p}bold_p. Thus, the optimal 𝐱s*superscriptsubscript𝐱𝑠\mathbf{x}_{s}^{*}bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and 𝐱t*superscriptsubscript𝐱𝑡\mathbf{x}_{t}^{*}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT are also functions of 𝐩𝐩\mathbf{p}bold_p. :

𝐱s*(𝐩)=superscriptsubscript𝐱𝑠𝐩absent\displaystyle\mathbf{x}_{s}^{*}(\mathbf{p})=bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( bold_p ) = argmax𝐱sus(𝐱s)subscriptargmaxsubscript𝐱𝑠subscript𝑢𝑠subscript𝐱𝑠\displaystyle\operatorname*{argmax}_{\mathbf{x}_{s}}u_{s}(\mathbf{x}_{s})roman_argmax start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) (3)
𝐱t*(𝐩)=superscriptsubscript𝐱𝑡𝐩absent\displaystyle\mathbf{x}_{t}^{*}(\mathbf{p})=bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( bold_p ) = argmax𝐱tut(𝐱t).subscriptargmaxsubscript𝐱𝑡subscript𝑢𝑡subscript𝐱𝑡\displaystyle\operatorname*{argmax}_{\mathbf{x}_{t}}u_{t}(\mathbf{x}_{t})\>.roman_argmax start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (4)

The common information (𝐩)𝐩\mathcal{I}(\mathbf{p})caligraphic_I ( bold_p ) is a function of 𝐩𝐩\mathbf{p}bold_p such that it satisfies the following:

\displaystyle\exists 1(),2()subscript1subscript2\displaystyle\quad\mathcal{F}_{1}(\cdot),\>\mathcal{F}_{2}(\cdot)caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ⋅ ) , caligraphic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ⋅ )
s.t. 𝐱s*(𝐩)=1((𝐩))superscriptsubscript𝐱𝑠𝐩subscript1𝐩\displaystyle\quad\mathbf{x}_{s}^{*}(\mathbf{p})=\mathcal{F}_{1}\bigl{(}% \mathcal{I}(\mathbf{p})\bigr{)}bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( bold_p ) = caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( caligraphic_I ( bold_p ) ) (5)
𝐱t*(𝐩)=2((𝐩)).superscriptsubscript𝐱𝑡𝐩subscript2𝐩\displaystyle\quad\mathbf{x}_{t}^{*}(\mathbf{p})=\mathcal{F}_{2}\bigl{(}% \mathcal{I}(\mathbf{p})\bigr{)}\>.bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( bold_p ) = caligraphic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_I ( bold_p ) ) . (6)

Essentially, the common information is the information required by both 𝒮𝒮\mathcal{S}caligraphic_S and 𝒯𝒯\mathcal{T}caligraphic_T. We note that for all (𝒮𝒮\mathcal{S}caligraphic_S, 𝒯𝒯\mathcal{T}caligraphic_T) pairs, there exists at least one choice of the common information, which is the problem input 𝐩𝐩\mathbf{p}bold_p itself.

II-C Transfer Learning Principles

Despite the correlations, 𝒮𝒮\mathcal{S}caligraphic_S and 𝒯𝒯\mathcal{T}caligraphic_T are still two different tasks. For a neural network model exclusively trained on one task, its learned features are not optimal for the other one: some features learned from optimizing 𝒮𝒮\mathcal{S}caligraphic_S could be totally irrelevant or even counter-productive for optimizing 𝒯𝒯\mathcal{T}caligraphic_T. For successful transfer learning between 𝒮𝒮\mathcal{S}caligraphic_S and 𝒯𝒯\mathcal{T}caligraphic_T, the features need to be general, i.e., it should contain more knowledge than required for optimizing just a single task.

Moreover, due to reasons such as the cost and overhead of data acquisition, the data is assumed to be highly limited for training ΘtsubscriptΘ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT on 𝒯𝒯\mathcal{T}caligraphic_T. This assumption is particularly relevant for scenarios where the target task 𝒯𝒯\mathcal{T}caligraphic_T is adapted on the fly after the model training and deployment. To effectively learn ΘtsubscriptsubscriptΘ𝑡\mathcal{F}_{\Theta_{t}}caligraphic_F start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT with limited data, we need to utilize the correlation between 𝒮𝒮\mathcal{S}caligraphic_S and 𝒯𝒯\mathcal{T}caligraphic_T and transfer over the knowledge already learned in ΘssubscriptsubscriptΘ𝑠\mathcal{F}_{\Theta_{s}}caligraphic_F start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT which is also useful for solving 𝒯𝒯\mathcal{T}caligraphic_T. In essence, features and representations computed both in ΘssubscriptsubscriptΘ𝑠\mathcal{F}_{\Theta_{s}}caligraphic_F start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT and in ΘtsubscriptsubscriptΘ𝑡\mathcal{F}_{\Theta_{t}}caligraphic_F start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT should contain the common information (𝐩)𝐩\mathcal{I}(\mathbf{p})caligraphic_I ( bold_p ) as defined in Section II-B.

III Applications of Transfer Learning

We provide in details three applications under the transfer learning setting described in Section II-A. First, in Section III-A, we modify the canonical machine learning problem, the MNIST handwritten digits classification, to a transfer learning problem. The MNIST classification task has long been regarded as a benchmark problem in the machine learning literature. By exploring a transfer learning task for this application, we can reveal the potential of the proposed method for learning rich and complex transferable features. Then, we present two wireless communication applications: the D2D power control problem in Section III-B; and the MISO beamforming and localization problem in Section III-C. These two wireless communication problems represent examples of non-convex mathematical optimization problems, which lack specific structures in the inputs and solution mappings and therefore are challenging to solve for conventional transfer learning techniques.

III-A MNIST Handwritten Digits Classification

MNIST handwritten digits classification [32] is one classical problem explored by many machine learning algorithms. Specifically, the MNIST dataset consists of images of handwritings on single digits from 0 to 9. The original classification task requires predicting the correct digit from each handwriting image.

We formulate the following transfer learning problem based on this original MNIST digit classification problem. We modify the classification objectives to obtain a source task and a target task. Specifically, let 𝒮𝒮\mathcal{S}caligraphic_S be the task of identifying whether the input image represents the digit 1 or not; and let 𝒯𝒯\mathcal{T}caligraphic_T be the task of identifying whether the input image represents the digit 8 or not. We select the input images (which are the problem inputs 𝐩𝐩\mathbf{p}bold_p) to only include handwritings representing 0, 1, or 8, such that for both 𝒮𝒮\mathcal{S}caligraphic_S and 𝒯𝒯\mathcal{T}caligraphic_T, the positive and negative samples are relatively well balanced with a ratio around 1-to-2.

The reason for the specific number selections is as follows: as described in Section II-C, 𝒮𝒮\mathcal{S}caligraphic_S and 𝒯𝒯\mathcal{T}caligraphic_T should be similar and at the meantime with significant differences. Both tasks of identifying the number 1 and identifying the number 8 require the common information of the complete pixel patterns such as edges and corners over the entire input image. Nonetheless, the difference is also apparent, as handwritings of the digit 1 should resemble a linear pattern of pixels throughout the image, while handwrittings of the digit 8 should resemble local circular patterns of pixels. As the defined input set only includes handwrittings of 0, 1, and 8, a machine learning model exclusively trained for solving 𝒮𝒮\mathcal{S}caligraphic_S only needs to discover and rely on a simple high-level feature: whether the pixels form a global linear pattern or not. Needless to say, the knowledge in this model would not transfer well to solving 𝒯𝒯\mathcal{T}caligraphic_T, as the learned feature is not informative for distinguishing handwrittings of 8 from 0. Therefore, an effective transfer learning approach is needed for obtaining transferable features from 𝒮𝒮\mathcal{S}caligraphic_S to 𝒯𝒯\mathcal{T}caligraphic_T.

Corresponding to the notions in Section II-A, 𝐱ssubscript𝐱𝑠\mathbf{x}_{s}bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are the neural network predictions on the probabilities that the handwriting input 𝐩𝐩\mathbf{p}bold_p represents the respective digit specified by the corresponding task. Specifically, with 𝒮𝒮\mathcal{S}caligraphic_S and 𝒯𝒯\mathcal{T}caligraphic_T both being binary classification tasks, we have 𝐱s[0,1]subscript𝐱𝑠01\mathbf{x}_{s}\in[0,1]bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ [ 0 , 1 ] being the probability of 𝐩𝐩\mathbf{p}bold_p representing the digit 1; and 𝐱t[0,1]subscript𝐱𝑡01\mathbf{x}_{t}\in[0,1]bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ [ 0 , 1 ] being the probability of 𝐩𝐩\mathbf{p}bold_p representing the digit 8. Let 𝐩digitsubscript𝐩digit\mathbf{p}_{\text{digit}}bold_p start_POSTSUBSCRIPT digit end_POSTSUBSCRIPT be the actual digit the handwriting 𝐩𝐩\mathbf{p}bold_p represents. The source task and target task objectives are respectively:

us(𝐱s)=subscript𝑢𝑠subscript𝐱𝑠absent\displaystyle u_{s}(\mathbf{x}_{s})=italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = {1,if𝐱s0.5&𝐩digit=11,if𝐱s<0.5&𝐩digit10,otherwise,cases1ifsubscript𝐱𝑠0.5subscript𝐩digit11ifsubscript𝐱𝑠0.5subscript𝐩digit10otherwise\displaystyle\begin{cases}1,&\text{if}\ \mathbf{x}_{s}\geq 0.5\ \&\ \mathbf{p}% _{\text{digit}}=1\\ 1,&\text{if}\ \mathbf{x}_{s}<0.5\ \&\ \mathbf{p}_{\text{digit}}\neq 1\\ 0,&\text{otherwise}\end{cases}\ ,{ start_ROW start_CELL 1 , end_CELL start_CELL if bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ≥ 0.5 & bold_p start_POSTSUBSCRIPT digit end_POSTSUBSCRIPT = 1 end_CELL end_ROW start_ROW start_CELL 1 , end_CELL start_CELL if bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT < 0.5 & bold_p start_POSTSUBSCRIPT digit end_POSTSUBSCRIPT ≠ 1 end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise end_CELL end_ROW , (7)
ut(𝐱t)=subscript𝑢𝑡subscript𝐱𝑡absent\displaystyle u_{t}(\mathbf{x}_{t})=italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = {1,if𝐱t0.5&𝐩digit=81,if𝐱t<0.5&𝐩digit80,otherwise,cases1ifsubscript𝐱𝑡0.5subscript𝐩digit81ifsubscript𝐱𝑡0.5subscript𝐩digit80otherwise\displaystyle\begin{cases}1,&\text{if}\ \mathbf{x}_{t}\geq 0.5\ \&\ \mathbf{p}% _{\text{digit}}=8\\ 1,&\text{if}\ \mathbf{x}_{t}<0.5\ \&\ \mathbf{p}_{\text{digit}}\neq 8\\ 0,&\text{otherwise}\end{cases}\ ,{ start_ROW start_CELL 1 , end_CELL start_CELL if bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ 0.5 & bold_p start_POSTSUBSCRIPT digit end_POSTSUBSCRIPT = 8 end_CELL end_ROW start_ROW start_CELL 1 , end_CELL start_CELL if bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT < 0.5 & bold_p start_POSTSUBSCRIPT digit end_POSTSUBSCRIPT ≠ 8 end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise end_CELL end_ROW , (8)

The definitions in (7) and (8) represent the classification accuracy, i.e., the average values of us(𝐱s)subscript𝑢𝑠subscript𝐱𝑠u_{s}(\mathbf{x}_{s})italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) and ut(𝐱t)subscript𝑢𝑡subscript𝐱𝑡u_{t}(\mathbf{x}_{t})italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) over a set of samples are the percentages of correct predictions across that sample set.

III-B D2D Wireless Network Power Control

For the first wireless communication application, we study the power control problem for D2D wireless networks. Consider a wireless network with N𝑁Nitalic_N D2D links that transmit independently over the frequency band of bandwidth w𝑤witalic_w with full frequency reuse. Let 𝐆={gij}i,j{1N}𝐆subscriptsubscript𝑔𝑖𝑗𝑖𝑗1𝑁\mathbf{G}=\{g_{ij}\}_{i,j\in\{1\dots N\}}bold_G = { italic_g start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i , italic_j ∈ { 1 … italic_N } end_POSTSUBSCRIPT denote the set of channel gains, with gijsubscript𝑔𝑖𝑗g_{ij}italic_g start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT being the channel gain from the j𝑗jitalic_j-th transmitter to the i𝑖iitalic_i-th receiver. The power control problem takes the wireless channel state information as the input, so we have:

𝐩𝐆={gij}i,j{1N}.𝐩𝐆subscriptsubscript𝑔𝑖𝑗𝑖𝑗1𝑁\displaystyle\mathbf{p}\leftarrow\mathbf{G}=\{g_{ij}\}_{i,j\in\{1\dots N\}}\ .bold_p ← bold_G = { italic_g start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i , italic_j ∈ { 1 … italic_N } end_POSTSUBSCRIPT . (9)

Let Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the maximum transmission power for the i𝑖iitalic_i-th transmitter. The problem of power control is to find the optimal values of the variables 𝐱={xi}i{1N}𝐱subscriptsubscript𝑥𝑖𝑖1𝑁\mathbf{x}=\{x_{i}\}_{i\in\{1\dots N\}}bold_x = { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ { 1 … italic_N } end_POSTSUBSCRIPT, where xi[0,1]subscript𝑥𝑖01x_{i}\in[0,1]italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ 0 , 1 ] denotes the percentage of maximum power that the i𝑖iitalic_i-th transmitter transmits at. Under a specific power control solution 𝐱𝐱\mathbf{x}bold_x, the i𝑖iitalic_i-th link has the following achievable rate:

ri=wlog(1+giiPixijigijPjxj+σ2),subscript𝑟𝑖𝑤1subscript𝑔𝑖𝑖subscript𝑃𝑖subscript𝑥𝑖subscript𝑗𝑖subscript𝑔𝑖𝑗subscript𝑃𝑗subscript𝑥𝑗superscript𝜎2\displaystyle r_{i}=w\log\left(1+\frac{g_{ii}P_{i}x_{i}}{\sum_{j\neq i}g_{ij}P% _{j}x_{j}+\sigma^{2}}\right)\>,italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_w roman_log ( 1 + divide start_ARG italic_g start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , (10)

where σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT denotes the background noise power level.

For the transfer learning problem, we consider two link rate utility functions that are important under different application scenarios:

  • The sum rate optimization as 𝒮𝒮\mathcal{S}caligraphic_S:

    us(𝐱s)=i=1Nrisubscript𝑢𝑠subscript𝐱𝑠superscriptsubscript𝑖1𝑁subscript𝑟𝑖\displaystyle u_{s}(\mathbf{x}_{s})=\sum_{i=1}^{N}r_{i}italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (11)
  • The min rate optimization as 𝒯𝒯\mathcal{T}caligraphic_T:

    ut(𝐱t)=mini=1Nrisubscript𝑢𝑡subscript𝐱𝑡subscript𝑖1𝑁subscript𝑟𝑖\displaystyle u_{t}(\mathbf{x}_{t})=\min_{i=1\dots N}r_{i}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_min start_POSTSUBSCRIPT italic_i = 1 … italic_N end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (12)

where 𝐱ssubscript𝐱𝑠\mathbf{x}_{s}bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are the power control solutions for the sum-rate optimization 𝒮𝒮\mathcal{S}caligraphic_S and the min-rate optimization 𝒯𝒯\mathcal{T}caligraphic_T respectively.

Examining (11) and (12), they are correlated in the sense that a set of higher rates over all links leads to a higher objective value. Both objectives would benefit from proper interference mitigation. On the other hand, these two objectives differ significantly in term of fairness among links: (12) ensures complete fairness by optimizing the worst link rate; (11) largely ignores fairness since the optimal sum rate might be achieved through heavily utilizing strong links. With this distinction, conducting transfer learning between (11) and (12) is challenging, as certain features crucial in optimizing (11) could potentially lead to degraded performance in (12).

III-C MISO Wireless Network Beamforming and Localization

As a second application in wireless communications, we study the transfer learning from the downlink MISO beamforming task to the localization task. The source task aims to design the optimal downlink beamformers based on the uplink received pilots, while the target task aims to find the locations of the users. These two seemingly unrelated tasks nevertheless share the common input as the estimated channel state information.

Consider a downlink MISO network of M𝑀Mitalic_M base stations (BS) collectively serving a single user equipement (UE), where each BS is equipped with K𝐾Kitalic_K antennas while the UE is equipped with a single antenna. We assume that the locations of all BSs are fixed. We denote the UE location by the coordinate (xUE,yUE,zUE)subscript𝑥UEsubscript𝑦UEsubscript𝑧UE(x_{\text{UE}},y_{\text{UE}},z_{\text{UE}})( italic_x start_POSTSUBSCRIPT UE end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT UE end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT UE end_POSTSUBSCRIPT ), with xUEsubscript𝑥UEx_{\text{UE}}italic_x start_POSTSUBSCRIPT UE end_POSTSUBSCRIPT and yUEsubscript𝑦UEy_{\text{UE}}italic_y start_POSTSUBSCRIPT UE end_POSTSUBSCRIPT being unknown and zUEsubscript𝑧UEz_{\text{UE}}italic_z start_POSTSUBSCRIPT UE end_POSTSUBSCRIPT being fixed, which is usually the case in practical indoor scenarios where the UE is located on the ground or at a certain level of known height. We assume reciprocity of uplink and downlink channels, with the set of channel coefficients denoted by 𝐇={𝐡m}m{1M}𝐇subscriptsubscript𝐡𝑚𝑚1𝑀\mathbf{H}=\{\mathbf{h}_{m}\}_{m\in\{1\dots M\}}bold_H = { bold_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m ∈ { 1 … italic_M } end_POSTSUBSCRIPT, where 𝐡mKsubscript𝐡𝑚superscript𝐾\mathbf{h}_{m}\in\mathbb{C}^{K}bold_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT is the vector of channel coefficients from the K𝐾Kitalic_K antennas of the m𝑚mitalic_m-th BS to the UE. We use the Rician fading model for modeling the wireless channels, with the channels from the m𝑚mitalic_m-th BS to the UE modelled as follows:

𝐡m=ρ(dm)(ϵ1+ϵ𝐡mLOS+11+ϵ𝐡mNLOS)subscript𝐡𝑚𝜌subscript𝑑𝑚italic-ϵ1italic-ϵsuperscriptsubscript𝐡𝑚LOS11italic-ϵsuperscriptsubscript𝐡𝑚NLOS\displaystyle\mathbf{h}_{m}=\rho(d_{m})\left(\sqrt{\frac{\epsilon}{1+\epsilon}% }\mathbf{h}_{m}^{\text{LOS}}+\sqrt{\frac{1}{1+\epsilon}}\mathbf{h}_{m}^{\text{% NLOS}}\right)bold_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_ρ ( italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ( square-root start_ARG divide start_ARG italic_ϵ end_ARG start_ARG 1 + italic_ϵ end_ARG end_ARG bold_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT LOS end_POSTSUPERSCRIPT + square-root start_ARG divide start_ARG 1 end_ARG start_ARG 1 + italic_ϵ end_ARG end_ARG bold_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT NLOS end_POSTSUPERSCRIPT ) (13)

with dmsubscript𝑑𝑚d_{m}italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT being the distance from the UE to the m𝑚mitalic_m-th BS and ρ(dm)𝜌subscript𝑑𝑚\rho(d_{m})italic_ρ ( italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) being the associated pathloss as a function over this distance, 𝐡mLOSKsuperscriptsubscript𝐡𝑚LOSsuperscript𝐾\mathbf{h}_{m}^{\text{LOS}}\in\mathbb{C}^{K}bold_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT LOS end_POSTSUPERSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT and 𝐡mNLOSKsuperscriptsubscript𝐡𝑚NLOSsuperscript𝐾\mathbf{h}_{m}^{\text{NLOS}}\in\mathbb{C}^{K}bold_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT NLOS end_POSTSUPERSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT being the channel coefficients for the line-of-sight (LOS) path and non-line-of-sight (NLOS) paths respectively, and ϵitalic-ϵ\epsilonitalic_ϵ being the Rician factor as the ratio of power between the LOS and NLOS channel components. Specifically, we model the LOS channel component as:

𝐡mLOS=𝐚m(θm,ϕm),superscriptsubscript𝐡𝑚LOSsubscript𝐚𝑚subscript𝜃𝑚subscriptitalic-ϕ𝑚\displaystyle\mathbf{h}_{m}^{\text{LOS}}=\mathbf{a}_{m}(\theta_{m},\phi_{m})\>,bold_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT LOS end_POSTSUPERSCRIPT = bold_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) , (14)

where θmsubscript𝜃𝑚\theta_{m}italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and ϕmsubscriptitalic-ϕ𝑚\phi_{m}italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT are the azimuth and elevation angle-of-arrival (AoA) respectively from the UE to the m𝑚mitalic_m-th BS, and 𝐚mKsubscript𝐚𝑚superscript𝐾\mathbf{a}_{m}\in\mathbb{C}^{K}bold_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT is the steering vector, which is a function of θmsubscript𝜃𝑚\theta_{m}italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and ϕmsubscriptitalic-ϕ𝑚\phi_{m}italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. Let δ𝛿\deltaitalic_δ and λ𝜆\lambdaitalic_λ denote the antenna spacing and the signal wavelength respectively, the k𝑘kitalic_k-th component of 𝐚msubscript𝐚𝑚\mathbf{a}_{m}bold_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, which corresponds to the k𝑘kitalic_k-th antenna of the m𝑚mitalic_m-th BS, is computed as follows:

[𝐚m(θm,ϕm)]k=e2πδλ[ir(n)sin(θm)cos(ϕm)+ic(n)cos(θm)cos(ϕm)],subscriptdelimited-[]subscript𝐚𝑚subscript𝜃𝑚subscriptitalic-ϕ𝑚𝑘superscript𝑒2𝜋𝛿𝜆delimited-[]subscript𝑖𝑟𝑛subscript𝜃𝑚subscriptitalic-ϕ𝑚subscript𝑖𝑐𝑛subscript𝜃𝑚subscriptitalic-ϕ𝑚[\mathbf{a}_{m}(\theta_{m},\phi_{m})]_{k}=\\ e^{\frac{2\pi\delta}{\lambda}\left[i_{r}(n)\sin(\theta_{m})\cos(\phi_{m})+i_{c% }(n)\cos(\theta_{m})\cos(\phi_{m})\right]}\>,start_ROW start_CELL [ bold_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = end_CELL end_ROW start_ROW start_CELL italic_e start_POSTSUPERSCRIPT divide start_ARG 2 italic_π italic_δ end_ARG start_ARG italic_λ end_ARG [ italic_i start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_n ) roman_sin ( italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) roman_cos ( italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) + italic_i start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_n ) roman_cos ( italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) roman_cos ( italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ] end_POSTSUPERSCRIPT , end_CELL end_ROW (15)

with ir(n)subscript𝑖𝑟𝑛i_{r}(n)italic_i start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_n ) and ic(n)subscript𝑖𝑐𝑛i_{c}(n)italic_i start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_n ) being the row and column index of the n𝑛nitalic_n-th antenna in the antenna array respectively. As for the NLOS paths, we use the Rayleigh fading model to model these path components collectively:

𝐡mNLOS𝒞𝒩(0,𝐈).similar-tosuperscriptsubscript𝐡𝑚NLOS𝒞𝒩0𝐈\displaystyle\mathbf{h}_{m}^{\text{NLOS}}\sim\mathcal{CN}(0,\mathbf{I})\>.bold_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT NLOS end_POSTSUPERSCRIPT ∼ caligraphic_C caligraphic_N ( 0 , bold_I ) . (16)

We assume a maximum transmission power level of PBSsubscript𝑃BSP_{\text{BS}}italic_P start_POSTSUBSCRIPT BS end_POSTSUBSCRIPT for each BS and PUEsubscript𝑃UEP_{\text{UE}}italic_P start_POSTSUBSCRIPT UE end_POSTSUBSCRIPT for the UE, and a noise level of σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT both at the BSs and at the UE. We use an uplink pilot stage for the BS to infer the wireless channel, thereby allowing it to perform downlink beamforming or localization tasks. Specifically, the UE sends T𝑇Titalic_T uplink pilot signals, which each BS measures through applying a sequence of sensing vectors over its K𝐾Kitalic_K antennas. Let 𝐜={ct}t{1T}𝐜subscriptsuperscript𝑐𝑡𝑡1𝑇\mathbf{c}=\{c^{t}\}_{t\in\{1\dots T\}}bold_c = { italic_c start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t ∈ { 1 … italic_T } end_POSTSUBSCRIPT be the set of uplink pilot signals sent by UE, and 𝐕={𝐯mt}m{1M},t{1T}𝐕subscriptsuperscriptsubscript𝐯𝑚𝑡formulae-sequence𝑚1𝑀𝑡1𝑇\mathbf{V}=\{\mathbf{v}_{m}^{t}\}_{m\in\{1\dots M\},\ t\in\{1\dots T\}}bold_V = { bold_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m ∈ { 1 … italic_M } , italic_t ∈ { 1 … italic_T } end_POSTSUBSCRIPT be the set of sensing vectors employed by all the BSs, where 𝐯mtKsuperscriptsubscript𝐯𝑚𝑡superscript𝐾\mathbf{v}_{m}^{t}\in\mathbb{C}^{K}bold_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT is the sensing vector adopted by the m𝑚mitalic_m-th BS for receiving the t𝑡titalic_t-th pilot. According to the power constraint, we have |ct|2=PUE,tsuperscriptsuperscript𝑐𝑡2subscript𝑃UEfor-all𝑡|c^{t}|^{2}=P_{\text{UE}},\forall t| italic_c start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_P start_POSTSUBSCRIPT UE end_POSTSUBSCRIPT , ∀ italic_t. Also, we require 𝐯mt22=1,m,tsuperscriptsubscriptnormsuperscriptsubscript𝐯𝑚𝑡221for-all𝑚𝑡||\mathbf{v}_{m}^{t}||_{2}^{2}=1,\forall m,t| | bold_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1 , ∀ italic_m , italic_t. For the t𝑡titalic_t-th uplink pilot the UE transmits, the m𝑚mitalic_m-th BS receives the measurement rmtsuperscriptsubscript𝑟𝑚𝑡r_{m}^{t}\in\mathbb{C}italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_C:

rmt=(𝐯mt)H𝐡mct+nmt,superscriptsubscript𝑟𝑚𝑡superscriptsuperscriptsubscript𝐯𝑚𝑡Hsubscript𝐡𝑚superscript𝑐𝑡superscriptsubscript𝑛𝑚𝑡\displaystyle r_{m}^{t}=(\mathbf{v}_{m}^{t})^{\text{H}}\mathbf{h}_{m}c^{t}+n_{% m}^{t}\>,italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ( bold_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT H end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_n start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , (17)

where nmt𝒞𝒩(0,σ2𝐈)similar-tosuperscriptsubscript𝑛𝑚𝑡𝒞𝒩0superscript𝜎2𝐈n_{m}^{t}\sim\mathcal{CN}(0,\sigma^{2}\mathbf{I})italic_n start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∼ caligraphic_C caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ) is the noise experienced at the m𝑚mitalic_m-th BS when receiving the t𝑡titalic_t-th uplink pilot transmission through 𝐯mtsuperscriptsubscript𝐯𝑚𝑡\mathbf{v}_{m}^{t}bold_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.

We study the transfer learning problem between two tasks: the downlink beamforming task 𝒮𝒮\mathcal{S}caligraphic_S, and the UE localization task 𝒯𝒯\mathcal{T}caligraphic_T, both of which are based on the uplink pilot measurements as the input. Specifically, we assume both the UE uplink pilots 𝐜={ct}t{1T}𝐜subscriptsuperscript𝑐𝑡𝑡1𝑇\mathbf{c}=\{c^{t}\}_{t\in\{1\dots T\}}bold_c = { italic_c start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t ∈ { 1 … italic_T } end_POSTSUBSCRIPT and the sensing vectors adopted at each BS 𝐕={𝐯mt}m{1M},t{1T}𝐕subscriptsuperscriptsubscript𝐯𝑚𝑡formulae-sequence𝑚1𝑀𝑡1𝑇\mathbf{V}=\{\mathbf{v}_{m}^{t}\}_{m\in\{1\dots M\},\ t\in\{1\dots T\}}bold_V = { bold_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m ∈ { 1 … italic_M } , italic_t ∈ { 1 … italic_T } end_POSTSUBSCRIPT are fixed. Correspondingly, we have the problem inputs as:

𝐩{rmt}m{1M},t{1T}.𝐩subscriptsuperscriptsubscript𝑟𝑚𝑡formulae-sequence𝑚1𝑀𝑡1𝑇\displaystyle\mathbf{p}\leftarrow\{r_{m}^{t}\}_{m\in\{1\dots M\},t\in\{1\dots T% \}}\ .bold_p ← { italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m ∈ { 1 … italic_M } , italic_t ∈ { 1 … italic_T } end_POSTSUBSCRIPT . (18)

We now provide detailed descriptions of the source task and the target task as follows.

III-C1 Source Task — Downlink Beamforming

The task of downlink beamforming focuses on finding the optimal digital downlink beamformers at all BSs to collaboratively maximize the signal power received, or equivalently, the signal-to-noise ratio (SNR), at the UE. Specifically, let 𝐁={𝐛m}m{1M}𝐁subscriptsubscript𝐛𝑚𝑚1𝑀\mathbf{B}=\{\mathbf{b}_{m}\}_{m\in\{1\dots M\}}bold_B = { bold_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m ∈ { 1 … italic_M } end_POSTSUBSCRIPT denote the set of digital beamformers to be optimized, where 𝐛mKsubscript𝐛𝑚superscript𝐾\mathbf{b}_{m}\in\mathbb{C}^{K}bold_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT is the downlink beamformer employed by the m𝑚mitalic_m-th BS, with 𝐛m22=1superscriptsubscriptnormsubscript𝐛𝑚221||\mathbf{b}_{m}||_{2}^{2}=1| | bold_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1. Corresponding to the definition in Section II-A, the source-task optimization variables 𝐱ssubscript𝐱𝑠\mathbf{x}_{s}bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is the set of beamformers 𝐁𝐁\mathbf{B}bold_B:

𝐱smsuperscriptsubscript𝐱𝑠𝑚absent\displaystyle\mathbf{x}_{s}^{m}\leftarrowbold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ← 𝐛mmsubscript𝐛𝑚for-all𝑚\displaystyle\>\mathbf{b}_{m}\quad\forall mbold_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∀ italic_m (19)
𝐱ssubscript𝐱𝑠absent\displaystyle\mathbf{x}_{s}\leftarrowbold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ← {𝐱sm}m{1M}.subscriptsuperscriptsubscript𝐱𝑠𝑚𝑚1𝑀\displaystyle\>\{\mathbf{x}_{s}^{m}\}_{m\in\{1\dots M\}}\>.{ bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m ∈ { 1 … italic_M } end_POSTSUBSCRIPT . (20)

Given the set of optimized beamformers 𝐁𝐁\mathbf{B}bold_B, the SNR at the UE for downlink transmission, which is the objective for 𝒮𝒮\mathcal{S}caligraphic_S, is computed as follows:

us(𝐱s)=PBSm=1M𝐛mH𝐡mσ2.subscript𝑢𝑠subscript𝐱𝑠subscript𝑃BSsuperscriptsubscript𝑚1𝑀superscriptsubscript𝐛𝑚Hsubscript𝐡𝑚superscript𝜎2\displaystyle u_{s}(\mathbf{x}_{s})=\frac{P_{\text{BS}}\sum_{m=1}^{M}\mathbf{b% }_{m}^{\text{H}}\mathbf{h}_{m}}{\sigma^{2}}\>.italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = divide start_ARG italic_P start_POSTSUBSCRIPT BS end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT bold_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT H end_POSTSUPERSCRIPT bold_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (21)

III-C2 Target Task — Localization

The task of UE localization focuses on estimating the unknown UE location based on the collection of uplink pilot measurements from all BSs. Specifically, given the problem inputs {rmt}m{1M},t{1T}subscriptsuperscriptsubscript𝑟𝑚𝑡formulae-sequence𝑚1𝑀𝑡1𝑇\{r_{m}^{t}\}_{m\in\{1\dots M\},\ t\in\{1\dots T\}}{ italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m ∈ { 1 … italic_M } , italic_t ∈ { 1 … italic_T } end_POSTSUBSCRIPT, the optimization variables for 𝒯𝒯\mathcal{T}caligraphic_T, i.e., 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, are the estimation of the x-coordinate and y-coordinate of the UE location, which we denote as x^UEsubscript^𝑥UE\hat{x}_{\text{UE}}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT UE end_POSTSUBSCRIPT and y^UEsubscript^𝑦UE\hat{y}_{\text{UE}}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT UE end_POSTSUBSCRIPT. Therefore, we have:

𝐱t(x^UE,y^UE).subscript𝐱𝑡subscript^𝑥UEsubscript^𝑦UE\displaystyle\mathbf{x}_{t}\leftarrow(\hat{x}_{\text{UE}},\>\hat{y}_{\text{UE}% })\>.bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT UE end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT UE end_POSTSUBSCRIPT ) . (22)

Naturally, the objective for the localization task is to minimize the location estimation error. Using the Euclidean distance as the metric, the utility function is as follows:

ut(𝐱t)=(xUEx^UE)2+(yUEy^UE)2.subscript𝑢𝑡subscript𝐱𝑡superscriptsubscript𝑥UEsubscript^𝑥UE2superscriptsubscript𝑦UEsubscript^𝑦UE2\displaystyle u_{t}(\mathbf{x}_{t})=-\sqrt{(x_{\text{UE}}-\hat{x}_{\text{UE}})% ^{2}+(y_{\text{UE}}-\hat{y}_{\text{UE}})^{2}}\>.italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = - square-root start_ARG ( italic_x start_POSTSUBSCRIPT UE end_POSTSUBSCRIPT - over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT UE end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_y start_POSTSUBSCRIPT UE end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT UE end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (23)

Note that we define the utility function as the negative value of the estimation error in Euclidean distance. With this definition, we will aim to maximize the utility function utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which in turn would minimize the localization error and lead to a high localization accuracy.

IV Transfer Learning with Reconstruction Loss

We now present the proposed novel transfer learning method, which is effective for a wide range of problem domains and is applicable to a variety of deep neural network architectures. The proposed method is target-task agnostic as long as the target task shares the same problem input. As a result, it can be applied to arbitrary target tasks on the fly with minimal additional training. Furthermore, as compared to the conventional transfer learning approaches, only during the source task training, the proposed method introduces some relatively low amount of additional parameter complexity and computational complexity. During the target task training and testing or model implementations however, the proposed method does not introduce any additional parameters or computational complexity.

IV-A Information within Neural Network Computation Flow

A neural network consists of consecutive hidden layers of neurons computing non-linear functions (i.e. activations), forming a computation flow. For a regular neural network learning one specific input-to-output mapping, the features computed at each hidden layer follow a general information flow pattern: from the input to the output, the amount of information describing the input gradually reduces layer by layer, while only the information necessary for predicting the output is maintained [35, 36, 37]. While being efficient for learning a single mapping, this information flow pattern may not be desirable in the transfer learning setting. Instead, we desire the features learned by the model to be generalizable and retain sufficient information for being transferable to new target tasks. In other words, the learned model features need to contain the common information among tasks of interest in order to be effective for transfer learning.

For transfer learning on CV or NLP applications, it is relatively clear which features are likely to hold the common information. Specifically, with highly structured inputs, the neural network computation flow learned under regular training is likely to have structures already: the entire flow can be divided into a feature learning stage and an optimization stage. Take convolutional neural networks solving CV tasks as examples, the feature learning stage includes the convolution layers that compute general high-level features which contain the common information for many relevant CV tasks (such as edge patterns, pixel intensities, or color gradients over the input image), followed by the optimization stage consisting of fully connected layers that process these high-level features and compute the task-specific outputs (e.g. classification class scores). To conduct transfer learning over correlated CV tasks, the convolutional layers are shared among the models as the feature learning stage, while the fully connected layers of each model are further trained on a per-task basis [12, 14, 15].

However, for general mathematical optimization problems, the inputs lack clear structures in most cases. Under regular training methods, the resulting neural network computation flow is not clearly divided by stages, with internal features gradually becoming more and more task-specific layer-by-layer. As the result, it is difficult to identify or explicitly encourage the learning of transferable features, or equivalently, to obtain features that incorporate the desired common information.

IV-B Source Task Training with Added Reconstruction Loss

To tackle the challenges of transfer learning for mathematical optimization problems, we propose a novel transfer learning approach to encourage the learning of transferable features. Specifically, we first identify a proper selection of the common information for the source task and the target task. When training the neural network model on the source task, on top of the regular task-based loss, we introduce in addition a loss term for the common information reconstruction.

Refer to caption
Figure 1: Transfer Learning with Reconstruction Loss.

Fig. 1 illustrates the proposed transfer learning approach. We adopt the most general fully-connected neural network architecture. Within the neural network, we select a hidden layer as the feature layer where we encourage the transferable features to be computed. Correspondingly, the part of the neural network computation flow from the input layer to the feature layer forms the feature learning stage, while the part of the computation flow from the feature layer to the output layer forms the optimization stage. There is no particular constraint on which hidden layer to select as the feature layer, as long as there is sufficient transformation capacity (by having a sufficient number of hidden layers) both from the input layer to the selected layer and from the selected layer to the output layer.

From the selected feature layer, we add in a reconstruction stage as a separate branch in the computation flow, in parallel to the optimization stage. The reconstruction stage aims to reconstruct the common information (𝐩)𝐩\mathcal{I}(\mathbf{p})caligraphic_I ( bold_p ) by using the features in the feature layer. We denote the corresponding reconstruction loss by Rsubscript𝑅\mathcal{L}_{R}caligraphic_L start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT. Together with the original loss associated with optimizing the source task utility us(𝐱s)subscript𝑢𝑠subscript𝐱𝑠u_{s}(\mathbf{x}_{s})italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ), which we denote by Ssubscript𝑆\mathcal{L}_{S}caligraphic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, the loss function \mathcal{L}caligraphic_L that we use to train ΘssubscriptsubscriptΘ𝑠\mathcal{F}_{\Theta_{s}}caligraphic_F start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT is:

=S+αRsubscript𝑆𝛼subscript𝑅\displaystyle\mathcal{L}=\mathcal{L}_{S}+\alpha\mathcal{L}_{R}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT + italic_α caligraphic_L start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT (24)

where α𝛼\alphaitalic_α is the relative weighting scalar between the loss terms.

With ΘssubscriptsubscriptΘ𝑠\mathcal{F}_{\Theta_{s}}caligraphic_F start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT trained on \mathcal{L}caligraphic_L as in (24), the feature layer computes features that are pertinent to the source task optimization, while also containing knowledge of the common information (𝐩)𝐩\mathcal{I}(\mathbf{p})caligraphic_I ( bold_p ). Therefore, these features are highly general and transferable to different tasks that are relevant to the source task. We emphasize that when the choice of (𝐩)𝐩\mathcal{I}(\mathbf{p})caligraphic_I ( bold_p ) is generic (e.g., by choosing the problem inputs 𝐩𝐩\mathbf{p}bold_p as the common information), the proposed approach is target-task agnostic, since no prior knowledge of the target task is needed throughout the training procedure.

IV-C Transfer Learning by Sharing Feature Layer

After training the neural network ΘssubscriptsubscriptΘ𝑠\mathcal{F}_{\Theta_{s}}caligraphic_F start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT on the source task 𝒮𝒮\mathcal{S}caligraphic_S as described in Section IV-B, transfer learning on the target task 𝒯𝒯\mathcal{T}caligraphic_T is straightforward. We first transfer the subset of the neural network parameters ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT up to the feature layer as the shared feature learning stage to the neural network parameters ΘtsubscriptΘ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and leave the remaining parameters in ΘtsubscriptΘ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT unassigned. When training ΘtsubscriptsubscriptΘ𝑡\mathcal{F}_{\Theta_{t}}caligraphic_F start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, we freeze all the transferred parameters so they would remain unchanged as the already-optimized feature learning stage. With the proposed transfer learning approach, through the transferred feature learning stage, the features computed in the feature layer of ΘtsubscriptsubscriptΘ𝑡\mathcal{F}_{\Theta_{t}}caligraphic_F start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT are already valuable for optimizing 𝒯𝒯\mathcal{T}caligraphic_T before starting any target task training.

For the parameters after the feature layer in ΘtsubscriptΘ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we train these parameters with the regular loss associated with the target task utility ut(𝐱t)subscript𝑢𝑡subscript𝐱𝑡u_{t}(\mathbf{x}_{t})italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), which we denote as 𝒯subscript𝒯\mathcal{L_{T}}caligraphic_L start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT. This further training leads to the task-specific optimization stage in ΘtsubscriptΘ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. With the number of trainable parameters greatly reduced, along with the fact that the features from the feature layer are also optimized for 𝒯𝒯\mathcal{T}caligraphic_T, only a small amount of additional training data is needed to obtain a well-performing model ΘtsubscriptsubscriptΘ𝑡\mathcal{F}_{\Theta_{t}}caligraphic_F start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT. In essence, the computational complexity of target-task training is significantly reduced when compared to the conventional transfer learning approach.

We emphasize that the selection of 𝒯subscript𝒯\mathcal{L_{T}}caligraphic_L start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT is not required during the source task training phase specified in Section IV-B. Instead, we only need to decide the proper 𝒯𝒯\mathcal{T}caligraphic_T after the target task 𝒯𝒯\mathcal{T}caligraphic_T occurs. Therefore, the specification on 𝒯subscript𝒯\mathcal{L_{T}}caligraphic_L start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT does not affect the proposed approach on being target-task agnostic.

For the actual testing or implementations on either 𝒮𝒮\mathcal{S}caligraphic_S or 𝒯𝒯\mathcal{T}caligraphic_T, only the feature learning stage and the corresponding optimization stage of the neural network model need to be executed. Therefore, the proposed approach does not introduce any additional parameters or computational complexity into the model for actual inference. In fact, our approach only increases the model complexity during the source task training. Since in practical scenarios, as the initial stage of the model development, training the model on 𝒮𝒮\mathcal{S}caligraphic_S is usually done offline with sufficient time and data resources, the extra complexity from our approach is likely to be negligible.

IV-D Selecting the Common Information

According to Definition Definition, selecting the proper common information for a given set of tasks is non-trivial. Fortunately, as mentioned in Section II-B, we can always resort to using the problem inputs 𝐩𝐩\mathbf{p}bold_p as the common information when no alternative selection is apparent (which also enables the proposed approach to be target task agnostic, as previously discussed). In the following, we examine the three applications for numerical simulations and propose a proper choice for the common information for each of the applications.

IV-D1 MNIST Digits Classification

As described in Section III-A, 𝒮𝒮\mathcal{S}caligraphic_S focuses on identifying the digit 1 while 𝒯𝒯\mathcal{T}caligraphic_T focuses on identifying the digit 8, over images of handwritings for three digits: 0, 1, and 8. With the inputs 𝐩𝐩\mathbf{p}bold_p being images as high-dimensional data, the common information among tasks of identifying different digits is highly complex and involves detecting various pixel patterns and their relative locations in the images. Summarizing the common information in a concise form is difficult if not impossible. Therefore, we select the original problem inputs 𝐩𝐩\mathbf{p}bold_p, i.e. the images of handwritings, as the common information for this application:

(𝐩)𝐩𝐩𝐩\displaystyle\mathcal{I}(\mathbf{p})\leftarrow\mathbf{p}caligraphic_I ( bold_p ) ← bold_p (25)

IV-D2 D2D Wireless Network Power Control

With 𝒮𝒮\mathcal{S}caligraphic_S focusing on the sum-rate maximization and 𝒯𝒯\mathcal{T}caligraphic_T focusing on the min-rate maximization, both power control tasks largely depend on the mutual interference among all the N𝑁Nitalic_N D2D links. Therefore, the entire set of N2superscript𝑁2N^{2}italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT wireless channels (with N𝑁Nitalic_N direct-link channels and N×(N1)𝑁𝑁1N\times(N-1)italic_N × ( italic_N - 1 ) cross-link channels) need to be considered when optimizing both 𝒮𝒮\mathcal{S}caligraphic_S and 𝒯𝒯\mathcal{T}caligraphic_T. Correspondingly, we select all the channel gains 𝐆={gij}i,j{1N}𝐆subscriptsubscript𝑔𝑖𝑗𝑖𝑗1𝑁\mathbf{G}=\{g_{ij}\}_{i,j\in\{1\dots N\}}bold_G = { italic_g start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i , italic_j ∈ { 1 … italic_N } end_POSTSUBSCRIPT, which are the problem inputs 𝐩𝐩\mathbf{p}bold_p, as the common information for this application:

(𝐩)𝐩𝐩𝐩\displaystyle\mathcal{I}(\mathbf{p})\leftarrow\mathbf{p}caligraphic_I ( bold_p ) ← bold_p (26)

IV-D3 MISO Wireless Network Beamforming and Localization

Unlike the above-mentioned two applications, the common information between 𝒮𝒮\mathcal{S}caligraphic_S and 𝒯𝒯\mathcal{T}caligraphic_T for this application can be summarized in a concise form. Both downlink beamforming and UE localization implicitly require channel estimation. With the channel model as in (13), apart from the unknown stochastic NLOS paths as in (16), the downlink channels are largely determined by the following geometric parameters:

  1. (i)

    The set of distances from the UE to all M𝑀Mitalic_M BSs {dm}m{1M}subscriptsubscript𝑑𝑚𝑚1𝑀\{d_{m}\}_{m\in\{1\dots M\}}{ italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m ∈ { 1 … italic_M } end_POSTSUBSCRIPT.

  2. (ii)

    The set of azimuth AoAs from the UE to all M𝑀Mitalic_M BSs {θm}m{1M}subscriptsubscript𝜃𝑚𝑚1𝑀\{\theta_{m}\}_{m\in\{1\dots M\}}{ italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m ∈ { 1 … italic_M } end_POSTSUBSCRIPT.

  3. (iii)

    The set of elevation AoAs from the UE to al M𝑀Mitalic_M BSs {ϕm}m{1M}subscriptsubscriptitalic-ϕ𝑚𝑚1𝑀\{\phi_{m}\}_{m\in\{1\dots M\}}{ italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m ∈ { 1 … italic_M } end_POSTSUBSCRIPT.

For the source task 𝒮𝒮\mathcal{S}caligraphic_S on downlink beamforming, the optimal beamformers that maximize (21) are the beamformers designed to be perfectly aligned with the downlink channels. Since the geometric parameters (i)-(iii) completely determine the deterministic components within the channels as in (13)-(15), the optimization for 𝐱ssubscript𝐱𝑠\mathbf{x}_{s}bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is also largely dependent on the knowledge of these geometric parameters. On the other hand, for the target task 𝒯𝒯\mathcal{T}caligraphic_T on the UE localization, the UE location estimation (x^UEsubscript^𝑥UE\hat{x}_{\text{UE}}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT UE end_POSTSUBSCRIPT, y^UEsubscript^𝑦UE\hat{y}_{\text{UE}}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT UE end_POSTSUBSCRIPT, z^UEsubscript^𝑧UE\hat{z}_{\text{UE}}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT UE end_POSTSUBSCRIPT) can be obtained through the technique of triangulation from the fixed locations of the BSs, also by using the geometric parameters (i)-(iii). Therefore, these geometric parameters collectively can serve as a good choice for the common information:

(𝐩){dm,θm,ϕm}m{1M}.𝐩subscriptsubscript𝑑𝑚subscript𝜃𝑚subscriptitalic-ϕ𝑚𝑚1𝑀\displaystyle\mathcal{I}(\mathbf{p})\leftarrow\{d_{m},\theta_{m},\phi_{m}\}_{m% \in\{1\dots M\}}\>.caligraphic_I ( bold_p ) ← { italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m ∈ { 1 … italic_M } end_POSTSUBSCRIPT . (27)

As explained previously, the common information is only required at the source task training stage: it is used as the targets for the reconstruction loss subscript\mathcal{L_{R}}caligraphic_L start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT in (24), as shown in Fig. 1. For the target task training, and more importantly, the actual testing of the model, there is no need for collecting the common information. Correspondingly, for the MISO application, we only need to prepare the geometric parameters (i)-(iii) in the source task training set.

V Numerical Simulations

We present the numerical simulation results of the proposed transfer learning method on each of the three applications introduced in Section III. For each application, we introduce two neural network based benchmarks. To illustrate the effectiveness of the proposed method through comparisons, the neural network models trained under each of the two benchmark methods share the identical network architecture and specification of hidden neurons for the feature learning stage and the optimization stage as the model trained with the proposed approach. The two neural network based benchmark methods are as follows:

  • Conventional Transfer Learning: Train ΘssubscriptsubscriptΘ𝑠\mathcal{F}_{\Theta_{s}}caligraphic_F start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT for the source task on the loss 𝒮subscript𝒮\mathcal{L_{S}}caligraphic_L start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT, then transfer all the parameters in ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT up to the feature layer to ΘtsubscriptΘ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, followed by training the parameters after the feature layer in ΘtsubscriptΘ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT on 𝒯subscript𝒯\mathcal{L_{T}}caligraphic_L start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT.

  • Regular Learning: Train ΘssubscriptsubscriptΘ𝑠\mathcal{F}_{\Theta_{s}}caligraphic_F start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT and ΘtsubscriptsubscriptΘ𝑡\mathcal{F}_{\Theta_{t}}caligraphic_F start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT on the loss 𝒮subscript𝒮\mathcal{L_{S}}caligraphic_L start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT and 𝒯subscript𝒯\mathcal{L_{T}}caligraphic_L start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT respectively, without any knowledge transfer (i.e. parameters sharing) in between.

In the following, we refer to the proposed transfer learning method as Transfer Learning with Reconstruction. Note that as discussed earlier in this paper, the first benchmark method listed above, which we refer to as the conventional transfer learning method, is the most popular transfer learning method adopted in the literature [12, 14, 15, 16, 17, 18, 19].

We emphasize that by definition, both the conventional transfer learning method and the regular learning method lead to the same training updates on ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT in the source task training (since both methods update the entire set of ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT through gradient descents solely on 𝒮subscript𝒮\mathcal{L_{S}}caligraphic_L start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT). Thus, the model performances on the source task utility ussubscript𝑢𝑠u_{s}italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT would be the same for these two methods. Correspondingly, for the simulation results below, we present just one value of ussubscript𝑢𝑠u_{s}italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT as the performance achieved by both benchmark methods.

One notable advantage of applying transfer learning, and especially the proposed transfer learning approach, is to mitigate over-fitting during training on limited target task data. On the other hand, early stopping [38] based on validation loss is also a popular technique for combating over-fitting, which however requires a sufficiently large validation set that is often not available for the target task. Nonetheless, for our simulations, we reserve large validation sets on 𝒯𝒯\mathcal{T}caligraphic_T and employ early stopping when training each competing neural network model. Correspondingly, each model is evaluated at its best possible performance for each training method, and thus the performance margins fully demonstrate the effectiveness of the knowledge transfer by the proposed method.

V-A Transfer Learning MNIST Digits Classification

V-A1 Simulation Settings

We take the original MNIST training and evaluation data sets, and keep only the images of handwritings on the digit 0, 1, or 8, as discussed in Section III-A. To better understand the target task training data efficiency from each method (which shows how effective the transfer learning is), we conduct training and testing under two data-set specifications: a data set specification with highly limited target task training data, referred to as Data-Spec A, and a data set specification with significantly more target task training data (around 10-times the size of that in Data-Spec A), referred to as Data-Spec B. For Data-Spec A, we divide all the training data according to a 95%-5% split for data used in 𝒮𝒮\mathcal{S}caligraphic_S and 𝒯𝒯\mathcal{T}caligraphic_T respectively. Furthermore, we adopt a 70%-30% training-validation split for data in 𝒮𝒮\mathcal{S}caligraphic_S, and a 10%-90% training-validation split for data in 𝒯𝒯\mathcal{T}caligraphic_T. For Data-Spec B, we divide all the training data according to a 90%-10% split for data used in 𝒮𝒮\mathcal{S}caligraphic_S and 𝒯𝒯\mathcal{T}caligraphic_T respectively. Furthermore, we adopt a 70%-30% training-validation split for data in 𝒮𝒮\mathcal{S}caligraphic_S, and a 50%-50% training-validation split for data in 𝒯𝒯\mathcal{T}caligraphic_T. These data split ratios used in two data set specifications result in the data set sizes shown in Table I. The unusual training-validation split ratios on 𝒯𝒯\mathcal{T}caligraphic_T in both Data-Spec A and Data-Spec B are chosen for two reasons: to cater to the assumption that the training data is usually limited in the target task and to ensure a sufficiently large validation set for accurate early stopping in the target task training.

TABLE I: MNIST Data Set Specifications
Data Set Specification Task Training Set Samples Validation Set Samples
Data Spec A 𝒮𝒮\mathcal{S}caligraphic_S 12313 5277
𝒯𝒯\mathcal{T}caligraphic_T 92 834*
Data Spec B 𝒮𝒮\mathcal{S}caligraphic_S 11664 5000
𝒯𝒯\mathcal{T}caligraphic_T 926 926*

*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT A large validation set is used here to ensure accurate early stopping.

As already mentioned in Section IV-A, with image inputs, convolutional neural networks naturally form computation flows that have clear structural distinction between the feature learning stage and the optimization stage. However, for more general mathematical optimization problems, there is no established neural network architecture that learns and provides computation flows with such clear distinctions, which is the challenge that the proposed method aims to address. As this MNIST digit classification problem is used for illustrative purpose, the image inputs of the problem are treated as general inputs without clear spatial structures to be exploited directly. Specifically, we flatten the image inputs into one-dimensional vectors444We down-sample each image to 10×\times×10 pixels before flattening them to vectors for maintaining manageable dimensions. Simulation results suggest that such down-sampling only mildly affect the classification performances from all methods. and adopt the fully-connected neural network architecture for all neural network models.

The same overall neural network specification is used by all the competing methods, as shown in Table II. We use the same specification for the feature learning stage in ΘtsubscriptΘ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to support the transfer of trained parameters. Furthermore, in this application, since 𝐱ssubscript𝐱𝑠\mathbf{x}_{s}bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT have the same dimension (i.e. a single scalar output), we also use the same specification for the optimization stage in ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and ΘtsubscriptΘ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We note that 25 features are used in the feature layer for each neural network model. Furthermore, each neural network outputs a single value between 0 and 1 (enforced by the sigmoid non-linearity), as the probability of the input handwriting image representing the digit 1 for 𝒯𝒯\mathcal{T}caligraphic_T or the digit 8 for 𝒮𝒮\mathcal{S}caligraphic_S.

In terms of the selections of the source-task loss function 𝒮subscript𝒮\mathcal{L_{S}}caligraphic_L start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT and the target-task loss function 𝒯subscript𝒯\mathcal{L_{T}}caligraphic_L start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT, directly optimizing ussubscript𝑢𝑠u_{s}italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT or utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as in (7) or (8) is not feasible, since (7) (or (8)) is not a differentiable function over 𝐱ssubscript𝐱𝑠\mathbf{x}_{s}bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT (or 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT), and therefore no gradient can be derived for neural network model parameter updates. Instead, we use the popular cross-entropy function as the task-based loss functions. With 𝐩digitsubscript𝐩digit\mathbf{p}_{\text{digit}}bold_p start_POSTSUBSCRIPT digit end_POSTSUBSCRIPT being the true digit for 𝐩𝐩\mathbf{p}bold_p, the (supervised-learning) loss functions for both tasks are as follows:

𝒮=𝟙(𝐩digit=1)log(𝐱s)𝟙(𝐩digit1)log(1𝐱s),subscript𝒮1subscript𝐩digit1subscript𝐱𝑠1subscript𝐩digit11subscript𝐱𝑠\mathcal{L_{S}}=-\mathds{1}(\mathbf{p}_{\text{digit}}=1)\log(\mathbf{x}_{s})\\ -\mathds{1}(\mathbf{p}_{\text{digit}}\neq 1)\log(1-\mathbf{x}_{s})\>,start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT = - blackboard_1 ( bold_p start_POSTSUBSCRIPT digit end_POSTSUBSCRIPT = 1 ) roman_log ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL - blackboard_1 ( bold_p start_POSTSUBSCRIPT digit end_POSTSUBSCRIPT ≠ 1 ) roman_log ( 1 - bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , end_CELL end_ROW (28)
𝒯=𝟙(𝐩digit=8)log(𝐱t)𝟙(𝐩digit8)log(1𝐱t),subscript𝒯1subscript𝐩digit8subscript𝐱𝑡1subscript𝐩digit81subscript𝐱𝑡\mathcal{L_{T}}=-\mathds{1}(\mathbf{p}_{\text{digit}}=8)\log(\mathbf{x}_{t})\\ -\mathds{1}(\mathbf{p}_{\text{digit}}\neq 8)\log(1-\mathbf{x}_{t})\>,start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT = - blackboard_1 ( bold_p start_POSTSUBSCRIPT digit end_POSTSUBSCRIPT = 8 ) roman_log ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL - blackboard_1 ( bold_p start_POSTSUBSCRIPT digit end_POSTSUBSCRIPT ≠ 8 ) roman_log ( 1 - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , end_CELL end_ROW (29)

where 𝟙()1\mathds{1}(\cdot)blackboard_1 ( ⋅ ) denotes the standard binary indicator function. We use α=5𝛼5\alpha=5italic_α = 5 in (24) for the loss \mathcal{L}caligraphic_L, which provides the best performance for the proposed approach in the process of hyper-parameter tuning.

TABLE II: MNIST Neural Network Architecture
Stages Layers Number of Neurons
Feature Learning Stage (ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT or ΘtsubscriptΘ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) 1st 50
Feature Layer 25
Optimization Stage (ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT or ΘtsubscriptΘ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) 1st 10
Output Layer 1
Reconstruction Stage (only ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT) 1st 50
Reconstruct Layer 100*

*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT The output dimension corresponds to the down-sampled image dimension.

V-A2 Transfer Learning Performances

We present the results of the classification accuracy by all the competing methods, trained and evaluated under both data-set specifications, in Table III. First, the performances of all the methods are about the same on 𝒮𝒮\mathcal{S}caligraphic_S (under both data-set specifications), indicating that with the proposed method, the source task performance trade-off due to optimizing the extra reconstruction loss Rsubscript𝑅\mathcal{L}_{R}caligraphic_L start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT is minimal. At this very small cost, the performance advantages that our approach achieves on the target task are significant as evident by the noticeable margins obtained on the classification accuracies on 𝒯𝒯\mathcal{T}caligraphic_T.

We next examine the evaluation performances on 𝒯𝒯\mathcal{T}caligraphic_T under Data-Spec A: the proposed approach achieves the best performance at a 97.4% prediction accuracy, with a 4.4%percent4.44.4\%4.4 % and 2.4%percent2.42.4\%2.4 % performance margins over two benchmark methods. To understand the significance of these margins, we further compare the evaluation results on 𝒯𝒯\mathcal{T}caligraphic_T across two data-set specifications. As shown in Table III, the performance of the proposed approach under Data-Spec A matches that of the two benchmark methods under Data-Spec B. Essentially, with the number of training samples in 𝒯𝒯\mathcal{T}caligraphic_T being 92 and 926 under Data-Spec A and B respectively, the results show that to reach the same prediction accuracy that our approach achieves with less than 100 samples, a 10-fold increase of training data on the target task is needed for the two benchmark methods. This comparison result validates the significance of the reported performance margins under Data Spec A, and suggests that our approach indeed achieves high data efficiency in training as a result from effective knowledge transfer from 𝒮𝒮\mathcal{S}caligraphic_S to 𝒯𝒯\mathcal{T}caligraphic_T. Lastly, focusing on the classification accuracies on 𝒯𝒯\mathcal{T}caligraphic_T under Data-Spec B, with increased available training data, the conventional transfer learning approach has already lost its edge against the regular learning method, while our proposed approach still produces the best result. Overall, these results indicate that the proposed approach indeed effectively addresses the challenge of transfer learning when the neural network lacks structures in its information flow, through explicitly enforcing the learning of transferable features.

TABLE III: MNIST Transfer Learning Performances
Task Method Accuracy under Data Spec A Accuracy under Data Spec B
𝒮𝒮\mathcal{S}caligraphic_S: Digit 1 Regular Learning 99.7% 99.7%
Conventional Transfer Learning
Transfer Learning with Reconstruction 99.7% 99.6%
𝒯𝒯\mathcal{T}caligraphic_T: Digit 8 Regular Learning 93.0% 97.8%
Conventional Transfer Learning 95.0% 97.1%
Transfer Learning with Reconstruction 97.4% 98.6%

V-B Transfer Learning on D2D Wireless Network Power Control

V-B1 Simulation Settings

We simulate each wireless network containing N=10𝑁10N=10italic_N = 10 links randomly deployed in a 150m×\times×150m region. We first generate the locations of the transmitters uniformly within the region, and then generate the locations of the receivers such that the direct-channel transceiver distances follow a uniform distribution in the interval of 5msimilar-to\sim25m. We impose a minimum of 5m distance between any interferring transmitter and receiver. We assume each transmitter has the maximum transmission power of 30dBm with a direct-channel antenna gain of 6dB, while the noise level is -150dBm/Hz. We assume an available bandwidth of 5MHz with full frequency reuse across the entire wireless network.

To simulate wireless channels, we assume that the channel gain of each channel is determined by three components:

  • Path-Loss: modeled by the short-range outdoor model ITU-1411.

  • Shadowing: modeled by the log-normal distribution with 8dB standard deviation.

  • Fast Fading: modeled by Rayleigh fading with i.i.d circular Gaussian distribution of unit variance.

We collect N2superscript𝑁2N^{2}italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT channel gains for each layout into a N2superscript𝑁2N^{2}italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-dimensional vector as the input 𝐩𝐩\mathbf{p}bold_p. Similar to Section V-A, we utilize the same neural network specification in ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and ΘtsubscriptΘ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for both the feature learning stage and the optimization stage (since 𝐱ssubscript𝐱𝑠\mathbf{x}_{s}bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT have the same dimension as the number of links N𝑁Nitalic_N). The same overall neural network specification is used by all the competing methods, as summarized in Table IV. With N=10𝑁10N=10italic_N = 10, the total numbers of trainable parameters for all three stages in the neural network computation flows (including both weights and biases) are as follows:

  • Feature Learning Stage (Θssubscriptnormal-Θ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT or Θtsubscriptnormal-Θ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT): 52900 parameters;

  • Optimization Stage (Θssubscriptnormal-Θ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT or Θtsubscriptnormal-Θ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT): 5070 parameters;

  • Reconstruction Stage (only for Θssubscriptnormal-Θ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT trained by our proposed approach): 40300 parameters.

As the optimization stage only has a small number of parameters, training models on 𝒯𝒯\mathcal{T}caligraphic_T via transfer learning requires little data.

TABLE IV: D2D Transfer Learning Neural Network Architecture (N𝑁Nitalic_N: number of links)
Stages Layers Number of Neurons
Feature Learning Stage (ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT or ΘtsubscriptΘ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) 1st 1.5N21.5superscript𝑁21.5N^{2}1.5 italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
2nd 1.5N21.5superscript𝑁21.5N^{2}1.5 italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Feature Layer N2superscript𝑁2N^{2}italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Optimization Stage (ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT or ΘtsubscriptΘ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) 1st 4N4𝑁4N4 italic_N
2nd 2N2𝑁2N2 italic_N
Output Layer N𝑁Nitalic_N
Reconstruction Stage (only ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT) 1st 2N22superscript𝑁22N^{2}2 italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Reconstruct Layer N2superscript𝑁2N^{2}italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

To train each neural network, we formulate the (unsupervised-learning) loss functions 𝒮subscript𝒮\mathcal{L_{S}}caligraphic_L start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT and 𝒯subscript𝒯\mathcal{L_{T}}caligraphic_L start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT directly based on the task utility values of both tasks as follows:

𝒮=subscript𝒮absent\displaystyle\mathcal{L_{S}}=caligraphic_L start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT = us(𝐱s),subscript𝑢𝑠subscript𝐱𝑠\displaystyle\>-u_{s}(\mathbf{x}_{s})\>,- italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , (30)
𝒯=subscript𝒯absent\displaystyle\mathcal{L_{T}}=caligraphic_L start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT = ut(𝐱t),subscript𝑢𝑡subscript𝐱𝑡\displaystyle\>-u_{t}(\mathbf{x}_{t})\>,- italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (31)

with us(𝐱s)subscript𝑢𝑠subscript𝐱𝑠u_{s}(\mathbf{x}_{s})italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) and ut(𝐱t)subscript𝑢𝑡subscript𝐱𝑡u_{t}(\mathbf{x}_{t})italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) defined as (11) and (12) respectively. Note that we define the loss functions by negating the utility functions, such that the utility functions are maximized while the loss functions are minimized. Furthermore, to train ΘssubscriptsubscriptΘ𝑠\mathcal{F}_{\Theta_{s}}caligraphic_F start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT with our approach, we use α=3𝛼3\alpha=3italic_α = 3 in (24) for the loss \mathcal{L}caligraphic_L, which provides the best performance in the process of hyper-parameter tuning.

In terms of the data used for training under all competing methods, we utilize the data set specification listed in Table V. Each sample of a D2D wireless network is generated according to the wireless network simulation settings mentioned earlier. We note that the data set sizes in Table V are smaller than the number of neural network trainable parameters, especially for the training data on 𝒯𝒯\mathcal{T}caligraphic_T. Similar to Section V-A, we select these small data sets to illustrate that new target tasks can be adapted on-the-fly with minimal training overhead, as well as to show that the proposed approach is effective in knowledge transfer and is robust to over-fitting. We use relatively large validation sets to ensure accurate early stopping when training the model under each method. However, we may not have sufficient data for validation when training the model on 𝒯𝒯\mathcal{T}caligraphic_T in realistic scenarios. Therefore, in Section V-B2, we also include simulation results where no early stopping is performed during target task training under each method.

TABLE V: D2D Wireless Networks Data Set Specifications
Task Training Set Samples Validation Set Samples
𝒮𝒮\mathcal{S}caligraphic_S 5×1055superscript1055\times 10^{5}5 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT 5000
𝒯𝒯\mathcal{T}caligraphic_T 1000 5000*

*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT A large validation set is used to ensure accurate early stopping. May not be available in a realistic scenario in which the target task data is limited.

For the test data set (on which both utilities ussubscript𝑢𝑠u_{s}italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are evaluated), we generate 2000 new samples of D2D wireless networks to obtain performance statistics over all the methods.

V-B2 Transfer Learning Performances

Besides the two neural network based benchmarks, we also include the following traditional mathematical optimization algorithms serving as performance upper-bound baselines (with the cost of having much higher computational complexities):

  • Geometric Programming (GP)[39]: mathematical algorithm for solving the min-rate optimization.

  • Fractional Programming (FP)[40]: mathematical algorithm for solving the sum-rate optimization.

We train and evaluate each neural network based method under the transfer learning direction: Sum Rate 𝒮𝒮absent\mathcal{S}\tocaligraphic_S → Min Rate 𝒯𝒯\mathcal{T}caligraphic_T, while only evaluating FP on 𝒮𝒮\mathcal{S}caligraphic_S and GP on 𝒯𝒯\mathcal{T}caligraphic_T.

We present both 𝒮𝒮\mathcal{S}caligraphic_S and 𝒯𝒯\mathcal{T}caligraphic_T performances, averaged over all 2000 test wireless networks, in Table VI. As a first observation from the results, the conventional transfer learning approach performs worse than even the regular learning method, indicating that this D2D power control optimization application indeed poses as a challenging problem for transfer learning. Therefore, the proposed approach is much needed for achieving knowledge transfer on such general mathematical optimization problems. We then focus on the results with early stopping, obtaining which requires additional data reserved as the validation set on 𝒯𝒯\mathcal{T}caligraphic_T. Shown by the numerical results, when the training data on the target task is limited (1000 samples as in Table V), the transfer learning-with-reconstruction approach achieves the best target-task performance among the neural network based methods, with a 11% improvement over the regular learning approach, and a 17% improvement over the conventional transfer learning approach. The proposed approach achieves these improvements while sacrificing minimal source-task performance as the trade-off: a 1% reduction on the sum-rate results as compared to both neural network based benchmarks. This slight loss of the performance on 𝒯𝒯\mathcal{T}caligraphic_T is expected since our approach utilizes the training loss as in (24) that does not exclusively target at optimizing the source-task utility.

TABLE VI: D2D Wireless Networks Transfer Learning Performances
Task Method Result with Early Stopping (Mbps) Result without Early Stopping (Mbps)
𝒮𝒮\mathcal{S}caligraphic_S: Sum-Rate Regular Learning 155.69 N/A
Conventional Transfer Learning
Transfer Learning with Reconstruction 153.96
FP 157.45
𝒯𝒯\mathcal{T}caligraphic_T: Min-Rate Regular Learning 5.39 5.32
Conventional Transfer Learning 5.13 4.80
Transfer Learning with Reconstruction 6.00 6.00
GP 9.16 N/A

V-B3 Learning Dynamics and Over-fitting

To understand the training dynamics that lead to the presented results and to visualize if and how the over-fitting occurs for each method, we provide the training curves on 𝒮𝒮\mathcal{S}caligraphic_S and 𝒯𝒯\mathcal{T}caligraphic_T for all the methods in Fig. 2. Note that for the proposed approach, we have plotted two losses on 𝒮𝒮\mathcal{S}caligraphic_S: the source task based loss shown by the solid line (𝒮subscript𝒮\mathcal{L_{S}}caligraphic_L start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT in (24)), and the total loss shown by the dotted line (\mathcal{L}caligraphic_L in (24)). As evident by the validation curves on 𝒯𝒯\mathcal{T}caligraphic_T (in the bottom-right figure), while both the conventional transfer learning and the regular learning approaches plateau early in validation loss and then regress noticeably due to over-fitting, our approach enables the model to learn at a much more sustainable pace from the very limited training data set of 1000 samples, without any noticeable over-fitting.

The effects of over-fitting are shown more clearly from the simulation results when no early stopping is performed during training on 𝒯𝒯\mathcal{T}caligraphic_T. Examining again Table VI, as shown by the results without early stopping, our approach maintains its performance on utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (indicating the model does not over-fit throughout training) and enjoys larger margins on 𝒯𝒯\mathcal{T}caligraphic_T, with 13% and 25% improvements over the regular learning and conventional transfer learning approach respectively. As this set of results are achieved without needing a large validation set on 𝒯𝒯\mathcal{T}caligraphic_T, the performance comparison results are more relevant to practical implementations.

Refer to caption
Figure 2: Training Curves for Transfer Learning in D2D Wireless Network Optimizations (the top two figures provide training and validation curves for the source task; the bottom two figures provide training and validation curves for the target task).

V-C Transfer Learning on MISO Downlink Wireless Networks

V-C1 Simulation Settings

For the MISO downlink wireless network application, we simulate each wireless network within a 3-dimensional confined region with dimensions of 100m×\times×100m×\times×50m. We set up M=3𝑀3M=3italic_M = 3 BSs, located at the fixed locations at the top level of the region: (0m, 0m, 50m), (0m, 100m, 50m), (100m, 0m, 50m). Each BS is equipped with K=16𝐾16K=16italic_K = 16 antennas, arranged in a 4×\times×4 2-dimensional array in parallel to the x-y plane. These BS configurations stay the same across all instances of MISO wireless networks generated. In each MISO network, the UE is located at an unknown location uniformly generated within the planar region (50±plus-or-minus\pm±40m, 50±plus-or-minus\pm±40m, 0m). We assume a maximum transmission power level of PBS=40subscript𝑃𝐵𝑆40P_{BS}=40italic_P start_POSTSUBSCRIPT italic_B italic_S end_POSTSUBSCRIPT = 40 dBm for each BS and PUE=30subscript𝑃𝑈𝐸30P_{UE}=30italic_P start_POSTSUBSCRIPT italic_U italic_E end_POSTSUBSCRIPT = 30 dBm for the UE. We also assume a background noise level at -150 dBm/Hz.

For wireless channel specifications, we use the following pathloss model in the decibel (dB) scale to model the pathloss components of the channels (ρ(dm)𝜌subscript𝑑𝑚\rho(d_{m})italic_ρ ( italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) in (13)):

ρ(dm)dB=32.636.7log10(dm).𝜌subscriptsubscript𝑑𝑚𝑑𝐵32.636.7subscript10subscript𝑑𝑚\displaystyle\rho(d_{m})_{dB}=-32.6-36.7\log_{10}(d_{m}).italic_ρ ( italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_d italic_B end_POSTSUBSCRIPT = - 32.6 - 36.7 roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) . (32)

Without loss of generality, we assume antenna spacing δ𝛿\deltaitalic_δ and signal wavelength λ𝜆\lambdaitalic_λ values such that 2πδλ=12𝜋𝛿𝜆1\frac{2\pi\delta}{\lambda}=1divide start_ARG 2 italic_π italic_δ end_ARG start_ARG italic_λ end_ARG = 1. The Rician factor ϵ=10italic-ϵ10\epsilon=10italic_ϵ = 10 is used. For channel estimation, the UE transmits T=4𝑇4T=4italic_T = 4 uplink pilots. To receive the four uplink pilots, each BS employs four sensing vectors. The uplink pilots and the digital sensing vectors are all randomly generated by sampling from circularly-symmetric complex normal distribution, and then normalize to the proper power levels. Specifically, the pilots are generated as follows:

ctsimilar-tosuperscript𝑐superscript𝑡absent\displaystyle c^{t^{\prime}}\simitalic_c start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∼ 𝒞𝒩(0,𝐈)𝒞𝒩0𝐈\displaystyle\>\mathcal{CN}(0,\mathbf{I})caligraphic_C caligraphic_N ( 0 , bold_I ) (33)
ct=superscript𝑐𝑡absent\displaystyle c^{t}=italic_c start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = PUEct|ct|t,subscript𝑃UEsuperscript𝑐superscript𝑡superscript𝑐superscript𝑡for-all𝑡\displaystyle\>\sqrt{P_{\text{UE}}}\frac{c^{t^{\prime}}}{|c^{t^{\prime}}|}% \quad\forall t\>,square-root start_ARG italic_P start_POSTSUBSCRIPT UE end_POSTSUBSCRIPT end_ARG divide start_ARG italic_c start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG | italic_c start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | end_ARG ∀ italic_t , (34)

while the sensing vectors are generated as follows:

𝐯msimilar-tosuperscriptsubscript𝐯𝑚absent\displaystyle\mathbf{v}_{m}^{\prime}\simbold_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ 𝒞𝒩(0,𝐈)𝒞𝒩0𝐈\displaystyle\>\mathcal{CN}(0,\mathbf{I})caligraphic_C caligraphic_N ( 0 , bold_I ) (35)
𝐯m=subscript𝐯𝑚absent\displaystyle\mathbf{v}_{m}=bold_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = 𝐯m𝐯m2m,t.superscriptsubscript𝐯𝑚subscriptnormsuperscriptsubscript𝐯𝑚2for-all𝑚𝑡\displaystyle\>\frac{\mathbf{v}_{m}^{\prime}}{||\mathbf{v}_{m}^{\prime}||_{2}}% \quad\forall m,t\>.divide start_ARG bold_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG | | bold_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ∀ italic_m , italic_t . (36)

As mentioned in Section III-C, we use a fixed set of pilots {ct}t{1T}subscriptsuperscript𝑐𝑡𝑡1𝑇\{c^{t}\}_{t\in\{1\dots T\}}{ italic_c start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t ∈ { 1 … italic_T } end_POSTSUBSCRIPT and sensing vectors over all BSs {𝐯m}m{1M}subscriptsubscript𝐯𝑚𝑚1𝑀\{\mathbf{v}_{m}\}_{m\in\{1\dots M\}}{ bold_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m ∈ { 1 … italic_M } end_POSTSUBSCRIPT over all the training and testing MISO wireless network instances, and therefore the models can implicitly learn the knowledge about these pilots and sensing vectors and further estimate the varying wireless channels 𝐇𝐇\mathbf{H}bold_H by using only the inputs {rmt}m{1M},t{1T}subscriptsuperscriptsubscript𝑟𝑚𝑡formulae-sequence𝑚1𝑀𝑡1𝑇\{r_{m}^{t}\}_{m\in\{1\dots M\},t\in\{1\dots T\}}{ italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m ∈ { 1 … italic_M } , italic_t ∈ { 1 … italic_T } end_POSTSUBSCRIPT as in (17).

As specified in (27), we select the three sets of geometric parameters as the common information: the set of distances {dm}m{1M}subscriptsubscript𝑑𝑚𝑚1𝑀\{d_{m}\}_{m\in\{1\dots M\}}{ italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m ∈ { 1 … italic_M } end_POSTSUBSCRIPT, the set of azimuth AoAs {θm}m{1M}subscriptsubscript𝜃𝑚𝑚1𝑀\{\theta_{m}\}_{m\in\{1\dots M\}}{ italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m ∈ { 1 … italic_M } end_POSTSUBSCRIPT, and the set of elevation AoAs {ϕm}m{1M}subscriptsubscriptitalic-ϕ𝑚𝑚1𝑀\{\phi_{m}\}_{m\in\{1\dots M\}}{ italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m ∈ { 1 … italic_M } end_POSTSUBSCRIPT. In the actual implementation, we make the observation that in (15), {θm}subscript𝜃𝑚\{\theta_{m}\}{ italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } and {ϕm}subscriptitalic-ϕ𝑚\{\phi_{m}\}{ italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } occur in the terms sin(θm)cos(ϕm)subscript𝜃𝑚subscriptitalic-ϕ𝑚\sin(\theta_{m})\cos(\phi_{m})roman_sin ( italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) roman_cos ( italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) and cos(θm)cos(ϕm)subscript𝜃𝑚subscriptitalic-ϕ𝑚\cos(\theta_{m})\cos(\phi_{m})roman_cos ( italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) roman_cos ( italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ). Therefore, in our simulations, we collect the following values as the common information:

(𝐩){dm,sin(θm)cos(ϕm),cos(θm)cos(ϕm)}m{1M}.𝐩subscriptsubscript𝑑𝑚subscript𝜃𝑚subscriptitalic-ϕ𝑚subscript𝜃𝑚subscriptitalic-ϕ𝑚𝑚1𝑀\mathcal{I}(\mathbf{p})\leftarrow\\ \{d_{m},\sin(\theta_{m})\cos(\phi_{m}),\cos(\theta_{m})\cos(\phi_{m})\}_{m\in% \{1\dots M\}}.start_ROW start_CELL caligraphic_I ( bold_p ) ← end_CELL end_ROW start_ROW start_CELL { italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , roman_sin ( italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) roman_cos ( italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) , roman_cos ( italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) roman_cos ( italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_m ∈ { 1 … italic_M } end_POSTSUBSCRIPT . end_CELL end_ROW (37)

Although the common information in (37) is not as concise as in (27), the simulation results suggest that using (37) as the common information in our proposed approach is still effective in obtaining highly competitive transfer learning performances.

We provide in Table VII and Table VIII the specifications for data sets and the neural network architecture used by all the competing methods. We reiterate that for the MISO downlink transmission application, 𝐱ssubscript𝐱𝑠\mathbf{x}_{s}bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are solutions for the optimal beamformers and the UE location estimation respectively. Therefore, unlike the previous two applications in Section V-A and V-B, the optimization variables 𝐱ssubscript𝐱𝑠\mathbf{x}_{s}bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT have different dimensions: 𝐱s2MKsubscript𝐱𝑠superscript2𝑀𝐾\mathbf{x}_{s}\in\mathbb{R}^{2MK}bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_M italic_K end_POSTSUPERSCRIPT contain the real and imaginary parts of the M𝑀Mitalic_M beamformers {𝐛m}m{1M}subscriptsubscript𝐛𝑚𝑚1𝑀\{\mathbf{b}_{m}\}_{m\in\{1\dots M\}}{ bold_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m ∈ { 1 … italic_M } end_POSTSUBSCRIPT; while 𝐱t2subscript𝐱𝑡superscript2\mathbf{x}_{t}\in\mathbb{R}^{2}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT contain the location coordinate estimation (x^UEsubscript^𝑥UE\hat{x}_{\text{UE}}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT UE end_POSTSUBSCRIPT, y^UEsubscript^𝑦UE\hat{y}_{\text{UE}}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT UE end_POSTSUBSCRIPT) for the UE. Correspondingly, we construct the optimization stages in ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and ΘtsubscriptΘ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using different specifications as shown in Table VIII. We also note that for the common information as in (37), three values are collected for each of the M𝑀Mitalic_M BSs. Therefore, the dimension of the common information, which is also the output dimension for the reconstruction stage, is 3M=93𝑀93M=93 italic_M = 9. Specifically, the numbers of trainable parameters for all three stages in the neural network computation flows (including weights and biases), in ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and ΘtsubscriptΘ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, are as follows:

  • Feature Learning Stage (Θssubscriptnormal-Θ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT or Θtsubscriptnormal-Θ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT): 64150 parameters;

  • Optimization Stage in Θssubscriptnormal-Θ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT: 29896 parameters;

  • Optimization Stage in Θtsubscriptnormal-Θ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT: 36702 parameters;

  • Reconstruction Stage (only for Θssubscriptnormal-Θ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT trained by our proposed approach): 8059 parameters

As shown above, with carefully crafted low-dimensional common information, the number of extra parameters introduced by our approach as the reconstruction stage is relatively low in the source-task training.

TABLE VII: MISO Wireless Networks Data Set Specifications
Task Training Set Samples Validation Set Samples
𝒮𝒮\mathcal{S}caligraphic_S 1×1051superscript1051\times 10^{5}1 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT 5000
𝒯𝒯\mathcal{T}caligraphic_T 500 5000*

*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT A large validation set is used to ensure accurate early stopping. May not be available in a realistic scenario in which the target task data is limited.

TABLE VIII: MISO Transfer Learning Neural Network Architecture (M𝑀Mitalic_M: Number of BSs; N𝑁Nitalic_N: Number of Antennas per BS)
Stages Layers Number of Neurons
Feature Learning Stage (ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT or ΘtsubscriptΘ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) 1st 150
2nd 150
3rd 150
Feature Layer 100
Optimization Stage for ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT 1st 100
2nd 100
Output Layer 2MK2𝑀𝐾2MK2 italic_M italic_K
Optimization Stage for ΘtsubscriptΘ𝑡\Theta_{t}roman_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 1st 125
2nd 100
3rd 75
4th 50
Output Layer 2
Reconstruction Stage (only ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT) 1st 50
2nd 50
Reconstruct Layer 3M3𝑀3M3 italic_M

In terms of the selection for 𝒮subscript𝒮\mathcal{L_{S}}caligraphic_L start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT, unlike in Section V-B, we have discovered through simulations that training models by formulating the source-task loss function as the utility value ussubscript𝑢𝑠u_{s}italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT (as in (21)) does not lead to as strong performances as compared to training with supervised learning targets. For the source task 𝒯𝒯\mathcal{T}caligraphic_T on downlink beamforming, we have the training targets being the perfect beamformers, which are the beamformers perfectly aligned with the wireless channels from each BS to the UE, which we denote by 𝐁*={𝐛m*}m{1M}superscript𝐁subscriptsuperscriptsubscript𝐛𝑚𝑚1𝑀\mathbf{B}^{*}=\{\mathbf{b}_{m}^{*}\}_{m\in\{1\to M\}}bold_B start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = { bold_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m ∈ { 1 → italic_M } end_POSTSUBSCRIPT, obtained as follows:

𝐛m*=𝐡m𝐡m2m,superscriptsubscript𝐛𝑚subscript𝐡𝑚subscriptnormsubscript𝐡𝑚2for-all𝑚\displaystyle\mathbf{b}_{m}^{*}=\frac{\mathbf{h}_{m}}{||\mathbf{h}_{m}||_{2}}% \quad\forall m\>,bold_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = divide start_ARG bold_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG | | bold_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ∀ italic_m , (38)

where 𝐡msubscript𝐡𝑚\mathbf{h}_{m}bold_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is the set of actual channels from the m𝑚mitalic_m-th BS to the UE, as described in Section III-C. With 𝐁*superscript𝐁\mathbf{B}^{*}bold_B start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT being the target labels, we formulate the source-task loss function 𝒮subscript𝒮\mathcal{L_{S}}caligraphic_L start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT as the squared-error loss (which is a popular choice of loss function for regression tasks):

𝒮=m=1M𝐛mperfect𝐱sm22,subscript𝒮superscriptsubscript𝑚1𝑀superscriptsubscriptnormsuperscriptsubscript𝐛𝑚perfectsuperscriptsubscript𝐱𝑠𝑚22\displaystyle\mathcal{L_{S}}=\sum_{m=1}^{M}||\mathbf{b}_{m}^{\text{perfect}}-% \mathbf{x}_{s}^{m}||_{2}^{2}\>,caligraphic_L start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT | | bold_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT perfect end_POSTSUPERSCRIPT - bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (39)

where 𝐱smsuperscriptsubscript𝐱𝑠𝑚\mathbf{x}_{s}^{m}bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is the optimized beamformers for the m𝑚mitalic_m-th BS as defined in (19). With the target task 𝒯𝒯\mathcal{T}caligraphic_T being the downlink localization task, it is naturally formulated as a regression task with the true UE location being the target. Correspondingly, we use the squared-error loss function 𝒯subscript𝒯\mathcal{L_{T}}caligraphic_L start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT:

𝒯=(xUEx^UE)2+(yUEy^UE)2,subscript𝒯superscriptsubscript𝑥UEsubscript^𝑥UE2superscriptsubscript𝑦UEsubscript^𝑦UE2\displaystyle\mathcal{L_{T}}=(x_{\text{UE}}-\hat{x}_{\text{UE}})^{2}+(y_{\text% {UE}}-\hat{y}_{\text{UE}})^{2}\>,caligraphic_L start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT UE end_POSTSUBSCRIPT - over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT UE end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_y start_POSTSUBSCRIPT UE end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT UE end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (40)

with x^UEsubscript^𝑥UE\hat{x}_{\text{UE}}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT UE end_POSTSUBSCRIPT and y^UEsubscript^𝑦UE\hat{y}_{\text{UE}}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT UE end_POSTSUBSCRIPT being the UE location estimations as the model’s output 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, as specified in (22). To train ΘssubscriptsubscriptΘ𝑠\mathcal{F}_{\Theta_{s}}caligraphic_F start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT with our approach, we use α=4𝛼4\alpha=4italic_α = 4 in (24) for the loss \mathcal{L}caligraphic_L, which provides the best performance in the process of hyper-parameter tuning.

V-C2 Transfer Learning Performances

For the downlink beamforming task 𝒮𝒮\mathcal{S}caligraphic_S, we provide several baselines in addition to the two neural network based benchmarks, as follows:

  • Perfect Beamformers: assuming that accurate knowledge of all wireless channels 𝐇𝐇\mathbf{H}bold_H is available, the beamformers are designed to align with the channels from each BS to the UE, computed as in (38).

  • Random Beamformers: randomly generate beamformers by firstly generate entries from circularly-symmetric complex normal distribution and then apply normalization to ensure unit beamformer power.

We test all the methods over 2000 newly generated testing MISO wireless networks, and present results on 𝒮𝒮\mathcal{S}caligraphic_S and 𝒯𝒯\mathcal{T}caligraphic_T in Table IX, where we report the average of the SNR values as us(𝐱s)subscript𝑢𝑠subscript𝐱𝑠u_{s}(\mathbf{x}_{s})italic_u start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) in (21) and the average of the localization errors as ut(𝐱t)subscript𝑢𝑡subscript𝐱𝑡u_{t}(\mathbf{x}_{t})italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) in (23). Similar to Section V-B2, we have also included in Table IX the target task performances by each method when no early stopping is performed during training. Furthermore, we also present in Fig. 4 the CDF curves over 2000 testing MISO wireless networks for the localization errors utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

TABLE IX: MISO Downlink Wireless Networks Transfer Learning Performances
Task Method Result with Early Stopping Result without Early Stopping
𝒮𝒮\mathcal{S}caligraphic_S: Beamforming Regular Learning 36.47dB N/A
Conventional Transfer Learning
Transfer Learning with Reconstruction 36.43dB
Random Beamformers 23.79dB
Perfect Beamformers 36.83dB
𝒯𝒯\mathcal{T}caligraphic_T: Localization Regular Learning 6.56m 7.79m
Conventional Transfer Learning 6.06m 6.07m
Transfer Learning with Reconstruction 5.59m 5.60m
Refer to caption
Figure 3: Training Curves for Transfer Learning in MISO Wireless Network Optimizations (the top two figures provide training and validation curves for the source task; the bottom two figures provide training and validation curves for the target task).
Refer to caption
Figure 4: CDF for MISO network downlink localization errors (the more to the left the curve locates, the lower the localization errors are).
Refer to caption
(a) Localization on MISO network sample # 1
Refer to caption
(b) Localization on MISO network sample # 2
Figure 5: Visualization of localization results in two randomly selected testing samples of MISO networks.

As suggested by the performance statistics in Table IX, the proposed transfer learning with reconstruction loss approach achieves a better target task localization performance, with a 15% improvement over the regular learning method and a 8% improvement over the convetional transfer learning method. To visualize the performance margins, we provide in Fig. 5 visualizations of the localization results by all the methods in two testing MISO wireless network samples.

V-C3 Learning Dynamics and Over-fitting

To show how the proposed approach excels, we provide in Fig. 3 the training and validation curves on both 𝒮𝒮\mathcal{S}caligraphic_S and 𝒯𝒯\mathcal{T}caligraphic_T by all the methods. Under limited target task training data (500 samples as specified in Table VII), as clearly shown by the validation loss curves in Fig. 3, the model trained with the proposed approach shows little overfitting and maintains a much more sustainable and effective learning progress compared to the other two methods. These observed target-task learning dynamics validate the fact that the proposed approach effectively addresses the over-fitting issue in transfer learning.

The effects from over-fitting are further reflected in the comparison between the localization performances with and without early stopping under each method, as shown in Table IX. Similar to the previous two applications, the proposed method achieves the better target task performances with minimal performance tradeoff on the source task.

V-C4 Concise Representation of Common Information

We would also like to emphasize that these performance gains are achieved with the common information in a concise representation tailored specifically to the tasks as in (27) or (37). Benefited from such low-dimensional common information, the additional reconstruction stage, which is introduced into ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT by the proposed approach only during training for 𝒮𝒮\mathcal{S}caligraphic_S, also enjoys low parameter and computational complexities relative to the feature learning stage and the optimization stage of the model. This application illustrates the effectiveness of the proposed method when the domain knowledge is applied on selecting the task-specific common information.

VI Conclusion

Transfer learning has great potential and wide applicability in general mathematical optimization problems. However, when using neural networks to learn such optimization mappings, it is challenging to learn or to identify the transferable features within the neural network computation flows, due to the lack of structures in the inputs and in the computation flows. This paper proposes a novel transfer learning approach for learning general and transferable features at specified locations within neural networks. We first establish the concept of common information in correlated tasks, the choice of which can be generic or problem-specific. We then introduce a reconstruction stage in the neural network model starting from a pre-specified hidden layer, i.e., the feature layer. By enforcing the reconstruction of the common information based on the features from the feature layer, these learned features are generally descriptive of all the correlated tasks, and therefore can be transferred among multiple task optimizations. Simulation results on the MNIST classification problem and two wireless network utility optimization problems suggest that the proposed approach consistently outperforms the conventional transfer learning method and is robust against over-fitting under limited training data. We hope this work could help open up further exploration on bridging together transfer learning and general mathematical optimizations.

References

  • [1] W. Cui and W. Yu, “Transfer learning with input reconstruction loss,” in IEEE Global Commun. Conf. (Globecom), Dec. 2022.
  • [2] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos, “Learning to optimize: Training deep neural networks for interference management,” IEEE Trans. Signal Process., vol. 66, no. 20, pp. 5438–5453, Aug. 2018.
  • [3] W. Cui, K. Shen, and W. Yu, “Spatial deep learning for wireless scheduling,” IEEE J. Sel. Areas Commun., vol. 37, pp. 1248–1261, June 2019.
  • [4] F. Liang, C. Shen, W. Yu, and F. Wu, “Towards optimal power control via ensembling deep neural networks,” IEEE Trans. Commun. (TCOM), vol. 68, no. 3, pp. 1760–1776, Mar. 2020.
  • [5] W. Cui, K. Shen, and W. Yu, “Deep learning for robust power control for wireless networks,” in IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), May 2020.
  • [6] K. Gregor and Y. LeCun, “Learning fast approximations of sparse coding,” in Int. Conf. Mach. Learn. (ICML), Jun. 2010.
  • [7] S. Khobahi and M. Soltanalian, “Model-based deep learning for one-bit compressive sensing,” IEEE Trans. Signal Process., vol. 68, pp. 5292–5307, Sep. 2020.
  • [8] K. M. Attiah, F. Sohrabi, and W. Yu, “Deep learning for channel sensing and hybrid precoding in TDD massive MIMO OFDM systems,” IEEE Trans. Wireless Commun., vol. 21, no. 12, pp. 10 839–10 853, Dec. 2022.
  • [9] F. Sohrabi, T. Jiang, W. Cui, and W. Yu, “Active sensing for communications by learning,” IEEE J. Sel. Areas Commun., vol. 40, no. 6, pp. 1780–1794, Jun. 2022.
  • [10] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and Q. He, “A comprehensive survey on transfer learning,” Proc. IEEE, vol. 109, no. 1, pp. 43–76, Jul. 2020.
  • [11] M. Long, Y. Cao, J. Wang, and M. I. Jordan, “Learning transferable features with deep adaptation networks,” in Int. Conf. Mach. Learn. (ICML), Jul. 2015.
  • [12] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” in Int. Conf. Mach. Learn. (ICML), Jul. 2015.
  • [13] B. Sun and K. Saenko, “Deep CORAL: Correlation alignment for deep domain adaptation,” in Eur. Conf. Comput. Vision (ECCV), Oct. 2016.
  • [14] K. Gopalakrishnan, S. K. Khaitan, A. Choudhary, and A. Agrawal, “Deep convolutional neural networks with transfer learning for computer vision-based data-driven pavement distress detection,” Construction Building Mater., vol. 157, pp. 322–330, Dec. 2017.
  • [15] S. Tammina, “Transfer learning using VGG-16 with deep convolutional neural network for classifying images,” Int. J. Sci. Res. Publications, vol. 9, no. 10, pp. 143–150, Oct. 2019.
  • [16] F. Zhuang, P. Luo, H. Xiong, Q. He, Y. Xiong, and Z. Shi, “Exploiting associations between word clusters and document classes for cross-domain text categorization,” Statist. Anal. Data Mining, vol. 4, no. 1, pp. 100–114, Feb. 2011.
  • [17] J. Deng, Z. Zhang, E. Marchi, and B. Schuller, “Sparse autoencoder-based feature transfer learning for speech emotion recognition,” in Humaine Assoc. Conf. Affect. Comput. Intell. Interact., Sep. 2013.
  • [18] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Annu. Conf. North Amer. Chapter Assoc. Comput. Linguistics (NAACL-HLT), Jun. 2019.
  • [19] T. Brown et al., “Language models are few-shot learners,” in Conf. Neural Inf. Process. Syst. (NeurIPS), Dec. 2020.
  • [20] L. Yu and V. Y. Tan, “Common information, noise stability, and their extensions,” Found. Trends Commun. Inf. Theory, vol. 19, no. 2, pp. 107–389, Apr. 2022.
  • [21] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira, “Analysis of representations for domain adaptation,” in Conf. Neural Inf. Process. Syst. (NeurIPS), Dec. 2006.
  • [22] F. Zhuang, X. Cheng, P. Luo, S. J. Pan, and Q. He, “Supervised representation learning: Transfer learning with deep autoencoders,” in Int. Joint Conf. Artif. Intell. (IJCAI), Jul. 2015.
  • [23] M. Wang, Y. Lin, Q. Tian, and G. Si, “Transfer learning promotes 6G wireless communications: Recent advances and future challenges,” IEEE Trans. Reliability, vol. 20, no. 3, pp. 1742–1755, Mar. 2021.
  • [24] M. A. Ranzato and M. Szummer, “Semi-supervised learning of compact document representations with deep networks,” in Int. Conf. Mach. Learn. (ICML), Jul. 2008.
  • [25] T. Robert, N. Thome, and M. Cord, “HybridNet: Classification and reconstruction cooperation for semi-supervised learning,” in Eur. Conf. Comput. Vision (ECCV), Sep. 2018.
  • [26] G. Lu, X. Zhao, J. Yin, W. Yang, and B. Li, “Multi-task learning using variational auto-encoder for sentiment classification,” Pattern Recognit. Lett., vol. 132, pp. 115–122, Apr. 2020.
  • [27] X. Glorot, A. Bordes, and Y. Bengio, “Domain adaptation for large-scale sentiment classification: A deep learning approach,” in Int. Conf. Mach. Learn. (ICML), Jun. 2011.
  • [28] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,” J. Mach. Learn. Res., vol. 17, no. 59, pp. 1–35, 2016.
  • [29] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan, “Unsupervised pixel-level domain adaptation with generative adversarial networks,” in IEEE Conf. Comput. Vision Pattern Recognit. (CVPR), Jul. 2017.
  • [30] J. Hou, X. Ding, J. D. Deng, and S. Cranefield, “Deep adversarial transition learning using cross-grafted generative stacks,” Neural Netw., vol. 149, pp. 172–183, May 2022.
  • [31] M. Ghifary, W. B. Kleijn, M. Zhang, and D. Balduzzi, “Domain generalization for object recognition with multi-task autoencoders,” in Int. Conf. Comput. Vision (ICCV), Dec. 2015.
  • [32] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998.
  • [33] H. Yang, J. Jee, G. Kwon, and H. Park, “Deep transfer learning-based adaptive beamforming for realistic communication channels,” in Int. Conf. Inf. Commun. Technol. Convergence (ICTC), Oct. 2020.
  • [34] Y. Yuan, G. Zheng, K. Wong, B. Ottersten, and Z. Luo, “Transfer learning and meta learning-based fast downlink beamforming adaptation,” IEEE Trans. Wireless Commun., vol. 20, no. 3, pp. 1742–1755, Mar. 2021.
  • [35] N. Tishby and N. Zaslavsky, “Deep learning and the information bottleneck principle,” in IEEE Inf. Theory Workshop (ITW), Apr. 2015.
  • [36] R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural networks via information,” Mar. 2017, [Online] Available: https://arxiv.org/abs/1703.00810.
  • [37] A. M. Saxe, Y. Bansal, J. Dapello, M. Advani, A. Kolchinsky, B. D. Tracey, and D. D. Cox, “On the information bottleneck theory of deep learning,” in Int. Conf. Learn. Representations (ICLR), Feb. 2018.
  • [38] R. Caruana, S. Lawrence, and L. Giles, “Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping,” in Conf. Neural Inf. Process. Syst. (NeurIPS), Dec. 2000.
  • [39] M. Chiang, C. W. Tan, D. P. Palomar, D. O’nell, and D. Julian, “Power control by geometric programming,” IEEE Trans. Wireless Commun., vol. 6, no. 7, pp. 2640–2651, Jul. 2007.
  • [40] K. Shen and W. Yu, “FPLinQ: A cooperative spectrum sharing strategy for device-to-device communications,” in IEEE Int. Symp. Inf. Theory (ISIT), Jun. 2017, pp. 2323–2327.