Computation and Communication Efficient Lightweighting Vertical Federated Learning

Heqiang Wang1, Jieming Bian1, Lei Wang1
1Department of Electrical and Computer Engineering,
University of Miami, Coral Gables, FL 33146, USA

Abstract

The exploration of computational and communication efficiency within Federated Learning (FL) has emerged as a prominent and crucial field of study. While most existing efforts to enhance these efficiencies have focused on Horizontal FL, the distinct processes and model structures of Vertical FL preclude the direct application of Horizontal FL-based techniques. In response, we introduce the concept of Lightweight Vertical Federated Learning (LVFL), targeting both computational and communication efficiencies. This approach involves separate lightweighting strategies for the feature model, to improve computational efficiency, and for feature embedding, to enhance communication efficiency. Moreover, we establish a convergence bound for our LVFL algorithm, which accounts for both communication and computational lightweighting ratios. Our evaluation of the algorithm on a image classification dataset reveals that LVFL significantly alleviates computational and communication demands while preserving robust learning performance. This work effectively addresses the gaps in communication and computational efficiency within Vertical FL.

Index Terms:

Federated Learning, Model Pruning

I Introduction

Federated learning (FL) is a distributed machine learning paradigm that enables a set of clients with decentralized data to collaborate and learn a shared model under the coordination of a centralized server. In FL, data is stored on edge devices in a distributed manner, which reduces the amount of data that needs to be uploaded and decreases the risk of user privacy leakage. The majority of prior research focuses on horizontal federated learning (HFL), where participants share the same feature space but have distinct sample spaces. There’s growing interest in vertical federated learning (VFL) in recent years. In VFL, clients possess different feature spaces but share a common sample space. An example of VFL implementation is Webank’s collaboration with an invoice company to construct financial risk models for enterprise customers [1]. In this context, both entities possess distinct user feature datasets that, due to information security concerns, cannot be shared or transmitted. During VFL training, each client develops a feature model, converting raw data features into a vector representation known as ‘feature embedding’. Instead of transmitting their feature models, clients in VFL send these feature embeddings to the server. The server then integrates these embeddings into a head model to determine the final loss. A typical workflow of VFL is illustrated in Fig. 1. Clearly, VFL’s methodology diverges from HFL, introducing its own set of unique challenges. Many of these challenges cannot be addressed merely by adapting solutions from HFL. There is a need for an independent approach to address the VFL problem.

During FL training, the diversity among clients results in variations in computational and communication capabilities, raising the need for synchronization solutions. While previous studies within HFL have addressed this, in HFL, local and global models are uniform due to identical feature spaces across clients, necessitating only local model pruning to meet specific client requirements. Conversely, in VFL, feature spaces differ among clients, leading to individual training on local feature models and subsequent uploading of trained feature embeddings to the server. This approach imposes distinct computational and communication demands for each client. Although prior research in VFL has examined feature embedding compression to reduce communication load [2, 3], the critical issue of computational burden has been largely neglected. Addressing this computational load is arguably a more pressing challenge for VFL deployment.

To tackle the diverse computational burdens encountered in the deployment of VFL, this paper introduces the concept of Lightweight Vertical Federated Learning (LVFL), designed to mitigate challenges in both computational and communication efficiency. The principal contributions of this study are summarized as follows: (1) This study introduces the LVFL framework, tailored to accommodate the varied communication and computation capabilities of heterogeneous workers. LVFL dynamically adjusts computational expenses for training feature models and communication costs for updating feature embeddings, ensuring efficiency across diverse system architectures. (2) We establish the comprehensive convergence analysis of the LVFL algorithm and present the convergence bound that elucidates the relationship between the cumulative feature model and feature embedding lightweighting error, as well as the communication and computation lightweighting ratio. (3) Our experiments, conducted on the CIFAR-10 dataset, validate the capability of the LVFL algorithm to adjust the computation and communication lightweighting ratio both constantly and dynamically. The proposed LVFL algorithm yields comparable test accuracy levels while significantly reduced communication and computation burden.

The rest of this paper is organized as follows. In Section II, we discuss related works on VFL and model pruning. Section III presents the system model and formulates the learning objective. In Section IV, we introduce the detail of LVFL algorithm. Section V presents the convergence analysis of LVFL. The experimental results of LVFL are presented in Section VI. Finally, Section VII concludes the paper. The theoretical proof details are accessible in the online supplementary material with the following link: https://github.com/ystex/LVFL/tree/main.

II Related Work

In recent years, VFL has attracted significant attention. The concept of Federated Learning with vertically partitioned data was introduced in [4]. Comprehensive surveys on VFL have been presented in [5, 6, 7]. However, VFL, distinct from HFL, poses its own set of challenges. Some research, such as [8, 9], aims to optimize data utilization to enhance the joint model’s effectiveness in VFL. In contrast, studies like [10] focus on implementing privacy-preserving protocols to counter potential data leakage threats. Beyond these challenges, there has been exploration into improving training efficiency in VFL. These endeavors primarily target reducing communication overhead, either by allowing participants to conduct multiple local updates in each iteration [11, 12] or by compressing the data exchanged between parties [13, 2]. While these studies primarily emphasize reducing communication overhead, a gap in these studies is the oversight of computational efficiency as a means to enhance training efficiency.

Modern DNNs often comprise tens of millions of parameters [14]. Storing and training such highly parameterized models on resource-constrained devices within FL can be challenging. Therefore, the lightweighting of models during training is an indispensable topic [15]. Considering the diverse computational capabilities of training devices, a viable strategy involves dynamically adjusting the size of the local model and diminishing its parameters, primarily via model pruning. Generally, model pruning techniques fall into two categories: unstructured pruning [16, 17] and structured pruning [18, 19, 20]. Unstructured pruning eliminates non-essential weights, specifically the inter-neuronal connections, in neural networks, resulting in significant parameter sparsity [21]. However, the ensuing sparse matrix’s irregularity poses challenges in parameter compression in memory, requiring specialized hardware or software libraries for efficient training [22]. In contrast, structured pruning aims to discard redundant model structures, like convolutional filters [19], without introducing sparsity. Consequently, the derived model can be considered as a subset or sub-configuration of the initial neural network, encompassing fewer parameters, thereby facilitating a reduction in computation overhead.

While a substantial portion of model pruning research centers on centralized learning contexts, recent studies have ventured into its application within FL settings [23, 24]. The work in [23] presents PruneFL, an adaptive FL strategy that commences with initial pruning at a selected client and continues with subsequent pruning throughout the FL process, aiming to reduce both communication and computational overheads. The research outlined in [24] introduces FedMP, a framework that employs adaptive pruning and recovery techniques to enhance both communication and computation efficiency, leveraging a multi-armed bandit algorithm for the selection of pruning ratios. Notably, the aforementioned model pruning techniques predominantly originate from HFL scenarios. However, due to the inherent differences between HFL and VFL, these techniques cannot be seamlessly applied to the VFL setting. Therefore driving the motivation for our study on model and feature embedding pruning in the VFL scenarios.

III System Model

To begin, we define several foundational concepts of the VFL. The VFL system consists of $K$ clients and a central server. The dataset, denoted as $\textbf{x}\in\mathbb{R}^{N\times D}$ , has $N$ representing the total data samples and $D$ indicating the number of features. The row indexed by $i$ in x corresponds to a data sample $x^{i}$ . Each sample, $x^{i}$ , possesses a unique set of features, $x^{i}_{k}$ retained by client $k$ , and identified as a distinct subset. Every instance $x^{i}$ is paired with a label $y^{i}$ . The vector $\textbf{y}\in\mathbb{R}^{N\times 1}$ represents all sample labels. Client $k$ maintains a dataset of local features, $\textbf{x}_{k}\in\mathbb{R}^{N\times D_{k}}$ , with $D_{k}$ denoting the number of features for client $k$ and $i$ -th row signifying the respective features $x^{i}_{k}$ . In this context, we assume that both the server and all clients retain a copy of the labels y, consistent with previous studies.

Each client, represented by $k\in K$ has a unique set of feature model parameters, $\theta_{k}$ . The server maintains the head model, $\theta_{0}$ . The function $h_{k}(\theta_{k},x_{k}^{i})$ denotes the feature embedding extracted from sample $x_{k}^{i}$ by client $k$ . This feature embedding operation transforms the high-dimensional raw data into a lower-dimensional representation through multiple layers of DNN, effectively capturing essential information from the input data while significantly reducing its dimension. The overall model’s parameters are collectively denoted as $\Theta=[\theta^{\top}_{0},\theta^{\top}_{1},...,\theta^{\top}_{K}]^{\top}$ . We can then express the learning objective as minimizing the subsequent equation:

\displaystyle F(\Theta;\textbf{x};\textbf{y}):=\frac{1}{N}\sum_{i=1}^{N}l(% \theta_{0},\{h_{k}(\theta_{k};x^{i}_{k})\}_{k=1}^{K};y^{i})

(1)

where $l(\cdot)$ denotes the loss function for a single data sample. Within the server model framework, the equation $h_{0}(\theta_{0},x^{i})=\theta_{0}$ consistently applies. To enhance clarity and ensure consistency in notation throughout this paper, we implement several simplifications in our subsequent discussions. Firstly, the feature embedding for dataset $\textbf{x}_{k}$ , is represented by $h_{k}(\theta;\textbf{x}_{k})$ , which we frequently abbreviate to $h_{k}(\theta_{k};\textbf{x}_{k})=h_{k}(\textbf{x}_{k})$ . Secondly, we allocate $k=0$ to denote the head model in FC, where $h_{0}(\theta_{0})=\theta_{0}$ for convenience. Finally, we often represent $F(\Theta)=F(h_{0}(\theta_{0}),h_{1}(\theta_{1}),...,h_{K}(\theta_{K}))$ , particularly in the subsequent algorithm details and convergence analysis. In the following section, we will elaborate on the specific algorithmic workflow of LVFL.

IV Lightweighting Vertical Federated Learning

In this section, we introduce the detail of LVFL algorithm. Considering the unique structural attributes of VFL, where computational demands are determined by the model size and communication requirements correlate with feature embedding size, the lightweighting methods derived from HFL are not directly applicable to VFL. To tackle this limitation, we adopt a progressive structured pruning strategy tailored for each client’s feature model and a unstructured pruning strategy tailored for each client’s feature embedding. Those approach is guided by the distinct computational and communication capacities of each client. Through these pruning techniques, we ensure consistent synchronization among all clients and facilitate efficient updates to the server.

In this scenario, we introduce a dual lightweighting ratio mechanism to distinctly optimize computation and communication. We denote $\alpha_{k}^{t}$ as the computation lightweighting ratio and $\beta_{k}^{t}$ as the communication lightweighting ratio. A higher $\alpha_{k}^{t}$ implies reduced demand for CPU cycles when processing a data sample. Correspondingly, a larger $\beta_{k}^{t}$ signifies a reduced need for embedding size during uploads. Suppose that each client performs $E$ local iterations before transmitting the model to the server, and the training process spans over a total of $R$ rounds, resulting in a cumulative duration of $T=RE$ local iterations. Algorithm 1 showcases the workflow of LVFL, integrating the dual lightweighting ratio mechanism.

During each global round, when $t\mid E=0$ , every client will decide the unstructured pruned feature embedding $\hat{h}_{k}^{t}(\hat{\theta}_{k}^{t};\textbf{x}_{k})$ based on the communication lightweighting ratio $\beta_{k}^{t}$ (Line 5), and then transmit this pruned feature embedding to the server for the global update (Line 6). Once the server has gathered all embeddings, it proceeds to get the model representation $\hat{\Phi}^{t_{0}}$ . Subsequently, it distributes the model representation to all clients (Lines 8-9), and $t_{0}$ is the start of the most recent global round when embeddings were shared. After receiving the model representation, each client needs to prune the feature model $\hat{\theta}_{k}^{t}$ according to its computation lightweighting ratio $\alpha_{k}^{t}$ (Line 11). Following this, each client will engage in $E$ rounds of local iterations using both the pruned embeddings received during iteration $t_{0}$ and its own unpruned embedding $h_{k}(\hat{\theta}_{k}^{t};\textbf{x}_{k})$ . The collection of embeddings from other clients, excluding client $k$ , is represented as $\hat{\Phi}^{t_{0}}_{-k}$ , and all embeddings are collectively denoted as $\hat{\Phi}^{t}_{k}$ . During each local iteration, client $k$ will update the feature model $\hat{\theta}_{k}$ using mini-batch SGD with step size $\eta^{t_{0}}$ (Lines 15-16).

Algorithm 1 LVFL

1:Initialize: The initial local model

\theta_{k}^{0}

for all clients

k

and server model

\theta_{0}^{0}

2:for

t=1,2,...,T-1

3: if

t\mid E=0

then

4: for

k=1,2,...,K

in parallel do

5: Determine

\hat{h}_{k}^{t}(\hat{\theta}_{k}^{t};\textbf{x}_{k})

according to

\beta_{k}^{t}

6: Send

\hat{h}_{k}^{t}(\hat{\theta}_{k}^{t};\textbf{x}_{k})

to server

7: end for

8: Server collects model representation

\hat{\Phi}^{t_{0}}

9: Server sends

\hat{\Phi}^{t_{0}}

to all clients

10: for

k=1,2,...,K

in parallel do

11: Prune the feature model

\hat{\theta}_{k}^{t}

according to

\alpha_{k}^{t}

12: end for

13: end if

14: for

k=0,1,2,...,K

in parallel do

15: Obtain

\hat{\Phi}^{t}_{k}\leftarrow\left\{\hat{\Phi}^{t_{0}}_{-k},h_{k}(\hat{\theta}_% {k}^{t};\textbf{x}_{k})\right\}

16: Update feature or head model

\hat{\theta}_{k}^{t+1}

17: end for

18:end for

We utilize an element-wise product to express the pruned feature model and feature embedding. Specifically, the pruned feature model can be represented as $\hat{\theta}_{k}^{t}=\theta_{k}^{t}\odot\textbf{m}_{k}^{t}$ , where $\theta_{k}^{t}$ denotes the original structure of client $k$ ’s local model without pruning, and $\textbf{m}_{k}^{t}$ signifies the mask vector, containing zeros for parameters in $\theta_{k}^{t}$ that are pruned. Similarly, for the pruned feature embedding, we express it as $\hat{h}_{k}^{t}(\hat{\theta}_{k}^{t})=h_{k}^{t}(\hat{\theta}_{k}^{t})\odot% \textbf{l}_{k}^{t}$ , where $h_{k}^{t}(\hat{\theta}_{k}^{t})$ represents the initial embedding of client $k$ without pruning, and $\textbf{l}_{k}^{t}$ denotes the mask vector with zeros for parameters in $h_{k}^{t}(\hat{\theta}_{k}^{t})$ that have been pruned. During each global round, the client’s feature model and feature embedding undergo adjustment based on the computation lightweighting ratio $\alpha_{k}^{t}$ and the communication lightweighting ratio $\beta_{k}^{t}$ . Next we explain in detail how to prune feature model and feature embedding to achieve lightweight separately.

Feature Model Lightweighting. In the feature model, we adopt a structured model pruning method to adjust the feature model from ${\theta}_{k}^{t}$ to $\hat{\theta}_{k}^{t}$ , guided by the computation lightweighting ratio $\alpha_{k}^{t}$ . This approach is consistent with existing research. To simplify the model and avoid introducing layer-specific hyper-parameters, we apply a uniform pruning ratio across all layers, as suggested by previous studies [19]. Within each layer, filters or neurons are ranked based on their importance scores, and those with lower scores are pruned based on the predetermined lightweighting ratio. We recommend employing the $l_{1}$ norm to calculate these importance scores.

Feature Embedding Lightweighting. For feature embedding, we employ the unstructured model pruning method to refine the feature embedding from $h_{k}^{t}(\hat{\theta}_{k}^{t})$ to $\hat{h}_{k}^{t}(\hat{\theta}_{k}^{t})$ , guided by the communication lightweighting ratio $\beta_{k}^{t}$ . This approach involves nullifying weights of the lowest absolute values to meet the pruning criteria. Although this method preserves the structure of the embedding, it substantially reduces communication costs by transmitting only the non-zero values.

Leveraging the aforementioned mechanism, clients can streamline both the feature model and feature embedding, directed by the parameters $\alpha_{k}^{t}$ and $\beta_{k}^{t}$ . Our approach seeks to optimally diminish the model’s complexity based on demand. Nevertheless, given that structured pruning is applied to the feature model, it is imperative to reassess the necessity for further pruning of the feature model prior to the training begin of each global round. Conversely, for feature embedding, which utilizes unstructured pruning, no additional verification step is required. Subsequently, we elucidate the mechanism for determining the computation lightweighting ratio of the feature model in detail.

1.

If $\alpha_{k}^{t}>{\alpha}_{k}^{t-}$ . Where ${\alpha}_{k}^{t-}=\max\left\{{\alpha}_{k}^{1},\cdots,{\alpha}_{k}^{t-1}\right\}$ denotes the maximum computation lightweighting ratio previously attained by client $k$ ’s feature model before global round $t$ . In this case further computation lightweighting is executed to meet the required lightweighting ratio for the current round.
2.

If $\alpha_{k}^{t}\leq{\alpha}_{k}^{t-}$ . Since the computation lightweighting ratio has been attained, the existing feature model was sufficiently streamlined, negating the need for further lightweighting in the current round.

This iterative process is repeated until the training converges or until predetermined termination conditions are satisfied. The subsequent section will delve into the analysis of the convergence bound of the LVFL algorithm.

V Convergence Analysis

In this section, we will delve into the convergence analysis of our LVFL algorithm. To begin, we need to establish some notations and definitions that will be utilized in the subsequent discussion. Specifically, we will define two types of errors that arise from the lightweighting mechanisms: communication lightweighting error and computation lightweighting error.

Communication Lightweight Error: This error quantifies the degree to which the lightweighting feature embedding is an accurate approximation of the original feature embedding. It is mathematically expressed as $\epsilon_{k}^{t}:=h_{k}^{t}({\theta}_{k}^{t})-\hat{h}_{k}^{t}({\theta}_{k}^{t})$ . The squared communication lightweighting error from client $k$ at round $t$ is denoted as $\Omega_{k}^{t}\triangleq\mathbb{E}\|\epsilon_{k}^{t}\|^{2}$ .

Computation Lightweight Error: This error assesses how closely the lightweighting feature model resembles the original feature model. It is defined as $\varphi_{k}^{t}:=h_{k}^{t}({\theta}_{k}^{t})-h_{k}^{t}(\hat{\theta}_{k}^{t})$ . The squared computation lightweighting error from client $k$ at round $t$ as $\Psi_{k}^{t}\triangleq\mathbb{E}\|\varphi_{k}^{t}\|^{2}$ .

Let $\hat{\textbf{G}}^{t}$ be the stacked partial derivatives at iteration $t$ :

\displaystyle\hat{\textbf{G}}^{t}:=\left[(\triangledown_{0}F(\hat{\Phi}^{t}_{0% };\textbf{y}))^{T},\dots,(\triangledown_{K}F(\hat{\Phi}^{t}_{K};\textbf{y}))^{% T}\right]^{T}

(2)

Then the global model updates as: $\Theta^{t+1}=\Theta^{t}-\eta^{t_{0}}\hat{\textbf{G}}^{t}$ .

We define ${\Phi}^{t_{0}}$ to be the set of embeddings that would be received by each client at iteration $t_{0}$ if no computation and communication lightweight error were applied: ${\Phi}^{t_{0}}\leftarrow\left\{\theta_{0}^{t},{h}_{1}^{t}({\theta}_{1}^{t}),% \cdots{h}_{k}^{t}({\theta}_{k}^{t})\right\}$ . Here we define ${\Phi}^{t_{0}}_{-k}$ be the set of embeddings from other parties $j\neq k$ , and we have ${\Phi}^{t_{0}}_{k}:=\left\{{\Phi}^{t_{0}}_{-k},h_{k}({\theta}_{k}^{t};\textbf{% x}_{k})\right\}$ . Our convergence analysis will utilize the following standard assumptions about the VFL.

Assumption 1 (Smoothness).

There exists positive constants $L<\infty$ and $L_{k}<\infty$ , such that for all $\Theta_{1},\Theta_{2}$ , the objective function satisfies: $\left\|\triangledown F(\Theta_{1})-\triangledown F(\Theta_{2})\right\|\leq L% \left\|\Theta_{1}-\Theta_{2}\right\|$ and $\left\|\triangledown_{k}F(\Theta_{1})-\triangledown_{k}F(\Theta_{2})\right\|% \leq L_{k}\left\|\Theta_{1}-\Theta_{2}\right\|$ .

Assumption 2 (Bounded Hessian).

There exists positive constants $H_{k}$ for $k=0,\dots,K$ such that for all $\Theta$ , the second partial derivatives of $F$ satisfy: $\left\|\triangledown^{2}_{h_{k}}F(\Theta)\right\|_{\mathcal{F}}\leq H_{k}$ . Where $\left\|X\right\|_{\mathcal{F}}$ is the Frobenius norm of a matrix $X$ .

Assumption 3 (Bounded Embedding Gradients).

There exists positive constants $G_{k}$ for $k=0,\dots,K$ such that for all $\theta_{k}$ , the embedding gradients are bounded: $\left\|\triangledown_{\theta_{k}}h_{k}(\theta_{k};x_{k})\right\|_{\mathcal{F}}% \leq G_{k}$ .

Assumption 1 guarantees that the function’s slope changes smoothly without any sudden or drastic alterations. Assumption 2 controls the curvature of the function, preventing it from displaying extreme or erratic behavior. Assumption 3 manages the changes in the embedding, ensuring that they don’t result in severe fluctuations, thereby aiding in the stabilization of the learning process. With those assumptions, we can get the following Lemma.

Lemma 1.

The norm of the difference between the objective function value with computation and communication lightweighting and without computation and communication lightweighting is bounded as:

		$\displaystyle\mathbb{E}\\|\nabla_{k}F(\hat{\Phi}^{t}_{k})-\nabla_{k}F({\Phi}^{t% }_{k})\\|^{2}$
	$\displaystyle\leq$	$\displaystyle 2G_{k}^{2}H_{k}^{2}\sum_{j\neq 0}^{K}\Psi_{k}^{t_{0}}+2G_{k}^{2}% H_{k}^{2}\sum_{j\neq 0,j\neq k}^{K}\Omega_{k}^{t_{0}}$		(3)

Using Lemma 1, we can bound the effect of lightweighting errors in detail. Afterwards we can derive the Theorem 1:

Theorem 1.

Under Assumption 1-3, the average squared gradient over $R$ global rounds of the LVFL is bounded as:

	$\displaystyle\frac{1}{R}\sum_{t_{0}=0}^{R-1}\mathbb{E}\\|\nabla F(\Theta^{t_{0}% })\\|^{2}\leq\frac{3\left[F(\Theta^{0})-\mathbb{E}[F(\Theta^{T})]\right]}{\eta T}$
	$\displaystyle+\frac{108E^{2}}{R}\sum_{t_{0}=0}^{R-1}\sum_{k=0}^{K}H_{k}^{2}G_{% k}^{2}\sum_{j\neq 0}^{K}\Psi_{j}^{t_{0}}$
	$\displaystyle+\frac{108E^{2}}{R}\sum_{t_{0}=0}^{R-1}\sum_{k=0}^{K}G_{k}^{2}H_{% k}^{2}\sum_{j\neq 0,j\neq k}^{K}\Omega_{j}^{t_{0}}$		(4)

The proof can be found in Appendix -D of online supplementary materials. The convergence bound detailed above unveils several insights: The first term illustrates the difference between the initial and final models. The second term highlights the error stemming from the use of the lightweighting feature model, while the third term represents an error caused by lightweighting feature embedding. This analysis reveals that the primary elements influencing the bound are the errors related to communication lightweighting and computation lightweighting. In other words, a larger error will result in a greater bound in the end. To delve deeper into the connection between error and lightweighting ratios, we must turn to the subsequent additional assumptions.

Assumption 4 (Bounded Embedding and Model).

Positive constants $\delta>0$ and $\mu>0$ exist such that for $k=0,\dots,K$ , the following conditions are met: $\mathbb{E}\left\|h_{k}(\theta_{k};x_{k})\right\|^{2}\leq\delta^{2},\quad% \mathbb{E}\left\|\theta_{k}\right\|^{2}\leq\mu^{2}$ .

Assumption 5 (Lipschitz Continuous).

There also exists positive constants and $M_{k}<\infty$ , such that for the all $\theta_{1}$ and $\theta_{2}$ satisfies: $\left\|h_{k}(\theta_{1})-h_{k}(\theta_{2})\right\|\leq M_{k}\left\|\theta_{1}-% \theta_{2}\right\|$ .

Assumption 4 and Assumption 5 are required to build the connection between the lightweighting ratios and the lightweighting error. With these two assumptions, we are able to derive the subsequent result, referred to as Corollary 1.

Corollary 1.

Under Assumption 1-5, with the computation lightweighting ratio $\alpha_{k}^{t}$ and communication lightweighting ratio $\beta_{k}^{t}$ , we can further obtain the following bound:

	$\displaystyle\frac{1}{R}\sum_{t_{0}=0}^{R-1}\mathbb{E}\\|\nabla F(\Theta^{t_{0}% })\\|^{2}\leq\frac{3\left[F(\Theta^{0})-\mathbb{E}[F(\Theta^{T})]\right]}{\eta T}$
	$\displaystyle+\frac{108E^{2}\mu^{2}}{R}\sum_{t_{0}=0}^{R-1}\sum_{k=0}^{K}H_{k}% ^{2}G_{k}^{2}M_{k}^{2}\sum_{j\neq 0}^{K}\alpha_{j}^{t_{0}}$
	$\displaystyle+\frac{108E^{2}\delta^{2}}{R}\sum_{t_{0}=0}^{R-1}\sum_{k=0}^{K}G_% {k}^{2}H_{k}^{2}\sum_{j\neq 0,j\neq k}^{K}\beta_{j}^{t_{0}}$		(5)

The detailed proof for this result is located in Appendix -E within the online supplementary materials. As derived from Corollary 1, a key observation to emphasize is that the communication lightweighting error is governed by the communication lightweighting ratio $\beta_{k}^{t}$ , while the computation lightweighting error is influenced by the computation lightweighting ratio $\alpha_{k}^{t}$ . As a result, higher lightweighting ratios tend to result in more lenient convergence bounds, whereas lower ratios lead to more stringent convergence bounds.

VI Experiment

In this section, we carry out experiments to assess the performance of LVFL. The experiments were run on an Ubuntu 18.04 machine with an Intel Core i7-10700KF 3.8GHz CPU and GeForce RTX 3070 GPU. Specifically, we explore the effectiveness of the algorithm we’ve proposed by utilizing the VFL-based datasets: CIFAR-10. Subsequently, we will provide detailed information about the datasets and the corresponding models used in the experiment.

CIFAR-10: CIFAR-10 is a dataset used for object classification in images. In a specific training setup, there are 4 parties involved, with each party responsible for a different quadrant of each image. The number of training data samples $N=10000$ , and the batch size $b_{s}=256$ . Every party (client) utilizes VGG16 as feature model for training, while the server focuses on training a 3-layer fully connected networks (FCN) as head model.

In the experiment, the following approaches are used for performance comparison. No Lightweighting (NL), where no computational or communication lightweighting mechanisms are implemented. Computational Lightweighting Only (PL), applying solely computational lightweighting. Communication Lightweighting Only (ML), utilizing exclusively communication lightweighting. Lightweighting (L), incorporating both computational and communication lightweighting strategies to enhance efficiency. Next we first show the performance comparison of the above approaches with dynamic computational and communication lightweighting ratios.

VI-A Performance Comparison

This section presents the learning performance optimized for computational and communication efficiency. At each round each client will be assigned the computational lightweighting ratio $\alpha_{k}^{t}$ and communication lightweighting ratio $\beta_{k}^{t}$ . These ratios are then applied to individually tailor the lightweighting process to each client’s unique requirements. For illustrative purposes, we assume a uniform set of ratios across all clients, with the specific values depicted in Fig. 2b. The learning performance associated with our approaches are illustrated in Fig. 2a, from which several key observations can be derived. Primarily, computational lightweighting exerts a more significant influence on learning performance compared to communication lightweighting. Notably, computational lightweighting results in a marked decrease in test accuracy at the point of implementation, followed by a gradual recovery. Furthermore, the frequency and intensity of computational lightweighting are directly proportional to its impact on test accuracy.

VI-B Effect of Ratio on Learning Performance

In this section we further investigate the effect of the choice of $\alpha_{k}^{t}$ and $\beta_{k}^{t}$ on learning performance. Here we explore the effects of computation and communication separately. In the scenario of computational lightweighting, as depicted in Fig. 3a, adjustments to the feature model are made at $t=40$ with $\alpha^{t}\in[0.2,0.4,0.6]$ . The outcomes indicate that higher $\alpha^{t}$ values result in more significant reductions in test accuracy, consequently necessitating a longer recovery period. In the context of communication lightweighting, as shown in Fig. 3b, varying communication lightweighting ratios $\beta^{t}\in[0.2,0.4,0.6]$ were applied to the feature embedding. The findings illustrate that lower $\beta^{t}$ values are associated with quicker convergence rates, particularly noticeable within the range $t\in[10,40]$ . However, the impact of communication lightweighting is not as substantial as that observed with computational lightweighting.

VII Conclusion

Our paper introduces the concept of Lightweight Vertical Federated Learning (LVFL). Owing to structural distinctions between VFL and HFL, algorithm design and convergence analysis for VFL-based lightweight challenges significantly differ. Our convergence proofs elucidate the correlation between convergence bounds and the ratios of communication and computational lightweighting. Furthermore, the section on experimental results underscores the benefits of employing lightweight mechanisms. In the subsequent research, we intend to extend LVFL to be a more practical problem. We plan to formulate a long-term optimization problem that accurately represents the computational and communication difficulties clients face throughout the training phase. This will facilitate the determination of optimal lightweighting ratios specifically.

References

[1] Y. Liu, T. Fan, T. Chen, Q. Xu, and Q. Yang, “Fate: An industrial grade platform for collaborative learning with data protection,” Journal of Machine Learning Research, vol. 22, no. 226, pp. 1–6, 2021.
[2] T. J. Castiglia, A. Das, S. Wang, and S. Patterson, “Compressed-vfl: Communication-efficient learning with vertically partitioned data,” in International Conference on Machine Learning. PMLR, 2022, pp. 2738–2766.
[3] H. Wang and J. Xu, “Online vertical federated learning for cooperative spectrum sensing,” arXiv preprint arXiv:2312.11363, 2023.
[4] S. Hardy, W. Henecka, H. Ivey-Law, R. Nock, G. Patrini, G. Smith, and B. Thorne, “Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption,” arXiv preprint arXiv:1711.10677, 2017.
[5] L. Yang, D. Chai, J. Zhang, Y. Jin, L. Wang, H. Liu, H. Tian, Q. Xu, and K. Chen, “A survey on vertical federated learning: From a layered perspective,” arXiv preprint arXiv:2304.01829, 2023.
[6] K. Wei, J. Li, C. Ma, M. Ding, S. Wei, F. Wu, G. Chen, and T. Ranbaduge, “Vertical federated learning: Challenges, methodologies and experiments,” arXiv preprint arXiv:2202.04309, 2022.
[7] Y. Liu, Y. Kang, T. Zou, Y. Pu, Y. He, X. Ye, Y. Ouyang, Y.-Q. Zhang, and Q. Yang, “Vertical federated learning,” arXiv preprint arXiv:2211.12814, 2022.
[8] S. Feng, “Vertical federated learning-based feature selection with non-overlapping sample utilization,” Expert Systems with Applications, vol. 208, p. 118097, 2022.
[9] Y. Kang, Y. Liu, and X. Liang, “Fedcvt: Semi-supervised vertical federated learning with cross-view training,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 13, no. 4, pp. 1–16, 2022.
[10] J. Sun, Y. Yao, W. Gao, J. Xie, and C. Wang, “Defending against reconstruction attack in vertical federated learning,” arXiv preprint arXiv:2107.09898, 2021.
[11] Y. Liu, X. Zhang, Y. Kang, L. Li, T. Chen, M. Hong, and Q. Yang, “Fedbcd: A communication-efficient collaborative learning framework for distributed features,” IEEE Transactions on Signal Processing, vol. 70, pp. 4277–4290, 2022.
[12] T. Castiglia, S. Wang, and S. Patterson, “Flexible vertical federated learning with heterogeneous parties,” arXiv preprint arXiv:2208.12672, 2022.
[13] M. Li, Y. Chen, Y. Wang, and Y. Pan, “Efficient asynchronous vertical federated learning via gradient prediction and double-end sparse compression,” in 2020 16th international conference on control, automation, robotics and vision (ICARCV). IEEE, 2020, pp. 291–296.
[14] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[15] Z. Wei, Q. Pei, N. Zhang, X. Liu, C. Wu, and A. Taherkordi, “Lightweight federated learning for large-scale iot devices with privacy guarantee,” IEEE Internet of Things Journal, vol. 10, no. 4, pp. 3179–3191, 2021.
[16] X. Dong, S. Chen, and S. Pan, “Learning to prune deep neural networks via layer-wise optimal brain surgeon,” Advances in neural information processing systems, vol. 30, 2017.
[17] V. Sanh, T. Wolf, and A. Rush, “Movement pruning: Adaptive sparsity by fine-tuning,” Advances in Neural Information Processing Systems, vol. 33, pp. 20 378–20 389, 2020.
[18] X. Ding, G. Ding, Y. Guo, and J. Han, “Centripetal sgd for pruning very deep convolutional networks with complicated structure,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4943–4953.
[19] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient convnets,” arXiv preprint arXiv:1608.08710, 2016.
[20] Z. You, K. Yan, J. Ye, M. Ma, and P. Wang, “Gate decorator: Global filter pruning method for accelerating deep convolutional neural networks,” Advances in neural information processing systems, vol. 32, 2019.
[21] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” Advances in neural information processing systems, vol. 28, 2015.
[22] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “Eie: Efficient inference engine on compressed deep neural network,” ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 243–254, 2016.
[23] Y. Jiang, S. Wang, V. Valls, B. J. Ko, W.-H. Lee, K. K. Leung, and L. Tassiulas, “Model pruning enables efficient federated learning on edge devices,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
[24] Z. Jiang, Y. Xu, H. Xu, Z. Wang, J. Liu, Q. Chen, and C. Qiao, “Computation and communication efficient federated learning with adaptive model pruning,” IEEE Transactions on Mobile Computing, 2023.

In this section, we provide the details of proofs. Let’s start by defining a new series of notations.

-A Additional Notation

In this section, we present various notations that will be utilized in subsequent proofs. During each local round $t$ , every client $k$ trains the local model with the embedding $\hat{\Phi}^{t_{0}}_{k}$ . This process can also be interpreted as the client $k$ training the model ${\theta}_{k}^{t}$ and ${\theta}_{j}^{t_{0}}$ for all $j\neq k$ , where $t_{0}$ is the last communication iteration when client $k$ received the embeddings. Then we have:

\displaystyle\gamma_{k,j}^{t}=\left\{\begin{array}[]{ll}{\theta}_{j}^{t},&k=j% \\ {\theta}_{j}^{t_{0}},&\text{otherwise}\end{array}\right.

(8)

Which represents the real model that client $k$ utilized in the round $t$ . We define the column vector $\Gamma_{k}^{t}=\left[(\gamma_{k,0}^{t})^{T};\dots;(\gamma_{k,K}^{t})^{T}\right]$ to be the client $k$ ’s view of global model in the round $t$ . Then we define $\hat{F}({\Gamma}^{t}_{k})$ to be the loss with pruning error by client $k$ at round t:

\displaystyle\hat{F}({\Gamma}^{t}_{k})=F\left({\theta}_{0}^{t_{0}},h_{1}^{t_{0% }}({\theta}_{1}^{t_{0}})+\varphi_{1}^{t_{0}}+\epsilon_{1}^{t_{0}},\dots,h_{k}^% {t}({\theta}_{k}^{t})+\varphi_{k}^{t_{0}},\dots,h_{K}^{t_{0}}({\theta}_{K}^{t_% {0}})+\varphi_{K}^{t_{0}}+\epsilon_{K}^{t_{0}}\right)

(9)

It is crucial to emphasize that ${\theta}_{0}^{t_{0}}$ is free from computation and communication pruning errors, while the expression $h_{k}^{t}({\theta}_{k}^{t})$ is unaffected by communication pruning errors. Building on this understanding, we can now reform the definition of $\hat{\textbf{G}}^{t}$ as follows::

\displaystyle\hat{\textbf{G}}^{t}:=\left[(\triangledown_{0}\hat{F}({\Gamma}^{t% }_{0}))^{T},\dots,(\triangledown_{K}\hat{F}({\Gamma}^{t}_{K}))^{T}\right]^{T}

(10)

In the subsequent proof, we will employ both variations of $\hat{\textbf{G}}^{t}$ . Following that, we will define the computation pruning error associated with each embedding utilized in client $k$ ’s gradient calculation during round $t$ :

\displaystyle E_{k}^{t_{0}}:=\left[(\varphi_{1}^{t_{0}}+\epsilon_{1}^{t_{0}})^% {T},\dots,(\varphi_{k}^{t_{0}})^{T},\dots,(\varphi_{K}^{t_{0}}+\epsilon_{K}^{t% _{0}})^{T}\right]^{T}

(11)

Given the error, we can derive the following:

\displaystyle\nabla_{k}{F}(\Phi^{t}_{k}+E_{k}^{t_{0}})=\nabla_{k}\hat{F}({% \Gamma}^{t}_{k})

(12)

We apply the chain rule to $\nabla_{k}\hat{F}({\Gamma}^{t}_{k})$ to get:

\displaystyle\nabla_{k}\hat{F}({\Gamma}^{t}_{k})=\nabla_{{\theta}_{k}}h_{k}({% \theta}_{k}^{t})\nabla_{h_{k}({\theta}_{k})}F(\Phi^{t}_{k}+E_{k}^{t_{0}})

(13)

Using the Taylor series expansion to further expand $\nabla_{h_{k}({\theta}_{k})}F(\Phi^{t}_{k}+E_{k}^{t_{0}})$ around the point $\Phi^{t}_{k}$ :

\displaystyle\nabla_{h_{k}({\theta}_{k})}F(\Phi^{t}_{k}+E_{k}^{t_{0}})=\nabla_% {h_{k}({\theta}_{k})}F(\Phi^{t}_{k})+\nabla_{h_{k}({\theta}_{k})}^{2}F(\Phi^{t% }_{k})^{T}E_{k}^{t_{0}}+\dots

(14)

To convince, we let the infinite sum of all terms from the second partial derivatives as $R_{0}^{k}(\Phi^{t}_{k}+E_{k}^{t_{0}})$ . Then we can get:

\displaystyle\nabla_{h_{k}({\theta}_{k})}F(\Phi^{t}_{k}+E_{k}^{t_{0}})=\nabla_% {h_{k}({\theta}_{k})}F(\Phi^{t}_{k})+R_{0}^{k}(\Phi^{t}_{k}+E_{k}^{t_{0}})

(15)

Building on the definition provided above, we will now move to the theoretical proof.

-B Proof of Lemma 1

By definition, we know that

	$\displaystyle\nabla_{k}\hat{F}({\Gamma}^{t}_{k})=\nabla_{k}{F}(\Phi^{t}_{k}+E_% {k}^{t_{0}})$		(16)
	$\displaystyle=\nabla_{{\theta}_{k}}h_{k}({\theta}_{k}^{t})\nabla_{h_{k}({% \theta}_{k})}F(\Phi^{t}_{k}+E_{k}^{t_{0}})$		(17)
	$\displaystyle=\nabla_{{\theta}_{k}}h_{k}({\theta}_{k}^{t})\left(\nabla_{h_{k}(% {\theta}_{k})}F(\Phi^{t}_{k})+R_{0}^{k}(\Phi^{t}_{k}+E_{k}^{t_{0}})\right)$		(18)
	$\displaystyle=\nabla_{k}{F}({\Gamma}^{t}_{k})+\nabla_{{\theta}_{k}}h_{k}({% \theta}_{k}^{t})R_{0}^{k}(\Phi^{t}_{k}+E_{k}^{t_{0}})$		(19)

Where Eq. 18 uses the Taylor series expansion. Then we can get

\displaystyle\mathbb{E}\left\|\nabla_{k}\hat{F}({\Gamma}^{t}_{k})-\nabla_{k}{F% }({\Gamma}^{t}_{k})\right\|^{2}=\mathbb{E}\left\|\nabla_{{\theta}_{k}}h_{k}({% \theta}_{k}^{t})R_{0}^{k}(\Phi^{t}_{k}+E_{k}^{t_{0}})\right\|^{2}

(20)

Thus we need to get the bound of the right side first.

\displaystyle\left\|\nabla_{{\theta}_{k}}h_{k}({\theta}_{k}^{t})R_{0}^{k}(\Phi% ^{t}_{k}+E_{k}^{t_{0}})\right\|^{2}\leq\left\|\nabla_{{\theta}_{k}}h_{k}({% \theta}_{k}^{t})\right\|^{2}_{\mathcal{F}}\left\|R_{0}^{k}(\Phi^{t}_{k}+E_{k}^% {t_{0}})\right\|^{2}_{\mathcal{F}}\leq G_{k}^{2}H_{k}^{2}\left\|E_{k}^{t_{0}}% \right\|^{2}_{\mathcal{F}}

(21)

Then applying expectation we can finally get:

	$\displaystyle\mathbb{E}\left\\|\nabla_{{\theta}_{k}}h_{k}({\theta}_{k}^{t})R_{0% }^{k}(\Phi^{t}_{k}+E_{k}^{t_{0}})\right\\|^{2}\leq G_{k}^{2}H_{k}^{2}\mathbb{E}% \left\\|E_{k}^{t_{0}}\right\\|^{2}_{\mathcal{F}}$		(22)
	$\displaystyle\leq G_{k}^{2}H_{k}^{2}\mathbb{E}\left[\sum_{j\neq 0,j\neq k}^{K}% \left\\|\varphi_{j}^{t_{0}}+\epsilon_{j}^{t_{0}}\right\\|^{2}_{\mathcal{F}}+% \left\\|\varphi_{k}^{t_{0}}\right\\|^{2}_{\mathcal{F}}\right]$		(23)
	$\displaystyle\leq G_{k}^{2}H_{k}^{2}\mathbb{E}\left[\sum_{j\neq 0,j\neq k}^{K}% \left(2\left\\|\varphi_{j}^{t_{0}}\right\\|^{2}_{\mathcal{F}}+2\left\\|\epsilon_{% j}^{t_{0}}\right\\|^{2}_{\mathcal{F}}\right)+\left\\|\varphi_{k}^{t_{0}}\right\\|% ^{2}_{\mathcal{F}}\right]$		(24)
	$\displaystyle\leq 2G_{k}^{2}H_{k}^{2}\sum_{j\neq 0}^{K}\mathbb{E}\left\\|% \varphi_{j}^{t_{0}}\right\\|^{2}_{\mathcal{F}}+2G_{k}^{2}H_{k}^{2}\sum_{j\neq 0% ,j\neq k}^{K}\mathbb{E}\left\\|\epsilon_{j}^{t_{0}}\right\\|^{2}_{\mathcal{F}}$		(25)
	$\displaystyle=2G_{k}^{2}H_{k}^{2}\sum_{j\neq 0}^{K}\Psi_{j}^{t_{0}}+2G_{k}^{2}% H_{k}^{2}\sum_{j\neq 0,j\neq k}^{K}\Omega_{j}^{t_{0}}$		(26)

-C Proof of Lemma 2

In this section, we will get the bound of $\sum_{t=t_{0}}^{t_{0}+E-1}\mathbb{E}_{t_{0}}\left\|\hat{\textbf{G}}^{t}-{% \textbf{G}}^{t_{0}}\right\|^{2}$ .

	$\displaystyle\mathbb{E}_{t_{0}}\left\\|\hat{\textbf{G}}^{t}-{\textbf{G}}^{t_{0}% }\right\\|^{2}=\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\\|\nabla_{k}\hat{F}({% \Gamma}^{t}_{k})-\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\\|^{2}\right]$		(27)
	$\displaystyle=\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\\|\nabla_{k}\hat{F}({% \Gamma}^{t}_{k})-\nabla_{k}\hat{F}({\Gamma}^{t-1}_{k})+\nabla_{k}\hat{F}({% \Gamma}^{t-1}_{k})-\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\\|^{2}\right]$		(28)
	$\displaystyle\leq(1+n)\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\\|\nabla_{k}% \hat{F}({\Gamma}^{t}_{k})-\nabla_{k}\hat{F}({\Gamma}^{t-1}_{k})\right\\|^{2}% \right]+(1+\frac{1}{n})\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\\|\nabla_{k}% \hat{F}({\Gamma}^{t-1}_{k})-\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\\|^{2}\right]$		(29)
	$\displaystyle\leq 2(1+n)\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\\|\nabla_{k% }{F}({\Gamma}^{t}_{k})-\nabla_{k}{F}({\Gamma}^{t-1}_{k})\right\\|^{2}\right]+(1% +\frac{1}{n})\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\\|\nabla_{k}\hat{F}({% \Gamma}^{t-1}_{k})-\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\\|^{2}\right]$		(30)
	$\displaystyle+2(1+n)\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\\|\nabla_{{% \theta}_{k}}h_{k}({\theta}_{k}^{t})R_{0}^{k}(\Phi^{t}_{k}+E_{k}^{t_{0}})-% \nabla_{{\theta}_{k}}h_{k}({\theta}_{k}^{t-1})R_{0}^{k}(\Phi^{t-1}_{k}+E_{k}^{% t-1})\right\\|^{2}\right]$		(31)
	$\displaystyle\leq 2(1+n)\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\\|\nabla_{k% }{F}({\Gamma}^{t}_{k})-\nabla_{k}{F}({\Gamma}^{t-1}_{k})\right\\|^{2}\right]+(1% +\frac{1}{n})\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\\|\nabla_{k}\hat{F}({% \Gamma}^{t-1}_{k})-\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\\|^{2}\right]$		(32)
	$\displaystyle+8(1+n)\sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\\|E_{k}^{t_{0}}\right% \\|^{2}_{\mathcal{F}}$		(33)

Where the last term follows the Eq. 21. Then we have:

	$\displaystyle\mathbb{E}_{t_{0}}\left\\|\hat{\textbf{G}}^{t}-{\textbf{G}}^{t_{0}% }\right\\|^{2}$
	$\displaystyle\leq 2(1+n)(\eta^{t_{0}})^{2}\sum_{k=0}^{K}(L_{k})^{2}\mathbb{E}_% {t_{0}}\left[\left\\|\nabla_{k}\hat{F}({\Gamma}^{t-1}_{k})\right\\|^{2}\right]+(% 1+\frac{1}{n})\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\\|\nabla_{k}\hat{F}({% \Gamma}^{t-1}_{k})-\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\\|^{2}\right]$		(34)
	$\displaystyle+8(1+n)\sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\\|E_{k}^{t_{0}}\right% \\|^{2}_{\mathcal{F}}$		(35)
	$\displaystyle\leq 2(1+n)(\eta^{t_{0}})^{2}\sum_{k=0}^{K}(L_{k})^{2}\mathbb{E}_% {t_{0}}\left[\left\\|\nabla_{k}\hat{F}({\Gamma}^{t-1}_{k})-\nabla_{k}{F}({% \Gamma}^{t_{0}}_{k})+\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\\|^{2}\right]$		(36)
	$\displaystyle+(1+\frac{1}{n})\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\\|% \nabla_{k}\hat{F}({\Gamma}^{t-1}_{k})-\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})% \right\\|^{2}\right]+8(1+n)\sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\\|E_{k}^{t_{0}}% \right\\|^{2}_{\mathcal{F}}$		(37)
	$\displaystyle\leq 4(1+n)(\eta^{t_{0}})^{2}\sum_{k=0}^{K}(L_{k})^{2}\mathbb{E}_% {t_{0}}\left[\left\\|\nabla_{k}\hat{F}({\Gamma}^{t-1}_{k})-\nabla_{k}{F}({% \Gamma}^{t_{0}}_{k})\right\\|^{2}\right]$		(38)
	$\displaystyle+4(1+n)(\eta^{t_{0}})^{2}\sum_{k=0}^{K}(L_{k})^{2}\mathbb{E}_{t_{% 0}}\left[\left\\|\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\\|^{2}\right]$		(39)
	$\displaystyle+(1+\frac{1}{n})\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\\|% \nabla_{k}\hat{F}({\Gamma}^{t-1}_{k})-\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})% \right\\|^{2}\right]+8(1+n)\sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\\|E_{k}^{t_{0}}% \right\\|^{2}_{\mathcal{F}}$		(40)
	$\displaystyle\leq\sum_{k=0}^{K}\left(4(1+n)(\eta^{t_{0}})^{2}(L_{k})^{2}+(1+% \frac{1}{n})\right)\mathbb{E}_{t_{0}}\left[\left\\|\nabla_{k}\hat{F}({\Gamma}^{% t-1}_{k})-\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\\|^{2}\right]$		(41)
	$\displaystyle+4(1+n)(\eta^{t_{0}})^{2}\sum_{k=0}^{K}(L_{k})^{2}\mathbb{E}_{t_{% 0}}\left[\left\\|\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\\|^{2}\right]+8(1+n)% \sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\\|E_{k}^{t_{0}}\right\\|^{2}_{\mathcal{F}}$		(42)

Where Eq. 34 follows the assumption 1 and update rule $\Gamma^{t}_{k}-{\Gamma}^{t-1}_{k}=\eta^{t_{0}}\nabla_{k}\hat{F}({\Gamma}^{t-1}% _{k})$ .
Then by set $n=E$ and $\eta^{t_{0}}\leq\frac{1}{4E\max_{k}L_{k}^{a}}$ , we can get:

	$\displaystyle\mathbb{E}_{t_{0}}\left\\|\hat{\textbf{G}}^{t}-{\textbf{G}}^{t_{0}% }\right\\|^{2}$		(43)
	$\displaystyle\leq\sum_{k=0}^{K}\left(\frac{(1+E)}{4E^{2}}+(1+\frac{1}{E})% \right)\mathbb{E}_{t_{0}}\left[\left\\|\nabla_{k}\hat{F}({\Gamma}^{t-1}_{k})-% \nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\\|^{2}\right]$		(44)
	$\displaystyle+4(1+E)(\eta^{t_{0}})^{2}\sum_{k=0}^{K}(L_{k})^{2}\mathbb{E}_{t_{% 0}}\left[\left\\|\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\\|^{2}\right]+8(1+n)% \sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\\|E_{k}^{t_{0}}\right\\|^{2}_{\mathcal{F}}$		(45)
	$\displaystyle\leq\sum_{k=0}^{K}\left(1+\frac{2}{E}\right)\mathbb{E}_{t_{0}}% \left[\left\\|\nabla_{k}\hat{F}({\Gamma}^{t-1}_{k})-\nabla_{k}{F}({\Gamma}^{t_{% 0}}_{k})\right\\|^{2}\right]$		(46)
	$\displaystyle+4(1+E)(\eta^{t_{0}})^{2}\sum_{k=0}^{K}(L_{k})^{2}\mathbb{E}_{t_{% 0}}\left[\left\\|\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\\|^{2}\right]+8(1+n)% \sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\\|E_{k}^{t_{0}}\right\\|^{2}_{\mathcal{F}}$		(47)

Here, we set the new notations for each term as follows:

	$\displaystyle A^{t}=\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\\|\nabla_{k}% \hat{F}({\Gamma}^{t}_{k})-\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\\|^{2}\right]$		(48)
	$\displaystyle B_{0}=4(1+E)(\eta^{t_{0}})^{2}\sum_{k=0}^{K}(L_{k})^{2}\mathbb{E% }_{t_{0}}\left[\left\\|\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\\|^{2}\right]$		(49)
	$\displaystyle B_{1}=8(1+n)\sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\\|E_{k}^{t_{0}}% \right\\|^{2}_{\mathcal{F}}$		(50)
	$\displaystyle C=\left(1+\frac{2}{E}\right)$		(51)

We can get $A^{t}\leq CA^{t-1}+B_{0}+B_{1}$ . By induction, we can obtain $A^{t}\leq C^{t-t_{0}-1}A^{t_{0}}+(B_{0}+B_{1})\frac{C^{t-t_{0}-1}-1}{C-1}$ .

\displaystyle A^{t_{0}}

\displaystyle=\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\|\nabla_{k}\hat{F}({% \Gamma}^{t_{0}}_{k})-\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\|^{2}\right]% \leq\sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\|E_{k}^{t_{0}}\right\|^{2}

(52)

Then we get the bounds of $C^{t-t_{0}-1}$ and $\frac{C^{t-t_{0}-1}-1}{C-1}$ with summing over the set of local iterations $t_{0}$ to $t_{0}+E-1$ respectively:

	$\displaystyle\sum_{t=t_{0}}^{t_{0}+E-1}C^{t-t_{0}-1}=\frac{C^{E}-1}{C-1}$		(53)
	$\displaystyle=\frac{\left(1+\frac{2}{E}\right)^{E}-1}{\left(1+\frac{2}{E}% \right)-1}\leq\frac{{\boldsymbol{e}}^{2}-1}{2}E\leq 4E$		(54)

We can also get

	$\displaystyle\sum_{t=t_{0}}^{t_{0}+E-1}\frac{C^{t-t_{0}-1}-1}{C-1}=\frac{1}{C-% 1}\left(\sum_{t=t_{0}}^{t_{0}+E-1}C^{t-t_{0}-1}-E\right)$		(55)
	$\displaystyle=\frac{1}{C-1}\left(\frac{C^{E}-1}{C-1}-E\right)=\frac{E}{2}\left% (\frac{E\left[(1+\frac{2}{E})^{E}-1\right]}{2}-E\right)$		(56)
	$\displaystyle=\frac{E^{2}}{2}\left(\frac{\left[(1+\frac{2}{E})^{E}-1\right]}{2% }-1\right)\leq\frac{E^{2}}{2}\left(\frac{{\boldsymbol{e}}^{2}-1}{2}-1\right)% \leq 2E^{2}$		(57)

Plugging all values into, we can get:

$\displaystyle\sum_{t=t_{0}}^{t_{0}+E-1}A^{t}$	$\displaystyle\leq\sum_{t=t_{0}}^{t_{0}+E-1}C^{t-t_{0}-1}A^{t_{0}}+(B_{0}+B_{1}% )\sum_{t=t_{0}}^{t_{0}+E-1}\frac{C^{t-t_{0}-1}-1}{C-1}$	(58)
	$\displaystyle\leq 4E\sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\\|E_{k}^{t_{0}}\right% \\|^{2}_{\mathcal{F}}$	(59)
	$\displaystyle+8E^{2}(1+E)(\eta^{t_{0}})^{2}\sum_{k=0}^{K}(L_{k})^{2}\mathbb{E}% _{t_{0}}\left[\left\\|\nabla_{k}{F}({\Theta}^{t_{0}}_{k})\right\\|^{2}\right]$	(60)
	$\displaystyle+16E^{2}(1+E)\sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\\|E_{k}^{t_{0}}% \right\\|^{2}_{\mathcal{F}}$	(61)
	$\displaystyle\leq 16E^{3}(\eta^{t_{0}})^{2}\sum_{k=0}^{K}(L_{k})^{2}\mathbb{E}% _{t_{0}}\left[\left\\|\nabla_{k}{F}({\Theta}^{t_{0}})\right\\|^{2}\right]$	(62)
	$\displaystyle+4(4E^{2}(1+E)+E)\sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\\|E_{k}^{t_% {0}}\right\\|^{2}_{\mathcal{F}}$	(63)
	$\displaystyle\leq 16E^{3}(\eta^{t_{0}})^{2}\sum_{k=0}^{K}(L_{k})^{2}\mathbb{E}% _{t_{0}}\left[\left\\|\nabla_{k}{F}({\Theta}^{t_{0}})\right\\|^{2}\right]$	(64)
	$\displaystyle+36E^{3}\sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\\|E_{k}^{t_{0}}% \right\\|^{2}_{\mathcal{F}}$	(65)

-D Proof of Theorem 1

Here we will derive Theorem 1, first, we set $t_{0}^{+}=t_{0}+E-1$ , then we get:

	$\displaystyle{F}({\Theta}^{t_{0}^{+}})-{F}({\Theta}^{t_{0}})\leq\left\langle% \nabla{F}({\Theta}^{t_{0}}),{\Theta}^{t_{0}^{+}}-{\Theta}^{t_{0}}\right\rangle% +\frac{L}{2}\left\\|{\Theta}^{t_{0}^{+}}-{\Theta}^{t_{0}}\right\\|^{2}$		(66)
	$\displaystyle=-\left\langle\nabla{F}({\Theta}^{t_{0}}),\sum_{t=t_{0}}^{t_{0}^{% +}}\eta^{t_{0}}\hat{\textbf{G}}^{t}\right\rangle+\frac{L}{2}\left\\|\sum_{t=t_{% 0}}^{t_{0}^{+}}\eta^{t_{0}}\hat{\textbf{G}}^{t}\right\\|^{2}$		(67)
	$\displaystyle\leq-\sum_{t=t_{0}}^{t_{0}^{+}}\eta^{t_{0}}\left\langle\nabla{F}(% {\Theta}^{t_{0}}),\hat{\textbf{G}}^{t}\right\rangle+\frac{LE}{2}\sum_{t=t_{0}}% ^{t_{0}^{+}}(\eta^{t_{0}})^{2}\left\\|\hat{\textbf{G}}^{t}\right\\|^{2}$		(68)
	$\displaystyle\leq\sum_{t=t_{0}}^{t_{0}^{+}}\eta^{t_{0}}\left\langle-\nabla{F}(% {\Theta}^{t_{0}}),\hat{\textbf{G}}^{t}-{\textbf{G}}^{t_{0}}\right\rangle-\sum_% {t=t_{0}}^{t_{0}^{+}}\eta^{t_{0}}\left\langle\nabla{F}({\Theta}^{t_{0}}),{% \textbf{G}}^{t_{0}}\right\rangle$		(69)
	$\displaystyle+LE\sum_{t=t_{0}}^{t_{0}^{+}}(\eta^{t_{0}})^{2}\left\\|\hat{% \textbf{G}}^{t}-{\textbf{G}}^{t_{0}}\right\\|^{2}+LE\sum_{t=t_{0}}^{t_{0}^{+}}(% \eta^{t_{0}})^{2}\left\\|{\textbf{G}}^{t_{0}}\right\\|^{2}$		(70)
	$\displaystyle\leq\frac{1}{2}\sum_{t=t_{0}}^{t_{0}^{+}}\eta^{t_{0}}\left\\|% \nabla{F}({\Theta}^{t_{0}})\right\\|^{2}+\frac{1}{2}\sum_{t=t_{0}}^{t_{0}^{+}}% \eta^{t_{0}}\left\\|\hat{\textbf{G}}^{t}-{\textbf{G}}^{t_{0}}\right\\|^{2}-\sum_% {t=t_{0}}^{t_{0}^{+}}\eta^{t_{0}}\left\langle\nabla{F}({\Theta}^{t_{0}}),{% \textbf{G}}^{t_{0}}\right\rangle$		(71)
	$\displaystyle+LE\sum_{t=t_{0}}^{t_{0}^{+}}(\eta^{t_{0}})^{2}\left\\|\hat{% \textbf{G}}^{t}-{\textbf{G}}^{t_{0}}\right\\|^{2}+LE\sum_{t=t_{0}}^{t_{0}^{+}}(% \eta^{t_{0}})^{2}\left\\|{\textbf{G}}^{t_{0}}\right\\|^{2}$		(72)

where the last term follows $A\cdot B=\frac{1}{2}A^{2}+\frac{1}{2}B^{2}-\frac{1}{2}(A-B)^{2}$ . Then we apply the expectation $\mathbb{E}_{t_{0}}$ to both sides of the last term.

	$\displaystyle\mathbb{E}_{t_{0}}[{F}({\Theta}^{t_{0}^{+}})]-{F}({\Theta}^{t_{0}})$		(73)
	$\displaystyle\leq-\frac{1}{2}\sum_{t=t_{0}}^{t_{0}^{+}}\eta^{t_{0}}\left\\|% \nabla{F}({\Theta}^{t_{0}})\right\\|^{2}+\frac{1}{2}\sum_{t=t_{0}}^{t_{0}^{+}}% \eta^{t_{0}}(1+2LE\eta^{t_{0}})\mathbb{E}_{t_{0}}\left\\|\hat{\textbf{G}}^{t}-{% \textbf{G}}^{t_{0}}\right\\|^{2}+LE\sum_{t=t_{0}}^{t_{0}^{+}}(\eta^{t_{0}})^{2}% \mathbb{E}_{t_{0}}\left\\|{\textbf{G}}^{t_{0}}\right\\|^{2}$		(74)
	$\displaystyle\leq-\frac{1}{2}\sum_{t=t_{0}}^{t_{0}^{+}}\eta^{t_{0}}(1-2LE\eta^% {t_{0}})\left\\|\nabla{F}({\Theta}^{t_{0}})\right\\|^{2}+\frac{1}{2}\sum_{t=t_{0% }}^{t_{0}^{+}}\eta^{t_{0}}(1+2LE\eta^{t_{0}})\mathbb{E}_{t_{0}}\left\\|\hat{% \textbf{G}}^{t}-{\textbf{G}}^{t_{0}}\right\\|^{2}$		(75)
	$\displaystyle=-\frac{E}{2}\eta^{t_{0}}(1-2LE\eta^{t_{0}})\left\\|\nabla{F}({% \Theta}^{t_{0}})\right\\|^{2}+\frac{1}{2}\sum_{t=t_{0}}^{t_{0}^{+}}\eta^{t_{0}}% (1+2LE\eta^{t_{0}})\mathbb{E}_{t_{0}}\left\\|\hat{\textbf{G}}^{t}-{\textbf{G}}^% {t_{0}}\right\\|^{2}$		(76)

where the Eq.74 follows ${\textbf{G}}^{t_{0}}=\nabla{F}({\Theta}^{t_{0}})$ since we consider the full batch participation in each training round. If mini-batch is considered here there will have additional terms. Then with Lemma 2, we have:

	$\displaystyle\mathbb{E}_{t_{0}}[{F}({\Theta}^{t_{0}^{+}})]-{F}({\Theta}^{t_{0}})$		(77)
	$\displaystyle\leq-\frac{E}{2}\eta^{t_{0}}(1-2LE\eta^{t_{0}})\left\\|\nabla{F}({% \Theta}^{t_{0}})\right\\|^{2}+8E^{3}(\eta^{t_{0}})^{3}(1+2LE\eta^{t_{0}})\sum_{% k=0}^{K}(L_{k})^{2}\left\\|\nabla_{k}{F}({\Theta}^{t_{0}}_{k})\right\\|^{2}$		(78)
	$\displaystyle+18E^{3}\eta^{t_{0}}(1+2LE\eta^{t_{0}})\sum_{k=0}^{K}H_{k}^{2}G_{% k}^{2}\left\\|E^{t_{0}}_{k}\right\\|^{2}_{\mathcal{F}}$		(79)

Then by let $\eta^{t_{0}}\leq\frac{1}{16E\max\left\{L,\max_{k}L_{k}\right\}}$ , we can get the further bound:

	$\displaystyle\mathbb{E}_{t_{0}}[{F}({\Theta}^{t_{0}^{+}})]-{F}({\Theta}^{t_{0}% })\leq-\frac{E}{2}\sum_{k=0}^{K}\eta^{t_{0}}(1-2LE\eta^{t_{0}}-16E^{2}(L_{k})^% {2}(\eta^{t_{0}})^{2}-16E^{3}(L_{k})^{2}L(\eta^{t_{0}})^{3})\left\\|\nabla_{k}{% F}({\Theta}^{t_{0}}_{k})\right\\|^{2}$
	$\displaystyle+18E^{3}\eta^{t_{0}}(1+2L^{a}E\eta^{t_{0}})\sum_{k=0}^{K}H_{k}^{2% }G_{k}^{2}\left\\|E^{t_{0}}_{k}\right\\|^{2}_{\mathcal{F}}$		(80)
	$\displaystyle\leq-\frac{3E}{8}\eta^{t_{0}}\left\\|\nabla{F}({\Theta}^{t_{0}})% \right\\|^{2}+18E^{3}\eta^{t_{0}}(1+2LE\eta^{t_{0}})\sum_{k=0}^{K}H_{k}^{2}G_{k% }^{2}\left\\|E^{t_{0}}_{k}\right\\|^{2}_{\mathcal{F}}$		(81)

Then we rearranged the terms:

\displaystyle\eta^{t_{0}}\left\|\nabla{F}({\Theta}^{t_{0}})\right\|^{2}

\displaystyle\leq\frac{3\left[{F}({\Theta}^{t_{0}})-\mathbb{E}_{t_{0}}[{F}({% \Theta}^{t_{0}^{+}})]\right]}{E}+48E^{2}\eta^{t_{0}}(1+2LE\eta^{t_{0}})\sum_{k% =0}^{K}H_{k}^{2}G_{k}^{2}\left\|E^{t_{0}}_{k}\right\|^{2}_{\mathcal{F}}

(82)

summing over all global round $t_{0}=0,\dots,R-1$ with expectation:

\displaystyle\sum_{t_{0}=0}^{R-1}\eta^{t_{0}}\mathbb{E}[\left\|\nabla{F}({% \Theta}^{t_{0}})\right\|^{2}]

\displaystyle\leq\frac{3\left[{F}({\Theta}^{0})-\mathbb{E}_{t_{0}}[{F}({\Theta% }^{T})]\right]}{E}+48E^{2}\eta^{t_{0}}(1+2LE\eta^{t_{0}})\sum_{t_{0}=0}^{R-1}% \sum_{k=0}^{K}H_{k}^{2}G_{k}^{2}\mathbb{E}[\left\|E^{t_{0}}_{k}\right\|^{2}_{% \mathcal{F}}]

(83)

Then with $\eta^{t_{0}}=\eta$ and averaging over R global rounds, we get:

	$\displaystyle\frac{1}{R}\sum_{t_{0}=0}^{R-1}\mathbb{E}[\left\\|\nabla{F}({% \Theta}^{t_{0}})\right\\|^{2}]\leq\frac{3\left[{F}({\Theta}^{0})-\mathbb{E}_{t_% {0}}[{F}({\Theta}^{T})]\right]}{\eta RE}$		(84)
	$\displaystyle+\frac{96E^{2}}{R}(1+2LE\eta)\sum_{t_{0}=0}^{R-1}\sum_{k=0}^{K}H_% {k}^{2}G_{k}^{2}\sum_{j\neq 0}^{K}\Psi_{j}^{t_{0}}+\frac{96E^{2}}{R}(1+2LE\eta% )\sum_{t_{0}=0}^{R-1}\sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\sum_{j\neq 0,j\neq k}^{K% }\Omega_{j}^{t_{0}}$		(85)
	$\displaystyle\leq\frac{3\left[{F}({\Theta}^{0})-\mathbb{E}_{t_{0}}[{F}({\Theta% }^{T})]\right]}{\eta RE}+\frac{108E^{2}}{R}\sum_{t_{0}=0}^{R-1}\sum_{k=0}^{K}H% _{k}^{2}G_{k}^{2}\sum_{j\neq 0}^{K}\Psi_{j}^{t_{0}}+\frac{108E^{2}}{R}\sum_{t_% {0}=0}^{R-1}\sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\sum_{j\neq 0,j\neq k}^{K}\Omega_{% j}^{t_{0}}$		(86)

where the last term is from $\eta=\eta^{t_{0}}\leq\frac{1}{16E\max\left\{L,\max_{k}L_{k}\right\}}$ . This finishes the proof of Theorem 1.

-E Proof of Corollary 1

To get the corollary 1, we need further bound the $\Omega_{k}^{t}$ and $\Psi_{k}^{t}$ respectively. We begin by giving the bound of $\Omega_{k}^{t}$ .

$\displaystyle\Omega_{k}^{t}$	$\displaystyle=\mathbb{E}\\|h_{k}^{t}(\hat{\theta}_{k}^{t})-\hat{h}_{k}^{t}(\hat% {\theta}_{k}^{t})\\|^{2}$	(87)
	$\displaystyle=\mathbb{E}\\|h_{k}^{t}(\hat{\theta}_{k}^{t})\odot\textbf{1}-h_{k}% ^{t}(\hat{\theta}_{k}^{t})\odot\textbf{l}_{k}^{t}\\|^{2}=\mathbb{E}\\|h_{k}^{t}(% \hat{\theta}_{k}^{t})\odot(\textbf{1}-\textbf{l}_{k}^{t})\\|^{2}$	(88)
	$\displaystyle\leq\mathbb{E}\left\\|\left\langle h_{k}^{t}(\hat{\theta}_{k}^{t})% ,(\textbf{1}-\textbf{l}_{k}^{t})\right\rangle\right\\|^{2}\leq\mathbb{E}\left\\|% h_{k}^{t}(\hat{\theta}_{k}^{t})\right\\|^{2}\mathbb{E}\left\\|(\textbf{1}-% \textbf{l}_{k}^{t})\right\\|^{2}$	(89)
	$\displaystyle\leq\delta^{2}(\beta_{k}^{t})^{2}\leq\delta^{2}\beta_{k}^{t}$	(90)

The last term is derived under the Assumption 4 and $\beta_{k}^{t}$ are valued within the $(0,1)$ . Then we will show the bound of $\Psi_{k}^{t}$ as:

$\displaystyle\Psi_{k}^{t}$	$\displaystyle=\mathbb{E}\\|h_{k}^{t}({\theta}_{k}^{t})-h_{k}^{t}(\hat{\theta}_{% k}^{t})\\|^{2}$	(91)
	$\displaystyle\leq M_{k}^{2}\mathbb{E}\\|{\theta}_{k}^{t}-\hat{\theta}_{k}^{t}\\|% ^{2}\leq M_{k}^{2}\mathbb{E}\\|{\theta}_{k}^{t}\odot\textbf{1}-{\theta}_{k}^{t}% \odot\textbf{m}_{k}^{t}\\|^{2}$	(92)
	$\displaystyle\leq M_{k}^{2}\mathbb{E}\\|{\theta}_{k}^{t}\odot(\textbf{1}-% \textbf{m}_{k}^{t})\\|^{2}\leq M_{k}^{2}\mathbb{E}\left\\|\left\langle{\theta}_{% k}^{t},(\textbf{1}-\textbf{m}_{k}^{t})\right\rangle\right\\|^{2}$	(93)
	$\displaystyle\leq M_{k}^{2}\mathbb{E}\left\\|{\theta}_{k}^{t}\right\\|^{2}% \mathbb{E}\left\\|(\textbf{1}-\textbf{m}_{k}^{t})\right\\|^{2}$	(94)
	$\displaystyle\leq M_{k}^{2}\mu^{2}(\alpha_{k}^{t})^{2}\leq M_{k}^{2}\mu^{2}% \alpha_{k}^{t}$	(95)

Where Eq. 92 is due to Assumption 5. And the last term is also derived under the Assumption 4 and $\alpha_{k}^{t}$ are valued within the $(0,1)$ . This finishes the proof of Corollary 1.

	$\displaystyle\mathbb{E}_{t_{0}}\left\\|\hat{\textbf{G}}^{t}-{\textbf{G}}^{t_{0}% }\right\\|^{2}$
	$\displaystyle\leq 2(1+n)(\eta^{t_{0}})^{2}\sum_{k=0}^{K}(L_{k})^{2}\mathbb{E}_% {t_{0}}\left[\left\\|\nabla_{k}\hat{F}({\Gamma}^{t-1}_{k})\right\\|^{2}\right]+(% 1+\frac{1}{n})\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\\|\nabla_{k}\hat{F}({% \Gamma}^{t-1}_{k})-\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\\|^{2}\right]$		(34)
	$\displaystyle+8(1+n)\sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\\|E_{k}^{t_{0}}\right% \\|^{2}_{\mathcal{F}}$		(35)
	$\displaystyle\leq 2(1+n)(\eta^{t_{0}})^{2}\sum_{k=0}^{K}(L_{k})^{2}\mathbb{E}_% {t_{0}}\left[\left\\|\nabla_{k}\hat{F}({\Gamma}^{t-1}_{k})-\nabla_{k}{F}({% \Gamma}^{t_{0}}_{k})+\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\\|^{2}\right]$		(36)
	$\displaystyle+(1+\frac{1}{n})\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\\|% \nabla_{k}\hat{F}({\Gamma}^{t-1}_{k})-\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})% \right\\|^{2}\right]+8(1+n)\sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\\|E_{k}^{t_{0}}% \right\\|^{2}_{\mathcal{F}}$		(37)
	$\displaystyle\leq 4(1+n)(\eta^{t_{0}})^{2}\sum_{k=0}^{K}(L_{k})^{2}\mathbb{E}_% {t_{0}}\left[\left\\|\nabla_{k}\hat{F}({\Gamma}^{t-1}_{k})-\nabla_{k}{F}({% \Gamma}^{t_{0}}_{k})\right\\|^{2}\right]$		(38)
	$\displaystyle+4(1+n)(\eta^{t_{0}})^{2}\sum_{k=0}^{K}(L_{k})^{2}\mathbb{E}_{t_{% 0}}\left[\left\\|\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\\|^{2}\right]$		(39)
	$\displaystyle+(1+\frac{1}{n})\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\\|% \nabla_{k}\hat{F}({\Gamma}^{t-1}_{k})-\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})% \right\\|^{2}\right]+8(1+n)\sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\\|E_{k}^{t_{0}}% \right\\|^{2}_{\mathcal{F}}$		(40)
	$\displaystyle\leq\sum_{k=0}^{K}\left(4(1+n)(\eta^{t_{0}})^{2}(L_{k})^{2}+(1+% \frac{1}{n})\right)\mathbb{E}_{t_{0}}\left[\left\\|\nabla_{k}\hat{F}({\Gamma}^{% t-1}_{k})-\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\\|^{2}\right]$		(41)
	$\displaystyle+4(1+n)(\eta^{t_{0}})^{2}\sum_{k=0}^{K}(L_{k})^{2}\mathbb{E}_{t_{% 0}}\left[\left\\|\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\\|^{2}\right]+8(1+n)% \sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\\|E_{k}^{t_{0}}\right\\|^{2}_{\mathcal{F}}$		(42)

	$\displaystyle\mathbb{E}_{t_{0}}[{F}({\Theta}^{t_{0}^{+}})]-{F}({\Theta}^{t_{0}% })\leq-\frac{E}{2}\sum_{k=0}^{K}\eta^{t_{0}}(1-2LE\eta^{t_{0}}-16E^{2}(L_{k})^% {2}(\eta^{t_{0}})^{2}-16E^{3}(L_{k})^{2}L(\eta^{t_{0}})^{3})\left\\|\nabla_{k}{% F}({\Theta}^{t_{0}}_{k})\right\\|^{2}$
	$\displaystyle+18E^{3}\eta^{t_{0}}(1+2L^{a}E\eta^{t_{0}})\sum_{k=0}^{K}H_{k}^{2% }G_{k}^{2}\left\\|E^{t_{0}}_{k}\right\\|^{2}_{\mathcal{F}}$		(80)
	$\displaystyle\leq-\frac{3E}{8}\eta^{t_{0}}\left\\|\nabla{F}({\Theta}^{t_{0}})% \right\\|^{2}+18E^{3}\eta^{t_{0}}(1+2LE\eta^{t_{0}})\sum_{k=0}^{K}H_{k}^{2}G_{k% }^{2}\left\\|E^{t_{0}}_{k}\right\\|^{2}_{\mathcal{F}}$		(81)