Computation and Communication Efficient Lightweighting Vertical Federated Learning

Heqiang Wang1, Jieming Bian1, Lei Wang1
1Department of Electrical and Computer Engineering,
University of Miami, Coral Gables, FL 33146, USA
Abstract

The exploration of computational and communication efficiency within Federated Learning (FL) has emerged as a prominent and crucial field of study. While most existing efforts to enhance these efficiencies have focused on Horizontal FL, the distinct processes and model structures of Vertical FL preclude the direct application of Horizontal FL-based techniques. In response, we introduce the concept of Lightweight Vertical Federated Learning (LVFL), targeting both computational and communication efficiencies. This approach involves separate lightweighting strategies for the feature model, to improve computational efficiency, and for feature embedding, to enhance communication efficiency. Moreover, we establish a convergence bound for our LVFL algorithm, which accounts for both communication and computational lightweighting ratios. Our evaluation of the algorithm on a image classification dataset reveals that LVFL significantly alleviates computational and communication demands while preserving robust learning performance. This work effectively addresses the gaps in communication and computational efficiency within Vertical FL.

Index Terms:
Federated Learning, Model Pruning

I Introduction

Federated learning (FL) is a distributed machine learning paradigm that enables a set of clients with decentralized data to collaborate and learn a shared model under the coordination of a centralized server. In FL, data is stored on edge devices in a distributed manner, which reduces the amount of data that needs to be uploaded and decreases the risk of user privacy leakage. The majority of prior research focuses on horizontal federated learning (HFL), where participants share the same feature space but have distinct sample spaces. There’s growing interest in vertical federated learning (VFL) in recent years. In VFL, clients possess different feature spaces but share a common sample space. An example of VFL implementation is Webank’s collaboration with an invoice company to construct financial risk models for enterprise customers [1]. In this context, both entities possess distinct user feature datasets that, due to information security concerns, cannot be shared or transmitted. During VFL training, each client develops a feature model, converting raw data features into a vector representation known as ‘feature embedding’. Instead of transmitting their feature models, clients in VFL send these feature embeddings to the server. The server then integrates these embeddings into a head model to determine the final loss. A typical workflow of VFL is illustrated in Fig. 1. Clearly, VFL’s methodology diverges from HFL, introducing its own set of unique challenges. Many of these challenges cannot be addressed merely by adapting solutions from HFL. There is a need for an independent approach to address the VFL problem.

Refer to caption
(a)
Figure 1: Workflow of Vertical Federated Learning. Raw data and the feature model reside on the client side, while the server model is maintained on the server.

During FL training, the diversity among clients results in variations in computational and communication capabilities, raising the need for synchronization solutions. While previous studies within HFL have addressed this, in HFL, local and global models are uniform due to identical feature spaces across clients, necessitating only local model pruning to meet specific client requirements. Conversely, in VFL, feature spaces differ among clients, leading to individual training on local feature models and subsequent uploading of trained feature embeddings to the server. This approach imposes distinct computational and communication demands for each client. Although prior research in VFL has examined feature embedding compression to reduce communication load [2, 3], the critical issue of computational burden has been largely neglected. Addressing this computational load is arguably a more pressing challenge for VFL deployment.

To tackle the diverse computational burdens encountered in the deployment of VFL, this paper introduces the concept of Lightweight Vertical Federated Learning (LVFL), designed to mitigate challenges in both computational and communication efficiency. The principal contributions of this study are summarized as follows: (1) This study introduces the LVFL framework, tailored to accommodate the varied communication and computation capabilities of heterogeneous workers. LVFL dynamically adjusts computational expenses for training feature models and communication costs for updating feature embeddings, ensuring efficiency across diverse system architectures. (2) We establish the comprehensive convergence analysis of the LVFL algorithm and present the convergence bound that elucidates the relationship between the cumulative feature model and feature embedding lightweighting error, as well as the communication and computation lightweighting ratio. (3) Our experiments, conducted on the CIFAR-10 dataset, validate the capability of the LVFL algorithm to adjust the computation and communication lightweighting ratio both constantly and dynamically. The proposed LVFL algorithm yields comparable test accuracy levels while significantly reduced communication and computation burden.

The rest of this paper is organized as follows. In Section II, we discuss related works on VFL and model pruning. Section III presents the system model and formulates the learning objective. In Section IV, we introduce the detail of LVFL algorithm. Section V presents the convergence analysis of LVFL. The experimental results of LVFL are presented in Section VI. Finally, Section VII concludes the paper. The theoretical proof details are accessible in the online supplementary material with the following link: https://github.com/ystex/LVFL/tree/main.

II Related Work

In recent years, VFL has attracted significant attention. The concept of Federated Learning with vertically partitioned data was introduced in [4]. Comprehensive surveys on VFL have been presented in [5, 6, 7]. However, VFL, distinct from HFL, poses its own set of challenges. Some research, such as [8, 9], aims to optimize data utilization to enhance the joint model’s effectiveness in VFL. In contrast, studies like [10] focus on implementing privacy-preserving protocols to counter potential data leakage threats. Beyond these challenges, there has been exploration into improving training efficiency in VFL. These endeavors primarily target reducing communication overhead, either by allowing participants to conduct multiple local updates in each iteration [11, 12] or by compressing the data exchanged between parties [13, 2]. While these studies primarily emphasize reducing communication overhead, a gap in these studies is the oversight of computational efficiency as a means to enhance training efficiency.

Modern DNNs often comprise tens of millions of parameters [14]. Storing and training such highly parameterized models on resource-constrained devices within FL can be challenging. Therefore, the lightweighting of models during training is an indispensable topic [15]. Considering the diverse computational capabilities of training devices, a viable strategy involves dynamically adjusting the size of the local model and diminishing its parameters, primarily via model pruning. Generally, model pruning techniques fall into two categories: unstructured pruning [16, 17] and structured pruning [18, 19, 20]. Unstructured pruning eliminates non-essential weights, specifically the inter-neuronal connections, in neural networks, resulting in significant parameter sparsity [21]. However, the ensuing sparse matrix’s irregularity poses challenges in parameter compression in memory, requiring specialized hardware or software libraries for efficient training [22]. In contrast, structured pruning aims to discard redundant model structures, like convolutional filters [19], without introducing sparsity. Consequently, the derived model can be considered as a subset or sub-configuration of the initial neural network, encompassing fewer parameters, thereby facilitating a reduction in computation overhead.

While a substantial portion of model pruning research centers on centralized learning contexts, recent studies have ventured into its application within FL settings [23, 24]. The work in [23] presents PruneFL, an adaptive FL strategy that commences with initial pruning at a selected client and continues with subsequent pruning throughout the FL process, aiming to reduce both communication and computational overheads. The research outlined in [24] introduces FedMP, a framework that employs adaptive pruning and recovery techniques to enhance both communication and computation efficiency, leveraging a multi-armed bandit algorithm for the selection of pruning ratios. Notably, the aforementioned model pruning techniques predominantly originate from HFL scenarios. However, due to the inherent differences between HFL and VFL, these techniques cannot be seamlessly applied to the VFL setting. Therefore driving the motivation for our study on model and feature embedding pruning in the VFL scenarios.

III System Model

To begin, we define several foundational concepts of the VFL. The VFL system consists of K𝐾Kitalic_K clients and a central server. The dataset, denoted as xN×Dxsuperscript𝑁𝐷\textbf{x}\in\mathbb{R}^{N\times D}x ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT, has N𝑁Nitalic_N representing the total data samples and D𝐷Ditalic_D indicating the number of features. The row indexed by i𝑖iitalic_i in x corresponds to a data sample xisuperscript𝑥𝑖x^{i}italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. Each sample, xisuperscript𝑥𝑖x^{i}italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, possesses a unique set of features, xkisubscriptsuperscript𝑥𝑖𝑘x^{i}_{k}italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT retained by client k𝑘kitalic_k, and identified as a distinct subset. Every instance xisuperscript𝑥𝑖x^{i}italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is paired with a label yisuperscript𝑦𝑖y^{i}italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. The vector yN×1ysuperscript𝑁1\textbf{y}\in\mathbb{R}^{N\times 1}y ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT represents all sample labels. Client k𝑘kitalic_k maintains a dataset of local features, xkN×Dksubscriptx𝑘superscript𝑁subscript𝐷𝑘\textbf{x}_{k}\in\mathbb{R}^{N\times D_{k}}x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, with Dksubscript𝐷𝑘D_{k}italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denoting the number of features for client k𝑘kitalic_k and i𝑖iitalic_i-th row signifying the respective features xkisubscriptsuperscript𝑥𝑖𝑘x^{i}_{k}italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. In this context, we assume that both the server and all clients retain a copy of the labels y, consistent with previous studies.

Each client, represented by kK𝑘𝐾k\in Kitalic_k ∈ italic_K has a unique set of feature model parameters, θksubscript𝜃𝑘\theta_{k}italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. The server maintains the head model, θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The function hk(θk,xki)subscript𝑘subscript𝜃𝑘superscriptsubscript𝑥𝑘𝑖h_{k}(\theta_{k},x_{k}^{i})italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) denotes the feature embedding extracted from sample xkisuperscriptsubscript𝑥𝑘𝑖x_{k}^{i}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT by client k𝑘kitalic_k. This feature embedding operation transforms the high-dimensional raw data into a lower-dimensional representation through multiple layers of DNN, effectively capturing essential information from the input data while significantly reducing its dimension. The overall model’s parameters are collectively denoted as Θ=[θ0,θ1,,θK]Θsuperscriptsubscriptsuperscript𝜃top0subscriptsuperscript𝜃top1subscriptsuperscript𝜃top𝐾top\Theta=[\theta^{\top}_{0},\theta^{\top}_{1},...,\theta^{\top}_{K}]^{\top}roman_Θ = [ italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. We can then express the learning objective as minimizing the subsequent equation:

F(Θ;x;y):=1Ni=1Nl(θ0,{hk(θk;xki)}k=1K;yi)assign𝐹Θxy1𝑁superscriptsubscript𝑖1𝑁𝑙subscript𝜃0superscriptsubscriptsubscript𝑘subscript𝜃𝑘subscriptsuperscript𝑥𝑖𝑘𝑘1𝐾superscript𝑦𝑖\displaystyle F(\Theta;\textbf{x};\textbf{y}):=\frac{1}{N}\sum_{i=1}^{N}l(% \theta_{0},\{h_{k}(\theta_{k};x^{i}_{k})\}_{k=1}^{K};y^{i})italic_F ( roman_Θ ; x ; y ) := divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_l ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , { italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ; italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) (1)

where l()𝑙l(\cdot)italic_l ( ⋅ ) denotes the loss function for a single data sample. Within the server model framework, the equation h0(θ0,xi)=θ0subscript0subscript𝜃0superscript𝑥𝑖subscript𝜃0h_{0}(\theta_{0},x^{i})=\theta_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) = italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT consistently applies. To enhance clarity and ensure consistency in notation throughout this paper, we implement several simplifications in our subsequent discussions. Firstly, the feature embedding for dataset xksubscriptx𝑘\textbf{x}_{k}x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, is represented by hk(θ;xk)subscript𝑘𝜃subscriptx𝑘h_{k}(\theta;\textbf{x}_{k})italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ ; x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), which we frequently abbreviate to hk(θk;xk)=hk(xk)subscript𝑘subscript𝜃𝑘subscriptx𝑘subscript𝑘subscriptx𝑘h_{k}(\theta_{k};\textbf{x}_{k})=h_{k}(\textbf{x}_{k})italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). Secondly, we allocate k=0𝑘0k=0italic_k = 0 to denote the head model in FC, where h0(θ0)=θ0subscript0subscript𝜃0subscript𝜃0h_{0}(\theta_{0})=\theta_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for convenience. Finally, we often represent F(Θ)=F(h0(θ0),h1(θ1),,hK(θK))𝐹Θ𝐹subscript0subscript𝜃0subscript1subscript𝜃1subscript𝐾subscript𝜃𝐾F(\Theta)=F(h_{0}(\theta_{0}),h_{1}(\theta_{1}),...,h_{K}(\theta_{K}))italic_F ( roman_Θ ) = italic_F ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_h start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) ), particularly in the subsequent algorithm details and convergence analysis. In the following section, we will elaborate on the specific algorithmic workflow of LVFL.

IV Lightweighting Vertical Federated Learning

In this section, we introduce the detail of LVFL algorithm. Considering the unique structural attributes of VFL, where computational demands are determined by the model size and communication requirements correlate with feature embedding size, the lightweighting methods derived from HFL are not directly applicable to VFL. To tackle this limitation, we adopt a progressive structured pruning strategy tailored for each client’s feature model and a unstructured pruning strategy tailored for each client’s feature embedding. Those approach is guided by the distinct computational and communication capacities of each client. Through these pruning techniques, we ensure consistent synchronization among all clients and facilitate efficient updates to the server.

In this scenario, we introduce a dual lightweighting ratio mechanism to distinctly optimize computation and communication. We denote αktsuperscriptsubscript𝛼𝑘𝑡\alpha_{k}^{t}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT as the computation lightweighting ratio and βktsuperscriptsubscript𝛽𝑘𝑡\beta_{k}^{t}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT as the communication lightweighting ratio. A higher αktsuperscriptsubscript𝛼𝑘𝑡\alpha_{k}^{t}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT implies reduced demand for CPU cycles when processing a data sample. Correspondingly, a larger βktsuperscriptsubscript𝛽𝑘𝑡\beta_{k}^{t}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT signifies a reduced need for embedding size during uploads. Suppose that each client performs E𝐸Eitalic_E local iterations before transmitting the model to the server, and the training process spans over a total of R𝑅Ritalic_R rounds, resulting in a cumulative duration of T=RE𝑇𝑅𝐸T=REitalic_T = italic_R italic_E local iterations. Algorithm 1 showcases the workflow of LVFL, integrating the dual lightweighting ratio mechanism.

During each global round, when tE=0conditional𝑡𝐸0t\mid E=0italic_t ∣ italic_E = 0, every client will decide the unstructured pruned feature embedding h^kt(θ^kt;xk)superscriptsubscript^𝑘𝑡superscriptsubscript^𝜃𝑘𝑡subscriptx𝑘\hat{h}_{k}^{t}(\hat{\theta}_{k}^{t};\textbf{x}_{k})over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ; x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) based on the communication lightweighting ratio βktsuperscriptsubscript𝛽𝑘𝑡\beta_{k}^{t}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (Line 5), and then transmit this pruned feature embedding to the server for the global update (Line 6). Once the server has gathered all embeddings, it proceeds to get the model representation Φ^t0superscript^Φsubscript𝑡0\hat{\Phi}^{t_{0}}over^ start_ARG roman_Φ end_ARG start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Subsequently, it distributes the model representation to all clients (Lines 8-9), and t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the start of the most recent global round when embeddings were shared. After receiving the model representation, each client needs to prune the feature model θ^ktsuperscriptsubscript^𝜃𝑘𝑡\hat{\theta}_{k}^{t}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT according to its computation lightweighting ratio αktsuperscriptsubscript𝛼𝑘𝑡\alpha_{k}^{t}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (Line 11). Following this, each client will engage in E𝐸Eitalic_E rounds of local iterations using both the pruned embeddings received during iteration t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and its own unpruned embedding hk(θ^kt;xk)subscript𝑘superscriptsubscript^𝜃𝑘𝑡subscriptx𝑘h_{k}(\hat{\theta}_{k}^{t};\textbf{x}_{k})italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ; x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). The collection of embeddings from other clients, excluding client k𝑘kitalic_k, is represented as Φ^kt0subscriptsuperscript^Φsubscript𝑡0𝑘\hat{\Phi}^{t_{0}}_{-k}over^ start_ARG roman_Φ end_ARG start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_k end_POSTSUBSCRIPT, and all embeddings are collectively denoted as Φ^ktsubscriptsuperscript^Φ𝑡𝑘\hat{\Phi}^{t}_{k}over^ start_ARG roman_Φ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. During each local iteration, client k𝑘kitalic_k will update the feature model θ^ksubscript^𝜃𝑘\hat{\theta}_{k}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT using mini-batch SGD with step size ηt0superscript𝜂subscript𝑡0\eta^{t_{0}}italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (Lines 15-16).

Algorithm 1 LVFL
1:Initialize: The initial local model θk0superscriptsubscript𝜃𝑘0\theta_{k}^{0}italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT for all clients k𝑘kitalic_k and server model θ00superscriptsubscript𝜃00\theta_{0}^{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT.
2:for t=1,2,,T1𝑡12𝑇1t=1,2,...,T-1italic_t = 1 , 2 , … , italic_T - 1 do
3:     if tE=0conditional𝑡𝐸0t\mid E=0italic_t ∣ italic_E = 0 then
4:         for k=1,2,,K𝑘12𝐾k=1,2,...,Kitalic_k = 1 , 2 , … , italic_K in parallel do
5:              Determine h^kt(θ^kt;xk)superscriptsubscript^𝑘𝑡superscriptsubscript^𝜃𝑘𝑡subscriptx𝑘\hat{h}_{k}^{t}(\hat{\theta}_{k}^{t};\textbf{x}_{k})over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ; x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) according to βktsuperscriptsubscript𝛽𝑘𝑡\beta_{k}^{t}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
6:              Send h^kt(θ^kt;xk)superscriptsubscript^𝑘𝑡superscriptsubscript^𝜃𝑘𝑡subscriptx𝑘\hat{h}_{k}^{t}(\hat{\theta}_{k}^{t};\textbf{x}_{k})over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ; x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) to server
7:         end for
8:         Server collects model representation Φ^t0superscript^Φsubscript𝑡0\hat{\Phi}^{t_{0}}over^ start_ARG roman_Φ end_ARG start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
9:         Server sends Φ^t0superscript^Φsubscript𝑡0\hat{\Phi}^{t_{0}}over^ start_ARG roman_Φ end_ARG start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT to all clients
10:         for k=1,2,,K𝑘12𝐾k=1,2,...,Kitalic_k = 1 , 2 , … , italic_K in parallel do
11:              Prune the feature model θ^ktsuperscriptsubscript^𝜃𝑘𝑡\hat{\theta}_{k}^{t}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT according to αktsuperscriptsubscript𝛼𝑘𝑡\alpha_{k}^{t}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
12:         end for
13:     end if
14:     for k=0,1,2,,K𝑘012𝐾k=0,1,2,...,Kitalic_k = 0 , 1 , 2 , … , italic_K in parallel do
15:         Obtain Φ^kt{Φ^kt0,hk(θ^kt;xk)}subscriptsuperscript^Φ𝑡𝑘subscriptsuperscript^Φsubscript𝑡0𝑘subscript𝑘superscriptsubscript^𝜃𝑘𝑡subscriptx𝑘\hat{\Phi}^{t}_{k}\leftarrow\left\{\hat{\Phi}^{t_{0}}_{-k},h_{k}(\hat{\theta}_% {k}^{t};\textbf{x}_{k})\right\}over^ start_ARG roman_Φ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← { over^ start_ARG roman_Φ end_ARG start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_k end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ; x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) }
16:         Update feature or head model θ^kt+1superscriptsubscript^𝜃𝑘𝑡1\hat{\theta}_{k}^{t+1}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT.
17:     end for
18:end for

We utilize an element-wise product to express the pruned feature model and feature embedding. Specifically, the pruned feature model can be represented as θ^kt=θktmktsuperscriptsubscript^𝜃𝑘𝑡direct-productsuperscriptsubscript𝜃𝑘𝑡superscriptsubscriptm𝑘𝑡\hat{\theta}_{k}^{t}=\theta_{k}^{t}\odot\textbf{m}_{k}^{t}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⊙ m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, where θktsuperscriptsubscript𝜃𝑘𝑡\theta_{k}^{t}italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT denotes the original structure of client k𝑘kitalic_k’s local model without pruning, and mktsuperscriptsubscriptm𝑘𝑡\textbf{m}_{k}^{t}m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT signifies the mask vector, containing zeros for parameters in θktsuperscriptsubscript𝜃𝑘𝑡\theta_{k}^{t}italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT that are pruned. Similarly, for the pruned feature embedding, we express it as h^kt(θ^kt)=hkt(θ^kt)lktsuperscriptsubscript^𝑘𝑡superscriptsubscript^𝜃𝑘𝑡direct-productsuperscriptsubscript𝑘𝑡superscriptsubscript^𝜃𝑘𝑡superscriptsubscriptl𝑘𝑡\hat{h}_{k}^{t}(\hat{\theta}_{k}^{t})=h_{k}^{t}(\hat{\theta}_{k}^{t})\odot% \textbf{l}_{k}^{t}over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⊙ l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, where hkt(θ^kt)superscriptsubscript𝑘𝑡superscriptsubscript^𝜃𝑘𝑡h_{k}^{t}(\hat{\theta}_{k}^{t})italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) represents the initial embedding of client k𝑘kitalic_k without pruning, and lktsuperscriptsubscriptl𝑘𝑡\textbf{l}_{k}^{t}l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT denotes the mask vector with zeros for parameters in hkt(θ^kt)superscriptsubscript𝑘𝑡superscriptsubscript^𝜃𝑘𝑡h_{k}^{t}(\hat{\theta}_{k}^{t})italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) that have been pruned. During each global round, the client’s feature model and feature embedding undergo adjustment based on the computation lightweighting ratio αktsuperscriptsubscript𝛼𝑘𝑡\alpha_{k}^{t}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and the communication lightweighting ratio βktsuperscriptsubscript𝛽𝑘𝑡\beta_{k}^{t}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. Next we explain in detail how to prune feature model and feature embedding to achieve lightweight separately.

Feature Model Lightweighting. In the feature model, we adopt a structured model pruning method to adjust the feature model from θktsuperscriptsubscript𝜃𝑘𝑡{\theta}_{k}^{t}italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT to θ^ktsuperscriptsubscript^𝜃𝑘𝑡\hat{\theta}_{k}^{t}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, guided by the computation lightweighting ratio αktsuperscriptsubscript𝛼𝑘𝑡\alpha_{k}^{t}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. This approach is consistent with existing research. To simplify the model and avoid introducing layer-specific hyper-parameters, we apply a uniform pruning ratio across all layers, as suggested by previous studies [19]. Within each layer, filters or neurons are ranked based on their importance scores, and those with lower scores are pruned based on the predetermined lightweighting ratio. We recommend employing the l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm to calculate these importance scores.

Feature Embedding Lightweighting. For feature embedding, we employ the unstructured model pruning method to refine the feature embedding from hkt(θ^kt)superscriptsubscript𝑘𝑡superscriptsubscript^𝜃𝑘𝑡h_{k}^{t}(\hat{\theta}_{k}^{t})italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) to h^kt(θ^kt)superscriptsubscript^𝑘𝑡superscriptsubscript^𝜃𝑘𝑡\hat{h}_{k}^{t}(\hat{\theta}_{k}^{t})over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ), guided by the communication lightweighting ratio βktsuperscriptsubscript𝛽𝑘𝑡\beta_{k}^{t}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. This approach involves nullifying weights of the lowest absolute values to meet the pruning criteria. Although this method preserves the structure of the embedding, it substantially reduces communication costs by transmitting only the non-zero values.

Leveraging the aforementioned mechanism, clients can streamline both the feature model and feature embedding, directed by the parameters αktsuperscriptsubscript𝛼𝑘𝑡\alpha_{k}^{t}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and βktsuperscriptsubscript𝛽𝑘𝑡\beta_{k}^{t}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. Our approach seeks to optimally diminish the model’s complexity based on demand. Nevertheless, given that structured pruning is applied to the feature model, it is imperative to reassess the necessity for further pruning of the feature model prior to the training begin of each global round. Conversely, for feature embedding, which utilizes unstructured pruning, no additional verification step is required. Subsequently, we elucidate the mechanism for determining the computation lightweighting ratio of the feature model in detail.

  1. 1.

    If αkt>αktsuperscriptsubscript𝛼𝑘𝑡superscriptsubscript𝛼𝑘limit-from𝑡\alpha_{k}^{t}>{\alpha}_{k}^{t-}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT > italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - end_POSTSUPERSCRIPT. Where αkt=max{αk1,,αkt1}superscriptsubscript𝛼𝑘limit-from𝑡superscriptsubscript𝛼𝑘1superscriptsubscript𝛼𝑘𝑡1{\alpha}_{k}^{t-}=\max\left\{{\alpha}_{k}^{1},\cdots,{\alpha}_{k}^{t-1}\right\}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - end_POSTSUPERSCRIPT = roman_max { italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , ⋯ , italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT } denotes the maximum computation lightweighting ratio previously attained by client k𝑘kitalic_k’s feature model before global round t𝑡titalic_t. In this case further computation lightweighting is executed to meet the required lightweighting ratio for the current round.

  2. 2.

    If αktαktsuperscriptsubscript𝛼𝑘𝑡superscriptsubscript𝛼𝑘limit-from𝑡\alpha_{k}^{t}\leq{\alpha}_{k}^{t-}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ≤ italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - end_POSTSUPERSCRIPT. Since the computation lightweighting ratio has been attained, the existing feature model was sufficiently streamlined, negating the need for further lightweighting in the current round.

This iterative process is repeated until the training converges or until predetermined termination conditions are satisfied. The subsequent section will delve into the analysis of the convergence bound of the LVFL algorithm.

V Convergence Analysis

In this section, we will delve into the convergence analysis of our LVFL algorithm. To begin, we need to establish some notations and definitions that will be utilized in the subsequent discussion. Specifically, we will define two types of errors that arise from the lightweighting mechanisms: communication lightweighting error and computation lightweighting error.

Communication Lightweight Error: This error quantifies the degree to which the lightweighting feature embedding is an accurate approximation of the original feature embedding. It is mathematically expressed as ϵkt:=hkt(θkt)h^kt(θkt)assignsuperscriptsubscriptitalic-ϵ𝑘𝑡superscriptsubscript𝑘𝑡superscriptsubscript𝜃𝑘𝑡superscriptsubscript^𝑘𝑡superscriptsubscript𝜃𝑘𝑡\epsilon_{k}^{t}:=h_{k}^{t}({\theta}_{k}^{t})-\hat{h}_{k}^{t}({\theta}_{k}^{t})italic_ϵ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT := italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ). The squared communication lightweighting error from client k𝑘kitalic_k at round t𝑡titalic_t is denoted as Ωkt𝔼ϵkt2superscriptsubscriptΩ𝑘𝑡𝔼superscriptnormsuperscriptsubscriptitalic-ϵ𝑘𝑡2\Omega_{k}^{t}\triangleq\mathbb{E}\|\epsilon_{k}^{t}\|^{2}roman_Ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ≜ blackboard_E ∥ italic_ϵ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Computation Lightweight Error: This error assesses how closely the lightweighting feature model resembles the original feature model. It is defined as φkt:=hkt(θkt)hkt(θ^kt)assignsuperscriptsubscript𝜑𝑘𝑡superscriptsubscript𝑘𝑡superscriptsubscript𝜃𝑘𝑡superscriptsubscript𝑘𝑡superscriptsubscript^𝜃𝑘𝑡\varphi_{k}^{t}:=h_{k}^{t}({\theta}_{k}^{t})-h_{k}^{t}(\hat{\theta}_{k}^{t})italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT := italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ). The squared computation lightweighting error from client k𝑘kitalic_k at round t𝑡titalic_t as Ψkt𝔼φkt2superscriptsubscriptΨ𝑘𝑡𝔼superscriptnormsuperscriptsubscript𝜑𝑘𝑡2\Psi_{k}^{t}\triangleq\mathbb{E}\|\varphi_{k}^{t}\|^{2}roman_Ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ≜ blackboard_E ∥ italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Let G^tsuperscript^G𝑡\hat{\textbf{G}}^{t}over^ start_ARG G end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT be the stacked partial derivatives at iteration t𝑡titalic_t:

G^t:=[(0F(Φ^0t;y))T,,(KF(Φ^Kt;y))T]Tassignsuperscript^G𝑡superscriptsuperscriptsubscript0𝐹subscriptsuperscript^Φ𝑡0y𝑇superscriptsubscript𝐾𝐹subscriptsuperscript^Φ𝑡𝐾y𝑇𝑇\displaystyle\hat{\textbf{G}}^{t}:=\left[(\triangledown_{0}F(\hat{\Phi}^{t}_{0% };\textbf{y}))^{T},\dots,(\triangledown_{K}F(\hat{\Phi}^{t}_{K};\textbf{y}))^{% T}\right]^{T}over^ start_ARG G end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT := [ ( ▽ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_F ( over^ start_ARG roman_Φ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; y ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , … , ( ▽ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT italic_F ( over^ start_ARG roman_Φ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ; y ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (2)

Then the global model updates as: Θt+1=Θtηt0G^tsuperscriptΘ𝑡1superscriptΘ𝑡superscript𝜂subscript𝑡0superscript^G𝑡\Theta^{t+1}=\Theta^{t}-\eta^{t_{0}}\hat{\textbf{G}}^{t}roman_Θ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = roman_Θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT over^ start_ARG G end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.

We define Φt0superscriptΦsubscript𝑡0{\Phi}^{t_{0}}roman_Φ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT to be the set of embeddings that would be received by each client at iteration t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT if no computation and communication lightweight error were applied: Φt0{θ0t,h1t(θ1t),hkt(θkt)}superscriptΦsubscript𝑡0superscriptsubscript𝜃0𝑡superscriptsubscript1𝑡superscriptsubscript𝜃1𝑡superscriptsubscript𝑘𝑡superscriptsubscript𝜃𝑘𝑡{\Phi}^{t_{0}}\leftarrow\left\{\theta_{0}^{t},{h}_{1}^{t}({\theta}_{1}^{t}),% \cdots{h}_{k}^{t}({\theta}_{k}^{t})\right\}roman_Φ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ← { italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , ⋯ italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) }. Here we define Φkt0subscriptsuperscriptΦsubscript𝑡0𝑘{\Phi}^{t_{0}}_{-k}roman_Φ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_k end_POSTSUBSCRIPT be the set of embeddings from other parties jk𝑗𝑘j\neq kitalic_j ≠ italic_k, and we have Φkt0:={Φkt0,hk(θkt;xk)}assignsubscriptsuperscriptΦsubscript𝑡0𝑘subscriptsuperscriptΦsubscript𝑡0𝑘subscript𝑘superscriptsubscript𝜃𝑘𝑡subscriptx𝑘{\Phi}^{t_{0}}_{k}:=\left\{{\Phi}^{t_{0}}_{-k},h_{k}({\theta}_{k}^{t};\textbf{% x}_{k})\right\}roman_Φ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT := { roman_Φ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_k end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ; x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) }. Our convergence analysis will utilize the following standard assumptions about the VFL.

Assumption 1 (Smoothness).

There exists positive constants L<𝐿L<\inftyitalic_L < ∞ and Lk<subscript𝐿𝑘L_{k}<\inftyitalic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT < ∞, such that for all Θ1,Θ2subscriptΘ1subscriptΘ2\Theta_{1},\Theta_{2}roman_Θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_Θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, the objective function satisfies: F(Θ1)F(Θ2)LΘ1Θ2norm𝐹subscriptΘ1𝐹subscriptΘ2𝐿normsubscriptΘ1subscriptΘ2\left\|\triangledown F(\Theta_{1})-\triangledown F(\Theta_{2})\right\|\leq L% \left\|\Theta_{1}-\Theta_{2}\right\|∥ ▽ italic_F ( roman_Θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - ▽ italic_F ( roman_Θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∥ ≤ italic_L ∥ roman_Θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - roman_Θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ and kF(Θ1)kF(Θ2)LkΘ1Θ2normsubscript𝑘𝐹subscriptΘ1subscript𝑘𝐹subscriptΘ2subscript𝐿𝑘normsubscriptΘ1subscriptΘ2\left\|\triangledown_{k}F(\Theta_{1})-\triangledown_{k}F(\Theta_{2})\right\|% \leq L_{k}\left\|\Theta_{1}-\Theta_{2}\right\|∥ ▽ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - ▽ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∥ ≤ italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ roman_Θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - roman_Θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥.

Assumption 2 (Bounded Hessian).

There exists positive constants Hksubscript𝐻𝑘H_{k}italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for k=0,,K𝑘0𝐾k=0,\dots,Kitalic_k = 0 , … , italic_K such that for all ΘΘ\Thetaroman_Θ, the second partial derivatives of F𝐹Fitalic_F satisfy: hk2F(Θ)Hksubscriptnormsubscriptsuperscript2subscript𝑘𝐹Θsubscript𝐻𝑘\left\|\triangledown^{2}_{h_{k}}F(\Theta)\right\|_{\mathcal{F}}\leq H_{k}∥ ▽ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_F ( roman_Θ ) ∥ start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ≤ italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Where Xsubscriptnorm𝑋\left\|X\right\|_{\mathcal{F}}∥ italic_X ∥ start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT is the Frobenius norm of a matrix X𝑋Xitalic_X.

Assumption 3 (Bounded Embedding Gradients).

There exists positive constants Gksubscript𝐺𝑘G_{k}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for k=0,,K𝑘0𝐾k=0,\dots,Kitalic_k = 0 , … , italic_K such that for all θksubscript𝜃𝑘\theta_{k}italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, the embedding gradients are bounded: θkhk(θk;xk)Gksubscriptnormsubscriptsubscript𝜃𝑘subscript𝑘subscript𝜃𝑘subscript𝑥𝑘subscript𝐺𝑘\left\|\triangledown_{\theta_{k}}h_{k}(\theta_{k};x_{k})\right\|_{\mathcal{F}}% \leq G_{k}∥ ▽ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ≤ italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

Assumption 1 guarantees that the function’s slope changes smoothly without any sudden or drastic alterations. Assumption 2 controls the curvature of the function, preventing it from displaying extreme or erratic behavior. Assumption 3 manages the changes in the embedding, ensuring that they don’t result in severe fluctuations, thereby aiding in the stabilization of the learning process. With those assumptions, we can get the following Lemma.

Lemma 1.

The norm of the difference between the objective function value with computation and communication lightweighting and without computation and communication lightweighting is bounded as:

𝔼kF(Φ^kt)kF(Φkt)2𝔼superscriptnormsubscript𝑘𝐹subscriptsuperscript^Φ𝑡𝑘subscript𝑘𝐹subscriptsuperscriptΦ𝑡𝑘2\displaystyle\mathbb{E}\|\nabla_{k}F(\hat{\Phi}^{t}_{k})-\nabla_{k}F({\Phi}^{t% }_{k})\|^{2}blackboard_E ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( over^ start_ARG roman_Φ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Φ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq 2Gk2Hk2j0KΨkt0+2Gk2Hk2j0,jkKΩkt02superscriptsubscript𝐺𝑘2superscriptsubscript𝐻𝑘2superscriptsubscript𝑗0𝐾superscriptsubscriptΨ𝑘subscript𝑡02superscriptsubscript𝐺𝑘2superscriptsubscript𝐻𝑘2superscriptsubscriptformulae-sequence𝑗0𝑗𝑘𝐾superscriptsubscriptΩ𝑘subscript𝑡0\displaystyle 2G_{k}^{2}H_{k}^{2}\sum_{j\neq 0}^{K}\Psi_{k}^{t_{0}}+2G_{k}^{2}% H_{k}^{2}\sum_{j\neq 0,j\neq k}^{K}\Omega_{k}^{t_{0}}2 italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ≠ 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + 2 italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ≠ 0 , italic_j ≠ italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (3)

Using Lemma 1, we can bound the effect of lightweighting errors in detail. Afterwards we can derive the Theorem 1:

Theorem 1.

Under Assumption 1-3, the average squared gradient over R𝑅Ritalic_R global rounds of the LVFL is bounded as:

1Rt0=0R1𝔼F(Θt0)23[F(Θ0)𝔼[F(ΘT)]]ηT1𝑅superscriptsubscriptsubscript𝑡00𝑅1𝔼superscriptnorm𝐹superscriptΘsubscript𝑡023delimited-[]𝐹superscriptΘ0𝔼delimited-[]𝐹superscriptΘ𝑇𝜂𝑇\displaystyle\frac{1}{R}\sum_{t_{0}=0}^{R-1}\mathbb{E}\|\nabla F(\Theta^{t_{0}% })\|^{2}\leq\frac{3\left[F(\Theta^{0})-\mathbb{E}[F(\Theta^{T})]\right]}{\eta T}divide start_ARG 1 end_ARG start_ARG italic_R end_ARG ∑ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R - 1 end_POSTSUPERSCRIPT blackboard_E ∥ ∇ italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 3 [ italic_F ( roman_Θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) - blackboard_E [ italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ] ] end_ARG start_ARG italic_η italic_T end_ARG
+108E2Rt0=0R1k=0KHk2Gk2j0KΨjt0108superscript𝐸2𝑅superscriptsubscriptsubscript𝑡00𝑅1superscriptsubscript𝑘0𝐾superscriptsubscript𝐻𝑘2superscriptsubscript𝐺𝑘2superscriptsubscript𝑗0𝐾superscriptsubscriptΨ𝑗subscript𝑡0\displaystyle+\frac{108E^{2}}{R}\sum_{t_{0}=0}^{R-1}\sum_{k=0}^{K}H_{k}^{2}G_{% k}^{2}\sum_{j\neq 0}^{K}\Psi_{j}^{t_{0}}+ divide start_ARG 108 italic_E start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_R end_ARG ∑ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ≠ 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
+108E2Rt0=0R1k=0KGk2Hk2j0,jkKΩjt0108superscript𝐸2𝑅superscriptsubscriptsubscript𝑡00𝑅1superscriptsubscript𝑘0𝐾superscriptsubscript𝐺𝑘2superscriptsubscript𝐻𝑘2superscriptsubscriptformulae-sequence𝑗0𝑗𝑘𝐾superscriptsubscriptΩ𝑗subscript𝑡0\displaystyle+\frac{108E^{2}}{R}\sum_{t_{0}=0}^{R-1}\sum_{k=0}^{K}G_{k}^{2}H_{% k}^{2}\sum_{j\neq 0,j\neq k}^{K}\Omega_{j}^{t_{0}}+ divide start_ARG 108 italic_E start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_R end_ARG ∑ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ≠ 0 , italic_j ≠ italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Ω start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (4)

The proof can be found in Appendix -D of online supplementary materials. The convergence bound detailed above unveils several insights: The first term illustrates the difference between the initial and final models. The second term highlights the error stemming from the use of the lightweighting feature model, while the third term represents an error caused by lightweighting feature embedding. This analysis reveals that the primary elements influencing the bound are the errors related to communication lightweighting and computation lightweighting. In other words, a larger error will result in a greater bound in the end. To delve deeper into the connection between error and lightweighting ratios, we must turn to the subsequent additional assumptions.

Assumption 4 (Bounded Embedding and Model).

Positive constants δ>0𝛿0\delta>0italic_δ > 0 and μ>0𝜇0\mu>0italic_μ > 0 exist such that for k=0,,K𝑘0𝐾k=0,\dots,Kitalic_k = 0 , … , italic_K, the following conditions are met: 𝔼hk(θk;xk)2δ2,𝔼θk2μ2formulae-sequence𝔼superscriptnormsubscript𝑘subscript𝜃𝑘subscript𝑥𝑘2superscript𝛿2𝔼superscriptnormsubscript𝜃𝑘2superscript𝜇2\mathbb{E}\left\|h_{k}(\theta_{k};x_{k})\right\|^{2}\leq\delta^{2},\quad% \mathbb{E}\left\|\theta_{k}\right\|^{2}\leq\mu^{2}blackboard_E ∥ italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , blackboard_E ∥ italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Assumption 5 (Lipschitz Continuous).

There also exists positive constants and Mk<subscript𝑀𝑘M_{k}<\inftyitalic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT < ∞, such that for the all θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and θ2subscript𝜃2\theta_{2}italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT satisfies: hk(θ1)hk(θ2)Mkθ1θ2normsubscript𝑘subscript𝜃1subscript𝑘subscript𝜃2subscript𝑀𝑘normsubscript𝜃1subscript𝜃2\left\|h_{k}(\theta_{1})-h_{k}(\theta_{2})\right\|\leq M_{k}\left\|\theta_{1}-% \theta_{2}\right\|∥ italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∥ ≤ italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥.

Assumption 4 and Assumption 5 are required to build the connection between the lightweighting ratios and the lightweighting error. With these two assumptions, we are able to derive the subsequent result, referred to as Corollary 1.

Corollary 1.

Under Assumption 1-5, with the computation lightweighting ratio αktsuperscriptsubscript𝛼𝑘𝑡\alpha_{k}^{t}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and communication lightweighting ratio βktsuperscriptsubscript𝛽𝑘𝑡\beta_{k}^{t}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, we can further obtain the following bound:

1Rt0=0R1𝔼F(Θt0)23[F(Θ0)𝔼[F(ΘT)]]ηT1𝑅superscriptsubscriptsubscript𝑡00𝑅1𝔼superscriptnorm𝐹superscriptΘsubscript𝑡023delimited-[]𝐹superscriptΘ0𝔼delimited-[]𝐹superscriptΘ𝑇𝜂𝑇\displaystyle\frac{1}{R}\sum_{t_{0}=0}^{R-1}\mathbb{E}\|\nabla F(\Theta^{t_{0}% })\|^{2}\leq\frac{3\left[F(\Theta^{0})-\mathbb{E}[F(\Theta^{T})]\right]}{\eta T}divide start_ARG 1 end_ARG start_ARG italic_R end_ARG ∑ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R - 1 end_POSTSUPERSCRIPT blackboard_E ∥ ∇ italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 3 [ italic_F ( roman_Θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) - blackboard_E [ italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ] ] end_ARG start_ARG italic_η italic_T end_ARG
+108E2μ2Rt0=0R1k=0KHk2Gk2Mk2j0Kαjt0108superscript𝐸2superscript𝜇2𝑅superscriptsubscriptsubscript𝑡00𝑅1superscriptsubscript𝑘0𝐾superscriptsubscript𝐻𝑘2superscriptsubscript𝐺𝑘2superscriptsubscript𝑀𝑘2superscriptsubscript𝑗0𝐾superscriptsubscript𝛼𝑗subscript𝑡0\displaystyle+\frac{108E^{2}\mu^{2}}{R}\sum_{t_{0}=0}^{R-1}\sum_{k=0}^{K}H_{k}% ^{2}G_{k}^{2}M_{k}^{2}\sum_{j\neq 0}^{K}\alpha_{j}^{t_{0}}+ divide start_ARG 108 italic_E start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_R end_ARG ∑ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ≠ 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
+108E2δ2Rt0=0R1k=0KGk2Hk2j0,jkKβjt0108superscript𝐸2superscript𝛿2𝑅superscriptsubscriptsubscript𝑡00𝑅1superscriptsubscript𝑘0𝐾superscriptsubscript𝐺𝑘2superscriptsubscript𝐻𝑘2superscriptsubscriptformulae-sequence𝑗0𝑗𝑘𝐾superscriptsubscript𝛽𝑗subscript𝑡0\displaystyle+\frac{108E^{2}\delta^{2}}{R}\sum_{t_{0}=0}^{R-1}\sum_{k=0}^{K}G_% {k}^{2}H_{k}^{2}\sum_{j\neq 0,j\neq k}^{K}\beta_{j}^{t_{0}}+ divide start_ARG 108 italic_E start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_R end_ARG ∑ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ≠ 0 , italic_j ≠ italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (5)

The detailed proof for this result is located in Appendix -E within the online supplementary materials. As derived from Corollary 1, a key observation to emphasize is that the communication lightweighting error is governed by the communication lightweighting ratio βktsuperscriptsubscript𝛽𝑘𝑡\beta_{k}^{t}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, while the computation lightweighting error is influenced by the computation lightweighting ratio αktsuperscriptsubscript𝛼𝑘𝑡\alpha_{k}^{t}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. As a result, higher lightweighting ratios tend to result in more lenient convergence bounds, whereas lower ratios lead to more stringent convergence bounds.

VI Experiment

In this section, we carry out experiments to assess the performance of LVFL. The experiments were run on an Ubuntu 18.04 machine with an Intel Core i7-10700KF 3.8GHz CPU and GeForce RTX 3070 GPU. Specifically, we explore the effectiveness of the algorithm we’ve proposed by utilizing the VFL-based datasets: CIFAR-10. Subsequently, we will provide detailed information about the datasets and the corresponding models used in the experiment.

CIFAR-10: CIFAR-10 is a dataset used for object classification in images. In a specific training setup, there are 4 parties involved, with each party responsible for a different quadrant of each image. The number of training data samples N=10000𝑁10000N=10000italic_N = 10000, and the batch size bs=256subscript𝑏𝑠256b_{s}=256italic_b start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 256. Every party (client) utilizes VGG16 as feature model for training, while the server focuses on training a 3-layer fully connected networks (FCN) as head model.

In the experiment, the following approaches are used for performance comparison. No Lightweighting (NL), where no computational or communication lightweighting mechanisms are implemented. Computational Lightweighting Only (PL), applying solely computational lightweighting. Communication Lightweighting Only (ML), utilizing exclusively communication lightweighting. Lightweighting (L), incorporating both computational and communication lightweighting strategies to enhance efficiency. Next we first show the performance comparison of the above approaches with dynamic computational and communication lightweighting ratios.

VI-A Performance Comparison

This section presents the learning performance optimized for computational and communication efficiency. At each round each client will be assigned the computational lightweighting ratio αktsuperscriptsubscript𝛼𝑘𝑡\alpha_{k}^{t}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and communication lightweighting ratio βktsuperscriptsubscript𝛽𝑘𝑡\beta_{k}^{t}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. These ratios are then applied to individually tailor the lightweighting process to each client’s unique requirements. For illustrative purposes, we assume a uniform set of ratios across all clients, with the specific values depicted in Fig. 2b. The learning performance associated with our approaches are illustrated in Fig. 2a, from which several key observations can be derived. Primarily, computational lightweighting exerts a more significant influence on learning performance compared to communication lightweighting. Notably, computational lightweighting results in a marked decrease in test accuracy at the point of implementation, followed by a gradual recovery. Furthermore, the frequency and intensity of computational lightweighting are directly proportional to its impact on test accuracy.

Refer to caption
(a) Test Accuracy
Refer to caption
(b) Ratio By Rounds
Figure 2: Performance Comparison By Round With CIFAR-10

VI-B Effect of Ratio on Learning Performance

In this section we further investigate the effect of the choice of αktsuperscriptsubscript𝛼𝑘𝑡\alpha_{k}^{t}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and βktsuperscriptsubscript𝛽𝑘𝑡\beta_{k}^{t}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT on learning performance. Here we explore the effects of computation and communication separately. In the scenario of computational lightweighting, as depicted in Fig. 3a, adjustments to the feature model are made at t=40𝑡40t=40italic_t = 40 with αt[0.2,0.4,0.6]superscript𝛼𝑡0.20.40.6\alpha^{t}\in[0.2,0.4,0.6]italic_α start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ [ 0.2 , 0.4 , 0.6 ]. The outcomes indicate that higher αtsuperscript𝛼𝑡\alpha^{t}italic_α start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT values result in more significant reductions in test accuracy, consequently necessitating a longer recovery period. In the context of communication lightweighting, as shown in Fig. 3b, varying communication lightweighting ratios βt[0.2,0.4,0.6]superscript𝛽𝑡0.20.40.6\beta^{t}\in[0.2,0.4,0.6]italic_β start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ [ 0.2 , 0.4 , 0.6 ] were applied to the feature embedding. The findings illustrate that lower βtsuperscript𝛽𝑡\beta^{t}italic_β start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT values are associated with quicker convergence rates, particularly noticeable within the range t[10,40]𝑡1040t\in[10,40]italic_t ∈ [ 10 , 40 ]. However, the impact of communication lightweighting is not as substantial as that observed with computational lightweighting.

Refer to caption
(a) Computation Lightweighting
Refer to caption
(b) Communication Lightweighting
Figure 3: Effect of Ratio on Learning Performance

VII Conclusion

Our paper introduces the concept of Lightweight Vertical Federated Learning (LVFL). Owing to structural distinctions between VFL and HFL, algorithm design and convergence analysis for VFL-based lightweight challenges significantly differ. Our convergence proofs elucidate the correlation between convergence bounds and the ratios of communication and computational lightweighting. Furthermore, the section on experimental results underscores the benefits of employing lightweight mechanisms. In the subsequent research, we intend to extend LVFL to be a more practical problem. We plan to formulate a long-term optimization problem that accurately represents the computational and communication difficulties clients face throughout the training phase. This will facilitate the determination of optimal lightweighting ratios specifically.

References

  • [1] Y. Liu, T. Fan, T. Chen, Q. Xu, and Q. Yang, “Fate: An industrial grade platform for collaborative learning with data protection,” Journal of Machine Learning Research, vol. 22, no. 226, pp. 1–6, 2021.
  • [2] T. J. Castiglia, A. Das, S. Wang, and S. Patterson, “Compressed-vfl: Communication-efficient learning with vertically partitioned data,” in International Conference on Machine Learning.   PMLR, 2022, pp. 2738–2766.
  • [3] H. Wang and J. Xu, “Online vertical federated learning for cooperative spectrum sensing,” arXiv preprint arXiv:2312.11363, 2023.
  • [4] S. Hardy, W. Henecka, H. Ivey-Law, R. Nock, G. Patrini, G. Smith, and B. Thorne, “Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption,” arXiv preprint arXiv:1711.10677, 2017.
  • [5] L. Yang, D. Chai, J. Zhang, Y. Jin, L. Wang, H. Liu, H. Tian, Q. Xu, and K. Chen, “A survey on vertical federated learning: From a layered perspective,” arXiv preprint arXiv:2304.01829, 2023.
  • [6] K. Wei, J. Li, C. Ma, M. Ding, S. Wei, F. Wu, G. Chen, and T. Ranbaduge, “Vertical federated learning: Challenges, methodologies and experiments,” arXiv preprint arXiv:2202.04309, 2022.
  • [7] Y. Liu, Y. Kang, T. Zou, Y. Pu, Y. He, X. Ye, Y. Ouyang, Y.-Q. Zhang, and Q. Yang, “Vertical federated learning,” arXiv preprint arXiv:2211.12814, 2022.
  • [8] S. Feng, “Vertical federated learning-based feature selection with non-overlapping sample utilization,” Expert Systems with Applications, vol. 208, p. 118097, 2022.
  • [9] Y. Kang, Y. Liu, and X. Liang, “Fedcvt: Semi-supervised vertical federated learning with cross-view training,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 13, no. 4, pp. 1–16, 2022.
  • [10] J. Sun, Y. Yao, W. Gao, J. Xie, and C. Wang, “Defending against reconstruction attack in vertical federated learning,” arXiv preprint arXiv:2107.09898, 2021.
  • [11] Y. Liu, X. Zhang, Y. Kang, L. Li, T. Chen, M. Hong, and Q. Yang, “Fedbcd: A communication-efficient collaborative learning framework for distributed features,” IEEE Transactions on Signal Processing, vol. 70, pp. 4277–4290, 2022.
  • [12] T. Castiglia, S. Wang, and S. Patterson, “Flexible vertical federated learning with heterogeneous parties,” arXiv preprint arXiv:2208.12672, 2022.
  • [13] M. Li, Y. Chen, Y. Wang, and Y. Pan, “Efficient asynchronous vertical federated learning via gradient prediction and double-end sparse compression,” in 2020 16th international conference on control, automation, robotics and vision (ICARCV).   IEEE, 2020, pp. 291–296.
  • [14] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [15] Z. Wei, Q. Pei, N. Zhang, X. Liu, C. Wu, and A. Taherkordi, “Lightweight federated learning for large-scale iot devices with privacy guarantee,” IEEE Internet of Things Journal, vol. 10, no. 4, pp. 3179–3191, 2021.
  • [16] X. Dong, S. Chen, and S. Pan, “Learning to prune deep neural networks via layer-wise optimal brain surgeon,” Advances in neural information processing systems, vol. 30, 2017.
  • [17] V. Sanh, T. Wolf, and A. Rush, “Movement pruning: Adaptive sparsity by fine-tuning,” Advances in Neural Information Processing Systems, vol. 33, pp. 20 378–20 389, 2020.
  • [18] X. Ding, G. Ding, Y. Guo, and J. Han, “Centripetal sgd for pruning very deep convolutional networks with complicated structure,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4943–4953.
  • [19] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient convnets,” arXiv preprint arXiv:1608.08710, 2016.
  • [20] Z. You, K. Yan, J. Ye, M. Ma, and P. Wang, “Gate decorator: Global filter pruning method for accelerating deep convolutional neural networks,” Advances in neural information processing systems, vol. 32, 2019.
  • [21] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” Advances in neural information processing systems, vol. 28, 2015.
  • [22] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “Eie: Efficient inference engine on compressed deep neural network,” ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 243–254, 2016.
  • [23] Y. Jiang, S. Wang, V. Valls, B. J. Ko, W.-H. Lee, K. K. Leung, and L. Tassiulas, “Model pruning enables efficient federated learning on edge devices,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
  • [24] Z. Jiang, Y. Xu, H. Xu, Z. Wang, J. Liu, Q. Chen, and C. Qiao, “Computation and communication efficient federated learning with adaptive model pruning,” IEEE Transactions on Mobile Computing, 2023.

In this section, we provide the details of proofs. Let’s start by defining a new series of notations.

-A Additional Notation

In this section, we present various notations that will be utilized in subsequent proofs. During each local round t𝑡titalic_t, every client k𝑘kitalic_k trains the local model with the embedding Φ^kt0subscriptsuperscript^Φsubscript𝑡0𝑘\hat{\Phi}^{t_{0}}_{k}over^ start_ARG roman_Φ end_ARG start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. This process can also be interpreted as the client k𝑘kitalic_k training the model θktsuperscriptsubscript𝜃𝑘𝑡{\theta}_{k}^{t}italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and θjt0superscriptsubscript𝜃𝑗subscript𝑡0{\theta}_{j}^{t_{0}}italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for all jk𝑗𝑘j\neq kitalic_j ≠ italic_k, where t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the last communication iteration when client k𝑘kitalic_k received the embeddings. Then we have:

γk,jt={θjt,k=jθjt0,otherwisesuperscriptsubscript𝛾𝑘𝑗𝑡casessuperscriptsubscript𝜃𝑗𝑡𝑘𝑗superscriptsubscript𝜃𝑗subscript𝑡0otherwise\displaystyle\gamma_{k,j}^{t}=\left\{\begin{array}[]{ll}{\theta}_{j}^{t},&k=j% \\ {\theta}_{j}^{t_{0}},&\text{otherwise}\end{array}\right.italic_γ start_POSTSUBSCRIPT italic_k , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = { start_ARRAY start_ROW start_CELL italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , end_CELL start_CELL italic_k = italic_j end_CELL end_ROW start_ROW start_CELL italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , end_CELL start_CELL otherwise end_CELL end_ROW end_ARRAY (8)

Which represents the real model that client k𝑘kitalic_k utilized in the round t𝑡titalic_t. We define the column vector Γkt=[(γk,0t)T;;(γk,Kt)T]superscriptsubscriptΓ𝑘𝑡superscriptsuperscriptsubscript𝛾𝑘0𝑡𝑇superscriptsuperscriptsubscript𝛾𝑘𝐾𝑡𝑇\Gamma_{k}^{t}=\left[(\gamma_{k,0}^{t})^{T};\dots;(\gamma_{k,K}^{t})^{T}\right]roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = [ ( italic_γ start_POSTSUBSCRIPT italic_k , 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ; … ; ( italic_γ start_POSTSUBSCRIPT italic_k , italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] to be the client k𝑘kitalic_k’s view of global model in the round t𝑡titalic_t. Then we define F^(Γkt)^𝐹subscriptsuperscriptΓ𝑡𝑘\hat{F}({\Gamma}^{t}_{k})over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) to be the loss with pruning error by client k𝑘kitalic_k at round t:

F^(Γkt)=F(θ0t0,h1t0(θ1t0)+φ1t0+ϵ1t0,,hkt(θkt)+φkt0,,hKt0(θKt0)+φKt0+ϵKt0)^𝐹subscriptsuperscriptΓ𝑡𝑘𝐹superscriptsubscript𝜃0subscript𝑡0superscriptsubscript1subscript𝑡0superscriptsubscript𝜃1subscript𝑡0superscriptsubscript𝜑1subscript𝑡0superscriptsubscriptitalic-ϵ1subscript𝑡0superscriptsubscript𝑘𝑡superscriptsubscript𝜃𝑘𝑡superscriptsubscript𝜑𝑘subscript𝑡0superscriptsubscript𝐾subscript𝑡0superscriptsubscript𝜃𝐾subscript𝑡0superscriptsubscript𝜑𝐾subscript𝑡0superscriptsubscriptitalic-ϵ𝐾subscript𝑡0\displaystyle\hat{F}({\Gamma}^{t}_{k})=F\left({\theta}_{0}^{t_{0}},h_{1}^{t_{0% }}({\theta}_{1}^{t_{0}})+\varphi_{1}^{t_{0}}+\epsilon_{1}^{t_{0}},\dots,h_{k}^% {t}({\theta}_{k}^{t})+\varphi_{k}^{t_{0}},\dots,h_{K}^{t_{0}}({\theta}_{K}^{t_% {0}})+\varphi_{K}^{t_{0}}+\epsilon_{K}^{t_{0}}\right)over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = italic_F ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) + italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) + italic_φ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) (9)

It is crucial to emphasize that θ0t0superscriptsubscript𝜃0subscript𝑡0{\theta}_{0}^{t_{0}}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is free from computation and communication pruning errors, while the expression hkt(θkt)superscriptsubscript𝑘𝑡superscriptsubscript𝜃𝑘𝑡h_{k}^{t}({\theta}_{k}^{t})italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) is unaffected by communication pruning errors. Building on this understanding, we can now reform the definition of G^tsuperscript^G𝑡\hat{\textbf{G}}^{t}over^ start_ARG G end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT as follows::

G^t:=[(0F^(Γ0t))T,,(KF^(ΓKt))T]Tassignsuperscript^G𝑡superscriptsuperscriptsubscript0^𝐹subscriptsuperscriptΓ𝑡0𝑇superscriptsubscript𝐾^𝐹subscriptsuperscriptΓ𝑡𝐾𝑇𝑇\displaystyle\hat{\textbf{G}}^{t}:=\left[(\triangledown_{0}\hat{F}({\Gamma}^{t% }_{0}))^{T},\dots,(\triangledown_{K}\hat{F}({\Gamma}^{t}_{K}))^{T}\right]^{T}over^ start_ARG G end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT := [ ( ▽ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , … , ( ▽ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (10)

In the subsequent proof, we will employ both variations of G^tsuperscript^G𝑡\hat{\textbf{G}}^{t}over^ start_ARG G end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. Following that, we will define the computation pruning error associated with each embedding utilized in client k𝑘kitalic_k’s gradient calculation during round t𝑡titalic_t:

Ekt0:=[(φ1t0+ϵ1t0)T,,(φkt0)T,,(φKt0+ϵKt0)T]Tassignsuperscriptsubscript𝐸𝑘subscript𝑡0superscriptsuperscriptsuperscriptsubscript𝜑1subscript𝑡0superscriptsubscriptitalic-ϵ1subscript𝑡0𝑇superscriptsuperscriptsubscript𝜑𝑘subscript𝑡0𝑇superscriptsuperscriptsubscript𝜑𝐾subscript𝑡0superscriptsubscriptitalic-ϵ𝐾subscript𝑡0𝑇𝑇\displaystyle E_{k}^{t_{0}}:=\left[(\varphi_{1}^{t_{0}}+\epsilon_{1}^{t_{0}})^% {T},\dots,(\varphi_{k}^{t_{0}})^{T},\dots,(\varphi_{K}^{t_{0}}+\epsilon_{K}^{t% _{0}})^{T}\right]^{T}italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT := [ ( italic_φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , … , ( italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , … , ( italic_φ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (11)

Given the error, we can derive the following:

kF(Φkt+Ekt0)=kF^(Γkt)subscript𝑘𝐹subscriptsuperscriptΦ𝑡𝑘superscriptsubscript𝐸𝑘subscript𝑡0subscript𝑘^𝐹subscriptsuperscriptΓ𝑡𝑘\displaystyle\nabla_{k}{F}(\Phi^{t}_{k}+E_{k}^{t_{0}})=\nabla_{k}\hat{F}({% \Gamma}^{t}_{k})∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Φ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) = ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (12)

We apply the chain rule to kF^(Γkt)subscript𝑘^𝐹subscriptsuperscriptΓ𝑡𝑘\nabla_{k}\hat{F}({\Gamma}^{t}_{k})∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) to get:

kF^(Γkt)=θkhk(θkt)hk(θk)F(Φkt+Ekt0)subscript𝑘^𝐹subscriptsuperscriptΓ𝑡𝑘subscriptsubscript𝜃𝑘subscript𝑘superscriptsubscript𝜃𝑘𝑡subscriptsubscript𝑘subscript𝜃𝑘𝐹subscriptsuperscriptΦ𝑡𝑘superscriptsubscript𝐸𝑘subscript𝑡0\displaystyle\nabla_{k}\hat{F}({\Gamma}^{t}_{k})=\nabla_{{\theta}_{k}}h_{k}({% \theta}_{k}^{t})\nabla_{h_{k}({\theta}_{k})}F(\Phi^{t}_{k}+E_{k}^{t_{0}})∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∇ start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_F ( roman_Φ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) (13)

Using the Taylor series expansion to further expand hk(θk)F(Φkt+Ekt0)subscriptsubscript𝑘subscript𝜃𝑘𝐹subscriptsuperscriptΦ𝑡𝑘superscriptsubscript𝐸𝑘subscript𝑡0\nabla_{h_{k}({\theta}_{k})}F(\Phi^{t}_{k}+E_{k}^{t_{0}})∇ start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_F ( roman_Φ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) around the point ΦktsubscriptsuperscriptΦ𝑡𝑘\Phi^{t}_{k}roman_Φ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT:

hk(θk)F(Φkt+Ekt0)=hk(θk)F(Φkt)+hk(θk)2F(Φkt)TEkt0+subscriptsubscript𝑘subscript𝜃𝑘𝐹subscriptsuperscriptΦ𝑡𝑘superscriptsubscript𝐸𝑘subscript𝑡0subscriptsubscript𝑘subscript𝜃𝑘𝐹subscriptsuperscriptΦ𝑡𝑘superscriptsubscriptsubscript𝑘subscript𝜃𝑘2𝐹superscriptsubscriptsuperscriptΦ𝑡𝑘𝑇superscriptsubscript𝐸𝑘subscript𝑡0\displaystyle\nabla_{h_{k}({\theta}_{k})}F(\Phi^{t}_{k}+E_{k}^{t_{0}})=\nabla_% {h_{k}({\theta}_{k})}F(\Phi^{t}_{k})+\nabla_{h_{k}({\theta}_{k})}^{2}F(\Phi^{t% }_{k})^{T}E_{k}^{t_{0}}+\dots∇ start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_F ( roman_Φ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) = ∇ start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_F ( roman_Φ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + ∇ start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_F ( roman_Φ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + … (14)

To convince, we let the infinite sum of all terms from the second partial derivatives as R0k(Φkt+Ekt0)superscriptsubscript𝑅0𝑘subscriptsuperscriptΦ𝑡𝑘superscriptsubscript𝐸𝑘subscript𝑡0R_{0}^{k}(\Phi^{t}_{k}+E_{k}^{t_{0}})italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( roman_Φ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ). Then we can get:

hk(θk)F(Φkt+Ekt0)=hk(θk)F(Φkt)+R0k(Φkt+Ekt0)subscriptsubscript𝑘subscript𝜃𝑘𝐹subscriptsuperscriptΦ𝑡𝑘superscriptsubscript𝐸𝑘subscript𝑡0subscriptsubscript𝑘subscript𝜃𝑘𝐹subscriptsuperscriptΦ𝑡𝑘superscriptsubscript𝑅0𝑘subscriptsuperscriptΦ𝑡𝑘superscriptsubscript𝐸𝑘subscript𝑡0\displaystyle\nabla_{h_{k}({\theta}_{k})}F(\Phi^{t}_{k}+E_{k}^{t_{0}})=\nabla_% {h_{k}({\theta}_{k})}F(\Phi^{t}_{k})+R_{0}^{k}(\Phi^{t}_{k}+E_{k}^{t_{0}})∇ start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_F ( roman_Φ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) = ∇ start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_F ( roman_Φ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( roman_Φ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) (15)

Building on the definition provided above, we will now move to the theoretical proof.

-B Proof of Lemma 1

By definition, we know that

kF^(Γkt)=kF(Φkt+Ekt0)subscript𝑘^𝐹subscriptsuperscriptΓ𝑡𝑘subscript𝑘𝐹subscriptsuperscriptΦ𝑡𝑘superscriptsubscript𝐸𝑘subscript𝑡0\displaystyle\nabla_{k}\hat{F}({\Gamma}^{t}_{k})=\nabla_{k}{F}(\Phi^{t}_{k}+E_% {k}^{t_{0}})∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Φ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) (16)
=θkhk(θkt)hk(θk)F(Φkt+Ekt0)absentsubscriptsubscript𝜃𝑘subscript𝑘superscriptsubscript𝜃𝑘𝑡subscriptsubscript𝑘subscript𝜃𝑘𝐹subscriptsuperscriptΦ𝑡𝑘superscriptsubscript𝐸𝑘subscript𝑡0\displaystyle=\nabla_{{\theta}_{k}}h_{k}({\theta}_{k}^{t})\nabla_{h_{k}({% \theta}_{k})}F(\Phi^{t}_{k}+E_{k}^{t_{0}})= ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∇ start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_F ( roman_Φ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) (17)
=θkhk(θkt)(hk(θk)F(Φkt)+R0k(Φkt+Ekt0))absentsubscriptsubscript𝜃𝑘subscript𝑘superscriptsubscript𝜃𝑘𝑡subscriptsubscript𝑘subscript𝜃𝑘𝐹subscriptsuperscriptΦ𝑡𝑘superscriptsubscript𝑅0𝑘subscriptsuperscriptΦ𝑡𝑘superscriptsubscript𝐸𝑘subscript𝑡0\displaystyle=\nabla_{{\theta}_{k}}h_{k}({\theta}_{k}^{t})\left(\nabla_{h_{k}(% {\theta}_{k})}F(\Phi^{t}_{k})+R_{0}^{k}(\Phi^{t}_{k}+E_{k}^{t_{0}})\right)= ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ( ∇ start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_F ( roman_Φ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( roman_Φ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ) (18)
=kF(Γkt)+θkhk(θkt)R0k(Φkt+Ekt0)absentsubscript𝑘𝐹subscriptsuperscriptΓ𝑡𝑘subscriptsubscript𝜃𝑘subscript𝑘superscriptsubscript𝜃𝑘𝑡superscriptsubscript𝑅0𝑘subscriptsuperscriptΦ𝑡𝑘superscriptsubscript𝐸𝑘subscript𝑡0\displaystyle=\nabla_{k}{F}({\Gamma}^{t}_{k})+\nabla_{{\theta}_{k}}h_{k}({% \theta}_{k}^{t})R_{0}^{k}(\Phi^{t}_{k}+E_{k}^{t_{0}})= ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( roman_Φ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) (19)

Where Eq. 18 uses the Taylor series expansion. Then we can get

𝔼kF^(Γkt)kF(Γkt)2=𝔼θkhk(θkt)R0k(Φkt+Ekt0)2𝔼superscriptnormsubscript𝑘^𝐹subscriptsuperscriptΓ𝑡𝑘subscript𝑘𝐹subscriptsuperscriptΓ𝑡𝑘2𝔼superscriptnormsubscriptsubscript𝜃𝑘subscript𝑘superscriptsubscript𝜃𝑘𝑡superscriptsubscript𝑅0𝑘subscriptsuperscriptΦ𝑡𝑘superscriptsubscript𝐸𝑘subscript𝑡02\displaystyle\mathbb{E}\left\|\nabla_{k}\hat{F}({\Gamma}^{t}_{k})-\nabla_{k}{F% }({\Gamma}^{t}_{k})\right\|^{2}=\mathbb{E}\left\|\nabla_{{\theta}_{k}}h_{k}({% \theta}_{k}^{t})R_{0}^{k}(\Phi^{t}_{k}+E_{k}^{t_{0}})\right\|^{2}blackboard_E ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = blackboard_E ∥ ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( roman_Φ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (20)

Thus we need to get the bound of the right side first.

θkhk(θkt)R0k(Φkt+Ekt0)2θkhk(θkt)2R0k(Φkt+Ekt0)2Gk2Hk2Ekt02superscriptnormsubscriptsubscript𝜃𝑘subscript𝑘superscriptsubscript𝜃𝑘𝑡superscriptsubscript𝑅0𝑘subscriptsuperscriptΦ𝑡𝑘superscriptsubscript𝐸𝑘subscript𝑡02subscriptsuperscriptnormsubscriptsubscript𝜃𝑘subscript𝑘superscriptsubscript𝜃𝑘𝑡2subscriptsuperscriptnormsuperscriptsubscript𝑅0𝑘subscriptsuperscriptΦ𝑡𝑘superscriptsubscript𝐸𝑘subscript𝑡02superscriptsubscript𝐺𝑘2superscriptsubscript𝐻𝑘2subscriptsuperscriptnormsuperscriptsubscript𝐸𝑘subscript𝑡02\displaystyle\left\|\nabla_{{\theta}_{k}}h_{k}({\theta}_{k}^{t})R_{0}^{k}(\Phi% ^{t}_{k}+E_{k}^{t_{0}})\right\|^{2}\leq\left\|\nabla_{{\theta}_{k}}h_{k}({% \theta}_{k}^{t})\right\|^{2}_{\mathcal{F}}\left\|R_{0}^{k}(\Phi^{t}_{k}+E_{k}^% {t_{0}})\right\|^{2}_{\mathcal{F}}\leq G_{k}^{2}H_{k}^{2}\left\|E_{k}^{t_{0}}% \right\|^{2}_{\mathcal{F}}∥ ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( roman_Φ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ∥ ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ∥ italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( roman_Φ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ≤ italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT (21)

Then applying expectation we can finally get:

𝔼θkhk(θkt)R0k(Φkt+Ekt0)2Gk2Hk2𝔼Ekt02𝔼superscriptnormsubscriptsubscript𝜃𝑘subscript𝑘superscriptsubscript𝜃𝑘𝑡superscriptsubscript𝑅0𝑘subscriptsuperscriptΦ𝑡𝑘superscriptsubscript𝐸𝑘subscript𝑡02superscriptsubscript𝐺𝑘2superscriptsubscript𝐻𝑘2𝔼subscriptsuperscriptnormsuperscriptsubscript𝐸𝑘subscript𝑡02\displaystyle\mathbb{E}\left\|\nabla_{{\theta}_{k}}h_{k}({\theta}_{k}^{t})R_{0% }^{k}(\Phi^{t}_{k}+E_{k}^{t_{0}})\right\|^{2}\leq G_{k}^{2}H_{k}^{2}\mathbb{E}% \left\|E_{k}^{t_{0}}\right\|^{2}_{\mathcal{F}}blackboard_E ∥ ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( roman_Φ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E ∥ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT (22)
Gk2Hk2𝔼[j0,jkKφjt0+ϵjt02+φkt02]absentsuperscriptsubscript𝐺𝑘2superscriptsubscript𝐻𝑘2𝔼delimited-[]superscriptsubscriptformulae-sequence𝑗0𝑗𝑘𝐾subscriptsuperscriptnormsuperscriptsubscript𝜑𝑗subscript𝑡0superscriptsubscriptitalic-ϵ𝑗subscript𝑡02subscriptsuperscriptnormsuperscriptsubscript𝜑𝑘subscript𝑡02\displaystyle\leq G_{k}^{2}H_{k}^{2}\mathbb{E}\left[\sum_{j\neq 0,j\neq k}^{K}% \left\|\varphi_{j}^{t_{0}}+\epsilon_{j}^{t_{0}}\right\|^{2}_{\mathcal{F}}+% \left\|\varphi_{k}^{t_{0}}\right\|^{2}_{\mathcal{F}}\right]≤ italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∑ start_POSTSUBSCRIPT italic_j ≠ 0 , italic_j ≠ italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∥ italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT + ∥ italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ] (23)
Gk2Hk2𝔼[j0,jkK(2φjt02+2ϵjt02)+φkt02]absentsuperscriptsubscript𝐺𝑘2superscriptsubscript𝐻𝑘2𝔼delimited-[]superscriptsubscriptformulae-sequence𝑗0𝑗𝑘𝐾2subscriptsuperscriptnormsuperscriptsubscript𝜑𝑗subscript𝑡022subscriptsuperscriptnormsuperscriptsubscriptitalic-ϵ𝑗subscript𝑡02subscriptsuperscriptnormsuperscriptsubscript𝜑𝑘subscript𝑡02\displaystyle\leq G_{k}^{2}H_{k}^{2}\mathbb{E}\left[\sum_{j\neq 0,j\neq k}^{K}% \left(2\left\|\varphi_{j}^{t_{0}}\right\|^{2}_{\mathcal{F}}+2\left\|\epsilon_{% j}^{t_{0}}\right\|^{2}_{\mathcal{F}}\right)+\left\|\varphi_{k}^{t_{0}}\right\|% ^{2}_{\mathcal{F}}\right]≤ italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∑ start_POSTSUBSCRIPT italic_j ≠ 0 , italic_j ≠ italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( 2 ∥ italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT + 2 ∥ italic_ϵ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ) + ∥ italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ] (24)
2Gk2Hk2j0K𝔼φjt02+2Gk2Hk2j0,jkK𝔼ϵjt02absent2superscriptsubscript𝐺𝑘2superscriptsubscript𝐻𝑘2superscriptsubscript𝑗0𝐾𝔼subscriptsuperscriptnormsuperscriptsubscript𝜑𝑗subscript𝑡022superscriptsubscript𝐺𝑘2superscriptsubscript𝐻𝑘2superscriptsubscriptformulae-sequence𝑗0𝑗𝑘𝐾𝔼subscriptsuperscriptnormsuperscriptsubscriptitalic-ϵ𝑗subscript𝑡02\displaystyle\leq 2G_{k}^{2}H_{k}^{2}\sum_{j\neq 0}^{K}\mathbb{E}\left\|% \varphi_{j}^{t_{0}}\right\|^{2}_{\mathcal{F}}+2G_{k}^{2}H_{k}^{2}\sum_{j\neq 0% ,j\neq k}^{K}\mathbb{E}\left\|\epsilon_{j}^{t_{0}}\right\|^{2}_{\mathcal{F}}≤ 2 italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ≠ 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_E ∥ italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT + 2 italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ≠ 0 , italic_j ≠ italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_E ∥ italic_ϵ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT (25)
=2Gk2Hk2j0KΨjt0+2Gk2Hk2j0,jkKΩjt0absent2superscriptsubscript𝐺𝑘2superscriptsubscript𝐻𝑘2superscriptsubscript𝑗0𝐾superscriptsubscriptΨ𝑗subscript𝑡02superscriptsubscript𝐺𝑘2superscriptsubscript𝐻𝑘2superscriptsubscriptformulae-sequence𝑗0𝑗𝑘𝐾superscriptsubscriptΩ𝑗subscript𝑡0\displaystyle=2G_{k}^{2}H_{k}^{2}\sum_{j\neq 0}^{K}\Psi_{j}^{t_{0}}+2G_{k}^{2}% H_{k}^{2}\sum_{j\neq 0,j\neq k}^{K}\Omega_{j}^{t_{0}}= 2 italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ≠ 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + 2 italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ≠ 0 , italic_j ≠ italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Ω start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (26)

-C Proof of Lemma 2

In this section, we will get the bound of t=t0t0+E1𝔼t0G^tGt02superscriptsubscript𝑡subscript𝑡0subscript𝑡0𝐸1subscript𝔼subscript𝑡0superscriptnormsuperscript^G𝑡superscriptGsubscript𝑡02\sum_{t=t_{0}}^{t_{0}+E-1}\mathbb{E}_{t_{0}}\left\|\hat{\textbf{G}}^{t}-{% \textbf{G}}^{t_{0}}\right\|^{2}∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_E - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ over^ start_ARG G end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - G start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

𝔼t0G^tGt02=k=0K𝔼t0[kF^(Γkt)kF(Γkt0)2]subscript𝔼subscript𝑡0superscriptnormsuperscript^G𝑡superscriptGsubscript𝑡02superscriptsubscript𝑘0𝐾subscript𝔼subscript𝑡0delimited-[]superscriptnormsubscript𝑘^𝐹subscriptsuperscriptΓ𝑡𝑘subscript𝑘𝐹subscriptsuperscriptΓsubscript𝑡0𝑘2\displaystyle\mathbb{E}_{t_{0}}\left\|\hat{\textbf{G}}^{t}-{\textbf{G}}^{t_{0}% }\right\|^{2}=\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\|\nabla_{k}\hat{F}({% \Gamma}^{t}_{k})-\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\|^{2}\right]blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ over^ start_ARG G end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - G start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Γ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (27)
=k=0K𝔼t0[kF^(Γkt)kF^(Γkt1)+kF^(Γkt1)kF(Γkt0)2]absentsuperscriptsubscript𝑘0𝐾subscript𝔼subscript𝑡0delimited-[]superscriptnormsubscript𝑘^𝐹subscriptsuperscriptΓ𝑡𝑘subscript𝑘^𝐹subscriptsuperscriptΓ𝑡1𝑘subscript𝑘^𝐹subscriptsuperscriptΓ𝑡1𝑘subscript𝑘𝐹subscriptsuperscriptΓsubscript𝑡0𝑘2\displaystyle=\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\|\nabla_{k}\hat{F}({% \Gamma}^{t}_{k})-\nabla_{k}\hat{F}({\Gamma}^{t-1}_{k})+\nabla_{k}\hat{F}({% \Gamma}^{t-1}_{k})-\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\|^{2}\right]= ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Γ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (28)
(1+n)k=0K𝔼t0[kF^(Γkt)kF^(Γkt1)2]+(1+1n)k=0K𝔼t0[kF^(Γkt1)kF(Γkt0)2]absent1𝑛superscriptsubscript𝑘0𝐾subscript𝔼subscript𝑡0delimited-[]superscriptnormsubscript𝑘^𝐹subscriptsuperscriptΓ𝑡𝑘subscript𝑘^𝐹subscriptsuperscriptΓ𝑡1𝑘211𝑛superscriptsubscript𝑘0𝐾subscript𝔼subscript𝑡0delimited-[]superscriptnormsubscript𝑘^𝐹subscriptsuperscriptΓ𝑡1𝑘subscript𝑘𝐹subscriptsuperscriptΓsubscript𝑡0𝑘2\displaystyle\leq(1+n)\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\|\nabla_{k}% \hat{F}({\Gamma}^{t}_{k})-\nabla_{k}\hat{F}({\Gamma}^{t-1}_{k})\right\|^{2}% \right]+(1+\frac{1}{n})\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\|\nabla_{k}% \hat{F}({\Gamma}^{t-1}_{k})-\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\|^{2}\right]≤ ( 1 + italic_n ) ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + ( 1 + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ) ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Γ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (29)
2(1+n)k=0K𝔼t0[kF(Γkt)kF(Γkt1)2]+(1+1n)k=0K𝔼t0[kF^(Γkt1)kF(Γkt0)2]absent21𝑛superscriptsubscript𝑘0𝐾subscript𝔼subscript𝑡0delimited-[]superscriptnormsubscript𝑘𝐹subscriptsuperscriptΓ𝑡𝑘subscript𝑘𝐹subscriptsuperscriptΓ𝑡1𝑘211𝑛superscriptsubscript𝑘0𝐾subscript𝔼subscript𝑡0delimited-[]superscriptnormsubscript𝑘^𝐹subscriptsuperscriptΓ𝑡1𝑘subscript𝑘𝐹subscriptsuperscriptΓsubscript𝑡0𝑘2\displaystyle\leq 2(1+n)\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\|\nabla_{k% }{F}({\Gamma}^{t}_{k})-\nabla_{k}{F}({\Gamma}^{t-1}_{k})\right\|^{2}\right]+(1% +\frac{1}{n})\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\|\nabla_{k}\hat{F}({% \Gamma}^{t-1}_{k})-\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\|^{2}\right]≤ 2 ( 1 + italic_n ) ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + ( 1 + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ) ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Γ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (30)
+2(1+n)k=0K𝔼t0[θkhk(θkt)R0k(Φkt+Ekt0)θkhk(θkt1)R0k(Φkt1+Ekt1)2]21𝑛superscriptsubscript𝑘0𝐾subscript𝔼subscript𝑡0delimited-[]superscriptnormsubscriptsubscript𝜃𝑘subscript𝑘superscriptsubscript𝜃𝑘𝑡superscriptsubscript𝑅0𝑘subscriptsuperscriptΦ𝑡𝑘superscriptsubscript𝐸𝑘subscript𝑡0subscriptsubscript𝜃𝑘subscript𝑘superscriptsubscript𝜃𝑘𝑡1superscriptsubscript𝑅0𝑘subscriptsuperscriptΦ𝑡1𝑘superscriptsubscript𝐸𝑘𝑡12\displaystyle+2(1+n)\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\|\nabla_{{% \theta}_{k}}h_{k}({\theta}_{k}^{t})R_{0}^{k}(\Phi^{t}_{k}+E_{k}^{t_{0}})-% \nabla_{{\theta}_{k}}h_{k}({\theta}_{k}^{t-1})R_{0}^{k}(\Phi^{t-1}_{k}+E_{k}^{% t-1})\right\|^{2}\right]+ 2 ( 1 + italic_n ) ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( roman_Φ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ) italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( roman_Φ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (31)
2(1+n)k=0K𝔼t0[kF(Γkt)kF(Γkt1)2]+(1+1n)k=0K𝔼t0[kF^(Γkt1)kF(Γkt0)2]absent21𝑛superscriptsubscript𝑘0𝐾subscript𝔼subscript𝑡0delimited-[]superscriptnormsubscript𝑘𝐹subscriptsuperscriptΓ𝑡𝑘subscript𝑘𝐹subscriptsuperscriptΓ𝑡1𝑘211𝑛superscriptsubscript𝑘0𝐾subscript𝔼subscript𝑡0delimited-[]superscriptnormsubscript𝑘^𝐹subscriptsuperscriptΓ𝑡1𝑘subscript𝑘𝐹subscriptsuperscriptΓsubscript𝑡0𝑘2\displaystyle\leq 2(1+n)\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\|\nabla_{k% }{F}({\Gamma}^{t}_{k})-\nabla_{k}{F}({\Gamma}^{t-1}_{k})\right\|^{2}\right]+(1% +\frac{1}{n})\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\|\nabla_{k}\hat{F}({% \Gamma}^{t-1}_{k})-\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\|^{2}\right]≤ 2 ( 1 + italic_n ) ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + ( 1 + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ) ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Γ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (32)
+8(1+n)k=0KGk2Hk2Ekt0281𝑛superscriptsubscript𝑘0𝐾superscriptsubscript𝐺𝑘2superscriptsubscript𝐻𝑘2subscriptsuperscriptnormsuperscriptsubscript𝐸𝑘subscript𝑡02\displaystyle+8(1+n)\sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\|E_{k}^{t_{0}}\right% \|^{2}_{\mathcal{F}}+ 8 ( 1 + italic_n ) ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT (33)

Where the last term follows the Eq. 21. Then we have:

𝔼t0G^tGt02subscript𝔼subscript𝑡0superscriptnormsuperscript^G𝑡superscriptGsubscript𝑡02\displaystyle\mathbb{E}_{t_{0}}\left\|\hat{\textbf{G}}^{t}-{\textbf{G}}^{t_{0}% }\right\|^{2}blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ over^ start_ARG G end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - G start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
2(1+n)(ηt0)2k=0K(Lk)2𝔼t0[kF^(Γkt1)2]+(1+1n)k=0K𝔼t0[kF^(Γkt1)kF(Γkt0)2]absent21𝑛superscriptsuperscript𝜂subscript𝑡02superscriptsubscript𝑘0𝐾superscriptsubscript𝐿𝑘2subscript𝔼subscript𝑡0delimited-[]superscriptnormsubscript𝑘^𝐹subscriptsuperscriptΓ𝑡1𝑘211𝑛superscriptsubscript𝑘0𝐾subscript𝔼subscript𝑡0delimited-[]superscriptnormsubscript𝑘^𝐹subscriptsuperscriptΓ𝑡1𝑘subscript𝑘𝐹subscriptsuperscriptΓsubscript𝑡0𝑘2\displaystyle\leq 2(1+n)(\eta^{t_{0}})^{2}\sum_{k=0}^{K}(L_{k})^{2}\mathbb{E}_% {t_{0}}\left[\left\|\nabla_{k}\hat{F}({\Gamma}^{t-1}_{k})\right\|^{2}\right]+(% 1+\frac{1}{n})\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\|\nabla_{k}\hat{F}({% \Gamma}^{t-1}_{k})-\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\|^{2}\right]≤ 2 ( 1 + italic_n ) ( italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + ( 1 + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ) ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Γ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (34)
+8(1+n)k=0KGk2Hk2Ekt0281𝑛superscriptsubscript𝑘0𝐾superscriptsubscript𝐺𝑘2superscriptsubscript𝐻𝑘2subscriptsuperscriptnormsuperscriptsubscript𝐸𝑘subscript𝑡02\displaystyle+8(1+n)\sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\|E_{k}^{t_{0}}\right% \|^{2}_{\mathcal{F}}+ 8 ( 1 + italic_n ) ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT (35)
2(1+n)(ηt0)2k=0K(Lk)2𝔼t0[kF^(Γkt1)kF(Γkt0)+kF(Γkt0)2]absent21𝑛superscriptsuperscript𝜂subscript𝑡02superscriptsubscript𝑘0𝐾superscriptsubscript𝐿𝑘2subscript𝔼subscript𝑡0delimited-[]superscriptnormsubscript𝑘^𝐹subscriptsuperscriptΓ𝑡1𝑘subscript𝑘𝐹subscriptsuperscriptΓsubscript𝑡0𝑘subscript𝑘𝐹subscriptsuperscriptΓsubscript𝑡0𝑘2\displaystyle\leq 2(1+n)(\eta^{t_{0}})^{2}\sum_{k=0}^{K}(L_{k})^{2}\mathbb{E}_% {t_{0}}\left[\left\|\nabla_{k}\hat{F}({\Gamma}^{t-1}_{k})-\nabla_{k}{F}({% \Gamma}^{t_{0}}_{k})+\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\|^{2}\right]≤ 2 ( 1 + italic_n ) ( italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Γ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Γ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (36)
+(1+1n)k=0K𝔼t0[kF^(Γkt1)kF(Γkt0)2]+8(1+n)k=0KGk2Hk2Ekt0211𝑛superscriptsubscript𝑘0𝐾subscript𝔼subscript𝑡0delimited-[]superscriptnormsubscript𝑘^𝐹subscriptsuperscriptΓ𝑡1𝑘subscript𝑘𝐹subscriptsuperscriptΓsubscript𝑡0𝑘281𝑛superscriptsubscript𝑘0𝐾superscriptsubscript𝐺𝑘2superscriptsubscript𝐻𝑘2subscriptsuperscriptnormsuperscriptsubscript𝐸𝑘subscript𝑡02\displaystyle+(1+\frac{1}{n})\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\|% \nabla_{k}\hat{F}({\Gamma}^{t-1}_{k})-\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})% \right\|^{2}\right]+8(1+n)\sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\|E_{k}^{t_{0}}% \right\|^{2}_{\mathcal{F}}+ ( 1 + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ) ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Γ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + 8 ( 1 + italic_n ) ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT (37)
4(1+n)(ηt0)2k=0K(Lk)2𝔼t0[kF^(Γkt1)kF(Γkt0)2]absent41𝑛superscriptsuperscript𝜂subscript𝑡02superscriptsubscript𝑘0𝐾superscriptsubscript𝐿𝑘2subscript𝔼subscript𝑡0delimited-[]superscriptnormsubscript𝑘^𝐹subscriptsuperscriptΓ𝑡1𝑘subscript𝑘𝐹subscriptsuperscriptΓsubscript𝑡0𝑘2\displaystyle\leq 4(1+n)(\eta^{t_{0}})^{2}\sum_{k=0}^{K}(L_{k})^{2}\mathbb{E}_% {t_{0}}\left[\left\|\nabla_{k}\hat{F}({\Gamma}^{t-1}_{k})-\nabla_{k}{F}({% \Gamma}^{t_{0}}_{k})\right\|^{2}\right]≤ 4 ( 1 + italic_n ) ( italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Γ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (38)
+4(1+n)(ηt0)2k=0K(Lk)2𝔼t0[kF(Γkt0)2]41𝑛superscriptsuperscript𝜂subscript𝑡02superscriptsubscript𝑘0𝐾superscriptsubscript𝐿𝑘2subscript𝔼subscript𝑡0delimited-[]superscriptnormsubscript𝑘𝐹subscriptsuperscriptΓsubscript𝑡0𝑘2\displaystyle+4(1+n)(\eta^{t_{0}})^{2}\sum_{k=0}^{K}(L_{k})^{2}\mathbb{E}_{t_{% 0}}\left[\left\|\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\|^{2}\right]+ 4 ( 1 + italic_n ) ( italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Γ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (39)
+(1+1n)k=0K𝔼t0[kF^(Γkt1)kF(Γkt0)2]+8(1+n)k=0KGk2Hk2Ekt0211𝑛superscriptsubscript𝑘0𝐾subscript𝔼subscript𝑡0delimited-[]superscriptnormsubscript𝑘^𝐹subscriptsuperscriptΓ𝑡1𝑘subscript𝑘𝐹subscriptsuperscriptΓsubscript𝑡0𝑘281𝑛superscriptsubscript𝑘0𝐾superscriptsubscript𝐺𝑘2superscriptsubscript𝐻𝑘2subscriptsuperscriptnormsuperscriptsubscript𝐸𝑘subscript𝑡02\displaystyle+(1+\frac{1}{n})\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\|% \nabla_{k}\hat{F}({\Gamma}^{t-1}_{k})-\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})% \right\|^{2}\right]+8(1+n)\sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\|E_{k}^{t_{0}}% \right\|^{2}_{\mathcal{F}}+ ( 1 + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ) ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Γ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + 8 ( 1 + italic_n ) ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT (40)
k=0K(4(1+n)(ηt0)2(Lk)2+(1+1n))𝔼t0[kF^(Γkt1)kF(Γkt0)2]absentsuperscriptsubscript𝑘0𝐾41𝑛superscriptsuperscript𝜂subscript𝑡02superscriptsubscript𝐿𝑘211𝑛subscript𝔼subscript𝑡0delimited-[]superscriptnormsubscript𝑘^𝐹subscriptsuperscriptΓ𝑡1𝑘subscript𝑘𝐹subscriptsuperscriptΓsubscript𝑡0𝑘2\displaystyle\leq\sum_{k=0}^{K}\left(4(1+n)(\eta^{t_{0}})^{2}(L_{k})^{2}+(1+% \frac{1}{n})\right)\mathbb{E}_{t_{0}}\left[\left\|\nabla_{k}\hat{F}({\Gamma}^{% t-1}_{k})-\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\|^{2}\right]≤ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( 4 ( 1 + italic_n ) ( italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ) ) blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Γ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (41)
+4(1+n)(ηt0)2k=0K(Lk)2𝔼t0[kF(Γkt0)2]+8(1+n)k=0KGk2Hk2Ekt0241𝑛superscriptsuperscript𝜂subscript𝑡02superscriptsubscript𝑘0𝐾superscriptsubscript𝐿𝑘2subscript𝔼subscript𝑡0delimited-[]superscriptnormsubscript𝑘𝐹subscriptsuperscriptΓsubscript𝑡0𝑘281𝑛superscriptsubscript𝑘0𝐾superscriptsubscript𝐺𝑘2superscriptsubscript𝐻𝑘2subscriptsuperscriptnormsuperscriptsubscript𝐸𝑘subscript𝑡02\displaystyle+4(1+n)(\eta^{t_{0}})^{2}\sum_{k=0}^{K}(L_{k})^{2}\mathbb{E}_{t_{% 0}}\left[\left\|\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\|^{2}\right]+8(1+n)% \sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\|E_{k}^{t_{0}}\right\|^{2}_{\mathcal{F}}+ 4 ( 1 + italic_n ) ( italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Γ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + 8 ( 1 + italic_n ) ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT (42)

Where Eq. 34 follows the assumption 1 and update rule ΓktΓkt1=ηt0kF^(Γkt1)subscriptsuperscriptΓ𝑡𝑘subscriptsuperscriptΓ𝑡1𝑘superscript𝜂subscript𝑡0subscript𝑘^𝐹subscriptsuperscriptΓ𝑡1𝑘\Gamma^{t}_{k}-{\Gamma}^{t-1}_{k}=\eta^{t_{0}}\nabla_{k}\hat{F}({\Gamma}^{t-1}% _{k})roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - roman_Γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ).
Then by set n=E𝑛𝐸n=Eitalic_n = italic_E and ηt014EmaxkLkasuperscript𝜂subscript𝑡014𝐸subscript𝑘superscriptsubscript𝐿𝑘𝑎\eta^{t_{0}}\leq\frac{1}{4E\max_{k}L_{k}^{a}}italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 4 italic_E roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT end_ARG, we can get:

𝔼t0G^tGt02subscript𝔼subscript𝑡0superscriptnormsuperscript^G𝑡superscriptGsubscript𝑡02\displaystyle\mathbb{E}_{t_{0}}\left\|\hat{\textbf{G}}^{t}-{\textbf{G}}^{t_{0}% }\right\|^{2}blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ over^ start_ARG G end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - G start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (43)
k=0K((1+E)4E2+(1+1E))𝔼t0[kF^(Γkt1)kF(Γkt0)2]absentsuperscriptsubscript𝑘0𝐾1𝐸4superscript𝐸211𝐸subscript𝔼subscript𝑡0delimited-[]superscriptnormsubscript𝑘^𝐹subscriptsuperscriptΓ𝑡1𝑘subscript𝑘𝐹subscriptsuperscriptΓsubscript𝑡0𝑘2\displaystyle\leq\sum_{k=0}^{K}\left(\frac{(1+E)}{4E^{2}}+(1+\frac{1}{E})% \right)\mathbb{E}_{t_{0}}\left[\left\|\nabla_{k}\hat{F}({\Gamma}^{t-1}_{k})-% \nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\|^{2}\right]≤ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( divide start_ARG ( 1 + italic_E ) end_ARG start_ARG 4 italic_E start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + ( 1 + divide start_ARG 1 end_ARG start_ARG italic_E end_ARG ) ) blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Γ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (44)
+4(1+E)(ηt0)2k=0K(Lk)2𝔼t0[kF(Γkt0)2]+8(1+n)k=0KGk2Hk2Ekt0241𝐸superscriptsuperscript𝜂subscript𝑡02superscriptsubscript𝑘0𝐾superscriptsubscript𝐿𝑘2subscript𝔼subscript𝑡0delimited-[]superscriptnormsubscript𝑘𝐹subscriptsuperscriptΓsubscript𝑡0𝑘281𝑛superscriptsubscript𝑘0𝐾superscriptsubscript𝐺𝑘2superscriptsubscript𝐻𝑘2subscriptsuperscriptnormsuperscriptsubscript𝐸𝑘subscript𝑡02\displaystyle+4(1+E)(\eta^{t_{0}})^{2}\sum_{k=0}^{K}(L_{k})^{2}\mathbb{E}_{t_{% 0}}\left[\left\|\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\|^{2}\right]+8(1+n)% \sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\|E_{k}^{t_{0}}\right\|^{2}_{\mathcal{F}}+ 4 ( 1 + italic_E ) ( italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Γ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + 8 ( 1 + italic_n ) ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT (45)
k=0K(1+2E)𝔼t0[kF^(Γkt1)kF(Γkt0)2]absentsuperscriptsubscript𝑘0𝐾12𝐸subscript𝔼subscript𝑡0delimited-[]superscriptnormsubscript𝑘^𝐹subscriptsuperscriptΓ𝑡1𝑘subscript𝑘𝐹subscriptsuperscriptΓsubscript𝑡0𝑘2\displaystyle\leq\sum_{k=0}^{K}\left(1+\frac{2}{E}\right)\mathbb{E}_{t_{0}}% \left[\left\|\nabla_{k}\hat{F}({\Gamma}^{t-1}_{k})-\nabla_{k}{F}({\Gamma}^{t_{% 0}}_{k})\right\|^{2}\right]≤ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( 1 + divide start_ARG 2 end_ARG start_ARG italic_E end_ARG ) blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Γ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (46)
+4(1+E)(ηt0)2k=0K(Lk)2𝔼t0[kF(Γkt0)2]+8(1+n)k=0KGk2Hk2Ekt0241𝐸superscriptsuperscript𝜂subscript𝑡02superscriptsubscript𝑘0𝐾superscriptsubscript𝐿𝑘2subscript𝔼subscript𝑡0delimited-[]superscriptnormsubscript𝑘𝐹subscriptsuperscriptΓsubscript𝑡0𝑘281𝑛superscriptsubscript𝑘0𝐾superscriptsubscript𝐺𝑘2superscriptsubscript𝐻𝑘2subscriptsuperscriptnormsuperscriptsubscript𝐸𝑘subscript𝑡02\displaystyle+4(1+E)(\eta^{t_{0}})^{2}\sum_{k=0}^{K}(L_{k})^{2}\mathbb{E}_{t_{% 0}}\left[\left\|\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\|^{2}\right]+8(1+n)% \sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\|E_{k}^{t_{0}}\right\|^{2}_{\mathcal{F}}+ 4 ( 1 + italic_E ) ( italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Γ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + 8 ( 1 + italic_n ) ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT (47)

Here, we set the new notations for each term as follows:

At=k=0K𝔼t0[kF^(Γkt)kF(Γkt0)2]superscript𝐴𝑡superscriptsubscript𝑘0𝐾subscript𝔼subscript𝑡0delimited-[]superscriptnormsubscript𝑘^𝐹subscriptsuperscriptΓ𝑡𝑘subscript𝑘𝐹subscriptsuperscriptΓsubscript𝑡0𝑘2\displaystyle A^{t}=\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\|\nabla_{k}% \hat{F}({\Gamma}^{t}_{k})-\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\|^{2}\right]italic_A start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Γ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (48)
B0=4(1+E)(ηt0)2k=0K(Lk)2𝔼t0[kF(Γkt0)2]subscript𝐵041𝐸superscriptsuperscript𝜂subscript𝑡02superscriptsubscript𝑘0𝐾superscriptsubscript𝐿𝑘2subscript𝔼subscript𝑡0delimited-[]superscriptnormsubscript𝑘𝐹subscriptsuperscriptΓsubscript𝑡0𝑘2\displaystyle B_{0}=4(1+E)(\eta^{t_{0}})^{2}\sum_{k=0}^{K}(L_{k})^{2}\mathbb{E% }_{t_{0}}\left[\left\|\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\|^{2}\right]italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 4 ( 1 + italic_E ) ( italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Γ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (49)
B1=8(1+n)k=0KGk2Hk2Ekt02subscript𝐵181𝑛superscriptsubscript𝑘0𝐾superscriptsubscript𝐺𝑘2superscriptsubscript𝐻𝑘2subscriptsuperscriptnormsuperscriptsubscript𝐸𝑘subscript𝑡02\displaystyle B_{1}=8(1+n)\sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\|E_{k}^{t_{0}}% \right\|^{2}_{\mathcal{F}}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 8 ( 1 + italic_n ) ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT (50)
C=(1+2E)𝐶12𝐸\displaystyle C=\left(1+\frac{2}{E}\right)italic_C = ( 1 + divide start_ARG 2 end_ARG start_ARG italic_E end_ARG ) (51)

We can get AtCAt1+B0+B1superscript𝐴𝑡𝐶superscript𝐴𝑡1subscript𝐵0subscript𝐵1A^{t}\leq CA^{t-1}+B_{0}+B_{1}italic_A start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ≤ italic_C italic_A start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT + italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. By induction, we can obtain AtCtt01At0+(B0+B1)Ctt011C1superscript𝐴𝑡superscript𝐶𝑡subscript𝑡01superscript𝐴subscript𝑡0subscript𝐵0subscript𝐵1superscript𝐶𝑡subscript𝑡011𝐶1A^{t}\leq C^{t-t_{0}-1}A^{t_{0}}+(B_{0}+B_{1})\frac{C^{t-t_{0}-1}-1}{C-1}italic_A start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ≤ italic_C start_POSTSUPERSCRIPT italic_t - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT italic_A start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + ( italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) divide start_ARG italic_C start_POSTSUPERSCRIPT italic_t - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT - 1 end_ARG start_ARG italic_C - 1 end_ARG.

At0superscript𝐴subscript𝑡0\displaystyle A^{t_{0}}italic_A start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT =k=0K𝔼t0[kF^(Γkt0)kF(Γkt0)2]k=0KGk2Hk2Ekt02absentsuperscriptsubscript𝑘0𝐾subscript𝔼subscript𝑡0delimited-[]superscriptnormsubscript𝑘^𝐹subscriptsuperscriptΓsubscript𝑡0𝑘subscript𝑘𝐹subscriptsuperscriptΓsubscript𝑡0𝑘2superscriptsubscript𝑘0𝐾superscriptsubscript𝐺𝑘2superscriptsubscript𝐻𝑘2superscriptnormsuperscriptsubscript𝐸𝑘subscript𝑡02\displaystyle=\sum_{k=0}^{K}\mathbb{E}_{t_{0}}\left[\left\|\nabla_{k}\hat{F}({% \Gamma}^{t_{0}}_{k})-\nabla_{k}{F}({\Gamma}^{t_{0}}_{k})\right\|^{2}\right]% \leq\sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\|E_{k}^{t_{0}}\right\|^{2}= ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG ( roman_Γ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Γ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (52)

Then we get the bounds of Ctt01superscript𝐶𝑡subscript𝑡01C^{t-t_{0}-1}italic_C start_POSTSUPERSCRIPT italic_t - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT and Ctt011C1superscript𝐶𝑡subscript𝑡011𝐶1\frac{C^{t-t_{0}-1}-1}{C-1}divide start_ARG italic_C start_POSTSUPERSCRIPT italic_t - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT - 1 end_ARG start_ARG italic_C - 1 end_ARG with summing over the set of local iterations t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to t0+E1subscript𝑡0𝐸1t_{0}+E-1italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_E - 1 respectively:

t=t0t0+E1Ctt01=CE1C1superscriptsubscript𝑡subscript𝑡0subscript𝑡0𝐸1superscript𝐶𝑡subscript𝑡01superscript𝐶𝐸1𝐶1\displaystyle\sum_{t=t_{0}}^{t_{0}+E-1}C^{t-t_{0}-1}=\frac{C^{E}-1}{C-1}∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_E - 1 end_POSTSUPERSCRIPT italic_C start_POSTSUPERSCRIPT italic_t - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT = divide start_ARG italic_C start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT - 1 end_ARG start_ARG italic_C - 1 end_ARG (53)
=(1+2E)E1(1+2E)1𝒆212E4Eabsentsuperscript12𝐸𝐸112𝐸1superscript𝒆212𝐸4𝐸\displaystyle=\frac{\left(1+\frac{2}{E}\right)^{E}-1}{\left(1+\frac{2}{E}% \right)-1}\leq\frac{{\boldsymbol{e}}^{2}-1}{2}E\leq 4E= divide start_ARG ( 1 + divide start_ARG 2 end_ARG start_ARG italic_E end_ARG ) start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT - 1 end_ARG start_ARG ( 1 + divide start_ARG 2 end_ARG start_ARG italic_E end_ARG ) - 1 end_ARG ≤ divide start_ARG bold_italic_e start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 1 end_ARG start_ARG 2 end_ARG italic_E ≤ 4 italic_E (54)

We can also get

t=t0t0+E1Ctt011C1=1C1(t=t0t0+E1Ctt01E)superscriptsubscript𝑡subscript𝑡0subscript𝑡0𝐸1superscript𝐶𝑡subscript𝑡011𝐶11𝐶1superscriptsubscript𝑡subscript𝑡0subscript𝑡0𝐸1superscript𝐶𝑡subscript𝑡01𝐸\displaystyle\sum_{t=t_{0}}^{t_{0}+E-1}\frac{C^{t-t_{0}-1}-1}{C-1}=\frac{1}{C-% 1}\left(\sum_{t=t_{0}}^{t_{0}+E-1}C^{t-t_{0}-1}-E\right)∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_E - 1 end_POSTSUPERSCRIPT divide start_ARG italic_C start_POSTSUPERSCRIPT italic_t - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT - 1 end_ARG start_ARG italic_C - 1 end_ARG = divide start_ARG 1 end_ARG start_ARG italic_C - 1 end_ARG ( ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_E - 1 end_POSTSUPERSCRIPT italic_C start_POSTSUPERSCRIPT italic_t - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT - italic_E ) (55)
=1C1(CE1C1E)=E2(E[(1+2E)E1]2E)absent1𝐶1superscript𝐶𝐸1𝐶1𝐸𝐸2𝐸delimited-[]superscript12𝐸𝐸12𝐸\displaystyle=\frac{1}{C-1}\left(\frac{C^{E}-1}{C-1}-E\right)=\frac{E}{2}\left% (\frac{E\left[(1+\frac{2}{E})^{E}-1\right]}{2}-E\right)= divide start_ARG 1 end_ARG start_ARG italic_C - 1 end_ARG ( divide start_ARG italic_C start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT - 1 end_ARG start_ARG italic_C - 1 end_ARG - italic_E ) = divide start_ARG italic_E end_ARG start_ARG 2 end_ARG ( divide start_ARG italic_E [ ( 1 + divide start_ARG 2 end_ARG start_ARG italic_E end_ARG ) start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT - 1 ] end_ARG start_ARG 2 end_ARG - italic_E ) (56)
=E22([(1+2E)E1]21)E22(𝒆2121)2E2absentsuperscript𝐸22delimited-[]superscript12𝐸𝐸121superscript𝐸22superscript𝒆21212superscript𝐸2\displaystyle=\frac{E^{2}}{2}\left(\frac{\left[(1+\frac{2}{E})^{E}-1\right]}{2% }-1\right)\leq\frac{E^{2}}{2}\left(\frac{{\boldsymbol{e}}^{2}-1}{2}-1\right)% \leq 2E^{2}= divide start_ARG italic_E start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ( divide start_ARG [ ( 1 + divide start_ARG 2 end_ARG start_ARG italic_E end_ARG ) start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT - 1 ] end_ARG start_ARG 2 end_ARG - 1 ) ≤ divide start_ARG italic_E start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ( divide start_ARG bold_italic_e start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 1 end_ARG start_ARG 2 end_ARG - 1 ) ≤ 2 italic_E start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (57)

Plugging all values into, we can get:

t=t0t0+E1Atsuperscriptsubscript𝑡subscript𝑡0subscript𝑡0𝐸1superscript𝐴𝑡\displaystyle\sum_{t=t_{0}}^{t_{0}+E-1}A^{t}∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_E - 1 end_POSTSUPERSCRIPT italic_A start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT t=t0t0+E1Ctt01At0+(B0+B1)t=t0t0+E1Ctt011C1absentsuperscriptsubscript𝑡subscript𝑡0subscript𝑡0𝐸1superscript𝐶𝑡subscript𝑡01superscript𝐴subscript𝑡0subscript𝐵0subscript𝐵1superscriptsubscript𝑡subscript𝑡0subscript𝑡0𝐸1superscript𝐶𝑡subscript𝑡011𝐶1\displaystyle\leq\sum_{t=t_{0}}^{t_{0}+E-1}C^{t-t_{0}-1}A^{t_{0}}+(B_{0}+B_{1}% )\sum_{t=t_{0}}^{t_{0}+E-1}\frac{C^{t-t_{0}-1}-1}{C-1}≤ ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_E - 1 end_POSTSUPERSCRIPT italic_C start_POSTSUPERSCRIPT italic_t - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT italic_A start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + ( italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_E - 1 end_POSTSUPERSCRIPT divide start_ARG italic_C start_POSTSUPERSCRIPT italic_t - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT - 1 end_ARG start_ARG italic_C - 1 end_ARG (58)
4Ek=0KGk2Hk2Ekt02absent4𝐸superscriptsubscript𝑘0𝐾superscriptsubscript𝐺𝑘2superscriptsubscript𝐻𝑘2subscriptsuperscriptnormsuperscriptsubscript𝐸𝑘subscript𝑡02\displaystyle\leq 4E\sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\|E_{k}^{t_{0}}\right% \|^{2}_{\mathcal{F}}≤ 4 italic_E ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT (59)
+8E2(1+E)(ηt0)2k=0K(Lk)2𝔼t0[kF(Θkt0)2]8superscript𝐸21𝐸superscriptsuperscript𝜂subscript𝑡02superscriptsubscript𝑘0𝐾superscriptsubscript𝐿𝑘2subscript𝔼subscript𝑡0delimited-[]superscriptnormsubscript𝑘𝐹subscriptsuperscriptΘsubscript𝑡0𝑘2\displaystyle+8E^{2}(1+E)(\eta^{t_{0}})^{2}\sum_{k=0}^{K}(L_{k})^{2}\mathbb{E}% _{t_{0}}\left[\left\|\nabla_{k}{F}({\Theta}^{t_{0}}_{k})\right\|^{2}\right]+ 8 italic_E start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + italic_E ) ( italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (60)
+16E2(1+E)k=0KGk2Hk2Ekt0216superscript𝐸21𝐸superscriptsubscript𝑘0𝐾superscriptsubscript𝐺𝑘2superscriptsubscript𝐻𝑘2subscriptsuperscriptnormsuperscriptsubscript𝐸𝑘subscript𝑡02\displaystyle+16E^{2}(1+E)\sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\|E_{k}^{t_{0}}% \right\|^{2}_{\mathcal{F}}+ 16 italic_E start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + italic_E ) ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT (61)
16E3(ηt0)2k=0K(Lk)2𝔼t0[kF(Θt0)2]absent16superscript𝐸3superscriptsuperscript𝜂subscript𝑡02superscriptsubscript𝑘0𝐾superscriptsubscript𝐿𝑘2subscript𝔼subscript𝑡0delimited-[]superscriptnormsubscript𝑘𝐹superscriptΘsubscript𝑡02\displaystyle\leq 16E^{3}(\eta^{t_{0}})^{2}\sum_{k=0}^{K}(L_{k})^{2}\mathbb{E}% _{t_{0}}\left[\left\|\nabla_{k}{F}({\Theta}^{t_{0}})\right\|^{2}\right]≤ 16 italic_E start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (62)
+4(4E2(1+E)+E)k=0KGk2Hk2Ekt0244superscript𝐸21𝐸𝐸superscriptsubscript𝑘0𝐾superscriptsubscript𝐺𝑘2superscriptsubscript𝐻𝑘2subscriptsuperscriptnormsuperscriptsubscript𝐸𝑘subscript𝑡02\displaystyle+4(4E^{2}(1+E)+E)\sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\|E_{k}^{t_% {0}}\right\|^{2}_{\mathcal{F}}+ 4 ( 4 italic_E start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + italic_E ) + italic_E ) ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT (63)
16E3(ηt0)2k=0K(Lk)2𝔼t0[kF(Θt0)2]absent16superscript𝐸3superscriptsuperscript𝜂subscript𝑡02superscriptsubscript𝑘0𝐾superscriptsubscript𝐿𝑘2subscript𝔼subscript𝑡0delimited-[]superscriptnormsubscript𝑘𝐹superscriptΘsubscript𝑡02\displaystyle\leq 16E^{3}(\eta^{t_{0}})^{2}\sum_{k=0}^{K}(L_{k})^{2}\mathbb{E}% _{t_{0}}\left[\left\|\nabla_{k}{F}({\Theta}^{t_{0}})\right\|^{2}\right]≤ 16 italic_E start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (64)
+36E3k=0KGk2Hk2Ekt0236superscript𝐸3superscriptsubscript𝑘0𝐾superscriptsubscript𝐺𝑘2superscriptsubscript𝐻𝑘2subscriptsuperscriptnormsuperscriptsubscript𝐸𝑘subscript𝑡02\displaystyle+36E^{3}\sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\left\|E_{k}^{t_{0}}% \right\|^{2}_{\mathcal{F}}+ 36 italic_E start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT (65)

-D Proof of Theorem 1

Here we will derive Theorem 1, first, we set t0+=t0+E1superscriptsubscript𝑡0subscript𝑡0𝐸1t_{0}^{+}=t_{0}+E-1italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_E - 1, then we get:

F(Θt0+)F(Θt0)F(Θt0),Θt0+Θt0+L2Θt0+Θt02𝐹superscriptΘsuperscriptsubscript𝑡0𝐹superscriptΘsubscript𝑡0𝐹superscriptΘsubscript𝑡0superscriptΘsuperscriptsubscript𝑡0superscriptΘsubscript𝑡0𝐿2superscriptnormsuperscriptΘsuperscriptsubscript𝑡0superscriptΘsubscript𝑡02\displaystyle{F}({\Theta}^{t_{0}^{+}})-{F}({\Theta}^{t_{0}})\leq\left\langle% \nabla{F}({\Theta}^{t_{0}}),{\Theta}^{t_{0}^{+}}-{\Theta}^{t_{0}}\right\rangle% +\frac{L}{2}\left\|{\Theta}^{t_{0}^{+}}-{\Theta}^{t_{0}}\right\|^{2}italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) - italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ≤ ⟨ ∇ italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) , roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⟩ + divide start_ARG italic_L end_ARG start_ARG 2 end_ARG ∥ roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (66)
=F(Θt0),t=t0t0+ηt0G^t+L2t=t0t0+ηt0G^t2absent𝐹superscriptΘsubscript𝑡0superscriptsubscript𝑡subscript𝑡0superscriptsubscript𝑡0superscript𝜂subscript𝑡0superscript^G𝑡𝐿2superscriptnormsuperscriptsubscript𝑡subscript𝑡0superscriptsubscript𝑡0superscript𝜂subscript𝑡0superscript^G𝑡2\displaystyle=-\left\langle\nabla{F}({\Theta}^{t_{0}}),\sum_{t=t_{0}}^{t_{0}^{% +}}\eta^{t_{0}}\hat{\textbf{G}}^{t}\right\rangle+\frac{L}{2}\left\|\sum_{t=t_{% 0}}^{t_{0}^{+}}\eta^{t_{0}}\hat{\textbf{G}}^{t}\right\|^{2}= - ⟨ ∇ italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) , ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT over^ start_ARG G end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⟩ + divide start_ARG italic_L end_ARG start_ARG 2 end_ARG ∥ ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT over^ start_ARG G end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (67)
t=t0t0+ηt0F(Θt0),G^t+LE2t=t0t0+(ηt0)2G^t2absentsuperscriptsubscript𝑡subscript𝑡0superscriptsubscript𝑡0superscript𝜂subscript𝑡0𝐹superscriptΘsubscript𝑡0superscript^G𝑡𝐿𝐸2superscriptsubscript𝑡subscript𝑡0superscriptsubscript𝑡0superscriptsuperscript𝜂subscript𝑡02superscriptnormsuperscript^G𝑡2\displaystyle\leq-\sum_{t=t_{0}}^{t_{0}^{+}}\eta^{t_{0}}\left\langle\nabla{F}(% {\Theta}^{t_{0}}),\hat{\textbf{G}}^{t}\right\rangle+\frac{LE}{2}\sum_{t=t_{0}}% ^{t_{0}^{+}}(\eta^{t_{0}})^{2}\left\|\hat{\textbf{G}}^{t}\right\|^{2}≤ - ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⟨ ∇ italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) , over^ start_ARG G end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⟩ + divide start_ARG italic_L italic_E end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ over^ start_ARG G end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (68)
t=t0t0+ηt0F(Θt0),G^tGt0t=t0t0+ηt0F(Θt0),Gt0absentsuperscriptsubscript𝑡subscript𝑡0superscriptsubscript𝑡0superscript𝜂subscript𝑡0𝐹superscriptΘsubscript𝑡0superscript^G𝑡superscriptGsubscript𝑡0superscriptsubscript𝑡subscript𝑡0superscriptsubscript𝑡0superscript𝜂subscript𝑡0𝐹superscriptΘsubscript𝑡0superscriptGsubscript𝑡0\displaystyle\leq\sum_{t=t_{0}}^{t_{0}^{+}}\eta^{t_{0}}\left\langle-\nabla{F}(% {\Theta}^{t_{0}}),\hat{\textbf{G}}^{t}-{\textbf{G}}^{t_{0}}\right\rangle-\sum_% {t=t_{0}}^{t_{0}^{+}}\eta^{t_{0}}\left\langle\nabla{F}({\Theta}^{t_{0}}),{% \textbf{G}}^{t_{0}}\right\rangle≤ ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⟨ - ∇ italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) , over^ start_ARG G end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - G start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⟩ - ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⟨ ∇ italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) , G start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⟩ (69)
+LEt=t0t0+(ηt0)2G^tGt02+LEt=t0t0+(ηt0)2Gt02𝐿𝐸superscriptsubscript𝑡subscript𝑡0superscriptsubscript𝑡0superscriptsuperscript𝜂subscript𝑡02superscriptnormsuperscript^G𝑡superscriptGsubscript𝑡02𝐿𝐸superscriptsubscript𝑡subscript𝑡0superscriptsubscript𝑡0superscriptsuperscript𝜂subscript𝑡02superscriptnormsuperscriptGsubscript𝑡02\displaystyle+LE\sum_{t=t_{0}}^{t_{0}^{+}}(\eta^{t_{0}})^{2}\left\|\hat{% \textbf{G}}^{t}-{\textbf{G}}^{t_{0}}\right\|^{2}+LE\sum_{t=t_{0}}^{t_{0}^{+}}(% \eta^{t_{0}})^{2}\left\|{\textbf{G}}^{t_{0}}\right\|^{2}+ italic_L italic_E ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ over^ start_ARG G end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - G start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_L italic_E ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ G start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (70)
12t=t0t0+ηt0F(Θt0)2+12t=t0t0+ηt0G^tGt02t=t0t0+ηt0F(Θt0),Gt0absent12superscriptsubscript𝑡subscript𝑡0superscriptsubscript𝑡0superscript𝜂subscript𝑡0superscriptnorm𝐹superscriptΘsubscript𝑡0212superscriptsubscript𝑡subscript𝑡0superscriptsubscript𝑡0superscript𝜂subscript𝑡0superscriptnormsuperscript^G𝑡superscriptGsubscript𝑡02superscriptsubscript𝑡subscript𝑡0superscriptsubscript𝑡0superscript𝜂subscript𝑡0𝐹superscriptΘsubscript𝑡0superscriptGsubscript𝑡0\displaystyle\leq\frac{1}{2}\sum_{t=t_{0}}^{t_{0}^{+}}\eta^{t_{0}}\left\|% \nabla{F}({\Theta}^{t_{0}})\right\|^{2}+\frac{1}{2}\sum_{t=t_{0}}^{t_{0}^{+}}% \eta^{t_{0}}\left\|\hat{\textbf{G}}^{t}-{\textbf{G}}^{t_{0}}\right\|^{2}-\sum_% {t=t_{0}}^{t_{0}^{+}}\eta^{t_{0}}\left\langle\nabla{F}({\Theta}^{t_{0}}),{% \textbf{G}}^{t_{0}}\right\rangle≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ ∇ italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ over^ start_ARG G end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - G start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⟨ ∇ italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) , G start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⟩ (71)
+LEt=t0t0+(ηt0)2G^tGt02+LEt=t0t0+(ηt0)2Gt02𝐿𝐸superscriptsubscript𝑡subscript𝑡0superscriptsubscript𝑡0superscriptsuperscript𝜂subscript𝑡02superscriptnormsuperscript^G𝑡superscriptGsubscript𝑡02𝐿𝐸superscriptsubscript𝑡subscript𝑡0superscriptsubscript𝑡0superscriptsuperscript𝜂subscript𝑡02superscriptnormsuperscriptGsubscript𝑡02\displaystyle+LE\sum_{t=t_{0}}^{t_{0}^{+}}(\eta^{t_{0}})^{2}\left\|\hat{% \textbf{G}}^{t}-{\textbf{G}}^{t_{0}}\right\|^{2}+LE\sum_{t=t_{0}}^{t_{0}^{+}}(% \eta^{t_{0}})^{2}\left\|{\textbf{G}}^{t_{0}}\right\|^{2}+ italic_L italic_E ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ over^ start_ARG G end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - G start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_L italic_E ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ G start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (72)

where the last term follows AB=12A2+12B212(AB)2𝐴𝐵12superscript𝐴212superscript𝐵212superscript𝐴𝐵2A\cdot B=\frac{1}{2}A^{2}+\frac{1}{2}B^{2}-\frac{1}{2}(A-B)^{2}italic_A ⋅ italic_B = divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_A - italic_B ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Then we apply the expectation 𝔼t0subscript𝔼subscript𝑡0\mathbb{E}_{t_{0}}blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT to both sides of the last term.

𝔼t0[F(Θt0+)]F(Θt0)subscript𝔼subscript𝑡0delimited-[]𝐹superscriptΘsuperscriptsubscript𝑡0𝐹superscriptΘsubscript𝑡0\displaystyle\mathbb{E}_{t_{0}}[{F}({\Theta}^{t_{0}^{+}})]-{F}({\Theta}^{t_{0}})blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ] - italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) (73)
12t=t0t0+ηt0F(Θt0)2+12t=t0t0+ηt0(1+2LEηt0)𝔼t0G^tGt02+LEt=t0t0+(ηt0)2𝔼t0Gt02absent12superscriptsubscript𝑡subscript𝑡0superscriptsubscript𝑡0superscript𝜂subscript𝑡0superscriptnorm𝐹superscriptΘsubscript𝑡0212superscriptsubscript𝑡subscript𝑡0superscriptsubscript𝑡0superscript𝜂subscript𝑡012𝐿𝐸superscript𝜂subscript𝑡0subscript𝔼subscript𝑡0superscriptnormsuperscript^G𝑡superscriptGsubscript𝑡02𝐿𝐸superscriptsubscript𝑡subscript𝑡0superscriptsubscript𝑡0superscriptsuperscript𝜂subscript𝑡02subscript𝔼subscript𝑡0superscriptnormsuperscriptGsubscript𝑡02\displaystyle\leq-\frac{1}{2}\sum_{t=t_{0}}^{t_{0}^{+}}\eta^{t_{0}}\left\|% \nabla{F}({\Theta}^{t_{0}})\right\|^{2}+\frac{1}{2}\sum_{t=t_{0}}^{t_{0}^{+}}% \eta^{t_{0}}(1+2LE\eta^{t_{0}})\mathbb{E}_{t_{0}}\left\|\hat{\textbf{G}}^{t}-{% \textbf{G}}^{t_{0}}\right\|^{2}+LE\sum_{t=t_{0}}^{t_{0}^{+}}(\eta^{t_{0}})^{2}% \mathbb{E}_{t_{0}}\left\|{\textbf{G}}^{t_{0}}\right\|^{2}≤ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ ∇ italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( 1 + 2 italic_L italic_E italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ over^ start_ARG G end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - G start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_L italic_E ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ G start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (74)
12t=t0t0+ηt0(12LEηt0)F(Θt0)2+12t=t0t0+ηt0(1+2LEηt0)𝔼t0G^tGt02absent12superscriptsubscript𝑡subscript𝑡0superscriptsubscript𝑡0superscript𝜂subscript𝑡012𝐿𝐸superscript𝜂subscript𝑡0superscriptnorm𝐹superscriptΘsubscript𝑡0212superscriptsubscript𝑡subscript𝑡0superscriptsubscript𝑡0superscript𝜂subscript𝑡012𝐿𝐸superscript𝜂subscript𝑡0subscript𝔼subscript𝑡0superscriptnormsuperscript^G𝑡superscriptGsubscript𝑡02\displaystyle\leq-\frac{1}{2}\sum_{t=t_{0}}^{t_{0}^{+}}\eta^{t_{0}}(1-2LE\eta^% {t_{0}})\left\|\nabla{F}({\Theta}^{t_{0}})\right\|^{2}+\frac{1}{2}\sum_{t=t_{0% }}^{t_{0}^{+}}\eta^{t_{0}}(1+2LE\eta^{t_{0}})\mathbb{E}_{t_{0}}\left\|\hat{% \textbf{G}}^{t}-{\textbf{G}}^{t_{0}}\right\|^{2}≤ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( 1 - 2 italic_L italic_E italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ∥ ∇ italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( 1 + 2 italic_L italic_E italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ over^ start_ARG G end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - G start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (75)
=E2ηt0(12LEηt0)F(Θt0)2+12t=t0t0+ηt0(1+2LEηt0)𝔼t0G^tGt02absent𝐸2superscript𝜂subscript𝑡012𝐿𝐸superscript𝜂subscript𝑡0superscriptnorm𝐹superscriptΘsubscript𝑡0212superscriptsubscript𝑡subscript𝑡0superscriptsubscript𝑡0superscript𝜂subscript𝑡012𝐿𝐸superscript𝜂subscript𝑡0subscript𝔼subscript𝑡0superscriptnormsuperscript^G𝑡superscriptGsubscript𝑡02\displaystyle=-\frac{E}{2}\eta^{t_{0}}(1-2LE\eta^{t_{0}})\left\|\nabla{F}({% \Theta}^{t_{0}})\right\|^{2}+\frac{1}{2}\sum_{t=t_{0}}^{t_{0}^{+}}\eta^{t_{0}}% (1+2LE\eta^{t_{0}})\mathbb{E}_{t_{0}}\left\|\hat{\textbf{G}}^{t}-{\textbf{G}}^% {t_{0}}\right\|^{2}= - divide start_ARG italic_E end_ARG start_ARG 2 end_ARG italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( 1 - 2 italic_L italic_E italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ∥ ∇ italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( 1 + 2 italic_L italic_E italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ over^ start_ARG G end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - G start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (76)

where the Eq.74 follows Gt0=F(Θt0)superscriptGsubscript𝑡0𝐹superscriptΘsubscript𝑡0{\textbf{G}}^{t_{0}}=\nabla{F}({\Theta}^{t_{0}})G start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = ∇ italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) since we consider the full batch participation in each training round. If mini-batch is considered here there will have additional terms. Then with Lemma 2, we have:

𝔼t0[F(Θt0+)]F(Θt0)subscript𝔼subscript𝑡0delimited-[]𝐹superscriptΘsuperscriptsubscript𝑡0𝐹superscriptΘsubscript𝑡0\displaystyle\mathbb{E}_{t_{0}}[{F}({\Theta}^{t_{0}^{+}})]-{F}({\Theta}^{t_{0}})blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ] - italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) (77)
E2ηt0(12LEηt0)F(Θt0)2+8E3(ηt0)3(1+2LEηt0)k=0K(Lk)2kF(Θkt0)2absent𝐸2superscript𝜂subscript𝑡012𝐿𝐸superscript𝜂subscript𝑡0superscriptnorm𝐹superscriptΘsubscript𝑡028superscript𝐸3superscriptsuperscript𝜂subscript𝑡0312𝐿𝐸superscript𝜂subscript𝑡0superscriptsubscript𝑘0𝐾superscriptsubscript𝐿𝑘2superscriptnormsubscript𝑘𝐹subscriptsuperscriptΘsubscript𝑡0𝑘2\displaystyle\leq-\frac{E}{2}\eta^{t_{0}}(1-2LE\eta^{t_{0}})\left\|\nabla{F}({% \Theta}^{t_{0}})\right\|^{2}+8E^{3}(\eta^{t_{0}})^{3}(1+2LE\eta^{t_{0}})\sum_{% k=0}^{K}(L_{k})^{2}\left\|\nabla_{k}{F}({\Theta}^{t_{0}}_{k})\right\|^{2}≤ - divide start_ARG italic_E end_ARG start_ARG 2 end_ARG italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( 1 - 2 italic_L italic_E italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ∥ ∇ italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 8 italic_E start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( 1 + 2 italic_L italic_E italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (78)
+18E3ηt0(1+2LEηt0)k=0KHk2Gk2Ekt0218superscript𝐸3superscript𝜂subscript𝑡012𝐿𝐸superscript𝜂subscript𝑡0superscriptsubscript𝑘0𝐾superscriptsubscript𝐻𝑘2superscriptsubscript𝐺𝑘2subscriptsuperscriptnormsubscriptsuperscript𝐸subscript𝑡0𝑘2\displaystyle+18E^{3}\eta^{t_{0}}(1+2LE\eta^{t_{0}})\sum_{k=0}^{K}H_{k}^{2}G_{% k}^{2}\left\|E^{t_{0}}_{k}\right\|^{2}_{\mathcal{F}}+ 18 italic_E start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( 1 + 2 italic_L italic_E italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_E start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT (79)

Then by let ηt0116Emax{L,maxkLk}superscript𝜂subscript𝑡0116𝐸𝐿subscript𝑘subscript𝐿𝑘\eta^{t_{0}}\leq\frac{1}{16E\max\left\{L,\max_{k}L_{k}\right\}}italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 16 italic_E roman_max { italic_L , roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } end_ARG, we can get the further bound:

𝔼t0[F(Θt0+)]F(Θt0)E2k=0Kηt0(12LEηt016E2(Lk)2(ηt0)216E3(Lk)2L(ηt0)3)kF(Θkt0)2subscript𝔼subscript𝑡0delimited-[]𝐹superscriptΘsuperscriptsubscript𝑡0𝐹superscriptΘsubscript𝑡0𝐸2superscriptsubscript𝑘0𝐾superscript𝜂subscript𝑡012𝐿𝐸superscript𝜂subscript𝑡016superscript𝐸2superscriptsubscript𝐿𝑘2superscriptsuperscript𝜂subscript𝑡0216superscript𝐸3superscriptsubscript𝐿𝑘2𝐿superscriptsuperscript𝜂subscript𝑡03superscriptnormsubscript𝑘𝐹subscriptsuperscriptΘsubscript𝑡0𝑘2\displaystyle\mathbb{E}_{t_{0}}[{F}({\Theta}^{t_{0}^{+}})]-{F}({\Theta}^{t_{0}% })\leq-\frac{E}{2}\sum_{k=0}^{K}\eta^{t_{0}}(1-2LE\eta^{t_{0}}-16E^{2}(L_{k})^% {2}(\eta^{t_{0}})^{2}-16E^{3}(L_{k})^{2}L(\eta^{t_{0}})^{3})\left\|\nabla_{k}{% F}({\Theta}^{t_{0}}_{k})\right\|^{2}blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ] - italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ≤ - divide start_ARG italic_E end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( 1 - 2 italic_L italic_E italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - 16 italic_E start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 16 italic_E start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ( italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) ∥ ∇ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+18E3ηt0(1+2LaEηt0)k=0KHk2Gk2Ekt0218superscript𝐸3superscript𝜂subscript𝑡012superscript𝐿𝑎𝐸superscript𝜂subscript𝑡0superscriptsubscript𝑘0𝐾superscriptsubscript𝐻𝑘2superscriptsubscript𝐺𝑘2subscriptsuperscriptnormsubscriptsuperscript𝐸subscript𝑡0𝑘2\displaystyle+18E^{3}\eta^{t_{0}}(1+2L^{a}E\eta^{t_{0}})\sum_{k=0}^{K}H_{k}^{2% }G_{k}^{2}\left\|E^{t_{0}}_{k}\right\|^{2}_{\mathcal{F}}+ 18 italic_E start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( 1 + 2 italic_L start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT italic_E italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_E start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT (80)
3E8ηt0F(Θt0)2+18E3ηt0(1+2LEηt0)k=0KHk2Gk2Ekt02absent3𝐸8superscript𝜂subscript𝑡0superscriptnorm𝐹superscriptΘsubscript𝑡0218superscript𝐸3superscript𝜂subscript𝑡012𝐿𝐸superscript𝜂subscript𝑡0superscriptsubscript𝑘0𝐾superscriptsubscript𝐻𝑘2superscriptsubscript𝐺𝑘2subscriptsuperscriptnormsubscriptsuperscript𝐸subscript𝑡0𝑘2\displaystyle\leq-\frac{3E}{8}\eta^{t_{0}}\left\|\nabla{F}({\Theta}^{t_{0}})% \right\|^{2}+18E^{3}\eta^{t_{0}}(1+2LE\eta^{t_{0}})\sum_{k=0}^{K}H_{k}^{2}G_{k% }^{2}\left\|E^{t_{0}}_{k}\right\|^{2}_{\mathcal{F}}≤ - divide start_ARG 3 italic_E end_ARG start_ARG 8 end_ARG italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ ∇ italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 18 italic_E start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( 1 + 2 italic_L italic_E italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_E start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT (81)

Then we rearranged the terms:

ηt0F(Θt0)2superscript𝜂subscript𝑡0superscriptnorm𝐹superscriptΘsubscript𝑡02\displaystyle\eta^{t_{0}}\left\|\nabla{F}({\Theta}^{t_{0}})\right\|^{2}italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ ∇ italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 3[F(Θt0)𝔼t0[F(Θt0+)]]E+48E2ηt0(1+2LEηt0)k=0KHk2Gk2Ekt02absent3delimited-[]𝐹superscriptΘsubscript𝑡0subscript𝔼subscript𝑡0delimited-[]𝐹superscriptΘsuperscriptsubscript𝑡0𝐸48superscript𝐸2superscript𝜂subscript𝑡012𝐿𝐸superscript𝜂subscript𝑡0superscriptsubscript𝑘0𝐾superscriptsubscript𝐻𝑘2superscriptsubscript𝐺𝑘2subscriptsuperscriptnormsubscriptsuperscript𝐸subscript𝑡0𝑘2\displaystyle\leq\frac{3\left[{F}({\Theta}^{t_{0}})-\mathbb{E}_{t_{0}}[{F}({% \Theta}^{t_{0}^{+}})]\right]}{E}+48E^{2}\eta^{t_{0}}(1+2LE\eta^{t_{0}})\sum_{k% =0}^{K}H_{k}^{2}G_{k}^{2}\left\|E^{t_{0}}_{k}\right\|^{2}_{\mathcal{F}}≤ divide start_ARG 3 [ italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) - blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ] ] end_ARG start_ARG italic_E end_ARG + 48 italic_E start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( 1 + 2 italic_L italic_E italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_E start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT (82)

summing over all global round t0=0,,R1subscript𝑡00𝑅1t_{0}=0,\dots,R-1italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 , … , italic_R - 1 with expectation:

t0=0R1ηt0𝔼[F(Θt0)2]superscriptsubscriptsubscript𝑡00𝑅1superscript𝜂subscript𝑡0𝔼delimited-[]superscriptnorm𝐹superscriptΘsubscript𝑡02\displaystyle\sum_{t_{0}=0}^{R-1}\eta^{t_{0}}\mathbb{E}[\left\|\nabla{F}({% \Theta}^{t_{0}})\right\|^{2}]∑ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R - 1 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_E [ ∥ ∇ italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] 3[F(Θ0)𝔼t0[F(ΘT)]]E+48E2ηt0(1+2LEηt0)t0=0R1k=0KHk2Gk2𝔼[Ekt02]absent3delimited-[]𝐹superscriptΘ0subscript𝔼subscript𝑡0delimited-[]𝐹superscriptΘ𝑇𝐸48superscript𝐸2superscript𝜂subscript𝑡012𝐿𝐸superscript𝜂subscript𝑡0superscriptsubscriptsubscript𝑡00𝑅1superscriptsubscript𝑘0𝐾superscriptsubscript𝐻𝑘2superscriptsubscript𝐺𝑘2𝔼delimited-[]subscriptsuperscriptnormsubscriptsuperscript𝐸subscript𝑡0𝑘2\displaystyle\leq\frac{3\left[{F}({\Theta}^{0})-\mathbb{E}_{t_{0}}[{F}({\Theta% }^{T})]\right]}{E}+48E^{2}\eta^{t_{0}}(1+2LE\eta^{t_{0}})\sum_{t_{0}=0}^{R-1}% \sum_{k=0}^{K}H_{k}^{2}G_{k}^{2}\mathbb{E}[\left\|E^{t_{0}}_{k}\right\|^{2}_{% \mathcal{F}}]≤ divide start_ARG 3 [ italic_F ( roman_Θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) - blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ] ] end_ARG start_ARG italic_E end_ARG + 48 italic_E start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( 1 + 2 italic_L italic_E italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ∑ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ italic_E start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ] (83)

Then with ηt0=ηsuperscript𝜂subscript𝑡0𝜂\eta^{t_{0}}=\etaitalic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = italic_η and averaging over R global rounds, we get:

1Rt0=0R1𝔼[F(Θt0)2]3[F(Θ0)𝔼t0[F(ΘT)]]ηRE1𝑅superscriptsubscriptsubscript𝑡00𝑅1𝔼delimited-[]superscriptnorm𝐹superscriptΘsubscript𝑡023delimited-[]𝐹superscriptΘ0subscript𝔼subscript𝑡0delimited-[]𝐹superscriptΘ𝑇𝜂𝑅𝐸\displaystyle\frac{1}{R}\sum_{t_{0}=0}^{R-1}\mathbb{E}[\left\|\nabla{F}({% \Theta}^{t_{0}})\right\|^{2}]\leq\frac{3\left[{F}({\Theta}^{0})-\mathbb{E}_{t_% {0}}[{F}({\Theta}^{T})]\right]}{\eta RE}divide start_ARG 1 end_ARG start_ARG italic_R end_ARG ∑ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R - 1 end_POSTSUPERSCRIPT blackboard_E [ ∥ ∇ italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ divide start_ARG 3 [ italic_F ( roman_Θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) - blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ] ] end_ARG start_ARG italic_η italic_R italic_E end_ARG (84)
+96E2R(1+2LEη)t0=0R1k=0KHk2Gk2j0KΨjt0+96E2R(1+2LEη)t0=0R1k=0KGk2Hk2j0,jkKΩjt096superscript𝐸2𝑅12𝐿𝐸𝜂superscriptsubscriptsubscript𝑡00𝑅1superscriptsubscript𝑘0𝐾superscriptsubscript𝐻𝑘2superscriptsubscript𝐺𝑘2superscriptsubscript𝑗0𝐾superscriptsubscriptΨ𝑗subscript𝑡096superscript𝐸2𝑅12𝐿𝐸𝜂superscriptsubscriptsubscript𝑡00𝑅1superscriptsubscript𝑘0𝐾superscriptsubscript𝐺𝑘2superscriptsubscript𝐻𝑘2superscriptsubscriptformulae-sequence𝑗0𝑗𝑘𝐾superscriptsubscriptΩ𝑗subscript𝑡0\displaystyle+\frac{96E^{2}}{R}(1+2LE\eta)\sum_{t_{0}=0}^{R-1}\sum_{k=0}^{K}H_% {k}^{2}G_{k}^{2}\sum_{j\neq 0}^{K}\Psi_{j}^{t_{0}}+\frac{96E^{2}}{R}(1+2LE\eta% )\sum_{t_{0}=0}^{R-1}\sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\sum_{j\neq 0,j\neq k}^{K% }\Omega_{j}^{t_{0}}+ divide start_ARG 96 italic_E start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_R end_ARG ( 1 + 2 italic_L italic_E italic_η ) ∑ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ≠ 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + divide start_ARG 96 italic_E start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_R end_ARG ( 1 + 2 italic_L italic_E italic_η ) ∑ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ≠ 0 , italic_j ≠ italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Ω start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (85)
3[F(Θ0)𝔼t0[F(ΘT)]]ηRE+108E2Rt0=0R1k=0KHk2Gk2j0KΨjt0+108E2Rt0=0R1k=0KGk2Hk2j0,jkKΩjt0absent3delimited-[]𝐹superscriptΘ0subscript𝔼subscript𝑡0delimited-[]𝐹superscriptΘ𝑇𝜂𝑅𝐸108superscript𝐸2𝑅superscriptsubscriptsubscript𝑡00𝑅1superscriptsubscript𝑘0𝐾superscriptsubscript𝐻𝑘2superscriptsubscript𝐺𝑘2superscriptsubscript𝑗0𝐾superscriptsubscriptΨ𝑗subscript𝑡0108superscript𝐸2𝑅superscriptsubscriptsubscript𝑡00𝑅1superscriptsubscript𝑘0𝐾superscriptsubscript𝐺𝑘2superscriptsubscript𝐻𝑘2superscriptsubscriptformulae-sequence𝑗0𝑗𝑘𝐾superscriptsubscriptΩ𝑗subscript𝑡0\displaystyle\leq\frac{3\left[{F}({\Theta}^{0})-\mathbb{E}_{t_{0}}[{F}({\Theta% }^{T})]\right]}{\eta RE}+\frac{108E^{2}}{R}\sum_{t_{0}=0}^{R-1}\sum_{k=0}^{K}H% _{k}^{2}G_{k}^{2}\sum_{j\neq 0}^{K}\Psi_{j}^{t_{0}}+\frac{108E^{2}}{R}\sum_{t_% {0}=0}^{R-1}\sum_{k=0}^{K}G_{k}^{2}H_{k}^{2}\sum_{j\neq 0,j\neq k}^{K}\Omega_{% j}^{t_{0}}≤ divide start_ARG 3 [ italic_F ( roman_Θ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) - blackboard_E start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_F ( roman_Θ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ] ] end_ARG start_ARG italic_η italic_R italic_E end_ARG + divide start_ARG 108 italic_E start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_R end_ARG ∑ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ≠ 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + divide start_ARG 108 italic_E start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_R end_ARG ∑ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ≠ 0 , italic_j ≠ italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Ω start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (86)

where the last term is from η=ηt0116Emax{L,maxkLk}𝜂superscript𝜂subscript𝑡0116𝐸𝐿subscript𝑘subscript𝐿𝑘\eta=\eta^{t_{0}}\leq\frac{1}{16E\max\left\{L,\max_{k}L_{k}\right\}}italic_η = italic_η start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 16 italic_E roman_max { italic_L , roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } end_ARG. This finishes the proof of Theorem 1.

-E Proof of Corollary 1

To get the corollary 1, we need further bound the ΩktsuperscriptsubscriptΩ𝑘𝑡\Omega_{k}^{t}roman_Ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and ΨktsuperscriptsubscriptΨ𝑘𝑡\Psi_{k}^{t}roman_Ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT respectively. We begin by giving the bound of ΩktsuperscriptsubscriptΩ𝑘𝑡\Omega_{k}^{t}roman_Ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.

ΩktsuperscriptsubscriptΩ𝑘𝑡\displaystyle\Omega_{k}^{t}roman_Ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT =𝔼hkt(θ^kt)h^kt(θ^kt)2absent𝔼superscriptnormsuperscriptsubscript𝑘𝑡superscriptsubscript^𝜃𝑘𝑡superscriptsubscript^𝑘𝑡superscriptsubscript^𝜃𝑘𝑡2\displaystyle=\mathbb{E}\|h_{k}^{t}(\hat{\theta}_{k}^{t})-\hat{h}_{k}^{t}(\hat% {\theta}_{k}^{t})\|^{2}= blackboard_E ∥ italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (87)
=𝔼hkt(θ^kt)1hkt(θ^kt)lkt2=𝔼hkt(θ^kt)(1lkt)2absent𝔼superscriptnormdirect-productsuperscriptsubscript𝑘𝑡superscriptsubscript^𝜃𝑘𝑡1direct-productsuperscriptsubscript𝑘𝑡superscriptsubscript^𝜃𝑘𝑡superscriptsubscriptl𝑘𝑡2𝔼superscriptnormdirect-productsuperscriptsubscript𝑘𝑡superscriptsubscript^𝜃𝑘𝑡1superscriptsubscriptl𝑘𝑡2\displaystyle=\mathbb{E}\|h_{k}^{t}(\hat{\theta}_{k}^{t})\odot\textbf{1}-h_{k}% ^{t}(\hat{\theta}_{k}^{t})\odot\textbf{l}_{k}^{t}\|^{2}=\mathbb{E}\|h_{k}^{t}(% \hat{\theta}_{k}^{t})\odot(\textbf{1}-\textbf{l}_{k}^{t})\|^{2}= blackboard_E ∥ italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⊙ 1 - italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⊙ l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = blackboard_E ∥ italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⊙ ( 1 - l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (88)
𝔼hkt(θ^kt),(1lkt)2𝔼hkt(θ^kt)2𝔼(1lkt)2absent𝔼superscriptnormsuperscriptsubscript𝑘𝑡superscriptsubscript^𝜃𝑘𝑡1superscriptsubscriptl𝑘𝑡2𝔼superscriptnormsuperscriptsubscript𝑘𝑡superscriptsubscript^𝜃𝑘𝑡2𝔼superscriptnorm1superscriptsubscriptl𝑘𝑡2\displaystyle\leq\mathbb{E}\left\|\left\langle h_{k}^{t}(\hat{\theta}_{k}^{t})% ,(\textbf{1}-\textbf{l}_{k}^{t})\right\rangle\right\|^{2}\leq\mathbb{E}\left\|% h_{k}^{t}(\hat{\theta}_{k}^{t})\right\|^{2}\mathbb{E}\left\|(\textbf{1}-% \textbf{l}_{k}^{t})\right\|^{2}≤ blackboard_E ∥ ⟨ italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , ( 1 - l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⟩ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ blackboard_E ∥ italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E ∥ ( 1 - l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (89)
δ2(βkt)2δ2βktabsentsuperscript𝛿2superscriptsuperscriptsubscript𝛽𝑘𝑡2superscript𝛿2superscriptsubscript𝛽𝑘𝑡\displaystyle\leq\delta^{2}(\beta_{k}^{t})^{2}\leq\delta^{2}\beta_{k}^{t}≤ italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (90)

The last term is derived under the Assumption 4 and βktsuperscriptsubscript𝛽𝑘𝑡\beta_{k}^{t}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT are valued within the (0,1)01(0,1)( 0 , 1 ). Then we will show the bound of ΨktsuperscriptsubscriptΨ𝑘𝑡\Psi_{k}^{t}roman_Ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT as:

ΨktsuperscriptsubscriptΨ𝑘𝑡\displaystyle\Psi_{k}^{t}roman_Ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT =𝔼hkt(θkt)hkt(θ^kt)2absent𝔼superscriptnormsuperscriptsubscript𝑘𝑡superscriptsubscript𝜃𝑘𝑡superscriptsubscript𝑘𝑡superscriptsubscript^𝜃𝑘𝑡2\displaystyle=\mathbb{E}\|h_{k}^{t}({\theta}_{k}^{t})-h_{k}^{t}(\hat{\theta}_{% k}^{t})\|^{2}= blackboard_E ∥ italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (91)
Mk2𝔼θktθ^kt2Mk2𝔼θkt1θktmkt2absentsuperscriptsubscript𝑀𝑘2𝔼superscriptnormsuperscriptsubscript𝜃𝑘𝑡superscriptsubscript^𝜃𝑘𝑡2superscriptsubscript𝑀𝑘2𝔼superscriptnormdirect-productsuperscriptsubscript𝜃𝑘𝑡1direct-productsuperscriptsubscript𝜃𝑘𝑡superscriptsubscriptm𝑘𝑡2\displaystyle\leq M_{k}^{2}\mathbb{E}\|{\theta}_{k}^{t}-\hat{\theta}_{k}^{t}\|% ^{2}\leq M_{k}^{2}\mathbb{E}\|{\theta}_{k}^{t}\odot\textbf{1}-{\theta}_{k}^{t}% \odot\textbf{m}_{k}^{t}\|^{2}≤ italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E ∥ italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E ∥ italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⊙ 1 - italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⊙ m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (92)
Mk2𝔼θkt(1mkt)2Mk2𝔼θkt,(1mkt)2absentsuperscriptsubscript𝑀𝑘2𝔼superscriptnormdirect-productsuperscriptsubscript𝜃𝑘𝑡1superscriptsubscriptm𝑘𝑡2superscriptsubscript𝑀𝑘2𝔼superscriptnormsuperscriptsubscript𝜃𝑘𝑡1superscriptsubscriptm𝑘𝑡2\displaystyle\leq M_{k}^{2}\mathbb{E}\|{\theta}_{k}^{t}\odot(\textbf{1}-% \textbf{m}_{k}^{t})\|^{2}\leq M_{k}^{2}\mathbb{E}\left\|\left\langle{\theta}_{% k}^{t},(\textbf{1}-\textbf{m}_{k}^{t})\right\rangle\right\|^{2}≤ italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E ∥ italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⊙ ( 1 - m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E ∥ ⟨ italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , ( 1 - m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⟩ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (93)
Mk2𝔼θkt2𝔼(1mkt)2absentsuperscriptsubscript𝑀𝑘2𝔼superscriptnormsuperscriptsubscript𝜃𝑘𝑡2𝔼superscriptnorm1superscriptsubscriptm𝑘𝑡2\displaystyle\leq M_{k}^{2}\mathbb{E}\left\|{\theta}_{k}^{t}\right\|^{2}% \mathbb{E}\left\|(\textbf{1}-\textbf{m}_{k}^{t})\right\|^{2}≤ italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E ∥ italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E ∥ ( 1 - m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (94)
Mk2μ2(αkt)2Mk2μ2αktabsentsuperscriptsubscript𝑀𝑘2superscript𝜇2superscriptsuperscriptsubscript𝛼𝑘𝑡2superscriptsubscript𝑀𝑘2superscript𝜇2superscriptsubscript𝛼𝑘𝑡\displaystyle\leq M_{k}^{2}\mu^{2}(\alpha_{k}^{t})^{2}\leq M_{k}^{2}\mu^{2}% \alpha_{k}^{t}≤ italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (95)

Where Eq. 92 is due to Assumption 5. And the last term is also derived under the Assumption 4 and αktsuperscriptsubscript𝛼𝑘𝑡\alpha_{k}^{t}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT are valued within the (0,1)01(0,1)( 0 , 1 ). This finishes the proof of Corollary 1.