\DocumentMetadata

Tree-based Models for Vertical Federated Learning: A Survey

Bingchen Qian [email protected] Alibaba GroupChina , Yuexiang Xie [email protected] Alibaba GroupChina , Yaliang Li [email protected] Alibaba GroupUSA , Bolin Ding [email protected] Alibaba GroupUSA and Jingren Zhou [email protected] Alibaba GroupUSA

(2018)

Abstract.

Tree-based models have achieved great success in a wide range of real-world applications due to their effectiveness, robustness, and interpretability, which inspired people to apply them in vertical federated learning (VFL) scenarios in recent years. In this paper, we conduct a comprehensive study to give an overall picture of applying tree-based models in VFL, from the perspective of their communication and computation protocols. We categorize tree-based models in VFL into two types, i.e., feature-gathering models and label-scattering models, and provide a detailed discussion regarding their characteristics, advantages, privacy protection mechanisms, and applications. This study also focuses on the implementation of tree-based models in VFL, summarizing several design principles for better satisfying various requirements from both academic research and industrial deployment. We conduct a series of experiments to provide empirical observations on the differences and advances of different types of tree-based models.

Tree-based Models, Vertical Federated Learning

^†^†copyright: acmcopyright^†^†journalyear: 2018^†^†doi: XXXXXXX.XXXXXXX^†^†journal: JACM^†^†journalvolume: 37^†^†journalnumber: 4^†^†article: 111^†^†publicationmonth: 8^†^†ccs: Computing methodologies Classification and regression trees^†^†ccs: Computing methodologies Distributed computing methodologies

1. Introduction

Although machine learning models have made remarkable progress over the past few years (1; 2; 3; 4; 5; 6; 7), driven by large-scale data and powerful computing capabilities (8; 9), there exist several factors that hinder the broader application of machine learning models in real-world scenarios: data privacy and decentralization. On the one hand, the public is becoming increasingly cautious about privacy leakage issues related to the usage of personal data, while more and more completed regulations (e.g., GDPR¹¹1https://gdpr-info.eu/, CCPA²²2https://oag.ca.gov/privacy/ccpa, FISMA³³3https://www.cisa.gov/federal-information-security-modernization-act, etc.) are promulgating to promote that data sharing among different organizations/departments is kept in a privacy-preserving manner. On the other hand, data required for a real-world application are often inevitably scattered across multiple data owners. Such decentralization scenarios make centralized storage and usage nearly impossible, due to the unaffordable costs of data collecting and issues of multi-party authorities for data accessing.

As a result, how to cooperate across multiple parties (i.e., data owners) to train machine learning models while protecting data privacy and keeping data decentralization has attracted rapidly growing popularity in both academic and industrial, and has motivated the proposal of Federated Learning (FL) (10; 11). Concretely, participants involved in FL are suggested to locally train machine learning models based on their private data, and then exchange the learned knowledge (e.g., updated models, gradients, etc.) with each other to produce models that are more effective and robust than isolated-trained models. According to the forms of data partition, FL can be roughly categorized into Horizontal Federated Learning (HFL) and Vertical Federated Learning (VFL) (12). HFL refers to the setting where parties share the same feature space but the sets of data samples are almost non-intersecting, while VFL refers to the setting where parties’ data samples are overlapped but their feature spaces are different and complementary.

Refer to caption — Figure 1. Communication and computation behaviors happen in a training round of Horizontal and Vertical FL.

The applications of VFL include financial risk management (13; 14; 15; 16), joint marketing (17; 18; 19; 20; 21), smart city (22; 23; 24), and so on (25; 26; 27; 28; 29). Compared to the HFL scenario where multiple parties train their local models independently, in the VFL scenario, the training process of the entire model is separated into different sub-tasks according to the feature spaces of parties, and thus a party might need to wait for the intermediate results provided by other parties. As illustrated in Figure 1, such characteristics of the VFL scenario lead to a phenomenon that a party’s computation behaviors (e.g., forward propagating and residuals calculating) and communication behaviors (e.g., results sending and receiving) are naturally fragmented and alternately executed in each training round.

Various types of methods and techniques have been proposed for the VFL scenario, including kernel-based models (30; 31), linear models (11; 32), deep neural networks (33; 34; 29), and tree-based models (35; 36; 37; 38; 39; 40; 41; 42; 43; 44; 45; 46; 47; 48; 49). Among these methods, tree-based models are particular and essential due to the following reasons. Firstly, tree-based models have been proven to be effective and robust on tabular data that might consist of both numerical and categorical features (50), which allows tree-based models to be feasible solutions in scenarios where tabular data is widely available, e.g., cross-silo VFL. Secondly, the interpretability of tree-based models satisfies the demand in many industries that need to ensure the model predictions are extremely reliable, including healthcare, education, finance, and so on. Moreover, compared to deep neural networks, tree-based models usually need fewer computation resources and hardware support to achieve competitive performance. Last but not least, the architectures of tree-based models are compatible with rich types of privacy protection mechanisms, e.g., differential privacy (51; 52; 53), homomorphic encryption (54; 55; 56; 57; 58; 59), and secure multi-party computation (60; 61; 62; 63; 64; 65), which further broadens the usage of tree-based models in the VFL scenario.

Compared to traditional VFL, tree-based models in VFL exhibit significant differences, including the nature of the heterogeneous messages that must be exchanged among participants (including both feature-related and label-related information), the unique behaviors involved in generating and handling these heterogeneous messages, and rich and robust privacy protection algorithms that need to be implemented to safeguard the sensitive information during these exchanges. As a result, tree-based models in VFL necessitate unique designs and classifications that are different from those typically employed in traditional VFL.

These years, researchers have conducted surveys on federated learning (12; 66; 11; 67; 68; 69; 70; 71; 72) to draw an overall picture of FL from the perspectives of definitions, challenges, architectures, approaches, applications, and future directions. Recently studies (73; 74) review the development of tree-based models in FL, and highlight the necessity and advantages of tree-based models compared to other models such as neural networks. However, to the best of our knowledge, there is currently no dedicated survey focusing on tree-based models for VFL, which motivates us to provide a comprehensive and up-to-date survey that highlights this promising research direction and aims to further inspire the community.

Figure 2. Overview of tree-based models for vertical federated learning.

In this study, as illustrated in Figure 2, we give an overview of tree-based models (TBMs) in vertical federate learning (VFL), focusing on the communication and computation protocols, and the privacy protection mechanisms for the information that is required to be shared during the training and inference procedural. To be more specific, TBMs in VFL can be divided into feature-gathering TBMs and label-scattering TBMs according to their communication and computation protocols. In order to determine the splitting rules at the nodes of decision trees, feature-gathering TBMs propose that the feature-related information (e.g., the ordinal numbers) is sent from the feature owners to the parties who hold labels for calculating the maximum splitting gain, while the label-scattering TBMs suggest the label-related information (e.g., the first-order and second-order gradients) is broadcast from the label owners to other parties. Furthermore, different privacy protection mechanisms are preferred in feature-gathering TBMs and label-scattering TBMs according to the types of shared messages, causing the differences in their performance from the perspectives of communication and computation cost, and protection strength. Please refer to Section 3 for more details on the characteristics, differences, and pros and cons of feature-gathering and label-scattering TBMs.

Moreover, this study is also concerned on the implementations of TBMs in the VFL scenario, for both applying the existing works and developing new algorithms. Previous studies (68; 78) have pointed out that several open-source FL platforms can allow users to apply TBMs in the VFL scenario, such as FATE (79), Fedlearner, FedTree (80), SecretFlow, FederatedScope (81), and so on. Taking a step forward, we provide discussions on the key challenges of developing TBMs, and summarize several design principles for making FL platforms more comprehensive and extendable. We conduct a series of experiments to show the trade-off among model utility, protection effect, and resource cost when applying TBMs in the VFL scenario, which can be a reference for users to choose suitable types of TBMs according to their applications.

Contributions. The main contributions can be summarized as follows:

•

We propose to categorize the TBMs in the VFL scenario according to the communication and computation protocols, which results in feature-gathering TBMs and label-scattering TBMs. For these two types of TBMs, we provide a detailed description of their training and inference procedure to highlight their differences and advantages.
•

Based on different computation and communication protocols, we discuss various privacy protection mechanisms that are designed to protect different types of shared information when applying TBMs, and further take a close look at some advanced algorithms.
•

Towards a better implementation, we summarize the open-source FL platforms that allow users to apply TBMs in the VFL scenario, and point out several design principles to inspire the community for both academic and industrial.
•

Last but not least, we conduct a series of experiments on widely-used datasets to provide an empirical understanding of the characteristics of tree-based models.

Paper Organization. The rest of this paper is organized as follows. In Section 2, we provide some preliminaries, including the concepts of VFL, TBMs, and privacy protection. Then in Section 3, we introduce the details of two different types of TBMs in the VFL scenario, i.e., feature-gathering TBMs and label-scattering TBMs, providing the discussions on communication and computation protocols, privacy protection mechanisms, and representative algorithms. After that, we review the open-source FL platforms and summarize several design principles for supporting TBMs in the VFL scenario, as described in Section 4. In Section 5, We conduct experiments to provide empirical observations and insights. Finally, we present real-world applications and highlight future directions in Section 6, and provide conclusions in Section 7.

2. Backgrounds

2.1. Vertical Federated Learning

Vertical federated learning (VFL) involves multiple parties in training machine learning models based on decentralized data collaboratively, where different parties have different feature spaces but their sample spaces are aligned. These parties can be divided into two categories according to whether they own labels or not. The one party who locally keeps the labels is called task party (a.k.a., task client) while others are called data parties (a.k.a., data clients). In some cases, there also exists one or more coordinators (a.k.a., servers) who play as the trusted third parties. Different from the server in the Horizontal Federated Learning (HFL) scenario, the coordinators in the VFL scenario are mostly focused on distributing the necessary information (such as public keys for encryption algorithms) and hardly involve the training process.

Formally, assume that $M$ parties involve in a VFL course, the $m$ -th party locally keeps the dataset $\mathcal{D}_{m}=\{(x_{i}^{(m)},y_{i})\in\mathcal{X}_{m}\times\mathcal{Y}|i\in% \mathcal{I}\}$ for a task party and $\mathcal{D}_{m}=\{x_{i}^{(m)}\in\mathcal{X}_{m}|i\in\mathcal{I}\}$ for a data party, where $\mathcal{X}_{m}$ denotes the feature space of the $m$ -th party, $\mathcal{Y}$ denotes the label space, and $\mathcal{I}$ denotes the sample space. Note that all the participating parties adopt the same sample space $\mathcal{I}$ through aligning techniques like private set intersection (82; 83; 84; 85), and their feature spaces are always different and complementary, i.e., $\mathcal{X}_{m_{1}}\neq\mathcal{X}_{m_{2}},\forall m_{1},m_{2}\in[M]$ and $\mathcal{X}=\bigcup_{m=1}^{M}\mathcal{X}_{m}$ . Without sharing private data directly, these $M$ parties aim to train a model $f_{\theta}:\mathcal{X}\rightarrow\mathcal{Y}$ parameterized by $\theta$ , with the loss $l:\mathcal{Y}\times\mathcal{Y}\rightarrow\mathbb{R}^{+}\cup\{0\}$ . The loss function in VFL can be given as:

(1)

\mathcal{L}=\frac{1}{|\mathcal{I}|}\sum_{i\in\mathcal{I},(x_{i},y_{i})\sim% \mathcal{X}\times\mathcal{Y}}l(f_{\theta}(x_{i}),y_{i}).

During the training process, the model parameters $\theta$ (such as the splitting rules in tree-based models) are stored and optimized in multiple parties.

2.2. Tree-based Models

Tree-based models (TBMs) in the VFL scenario are mainly constructed with multiple decision trees (86; 87; 88), which have been successfully used in a large range of real-world applications for solving classification and regression tasks (89; 90; 91; 92; 93).

For a decision tree, an internal node (including the root node) represents a splitting rule consisting of a splitting feature and a splitting value, and the branches attached to this internal node indicate the results according to the splitting rule. For example, for a binary decision tree, given an internal node with splitting feature $x$ and splitting value $a$ , the branches represent instances with $x\leq a$ and $x>a$ (for numerical features) or $x=a$ and $x\neq a$ (for categorical features), respectively. Besides the internal nodes, a decision tree includes several leaf nodes to represent the prediction results. The input data are fed into the root of a decision tree, partitioned into different subsets according to the splitting rules of the traversing nodes, and finally reach the leaf nodes to derive corresponding predictions.

Several tree-based models used in the VFL scenarios are introduced in the rest of this section, including Random Forest, Gradient-Boosted Decision Trees (GBDT), and eXtreme Gradient Boosting (XGBoost).

2.2.1. Random forest

Random forest (94) is a bagging-based ensemble supervised machine learning technique, which usually uses Classification and Regression Trees (CARTs) as weak learners. The randomization of a random forest can be presented in two different ways: random selection of samples and random selection of features. After building several CARTs, the predicted results are derived by either majority voting (for classification tasks) or averaging operation (for regression tasks).

2.2.2. GBDT

Given a dataset $\mathcal{D}\in\mathbb{R}^{m\times d}$ with $m$ samples and $d$ features, GBDT (95) predicts the output by using T regression trees, which can be formulated as:

(2)

\hat{y}_{i}=\sum_{t=1}^{T}f_{t}(x_{i}),~{}~{}f_{t}\in\mathcal{F},

where $\mathcal{F}=\{f(x)=w_{q(x)}\}$ is a collection of CARTs. The function $q(x)$ maps each input feature $x$ to the corresponding leaf index, and $w\in\mathbb{R}^{K}$ is the weight vector, where $K$ is the number of leaves in the tree. GBDT tries to minimize the following regularized loss function (3):

(3)

\mathcal{L}\simeq\sum_{i}l(y_{i},\hat{y}_{i})+\sum_{t}\Omega(f_{t}),

where $\Omega(f_{t})=\frac{1}{2}\lambda\|w\|^{2}$ is a regularization term, and $\lambda$ is a hyperparameter to control the strength of regularization.

In order to optimize the above loss function, GBDT minimizes the following function at the $t$ -th iteration (96):

(4)

\mathcal{L}^{(t)}\simeq\sum_{i=1}^{m}\left[l(y_{i},\hat{y}_{i}^{(t-1)})+g_{i}f% _{t}(x_{i})+\frac{1}{2}f_{t}^{2}(x_{i})\right]+\Omega(f_{t}),

where $\mathcal{L}^{(t)}$ denotes the $t$ -th loss of the training process, $l(y_{i},\hat{y}_{i}^{(t-1)})$ is the loss function of the $i$ -th sample between the prediction of the $(t-1)$ -th iteration $\hat{y}_{i}^{(t-1)}$ and the target value $y_{i},$ $g_{i}=\partial_{\hat{y}^{(t-1)}}l(y_{i},\hat{y}_{i}^{(t-1)})$ is the first order gradient statistics on the loss function.

Finally, the optimal weight of the $j$ -th leaf node can be given as:

(5)

w_{j}^{*}=-\frac{\sum_{i\in I_{j}}g_{i}}{|I_{j}|+\lambda},

where $I_{j}$ denotes the sample indices that belong to the $j$ -th leaf node. To find the best splitting threshold for the internal node, GBDT greedily maximizes the following gain score:

(6)

Gain=\frac{(\sum_{i\in I_{L}}g_{i})^{2}}{|I_{L}|+\lambda}+\frac{(\sum_{i\in I_% {R}}g_{i})^{2}}{|I_{R}|+\lambda},

where $I_{L}$ and $I_{R}$ are the instance sets of left and right nodes after the split.

2.2.3. XGBoost

XGBoost (3) is an efficient implementation of GBDT. One of the most important improvements made by XGBoost is that a second-order Taylor expansion function is used to approximate the loss function, as defined by:

(7)

\mathcal{L}^{(t)}\simeq\sum_{i=1}^{m}\left[l(y_{i},\hat{y}_{i}^{(t-1)})+g_{i}f% _{t}(x_{i})+\frac{1}{2}h_{i}f_{t}^{2}(x_{i})\right]+\Omega(f_{t}),

where $h_{i}=\partial^{2}_{\hat{y}^{(t-1)}}l(y_{i},\hat{y}_{i}^{(t-1)})$ is the second order gradient statistics on the loss function and $\Omega(f_{t})=\gamma T+\frac{1}{2}\lambda\|w\|^{2}$ . Here, $\gamma$ and $\lambda$ represent the regularizers to adjust the number and weights of leaves, respectively.

As a result, the optimal weight of the $j$ -th leaf node can be calculated as:

(8)

w_{j}^{*}=-\frac{\sum_{i\in I_{j}}g_{i}}{\sum_{i\in I_{j}}h_{i}+\lambda},

where $I_{j}$ denotes the sample indices that belong to the $j$ -th leaf node. In order to find the best splitting threshold for the internal nodes, XGBoost greedily maximizes the following gain score:

(9)

Gain=\frac{1}{2}\left[\frac{(\sum_{i\in I_{L}}g_{i})^{2}}{\sum_{i\in I_{L}}h_{% i}+\lambda}+\frac{(\sum_{i\in I_{R}}g_{i})^{2}}{\sum_{i\in I_{R}}g_{i}+\lambda% }-\frac{(\sum_{i\in I}g_{i})^{2}}{\sum_{i\in I}h_{i}+\lambda}\right]-\gamma,

where $I_{L}$ and $I_{R}$ are the instances of left and right nodes after the splitting operation with $I=I_{L}\cup I_{R}$ .

2.3. Privacy Protection

In the VFL scenario, the task party and data parties need to exchange some intermediate results during the process of building decision trees, e.g., the feature-related information and the label-related information, which might cause privacy leakage without protection (38; 36). Therefore, privacy protection techniques are necessary to consider in VFL algorithms and applications, including Differential Privacy (DP), Homomorphic Encryption (HE), Secure Multi-Party Computation (SMPC), and Trusted Execution Environments (TEE).

2.3.1. Differential privacy

Differential privacy (51; 52; 53) is the technology that enables researchers to avail a facility in obtaining useful information from the databases, containing people’s personal information, without divulging the personal identification of individuals. Local DP (LDP) (97; 98; 99; 100) is one type of DP, which is based on clients adding DPs to the data themselves. The distance-based LDP (101; 102; 103) is a special kind of LDP, which measures the level of privacy assurance between any pair of sensitive data based on their distance from each other.

2.3.2. Homomorphic encryption

Homomorphic encryption (54; 55; 56; 57; 58; 59) is a special encryption method that allows the ciphertext to be processed including addition and multiplication to obtain a result that is still encrypted. However, it usually requires more computation resources or storage costs. A partially homomorphic encryption (PHE) scheme is a probabilistic asymmetric encryption scheme for restricted computation over the ciphertexts.

2.3.3. Secure multi-party computation

Secure Multi-Party Computation (60; 61; 62; 63; 64; 65) is proposed to solve the problem of how to safely compute a conventional function in the absence of a trusted third party. Among the different protocols in SMPC, one of the widely-used techniques is Secret Sharing (104). Secret sharing suggests partitioning a secret value into several frames and sending one of the frames to a party, which satisfies that only a certain number of parties can jointly complete the decryption process for exactly recovering the secret value. From the perspectives of types, different secret-sharing techniques might allow secret addition, secret multiplication, secret division, and other mathematical operations.

2.3.4. Trusted execution environments

Trusted Execution Environments (105; 106) is a separate processing environment with computing and storage capabilities that provide security and integrity protection. The basic idea is that a separate isolated memory is allocated in hardware for sensitive data, all calculations of sensitive data are performed in this memory, and no other part of the hardware can access the information in this isolated memory except for authorized interfaces. In this way, private computation of sensitive data can be achieved.

3. Tree-based Models for VFL

Different from the backward propagation of training deep neural networks, the basic optimization step of tree-based models is to find a splitting rule that can achieve the best gain at each internal node, which can be formulated as a function $r=h(x,y)$ where $(x,y)\sim\mathcal{X}\times\mathcal{Y}$ and $r$ denotes the learned splitting rule consisting of a splitting feature and a splitting value.

In centralized training, the computation of function $h$ can be completed without communication since both features $x$ and labels $y$ are owned by a single party locally. However, in the VFL scenario, features can be distributed among multiple parties, and labels are only kept in one task party. Both features and labels could not be shared directly due to privacy concerns.

To complete the training process in the VFL scenario, the computation of $h$ is transformed into several sub-tasks. These sub-tasks would be assigned to different parties accordingly. The results of these sub-tasks would be aggregated, after being protected to avoid privacy leakage if necessary, to learn the optimal splitting rule at each internal node. Previous studies on TBMs in the VFL scenario focus on how to define the sub-tasks and how to protect the exchanged information, which can be divided into two categories according to their computation and communication protocols, i.e., feature-gathering TBMs and label-scattering TBMs.

Figure 3 illustrates the characteristics of feature-gathering and label-scattering TBMs. We will provide a more detailed introduction and comparison below.

3.1. Feature-gathering TBMs

3.1.1. Splitting rule finding

The main idea of feature-gathering TBMs is to modify the splitting rule finding function $h$ as:

(10)

r=h_{\text{feature}}\left[g(x^{(1)}),g(x^{(2)}),\ldots,g(x^{(M)}),y\right],

where $x^{(m)}\sim\mathcal{X}_{m}\ \forall m\in[M]$ . In other words, the calculation of splitting rule $r$ contains the following two steps: (1) Each data party completes the sub-task functioned as $g$ , taking its feature stored locally as input and outputting some (protected) intermediate results $g(x^{(m)})$ . Then these intermediate results are sent to the task party; (2) After receiving all the intermediate results from the data parties, the task party calculates the splitting rule $r$ based on $g(x^{(m)})\ \forall m\in[M]$ and the labels $y$ .

It is worth pointing out that the splitting rule finding function $h$ defined in Eq.(10) brings some issues for building feature-gathering TBMs. Since the task party only receives the intermediate results rather than the features from data parties, it could not identify the found splitting rules (i.e., the splitting features and splitting values) without communicating with the owner of the splitting features.

To solve this, what the task party actually obtained from $h_{\text{feature}}$ is a “pointer” of the splitting feature and splitting value, and the “pointer” should be sent back to the corresponding data party that is able to look up the real splitting feature and splitting value based on the “pointer”. For example, the splitting rule found by the task party can be “the $5$ -th feature of party $m$ , the $120$ -th-ranked value”, and it can only be recovered by party $m$ to “age” (i.e., the splitting feature) and “45” (i.e., the splitting value). The splitting features and splitting values are stored in data parties, and the task party only knows the “pointer” and should query for these splitting results when needed, such as during the inference process (more details can be found in Section 3.3).

One of the widely adopted instantiations of the function $g$ is getting the ordinal numbers of samples according to one’s features, a.k.a., the data sample indices ranked by the feature values. The data parties sort their data according to each of the features, respectively, and send the resulting ordinal numbers to the task party. Therefore, based on these ordinal numbers, the task party can calculate the gains achieved by different splitting rules and find the best one.

3.1.2. Tree building

The overall process of building feature-gathering TBMs is demonstrated in Algorithm 1. Specifically, the task party builds $T$ decision trees based on the received intermediate results $g(x^{(m)})\ \forall m\in[M]$ and labels $y$ .

Algorithm 1 Trees building of feature-gathering model

1: for each data party

m\in[M]

2: Calculate the intermediate results

g(x^{(m)})

;

3: Send

g(x^{(m)})

to the task party;

4: end for

5: for tree number

t=1,2,\ldots,T

, the task party do

6: Initialization:

N=\emptyset

, which consists of 2-term tuples. Each tuple contains a node from a tree as its first element, and the corresponding node data as its second element, data sample

D_{\text{entire}}=\{g(x^{(1)}),g(x^{(2)}),\ldots,g(x^{(m)}),y\}

;

7: Build a tree from root:

N\leftarrow N\cup\{(root\_node,D_{\text{entire}})\}

;

8: while

N\neq\emptyset

9: Get

(n,D)

from

N

to be traversed;

10: if

n

is a leaf node then

11: Set the output of

n

;

12: else

13: Find the splitting rule achieving the best gain:

r=h_{\text{feature}}(D)

;

14: Send

r

to the corresponding data party;

15: Split

D

into

D_{\text{left}}

and

D_{\text{right}}

according to

r

;

16: Add

n

’s children (denoted as

n_{\text{left}}

and

n_{\text{right}}

) to

N

N\leftarrow N\cup\{(n_{\text{left}},D_{\text{left}}),(n_{\text{right}},D_{% \text{right}})\}

;

17: end if

18: Remove the traversed node:

N\leftarrow N\backslash\{(n,D)\}

;

19: end while

20: end for

For each decision tree, the data party traverses from the root node to the leaf nodes, finding the splitting rule for each internal node that can achieve the best gain on the data samples. All the data samples are fed into the root of the tree at the beginning (line 6), and partitioned into two subsets (or $k$ subsets for a $k$ -way tree) according to the splitting rule (line 15). These two subsets go to the children of the root, respectively (line 16), and such a partition process is repeated at every traversed node until the leaf nodes are reached. For the leaf node, the task party sets the output of the leaf node according to the reached data (line 11), e.g., calculating a weight according to Equation (8) in XGBoost or performing major voting in random forest. The naive approach for finding the best splitting rule, i.e., $h_{\text{feature}}$ in line 13, can be exhaustive enumeration: For each possible splitting position, the task party calculates the gain (e.g., the score defined in Equation (9) in XGBoost or the Gini coefficient in a random forest) accordingly. The task party finally chooses the splitting rule that can achieve the best gain at every internal node. Several advanced algorithms for balancing the efficiency and effectiveness of finding the optimal splitting rule have been proposed recently (107; 108).

In a nutshell, we can conclude the communication and computation protocol of feature-gathering TBMs: (1) Feature-related information is sent from data parties to the task party in a privacy-preserving manner; (2) Most of the computation behaviors happen at the task party, including calculating the gain of different splitting rules to find the solution (i.e., line 13 in Algorithm 1), and further partitioning data samples into subsets (i.e., line 15 in Algorithm 1). As a result, it can be implied that the task party can be the bottleneck of computation and communication in feature-gathering TBMs. The reason is that both feature-related information and labels are pooled at the task party for building decision trees.

3.1.3. Privacy protection mechanisms.

The privacy threats, aimed at the shared feature-related information (e.g., the ordinal numbers of data samples) in feature-gathering TBMs, come from the semi-honest task part and insecure communication channels. In order to avoid leaking private information, it is necessary for the data parties to apply some protection mechanisms to feature-related information before sharing it. Such protection mechanisms should be carefully designed to balance the strength of protection and the informativeness of shared information. Here we briefly introduce several representative algorithms.

FederBoost (38) proposes to provide LDP privacy protection for ordinal numbers by adding noises to $g(x^{(m)})$ , i.e., making improvements at line $2$ in Algorithm 1. To be more specific, as shown in Figure 4, each data party first sorts the training samples according to its feature values for obtaining the ordinal numbers, and then partitions the ordinal numbers into several buckets sequentially. To achieve $\epsilon$ -DP, a mapping is applied, making that each ordinal number might move to other buckets with a certain probability (controlled by a hyperparameter to achieve a good utility-privacy trade-off), or just stay at the correct bucket. The ordinal numbers inside the buckets would be randomly shuffled before sharing, therefore the relative order between buckets can be preserved mostly while the orders within each bucket are protected. In this way, the data parties generate the protected ordinal numbers that can be sent to the task party for training feature-gathering TBMs.

OpBoost (39) designs a probabilistic order-preserving desensitization algorithm for privacy-preserving vertical federated tree boosting, which satisfies distance-based LDP. The main idea of OpBoost is to pay more attention to enhancing the indistinguishability of private feature values from their nearby neighbors. Indeed, as shown in Figure 5, the feature values are desensitized into a unified discrete value domain with the predefined lower and upper bound. After that, a mapping satisfying distance-based LDP is used to transform each feature value in the domain based on a distance-based scoring function, i.e., values will be mapped to nearby values with high probability. Finally, the data parties sort the training samples according to these desensitized feature values, generating the protected ordinal numbers. Such protected ordinal numbers are sent to the task party for training feature-gathering TBMs, which is proven to be a good balance of preventing privacy leakage and preserving useful information.

3.1.4. Summary.

In a nutshell, feature-gathering TBMs propose that the data parties calculate ordinal numbers of samples based on their private feature values in a private-preserving manner, and send these desensitized results to the task party for learning optimal splitting rules. As a result, the task party gathers the feature-related information from all data parties, and assumes the responsibility for the computation of finding the optimal splitting rules at every node of the decision trees, while the data parties only need to sort the samples and disrupt the generated ordinal numbers. The major communication overhead of feature-gathering TBMs is the ordinal numbers of samples, which would be sent from the data parties to the task party. The characteristics of feature-gathering TBMs are summarized in Table 1.

Note that there is a trade-off between model utility and privacy protection strength when training feature-gathering TBMs, as pointed out by previous studies (38; 39). Though differential privacy algorithms can enhance privacy protection strength, they might also cause a slight performance drop in the learned model. Such a trade-off implies that the protection strength should be carefully determined in real-world applications, and also inspires the research community to design advanced algorithms to achieve better model utility with a certain allocated privacy budget.

Table 1. The characteristics of feature-gathering TBMs and label-scattering TBMs.

	Feature-gathering TBMs	Label-scattering TBMs
Shared information	Feature-related information	Label-related information
Shared information	(from data parties to task party)	(from task party to data parties)
Computation contributors	Task party	Data parties and task party
Privacy protection	Differential privacy	Homomorphic encryption and secret sharing
Related studies	(38; 39)	(36; 77; 40; 75; 37; 76; 35; 41)

3.2. Label-scattering TBMs

3.2.1. Splitting rule finding.

Different from feature-gathering TBMs introduced above, label-scattering TBMs propose another protocol for building decision trees in the VFL scenario, i.e., the task party broadcasts the label-related information to data parties.

Formally, the main idea of label-scattering TBMs is to modify the splitting rule finding function $h$ as:

(11)

r=h_{\text{label}}\left[g(x^{(1)},e(y)),g(x^{(2)},e(y)),\ldots,g(x^{(M)},e(y))% \right],

where $x^{(m)}\sim\mathcal{X}_{m}\ \forall m\in[M]$ , and $e(y)$ denotes the label-related information calculated by the task party via the function $e$ . Accordingly, the calculation of the splitting rule consists of the following three steps: (1) The task party applies the function $e$ on label $y$ , and broadcasts the produced label-related information $e(y)$ to all data parties; (2) After receiving $e(y)$ , each data party completes the sub-task functioned as $g$ based on $e(y)$ and its private feature values, and sends the intermediate results $g(x^{(m)},e(y))$ back to the task party; (3) The task party calculates the splitting rule $r$ based on the intermediate results received from data parties.

Similar to that in feature-gathering TBMs, the obtained $h_{\text{label}}$ is just a “pointer” of the splitting feature and splitting value, which should be sent to the corresponding data party for querying. One instantiation of $e(y)$ can be the first-order and second-order gradients, and $g$ can be sorting the received $e(y)$ based on the ordinal numbers of samples ranked by data parties’ feature values.

Algorithm 2 Trees building of label-scattering model

1: for tree number

t=1,2,\ldots,T

2: The task party calculates

e(y)

;

3: The task party broadcasts

e(y)

to all the data parties;

4: for each data party

m\in[M]

5: Calculate the intermediate results

g(x^{(m)},e(y))

;

6: Send

g(x^{(m)},e(y))

to the task party;

7: end for

8: After receiving intermediate results, the task party performs initialization:

N=\emptyset

, which consists of 2-term tuples. Each tuple contains a node from a tree as its first element, and the corresponding node data as its second element, data sample

D_{\text{entire}}=\{g(x^{(1)},e(y)),\ldots,g(x^{(m)},e(y))\}

;

9: The task party builds a tree from the root:

N\leftarrow N\cup\{(root\_node,D_{\text{entire}})\}

;

10: while

N\neq\emptyset

, the task party do

11: Get

(n,D)

from

N

to be traversed;

12: if

n

is a leaf node then

13: Set the output of

n

;

14: else

15: Find the splitting rule achieving the best gain:

r=h_{\text{label}}(D)

;

16: Send

r

to the corresponding data party;

17: After receiving

r

, the data party splits

D

into

D_{\text{left}}

and

D_{\text{right}}

and sends indicator vectors back;

18: Add

n

’s children (denoted as

n_{\text{left}}

and

n_{\text{right}}

) to

N

N\leftarrow N\cup\{(n_{\text{left}},D_{\text{left}}),(n_{\text{right}},D_{% \text{right}})\}

;

19: end if

20: Remove the traversed node:

N\leftarrow N\backslash\{(n,D)\}

;

21: end while

22: end for

3.2.2. Tree building.

We describe the overall process of training label-scattering TBMs in Algorithm 2. The goal is to build $T$ decision trees collaboratively.

For each decision tree, the data party needs to generate the label-related information $e(y)$ and then broadcast $e(y)$ to all data parties (lines 2-3). In most cases, $e(y)$ can be different when building different decision trees, e.g., when $e(y)$ is the set of first-order and second-order gradients, it relies on both labels $y$ and decision trees built previously. For each data party $m\in[M]$ who has received $e(y)$ , it calculates the intermediate results and sends the results back to the task party (lines 5-6). After receiving the intermediate results from all data parties, the task party is able to traverse from the root node to the leaf nodes for finding the splitting rules that achieve the best gain on the reached data samples (lines 9-21). Such a traversing process is almost the same as the process in feature-gathering TBMs, except for how to partition the data samples into several subsets (line 18). In label-scattering TBMs, the task party could not partition the data samples into subsets based on the found splitting rule $r$ , since the task party does not have the feature-related information. Therefore, the task party has to send $r$ to the corresponding data party and wait for the feedback, including some (protected) indicator vectors to denote the data subsets attached to the children of the traversed node.

From the algorithm, we can conclude that in label-scattering TBMs, (1) The task party broadcasts the label-related information to data parties, which might happen before building every decision tree if such information would update; (2) Both the task party and data parties involve in the process of tree building, i.e., the traversing process from roots to leaf nodes. Specifically, keeping communication with each other, the task party finds the splitting rules, and the data parties partition the data samples into subsets accordingly. Compared to feature-gathering TBMs, the communication between the task party and the data parties in label-scattering TBMs is more frequent.

3.2.3. Privacy protection mechanisms.

In label-scattering TBMs, some information is vulnerable and required for further protection, including the label-related information $e(y)$ , the intermediate results $g(x^{(m)},e(y))$ , and the indicator vectors. In order to provide protection, researchers propose several advanced privacy protection mechanisms based on homomorphic encryption and secure multi-party computation.

SecureBoost (36) is focused on XGBoost and applies homomorphic encryption algorithms on $e(y)$ for enhancing privacy protection. An illustration of SecureBoost is shown in Figure 6. Specifically, the task party broadcasts the encrypted $e(y)$ to all data parties. Each data party calculates $g(x^{(m)},e(y))$ independently, following the process of sorting $e(y)$ based on the ordinal numbers ranked by its feature values and partitioning the sorted $e(y)$ into several buckets. These values inside each bucket are summed up and finally sent back to the task party. Therefore, the task party can perform decryption to recover the sum of values in each bucket (due to the characteristic of homomorphic encryption) but cannot get the individual values of $e(y)$ .

Further, with the aim of protecting both the label-related information, the intermediate results, and the indicator vectors, recent studies (77; 37) propose to combine partially homomorphic encryption and additive secret sharing techniques. To be more specific, for the task party, each node is attached to an indicator vector for training samples, where each element is a binary value to represent whether a sample reaches the node (value $1$ ) or not (value $0$ ). The indicator vector at the root node is initialized as $1$ s.

Then the task party splits each indicator vector into several frames, and sends one of the frames to a data party. And the task party also sends the encrypted label-related information $e(y)$ . After receiving these results, each data party sorts $e(y)$ according to the ordinal numbers ranked by its feature values, splits the sorted $e(y)$ into several frames, and sends one of the frames to a data/task party. Thus each party owns one frame of the encrypted $e(y)$ with different orders, and they could jointly perform the computation to get the maximal splitting gain for learning the optimal splitting rules via the secret sharing division technique. The data party that holds the splitting feature can update the indicator to represent which subtree would be traversed according to learned splitting rules for each training sample. The updated indicators are also split into several frames and sent to different parties, which can be used for further computation via secret sharing multiplication. In Figure 7 we show the splitting process among two parties, where each party only holds one frame of indicators and label-related information. Here, $L$ and $R$ mean the indicator vector of the incidences of samples belonging to the left and right subtree, respectively. By secret sharing multiplication, the children’s nodes also hold the frames of the protected information.

Although applying partially homomorphic encryption and additive secret sharing techniques can enhance the privacy protection strength for label-scattering TBMs, the additional computation and computation cost they might bring is non-negligible in real-world VFL applications.

3.2.4. Summary.

Generally speaking, label-scattering TBMs suggest that the task party broadcasts the label-related information to all data parties without raising privacy issues, and data parties calculate and then send the intermediate results back to the task party for learning splitting rules. Finally, the splitting rules would be sent to the data party that owns the corresponding splitting feature for generating the partition of data samples. As a result, the task party scatters the label-related information to data parties, and both the data parties and the task party make their computation for building label-scattering TBMs.

The major communication overhead of label-scattering TBMs is the label-related information, which might be broadcast by the task party before building every decision tree if such information would be updated after finishing building a tree. The characteristics of label-scattering TBMs are summarized in Table 1, which also shows the comparison between feature-gathering TBMs and label-scattering TBMs to highlight their differences.

3.3. Inference Procedure

In the previous sections, we introduce the training process of two types of TBMs in the VFL scenario. In this section, we describe their inference procedure, which also needs collaboration among multiple parties since the learned models are decentralized.

There exist two inference frameworks for TBMs in the VFL scenario, distinguished by whether the inference procedure is led by the task party or accomplished through multiple parties. Specifically, to produce a predicted result, a test sample traverses from the root node to one leaf node for each built decision tree, following the path generated according to the learned splitting rules at each internal node. One inference framework (36; 38) proposes the inference procedure can be led by the task party, since it holds the learned splitting rules (i.e., the “pointers”) at every internal node. When a test sample reaches an internal node, the task party sends a request to the corresponding data party (which is the owner of the splitting feature) to query which subtree the sample should choose to go. It can be implied that the communication between the task party and the data party aforementioned might expose the splitting features and values during such querying operations, which motivates researchers (37) to provide protection via secret sharing techniques.

Figure 8 illustrates an example of the inference procedure led by the task party, where solid circles represent nodes that hold the splitting rules or values, while dashed circles represent nodes that do not. We assume the test case reaches the leaf node $n_{5}$ , and the inference procedure can be summarized as follows. The inference begins at the root node. Since the task party does not hold the splitting rule at the root node, it needs to request the splitting rule associated with the root node from the data party. When receiving the request, the data party responds with an indicator “right” (which may be encrypted) to the task party. The inference procedure then moves on to node $n_{5}$ . Based on the splitting rule held by the task party, the test case finally arrives at the corresponding leaf node.

Another inference framework (77; 40; 109) needs to be accomplished by multiple parties. Taking the case of binary trees as an example, each party starts at the root node and holds a binary indicator valued as $1$ to denote that the test sample reaches the node. Then, if the party happens to hold the splitting feature and splitting value at the node, it knows which subtree the test sample should choose to go, producing $\langle 1,0\rangle$ (for going to the left subtree) or $\langle 0,1\rangle$ (for going to the right subtree) accordingly. If the party has no information about the splitting rules, it produces $\langle 1,1\rangle$ (the test sample might go to any one of the subtrees) or $\langle 0,0\rangle$ (the test sample has not chosen this path) following the indicator value at the current node (i.e., their parent node). Thus, each party can produce an indicator vector whose length is the same as the number of leaf nodes, which can be gathered by performing element-wise AND operation to know which leaf node the test sample finally reached. To further avoid privacy leakage brought by sharing the indicator vectors, researchers (77; 40) propose to apply homomorphic encryption algorithms to mask the indicator vectors.

Figure 9 illustrates an example of the inference procedure accomplished through multiple parties, where solid circles represent nodes that hold the splitting rules or values, while dashed circles indicate nodes that do not. We assume the test case reaches the leaf node $n_{5}$ , and the inference procedure can be summarized as follows. As the initialization, both parties start with indicators valued as $\langle 1\rangle$ . The data party then updates the indicator to $\langle 0,1\rangle$ according to the splitting rules, indicating that the test case would take the right branch. Meanwhile, the task party updates its indicator to $\langle 1,1\rangle$ since it lacks knowledge of the splitting rule at this point. As the inference moves to the next depth level, the data party updates the indicator from 0 to $\langle 0,0\rangle$ for $n_{1}$ (although the data party holds the splitting rule of $n_{1}$ , the test case does not choose this branch), and updates the indicator from 1 to $\langle 1,1\rangle$ for $n_{2}$ (as it does not hold the splitting rule). This results in concatenating the indicators to $\langle 0,0,1,1\rangle$ . On the other hand, the task party updates the indicator from 1 to $\langle 1,1\rangle$ for $n_{1}$ (as it does not hold the splitting rule), and updates the indicator from 1 to $\langle 1,0\rangle$ for $n_{2}$ based on the calculation from the splitting rule. This results in concatenating the indicators to $\langle 1,1,1,0\rangle$ . Finally, all indicators are aggregated using an element-wise logical AND operation, which produces $\langle 0,0,1,0\rangle$ and indicates that the test case finally arrives at leaf node $n_{5}$ .

4. Infrastructure

In this section, we review open-source FL platforms for supporting tree-based models (TBMs) in the vertical federated learning (VFL) scenario, and further summarize several principles to promote the design of infrastructure.

4.1. FL Platforms

Remarkable progress has been made by open-source FL platforms (79; 80; 81; 110; 111; 112; 113; 114) in supporting users to conveniently apply FL in real-world applications and developing new FL algorithms, which cover various FL scenarios. In order to support TBMs in the VFL scenario, FL platforms are expected to allow multiple parties to execute various kinds of subroutines for receiving, handling, and sending different types of information and completing different computation tasks.

Inspired by previous studies (78; 12; 67; 68; 115), we briefly summarize the open-source FL platforms that can satisfy the requirements of applying TBMs in the VFL scenario, including:

•

FATE⁴⁴4https://github.com/FederatedAI/FATE is an industrial-grade FL platform that supports the secure computation of different kinds of machine learning algorithms, including several tree-based models such as GBDT.
•

Fedlearner⁵⁵5https://github.com/bytedance/fedlearner is focused on multi-party collaborative tasks, and provides the implementation of SecureBoost for the VFL scenario.
•

FedTree⁶⁶6https://github.com/Xtra-Computing/FedTree is a specially designed FL platform targeting tree-based models, which provides different types of tree-based models such as GBDT and random forests, equipped with various privacy protection mechanisms.
•

SecretFlow⁷⁷7https://github.com/secretflow/secretflow provides device abstraction for conveniently applying SMPC and releases the implementation of label-scattering TBMs, e.g., HEP-XGB and CRP-XGB.
•

FS-Tree⁸⁸8https://github.com/alibaba/FederatedScope/tree/master/federatedscope/vertical_fl is a module in FederatedScope (81) designed for TBMs, which is a flexible and easy-to-use FL platform based on an event-driven architecture. It includes both feature-gathering TBMs and label-scattering TBMs, and provides rich types of privacy protection mechanisms as plugins.

4.2. Design Principles

Considering the diversity of VFL applications, we summarize the design principles of FL platforms to better satisfy the requirements of applying TBMs in the VFL scenario:

•

Flexibility. Flexibility refers to the platform’s ability to support various types of communication and computation protocols when applying TBMs in the VFL scenario. Specifically, a platform exhibiting strong flexibility should effectively handle the actions associated with existing types of communication and computation protocols, including sending, receiving, and processing behaviors during both training and inference procedures. In contrast, platforms that only support a specific protocol or lack capabilities for diverse actions are considered less flexible.
•

Extensibility. One of the main targets of constructing FL platforms is to save the effort of developers in implementing new algorithms, which motivates FL platforms to be extendable. Extensibility denotes the extent to which a platform can conveniently support the addition of new modules, which is particularly crucial for the ongoing development of TBMs. When the modules provided by the platform are tightly coupled, it increases the effort required for developers to introduce new modules or enhance existing ones, thereby diminishing the extensibility of platforms. A design featuring pluggable modules can significantly enhance extensibility. The level of extensibility of a platform can be assessed by measuring the effort required for developers to implement a reasonable extension, such as the number of lines of code added or the number of modified files.
•

Scalability. Scalability has two dimensions within the context of TBMs in VFL: the platform’s ability to handle increasing data volumes (e.g., additional features and samples) and its capacity to support an increasing number of participants. A platform with strong scalability should ensure that, as data volumes or participants increase, its efficiency does not exceed linear growth under specific computational resource conditions.
•

Security. Privacy protection is particularly important for applying TBMs in the VFL scenario, since feature-related and label-related information is required to be exchanged, which are more vulnerable than those in the HFL scenario (e.g., the model parameters). Security of the platform primarily pertains to the availability of advanced privacy protection algorithms that can be effectively implemented to address the varied requirements of different real-world applications. For TBMs in VFL, different protocols might necessitate different privacy protection algorithms. Consequently, platforms that facilitate user-friendly implementation and allow for parameter adjustments tailored to specific protection needs demonstrate superior security.

Overall, the aforementioned design principles motivate us to describe different subroutines as separate and pluggable behaviors, and to minimize the dependence between different parties and subroutines. We hope that these principles can further inspire the community to make improvements on FL platforms in supporting tree-based models.

5. Experiments

In this section, we conduct a series of experiments on several widely-used datasets. We aim to provide an empirical understanding of the characteristics of different types of TBMs, i.e., feature-gathering TBMs and label-scattering TBMs. Meanwhile, we demonstrate the trade-off among model utility, protection strength, and computation/communication cost, which should be carefully balanced in real-world applications.

5.1. Datasets and Metrics

We adopt the following datasets in the experiments:

•

Abalone⁹⁹9https://archive.ics.uci.edu/ml/datasets/abalone: This dataset is prepared for predicting the age of abalone from physical measurements, which contains 4,177 instances and 8 features.
•

Blog¹⁰¹⁰10http://archive.ics.uci.edu/ml/datasets/BlogFeedback: The task associated with this dataset is to predict how many comments the post will receive. It contains 52,397 and 7,624 instances for training and testing, respectively, with 280 features in each instance.
•

Adult¹¹¹¹11http://archive.ics.uci.edu/ml/datasets/Adult: This dataset is prepared for the prediction task to determine whether a person earns over 50K a year or not. It contains 32,561 instances for training, and 16,281 instances for testing, with 14 features. We delete the samples with unknown values, which leads to 30,162 and 15,060 instances for training and testing, respectively.
•

Credit¹²¹²12https://www.kaggle.com/c/GiveMeSomeCredit/overview: It is a credit score dataset that classifies whether a user would suffer from serious financial problems, helping banks determine whether or not a loan should be granted. It contains a total of 150,000 instances and 10 features.

Note that Abalone and Credit datasets do not provide the partition for evaluation, thus for them, we split $80\%$ and $20\%$ of samples for training and evaluation, respectively. For the evaluation metrics, we adopt Mean Squared Error (MSE) for regression tasks (on Abalone and Blog datasets), and accuracy and Area Under the ROC Curve (AUC) for classification tasks (on Adult and Credit datasets).

Table 2. The adopted hyperparameters for TBMs.

Models	Datasets	Number of trees	Depth	Learning rate	$\lambda$	$\gamma$	Feature subsample ratio
XGBoost	Abalone	14	3	0.19	0.1	0	1
	Blog	15	4	0.17	0.1	0	1
	Adult	10	3	0.56	0.1	0	1
	Credit	6	4	0.35	0.1	0	1
GBDT	Abalone	15	3	0.19	0.1	-	1
	Blog	15	4	0.12	0.1	-	1
	adult	15	4	0.49	0.1	-	1
	Credit	10	4	0.1	0.1	-	1
Random Forest	Abalone	10	6	-	-	-	1
	Blog	13	6	-	-	-	0.55
	Adult	10	5	-	-	-	0.4
	Credit	10	3	-	-	-	0.2

5.2. Implementation Details

We implement various feature-gathering TBMs and label-scattering TBMs, including RF, GBDT, and XGBoost, based on FS-Tree, a module in FederatedScope (81) designed for TBMs. The main reason for our choice is that FS-Tree can support flexible information exchange and handling processes with the help of an event-driven architecture. Meanwhile, it allows rich types of privacy protection mechanisms (differential privacy, homomorphic encryption, and secret sharing) as plugins for convenient usage, and has comprehensive benchmarking ability.

In the experiments, we set the number of parties to be $2$ , i.e., one task party and one data party. Each party holds part of the features (non-overlap) of the datasets. The categorical features in the datasets have been transformed into numerical types via one-hot encoding, following the settings in previous studies (39). We set the number of buckets to $50$ for accelerating the process of finding splitting rules. We utilize the hyperparameter optimization (HPO) tools provided in FederatedScope for searching the optimal hyperparameters used in various TBMs, and listed the adopted hyperparameters in Table 2.

Table 3. Model performance on four widely-used datasets.

	Models	Abalone	Blog	Adult		Credit
	Models	MSE	MSE	Accuracy	AUC	Accuracy	AUC
Feature-gathering	Random Forest	3.866 $\pm$ 0.105	576.599 $\pm$ 40.508	0.790 $\pm$ 0.064	0.836 $\pm$ 0.039	0.932 $\pm$ 0.000	0.583 $\pm$ 0.039
	GBDT	4.374 $\pm$ 0.186	552.942 $\pm$ 1.637	0.840 $\pm$ 0.001	0.895 $\pm$ 0.000	0.935 $\pm$ 0.000	0.812 $\pm$ 0.000
	XGBoost	4.177 $\pm$ 0.169	544.368 $\pm$ 4.423	0.844 $\pm$ 0.001	0.895 $\pm$ 0.001	0.935 $\pm$ 0.002	0.823 $\pm$ 0.013
Label-scattering	Random Forest	3.891 $\pm$ 0.095	574.698 $\pm$ 32.156	0.823 $\pm$ 0.020	0.838 $\pm$ 0.027	0.932 $\pm$ 0.000	0.566 $\pm$ 0.033
	GBDT	4.342 $\pm$ 0.122	552.518 $\pm$ 1.959	0.840 $\pm$ 0.001	0.895 $\pm$ 0.000	0.935 $\pm$ 0.000	0.812 $\pm$ 0.000
	XGBoost	4.214 $\pm$ 0.100	544.429 $\pm$ 4.459	0.844 $\pm$ 0.001	0.896 $\pm$ 0.000	0.936 $\pm$ 0.002	0.823 $\pm$ 0.011
	XGBoost (SS)	3.588 $\pm$ 0.000	546.289 $\pm$ 2.733	0.801 $\pm$ 0.003	0.820 $\pm$ 0.000	0.935 $\pm$ 0.000	0.854 $\pm$ 0.001

Table 4. Communication frequency on four widely-used datasets

	Models	Abalone	Blog	Adult	Credit
Feature-gathering	Random Forest	133.2 $\pm$ 5.8	370.7 $\pm$ 4.7	117.9 $\pm$ 7.5	37.8 $\pm$ 2.2
	GBDT	47.1 $\pm$ 1.2	127.1 $\pm$ 0.3	95.0 $\pm$ 2.2	58.0 $\pm$ 0.0
	XGBoost	43.4 $\pm$ 1.9	125.4 $\pm$ 0.7	40 $\pm$ 0.0	36.1 $\pm$ 0.3
Label-scattering	Random Forest	743.6 $\pm$ 7.3	1,153.6 $\pm$ 10.9	410.1 $\pm$ 9.8	96.7 $\pm$ 2.8
	GBDT	137.2 $\pm$ 1.5	337.1 $\pm$ 0.3	303.9 $\pm$ 2.1	198.0 $\pm$ 0.0
	XGBoost	127.8 $\pm$ 1.7	335.6 $\pm$ 0.8	100.0 $\pm$ 0.0	119.9 $\pm$ 0.3

5.3. Results and Analysis

5.3.1. Comparisons between feature-gathering and label-scattering TBMs

We conduct a series of experiments to show the differences between feature-gathering TBMs and label-scattering TBMs in terms of model performance and communication frequency, where the communication frequency represents the number of information exchanges between the task party and the data party.

The experimental results of model performance comparisons are shown in Table 3, from which we can see that different feature-gathering TBMs and label-scattering TBMs, including random forest, GBDT, and XGBoost, achieve similar model performance in all adopted datasets. XGBoost (SS) denotes the method proposed in (37), combining XGBoost and secret sharing, which is implemented using SecretFlow¹³¹³13https://github.com/secretflow/secretflow. These results are not surprising, since the feature-gathering TBMs and label-scattering TBMs are different in the communication and computation protocols while both of them are reliable to train decision trees well. However, these two kinds of TBMs need different communication frequencies to complete the tree-building process, as illustrated in Table 4. The communication frequency in XGBoost (ss) is meanless, since the method contains a large number of secret sharing computations, including secret sharing addition, division, and comparison, which makes the communication frequency a huge number. We can observe from the table that label-scattering TBMs need several times of communication frequencies compared to feature-gathering TBMs. The reason is that, in label-scattering TBMs, the task party needs to broadcast the label-related information to the data parties before building every tree. After finding the splitting rules at each node, the task party and the data parties need more communication for data sample partition compared to that in feature-gathering TBMs. These experimental results are consistent with the analysis in Section 3.

5.3.2. Model utility versus privacy protection strength

In feature-gathering TBMs, in order to prevent privacy leakage, the data parties tend to disrupt ordinal numbers generated based on their private feature. Thus, there exists a trade-off between privacy protection strength and the utility of the model learned based on such desensitized but noisy data. More discussions can be found in Section 3.1.3, and here we provide empirical observation for a better understanding of such a trade-off. We train XGBoost equipped with the privacy protection mechanisms proposed by FederBoost and OpBoost, and show the experimental results in Figure 10 and Figure 11, respectively.

As shown at the x-axis in Figure 10, we vary the probabilities of a sample staying in the correct bucket. From the figure, we can observe that as the probability increases, which implies the intermediate results provided by the data parties are more precise, the model performance becomes better and more stable (i.e., fewer variances), and gradually approaches the results achieved by the model learned without adding DP noise. Similarly, we adjust the privacy protection strength of OpBoost by changing the values of $\epsilon$ , which controls the probability of mapping a sample from the $a$ -th bucket to $b$ -th bucket following $\frac{e^{-|a-b|\cdot\epsilon/2}}{\sum_{j=1}^{50}e^{-|a-j|\cdot\epsilon/2}}$ , where $a,b\in[1,50]$ , and show the results in Figure 11. These results further confirm the trade-off between model utility and privacy protection strength when training feature-gathering TBMs.

Note that for label-scattering TBMs, the adopted privacy protection mechanisms might not impact the model utility but bring some efficiency issues, as discussed in the next part.

5.3.3. Privacy protection strength versus computation and communication cost

In order to avoid privacy leakage, label-scattering TBMs propose to apply homomorphic encryption and secret sharing techniques, which might bring additional computation and communication costs. The reason is that the encryption/decryption and framing process needs lots of computation resources, and the encrypted and framed information is always lengthy and costs more computation/communication resources than raw information.

In the experiment, we apply Paillier (54) algorithms to encrypt the shared information, and vary the sizes of public and private keys to change the length of encrypted information and thus adjust the privacy protection strength. We compare the time cost and communication overhead when using different key sizes, and show the results in Figure 13 and Figure 13, respectively. From these results, we can see that when the key size becomes large, the communication overhead grows linearly, and time consumption (which includes the additional time cost for encrypting, decrypting, sending, receiving, and handling the messages) is growing super linearly. Such a phenomenon inspires users to choose a suitable key size when using homomorphic encryption and secret sharing techniques, balancing the privacy protection strength and computation/communication cost.

For feature-gathering TBMs where data parties might inject DP noise into the intermediate results, the communication overhead and time consumption when applying different algorithms are kept at the same level. The slight differences mainly come from the computation and time cost for generating noises and disrupting the ordinal numbers.

5.3.4. Discussions regarding multiple data parties

As the number of data parties increases, the utility of the trained TBMs can be influenced by the privacy protection algorithms adopted by each participant. In an extreme case, if each participant either does not employ any privacy protection algorithms or uses performance-preserving algorithms (such as homomorphic encryption), the performance of the TBMs remains consistent regardless of the number of participants. This is primarily because, within VFL, the tree-building process does not fundamentally differ from that in centralized settings, although it distributes the computations among different participants and thus leads to some additional intermediate calculations. On the other hand, when participants utilize differential privacy for enhanced privacy protection, the number of participants affects the model’s utility.

An increase in participants results in a proportional rise in communication costs during broadcasting, as we typically utilize a peer-to-peer (P2P) transmission protocol in FL, which implies that each piece of information must be independently transmitted to its designated recipient. Furthermore, the computational costs are closely tied to the number of local features and samples. If the increase in participants leads to a corresponding growth in the number of features and samples (regardless of whether some are redundant), the computational costs would also increase.

It is worth noting that the communication and computation protocols discussed in Section 3 impose no restrictions on the number of data parties. Therefore, we set the number of data parties to one during the experiments to provide clear insights regarding the trade-offs between model utility, privacy protection, and communication/computation costs.

5.3.5. Summary

In a nutshell, both feature-gathering TBMs and label-scattering TBMs can hardly keep model utility, privacy protection strength, and communication/computation cost at the perfect level at the same time. For example, feature-gathering TBMs are more efficient in achieving an acceptable model utility while exposing desensitized plaintext results, and label-scattering TBMs provide high-level protection strength but need more communication and computation resources. When applying TBMs in the VFL scenario, researchers and developers are encouraged to propose more advanced algorithms to achieve a better trade-off, and balance model utility, privacy protection strength, and communication/computation cost according to their downstream applications.

6. Applications & Directions

In this section, we provide several real-world applications for a better understanding of the advantages of tree-based models in VFL. Besides, we outline the opportunities and future directions for TBMs in VFL.

6.1. Real-world Applications

6.1.1. Finance

Finance is crucial for social development, encompassing key areas such as risk management and financial marketing. Risk management involves identifying, evaluating, and mitigating potential risks to an organization’s capital, earnings, and overall operations. Financial marketing focuses on the strategies that financial institutions use to attract, acquire, and retain customers, including branding, advertising, customer relationship management, and digital marketing.

In practice, financial institutions face challenges due to the lack of comprehensive user profiles essential for effective model training, as well as the limited interpretability of predictive results. Training tree-based models within the federated learning (FL) framework can address these issues by enabling secure collaborations that utilize a wealth of data attributes from external sources, such as internet companies. Tree-based models, known for their intuitive structure nature, enhance interpretability by allowing users to easily understand the decision-making process behind predictions. This transparency is crucial for financial institutions, as it fosters trust in model outputs. In this way, institutions can further improve the accuracy of predictive analytics with actionable insights, and, at the same time, ensure the protection of data privacy.

Recently, notable contributions to this field include vertical federated logistic regression (116), vertical federated linear regression (12), and vertical federated boosting tree-based models (36; 117; 76). These approaches employ Homomorphic Encryption (HE) and Secret Sharing (SS) to protect the privacy of transmitted intermediate results.

Specifically, WeBank and its partner companies successfully complete vertical federated modeling using invoice data¹⁴¹⁴14https://www.fedai.org/cases/. This enables WeBank and its partners to collaboratively train the model while keeping their respective data private. The parties exchange encrypted intermediate results and retain control over individual model segments. When it comes time to make predictions, the models from both sides are combined to generate the final prediction. The entire training process prioritizes the security of both the data and the models. Ant Group has announced a financial risk control technology solution based on SMPC that allows for the inclusion of more dimensional credit data into a joint model without compromising the confidentiality of the data sources¹⁵¹⁵15https://www.secretflow.org.cn/en/. This approach helps to construct more accurate big data credit risk control models. The support of this technology solution enhances the necessary collaboration and communication among participants in financial risk control joint projects. It accelerates the transformation from traditional methods to cutting-edge technologies in financial risk management, serves the collaborative supervision of the government and the financial industry, and promotes the development of the financial data market.

6.1.2. Recommendation & Advertising

Recommendation and advertising are representative domains where VFL holds promising applications. Developing a robust recommendation or advertising system often requires the aggregation of extensive user data from various sources, which raises significant privacy and security concerns. VFL addresses this issue by allowing multiple organizations to collaboratively train a powerful recommendation model without the need to share raw user data.

Consider a scenario where multiple e-commerce platforms aim to develop a combined recommendation or advertising engine that leverages their unique datasets. Each platform contains distinct types of user information, such as purchase history, browsing behavior, user ratings, and demographic details. By utilizing VFL, these platforms can collaboratively train a tree-based model that effectively incorporates the complete feature set available across all platforms, thereby enhancing the relevance of recommendations.

Based on SecureBoost (36), (41) has effectively improved recommendation accuracy by leveraging data from mobile network operators and healthcare providers within an FL environment. Besides, research (69) points out that many companies have successfully implemented tree-based models in VFL for recommendation and advertising systems. For example, ByteDance utilizes the Fedlearner to significantly enhance its advertising effectiveness through the development of a tree-based VFL algorithm; Tencent employs vertical federated GBDTs to establish a VFL federation between advertisers and advertising platforms, resulting in improved model accuracy.

6.1.3. Healthcare

VFL has emerged as a transformative technology in the healthcare sector, especially given the fragmented nature of healthcare data. In this landscape, different institutions, such as hospitals, laboratories, and clinics, often hold critical yet incomplete pieces of patient information. These entities typically maintain separate records that encompass diverse aspects of a patient’s care, including medical history, diagnostic results, and treatment plans.

These organizations can engage in collaborative analysis without compromising sensitive data privacy with VFL, which allows them to collectively train on a more comprehensive dataset, improving the accuracy, robustness, and interpretability of predictive analytics, while strictly adhering to privacy regulations. Vertical federated TBMs hold considerable potential for creating decision rules across multiple data sources to predict patient disease risks, enhance the quality of medical services, improve disease prediction, and optimize treatment protocols. Tree-based models are also adept at handling variable interactions and providing clear classification rules, which assist physicians in developing personalized treatment plans based on the model’s outputs.

Recent studies (43; 44; 45; 46) have designed novel vertical federated TBMs and conducted a series of experiments on healthcare datasets, paving the way for a more integrated approach to healthcare and driving improvements in health outcomes on a larger scale. Specifically, Ant Group’s privacy computing platform, SecretFlow¹⁶¹⁶16https://github.com/secretflow/secretflow, in collaboration with Alibaba Cloud’s medical big data management platform (HData v2.0), utilizes its advanced privacy computing technology to overcome data collaboration barriers among medical institutions in DRGs applications. By integrating artificial intelligence and big data technologies without allowing data to leave the domain, the platform facilitates joint statistics and modeling across institutions. In the realm of medical diagnosis classification, SecretFlow enhances predictive models through the sharing of multi-institutional samples, significantly improving the accuracy of predictions compared to models trained on data from a single institution. The platform is also applied in various scenarios, including disease diagnosis, examination recommendations, medication suggestions, rare disease prediction, and quality control rule management. These applications provide robust solutions for medical data governance, comprehensive hospital quality control, medical research, medical insurance risk management, and critical clinical business challenges.

6.2. Challenges and Future Directions

6.2.1. Communication Overhead

Addressing communication overhead during training and inference is a critical area for improvement. This is particularly important considering that the number of transmissions between data parties and label parties in TBMs tends to be higher than in other methods. Meanwhile, communication costs would increase with the number of samples for TBMs. As a result, reducing communication overhead might bring significant improvements in cross-silo FL that involve large datasets.

6.2.2. Feature Abandonment and Crossing

In VFL scenarios, participants do not have access to knowledge about the features owned by others, which can lead to potential redundancy and missed opportunities for creating high-order features through feature crossing. While neural networks may implicitly handle these through non-linear layers, this could be more challenging in TBMs. Thus, enhancing model utility by effectively managing feature abandonment and facilitating feature crossing represents a promising direction, focusing on optimizing the use of available features to improve model accuracy and relevance.

6.2.3. Balance Among Participants

Different participants may possess varying numbers of features, and the importance of these features can be different. An interesting research direction can be proposing mechanisms that ensure the trained model does not become overly biased toward participants with a greater number of features. Conversely, designing suitable incentive mechanisms to reward participants who contribute important features is another valuable area for exploration, leveraging the inherent interpretability of TBMs.

6.2.4. Non-Overlapping Samples

Training TBMs in VFL typically rely on overlapping samples, which can limit their applicability. Therefore, designing mechanisms that allow the participation of non-overlapping samples, such as data augmentation strategies, represents a valuable research direction to further broaden the application scope of TBMs in VFL.

7. Conclusions

In this paper, we give a comprehensive survey on tree-based models (TBMs) in vertical federated learning (VFL). We categorize TBMs into feature-gathering TBMs and label-scattering TBMs based on differences in communication and computation protocols. We provide a detailed overview of their training processes and inference procedures, and discuss how to protect various types of shared information using techniques such as differential privacy, homomorphic encryption, and secure multi-party computation. We summarize several design principles aimed at promoting federated learning platforms to support diverse tree-based models for both academic research and industrial deployment. Besides, we provide real-world applications to better illustrate the advancements of TBMs in VFL, including in finance, recommendation systems, advertising, and healthcare, while highlighting some opportunities and future directions.

References

Hinton and Salakhutdinov [2006] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
LeCun et al. [2015] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
Chen and Guestrin [2016] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 630–645, 2016.
Krizhevsky et al. [2017] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
Srivastava et al. [2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Jordan and Mitchell [2015] M. I. Jordan and T. M. Mitchell. Machine learning: Trends, perspectives, and prospects. Science, 349(6245):255–260, 2015.
Li et al. [2021a] Qinbin Li, Zeyi Wen, Zhaomin Wu, Sixu Hu, Naibo Wang, Yuan Li, Xu Liu, and Bingsheng He. A survey on federated learning systems: Vision, hype and reality for data privacy and protection. IEEE Transactions on Knowledge and Data Engineering, 2021a.
McMahan et al. [2017] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282, 2017.
Kairouz et al. [2021] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021.
Yang et al. [2019a] Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):1–19, 2019a.
Chen et al. [2021a] Chaochao Chen, Jun Zhou, Li Wang, Xibin Wu, Wenjing Fang, Jin Tan, Lei Wang, Alex X Liu, Hao Wang, and Cheng Hong. When homomorphic encryption marries secret sharing: Secure large-scale sparse logistic regression and applications in risk control. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 2652–2662, 2021a.
Long et al. [2020] Guodong Long, Yue Tan, Jing Jiang, and Chengqi Zhang. Federated learning for open banking. In Federated learning, pages 240–254. 2020.
Wang [2019] Guan Wang. Interpret federated learning with shapley values. arXiv preprint arXiv:1905.04519, 2019.
Cheng et al. [2020] Yong Cheng, Yang Liu, Tianjian Chen, and Qiang Yang. Federated learning for privacy-preserving ai. Communications of the ACM, 63(12):33–36, 2020.
Ammad-Ud-Din et al. [2019] Muhammad Ammad-Ud-Din, Elena Ivannikova, Suleiman A Khan, Were Oyomno, Qiang Fu, Kuan Eeik Tan, and Adrian Flanagan. Federated collaborative filtering for privacy-preserving personalized recommendation system. arXiv preprint arXiv:1901.09888, 2019.
Zhang and Jiang [2021] JianFei Zhang and YuChen Jiang. A vertical federation recommendation method based on clustering and latent factor model. In 2021 International Conference on Electronic Information Engineering and Computer Science (EIECS), pages 362–366, 2021.
Cui et al. [2021] Jinming Cui, Chaochao Chen, Lingjuan Lyu, Carl Yang, and Wang Li. Exploiting data sparsity in secure cross-platform social recommendation. Advances in Neural Information Processing Systems, 34:10524–10534, 2021.
Shmueli and Tassa [2017] Erez Shmueli and Tamir Tassa. Secure multi-party protocols for item-based collaborative filtering. In Proceedings of the eleventh ACM conference on recommender systems, pages 89–97, 2017.
Leo et al. [2019] Martin Leo, Suneel Sharma, and Koilakuntla Maddulety. Machine learning in banking risk management: A literature review. Risks, 7(1):29, 2019.
Zheng et al. [2022] Zhaohua Zheng, Yize Zhou, Yilong Sun, Zhang Wang, Boyi Liu, and Keqiu Li. Applications of federated learning in smart cities: recent advances, taxonomy, and open challenges. Connection Science, 34(1):1–28, 2022.
Jiang et al. [2020] Ji Chu Jiang, Burak Kantarci, Sema Oktug, and Tolga Soyata. Federated learning in smart city sensing: Challenges and opportunities. Sensors, 20(21):6230, 2020.
Ramu et al. [2022] Swarna Priya Ramu, Parimala Boopalan, Quoc-Viet Pham, Praveen Kumar Reddy Maddikunta, Thien Huynh-The, Mamoun Alazab, Thanh Thi Nguyen, and Thippa Reddy Gadekallu. Federated learning enabled digital twins for smart cities: Concepts, recent advances, and future directions. Sustainable Cities and Society, 79:103663, 2022.
Chen et al. [2020a] Tianyi Chen, Xiao Jin, Yuejiao Sun, and Wotao Yin. Vafl: a method of vertical asynchronous federated learning. arXiv preprint arXiv:2007.06081, 2020a.
Teimoori et al. [2022] Zeinab Teimoori, Abdulsalam Yassine, and M Shamim Hossain. A secure cloudlet-based charging station recommendation for electric vehicles empowered by federated learning. IEEE Transactions on Industrial Informatics, 18(9):6464–6473, 2022.
Chen et al. [2020b] Chaochao Chen, Jun Zhou, Longfei Zheng, Huiwen Wu, Lingjuan Lyu, Jia Wu, Bingzhe Wu, Ziqi Liu, Li Wang, and Xiaolin Zheng. Vertically federated graph neural network for privacy-preserving node classification. arXiv preprint arXiv:2005.11903, 2020b.
He et al. [2020a] Chaoyang He, Murali Annavaram, and Salman Avestimehr. Group knowledge transfer: Federated learning of large cnns at the edge. Advances in Neural Information Processing Systems, 33:14068–14080, 2020a.
Gupta and Raskar [2018] Otkrist Gupta and Ramesh Raskar. Distributed learning of deep neural network over multiple agents. Journal of Network and Computer Applications, 116:1–8, 2018.
Dang et al. [2020] Zhiyuan Dang, Bin Gu, and Heng Huang. Large-scale kernel method for vertical federated learning. In Federated Learning, pages 66–80. 2020.
Gu et al. [2020] Bin Gu, Zhiyuan Dang, Xiang Li, and Heng Huang. Federated doubly stochastic kernel learning for vertically partitioned data. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2483–2493, 2020.
Yang et al. [2019b] Shengwen Yang, Bing Ren, Xuhui Zhou, and Liping Liu. Parallel distributed logistic regression for vertical federated learning without third-party coordinator. arXiv preprint arXiv:1911.09824, 2019b.
Bonawitz et al. [2017] Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for privacy-preserving machine learning. In proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 1175–1191, 2017.
Romanini et al. [2021] Daniele Romanini, Adam James Hall, Pavlos Papadopoulos, Tom Titcombe, Abbas Ismail, Tudor Cebere, Robert Sandmann, Robin Roehm, and Michael A Hoeh. Pyvertical: A vertical federated learning framework for multi-headed splitnn. arXiv preprint arXiv:2104.00489, 2021.
Liu et al. [2020] Yang Liu, Yingting Liu, Zhijie Liu, Yuxuan Liang, Chuishi Meng, Junbo Zhang, and Yu Zheng. Federated forest. IEEE Transactions on Big Data, 2020.
Cheng et al. [2021] Kewei Cheng, Tao Fan, Yilun Jin, Yang Liu, Tianjian Chen, Dimitrios Papadopoulos, and Qiang Yang. Secureboost: A lossless federated learning framework. IEEE Intelligent Systems, 36(6):87–98, 2021.
Fang et al. [2021] Wenjing Fang, Derun Zhao, Jin Tan, Chaochao Chen, Chaofan Yu, Li Wang, Lei Wang, Jun Zhou, and Benyu Zhang. Large-scale secure xgb for vertical federated learning. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 443–452, 2021.
Tian et al. [2024] Zhihua Tian, Rui Zhang, Xiaoyang Hou, Lingjuan Lyu, Tianyi Zhang, Jian Liu, and Kui Ren. Federboost: Private federated learning for gbdt. IEEE Transactions on Dependable and Secure Computing, 21(3):1274–1285, 2024.
Li et al. [2022] Xiaochen Li, Yuke Hu, Weiran Liu, Hanwen Feng, Li Peng, Yuan Hong, Kui Ren, and Zhan Qin. Opboost: a vertical federated tree boosting framework based on order-preserving desensitization. Proceedings of the VLDB Endowment, 16(2):202–215, 2022.
Chen et al. [2021b] Xiaolin Chen, Shuai Zhou, Bei Guan, Kai Yang, Hao Fao, Hu Wang, and Yongji Wang. Fed-eini: An efficient and interpretable inference framework for decision tree ensembles in vertical federated learning. In 2021 IEEE International Conference on Big Data (Big Data), pages 1242–1248, 2021b.
Song et al. [2021] Yong Song, Yuchen Xie, Hongwei Zhang, Yuxin Liang, Xiaozhou Ye, Aidong Yang, and Ye Ouyang. Federated learning application on telecommunication-joint healthcare recommendation. In 2021 IEEE 21st International Conference on Communication Technology (ICCT), pages 1443–1448, 2021.
Jin et al. [2022] Chao Jin, Jun Wang, Sin G Teo, Le Zhang, C Chan, Qibin Hou, and Khin Mi Mi Aung. Towards end-to-end secure and efficient federated learning for xgboost. In Proceedings of the AAAI International Workshop on Trustable, Verifiable and Auditable Federated Learning, 2022.
Zheng et al. [2023] Yifeng Zheng, Shuangqing Xu, Songlei Wang, Yansong Gao, and Zhongyun Hua. Privet: A privacy-preserving vertical federated learning service for gradient boosted decision tables. IEEE Transactions on Services Computing, 16(5):3604–3620, 2023.
Lu et al. [2023] Wen-jie Lu, Zhicong Huang, Qizhi Zhang, Yuchen Wang, and Cheng Hong. Squirrel: A scalable secure $\{$ Two-Party $\}$ computation framework for training gradient boosting decision tree. In 32nd USENIX Security Symposium (USENIX Security 23), pages 6435–6451, 2023.
Jiang et al. [2024] Yufan Jiang, Fei Mei, Tianxiang Dai, and Yong Li. Sigbdt: Large-scale gradient boosting decision tree training via function secret sharing. In Proceedings of the 19th ACM Asia Conference on Computer and Communications Security, pages 274–288, 2024.
Akhavan Mahdavi et al. [2023] Rasoul Akhavan Mahdavi, Haoyan Ni, Dimitry Linkov, and Florian Kerschbaum. Level up: Private non-interactive decision tree evaluation using levelled homomorphic encryption. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 2945–2958, 2023.
Xu et al. [2024] Wei Xu, Hui Zhu, Yandong Zheng, Fengwei Wang, Jiaqi Zhao, Zhe Liu, and Hui Li. Elxgb: An efficient and privacy-preserving xgboost for vertical federated learning. IEEE Transactions on Services Computing, 2024.
Chen et al. [2022] Hanxiao Chen, Hongwei Li, Yingzhe Wang, Meng Hao, Guowen Xu, and Tianwei Zhang. Privdt: An efficient two-party cryptographic framework for vertical decision trees. IEEE Transactions on Information Forensics and Security, 18:1006–1021, 2022.
Xia et al. [2022] Liqiao Xia, Pai Zheng, Jinjie Li, Wangchujun Tang, and Xiangying Zhang. Privacy-preserving gradient boosting tree: Vertical federated learning for collaborative bearing fault diagnosis. IET Collaborative Intelligent Manufacturing, 4(3):208–219, 2022.
Grinsztajn et al. [2022] Leo Grinsztajn, Edouard Oyallon, and Gael Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data? In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
Dwork [2008] Cynthia Dwork. Differential privacy: A survey of results. In International conference on theory and applications of models of computation, pages 1–19, 2008.
Abadi et al. [2016] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318, 2016.
Dwork et al. [2014] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
Paillier [1999] Pascal Paillier. Public-key cryptosystems based on composite degree residuosity classes. In Advances in Cryptology—EUROCRYPT’99: International Conference on the Theory and Application of Cryptographic Techniques Prague, Czech Republic, May 2–6, 1999 Proceedings 18, pages 223–238, 1999.
Yi et al. [2014] Xun Yi, Russell Paulet, Elisa Bertino, Xun Yi, Russell Paulet, and Elisa Bertino. Homomorphic encryption. 2014.
Fontaine and Galand [2007] Caroline Fontaine and Fabien Galand. A survey of homomorphic encryption for nonspecialists. EURASIP Journal on Information Security, 2007:1–10, 2007.
Acar et al. [2018] Abbas Acar, Hidayet Aksu, A Selcuk Uluagac, and Mauro Conti. A survey on homomorphic encryption schemes: Theory and implementation. ACM Computing Surveys (Csur), 51(4):1–35, 2018.
Naehrig et al. [2011] Michael Naehrig, Kristin Lauter, and Vinod Vaikuntanathan. Can homomorphic encryption be practical? In Proceedings of the 3rd ACM workshop on Cloud computing security workshop, pages 113–124, 2011.
Gentry [2009] Craig Gentry. Fully homomorphic encryption using ideal lattices. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 169–178, 2009.
Yao [1982] Andrew C Yao. Protocols for secure computations. In 23rd annual symposium on foundations of computer science (sfcs 1982), pages 160–164, 1982.
Goldreich [1998] Oded Goldreich. Secure multi-party computation. Manuscript. Preliminary version, 78(110), 1998.
Du and Atallah [2001] Wenliang Du and Mikhail J Atallah. Secure multi-party computation problems and their applications: a review and open problems. In Proceedings of the 2001 workshop on New security paradigms, pages 13–22, 2001.
Zhao et al. [2019] Chuan Zhao, Shengnan Zhao, Minghao Zhao, Zhenxiang Chen, Chong-Zhi Gao, Hongwei Li, and Yu-an Tan. Secure multi-party computation: theory, practice and applications. Information Sciences, 476:357–372, 2019.
Evans et al. [2018] David Evans, Vladimir Kolesnikov, Mike Rosulek, et al. A pragmatic introduction to secure multi-party computation. Foundations and Trends® in Privacy and Security, 2(2-3):70–246, 2018.
Ben-David et al. [2008] Assaf Ben-David, Noam Nisan, and Benny Pinkas. Fairplaymp: a system for secure multi-party computation. In Proceedings of the 15th ACM conference on Computer and communications security, pages 257–266, 2008.
Li et al. [2020] Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE signal processing magazine, 37(3):50–60, 2020.
Aledhari et al. [2020] Mohammed Aledhari, Rehma Razzak, Reza M Parizi, and Fahad Saeed. Federated learning: A survey on enabling technologies, protocols, and applications. IEEE Access, 8:140699–140725, 2020.
Li et al. [2021b] Qinbin Li, Zeyi Wen, Zhaomin Wu, Sixu Hu, Naibo Wang, Yuan Li, Xu Liu, and Bingsheng He. A survey on federated learning systems: vision, hype and reality for data privacy and protection. IEEE Transactions on Knowledge and Data Engineering, 2021b.
Liu et al. [2022a] Yang Liu, Yan Kang, Tianyuan Zou, Yanhong Pu, Yuanqin He, Xiaozhou Ye, Ye Ouyang, Ya-Qin Zhang, and Qiang Yang. Vertical federated learning. arXiv preprint arXiv:2211.12814, 2022a.
Yu et al. [2024] Lei Yu, Meng Han, Yiming Li, Changting Lin, Yao Zhang, Mingyang Zhang, Yan Liu, Haiqin Weng, Yuseok Jeon, Ka-Ho Chow, et al. A survey of privacy threats and defense in vertical federated learning: From model life cycle perspective. arXiv preprint arXiv:2402.03688, 2024.
Yang et al. [2023] Liu Yang, Di Chai, Junxue Zhang, Yilun Jin, Leye Wang, Hao Liu, Han Tian, Qian Xu, and Kai Chen. A survey on vertical federated learning: From a layered perspective. arXiv preprint arXiv:2304.01829, 2023.
Ye et al. [2024] Mang Ye, Wei Shen, Bo Du, Eduard Snezhko, Vassili Kovalev, and Pong C Yuen. Vertical federated learning for effectiveness, security, applicability: A survey. ACM Computing Surveys, 2024.
Chatel et al. [2021] Sylvain Chatel, Apostolos Pyrgelis, Juan Ramón Troncoso-Pastoriza, and Jean-Pierre Hubaux. Sok: Privacy-preserving collaborative tree-based model learning. Proceedings on Privacy Enhancing Technologies, 2021(3):182–203, 2021.
Ong et al. [2022] Yuya Jeremy Ong, Nathalie Baracaldo, and Yi Zhou. Tree-based models for federated learning systems. In Federated Learning, pages 27–52. 2022.
Han et al. [2022] Yujin Han, Pan Du, and Kai Yang. Fedgbf: An efficient vertical federated learning framework via gradient boosting and bagging. arXiv preprint arXiv:2204.00976, 2022.
Xie et al. [2022] Lunchen Xie, Jiaqi Liu, Songtao Lu, Tsung-Hui Chang, and Qingjiang Shi. An efficient learning framework for federated xgboost using secret sharing and distributed optimization. ACM Transactions on Intelligent Systems and Technology (TIST), 13(5):1–28, 2022.
Wu et al. [2020] Yuncheng Wu, Shaofeng Cai, Xiaokui Xiao, Gang Chen, and Beng Chin Ooi. Privacy preserving vertical federated learning for tree-based models. Proceedings of the VLDB Endowment, 13(12):2090–2103, 2020.
Liu et al. [2022b] Xiaoyuan Liu, Tianneng Shi, Chulin Xie, Qinbin Li, Kangping Hu, Haoyu Kim, Xiaojun Xu, Bo Li, and Dawn Song. Unifed: A benchmark for federated learning frameworks. arXiv preprint arXiv:2207.10308, 2022b.
Liu et al. [2021] Yang Liu, Tao Fan, Tianjian Chen, Qian Xu, and Qiang Yang. Fate: An industrial grade platform for collaborative learning with data protection. The Journal of Machine Learning Research, 22(1):10320–10325, 2021.
Li et al. [2023] Qinbin Li, Zhaomin Wu, Yanzheng Cai, Yuxuan Han, Ching Man Yung, Tianyuan Fu, and Bingsheng He. Fedtree: A federated learning system for trees. In Proceedings of Machine Learning and Systems, 2023.
Xie et al. [2023] Yuexiang Xie, Zhen Wang, Dawei Gao, Daoyuan Chen, Liuyi Yao, Weirui Kuang, Yaliang Li, Bolin Ding, and Jingren Zhou. Federatedscope: A flexible federated learning platform for heterogeneity. Proceedings of the VLDB Endowment, 16(5):1059–1072, 2023.
Pinkas et al. [2014] Benny Pinkas, Thomas Schneider, and Michael Zohner. Faster private set intersection based on $\{$ OT $\}$ extension. In 23rd $\{$ USENIX $\}$ Security Symposium ( $\{$ USENIX $\}$ Security 14), pages 797–812, 2014.
Pinkas et al. [2018] Benny Pinkas, Thomas Schneider, and Michael Zohner. Scalable private set intersection based on ot extension. ACM Transactions on Privacy and Security (TOPS), 21(2):1–35, 2018.
Dong et al. [2013] Changyu Dong, Liqun Chen, and Zikai Wen. When private set intersection meets big data: an efficient and scalable protocol. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security, pages 789–800, 2013.
Chen et al. [2017] Hao Chen, Kim Laine, and Peter Rindal. Fast private set intersection from homomorphic encryption. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 1243–1255, 2017.
Quinlan [1986] J. Ross Quinlan. Induction of decision trees. Machine learning, 1:81–106, 1986.
Quinlan [2014] J Ross Quinlan. C4. 5: programs for machine learning. 2014.
Breiman [2017] Leo Breiman. Classification and regression trees. 2017.
Maimon and Rokach [2014] Oded Z Maimon and Lior Rokach. Data mining with decision trees: theory and applications, volume 81. 2014.
Song and Ying [2015] Yan-Yan Song and LU Ying. Decision tree methods: applications for classification and prediction. Shanghai archives of psychiatry, 27(2):130, 2015.
Safavian and Landgrebe [1991] S Rasoul Safavian and David Landgrebe. A survey of decision tree classifier methodology. IEEE transactions on systems, man, and cybernetics, 21(3):660–674, 1991.
Rokach and Maimon [2005] Lior Rokach and Oded Maimon. Top-down induction of decision trees classifiers-a survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 35(4):476–487, 2005.
Sharma et al. [2016] Himani Sharma, Sunil Kumar, et al. A survey on decision tree algorithms of classification in data mining. International Journal of Science and Research (IJSR), 5(4):2094–2097, 2016.
Breiman [2001] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
Friedman [2001] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
Si et al. [2017] Si Si, Huan Zhang, S Sathiya Keerthi, Dhruv Mahajan, Inderjit S Dhillon, and Cho-Jui Hsieh. Gradient boosted decision trees for high dimensional sparse output. In International conference on machine learning, pages 3182–3190, 2017.
Kasiviswanathan et al. [2011] Shiva Prasad Kasiviswanathan, Homin K Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. What can we learn privately? SIAM Journal on Computing, 40(3):793–826, 2011.
Yang et al. [2020] Mengmeng Yang, Lingjuan Lyu, Jun Zhao, Tianqing Zhu, and Kwok-Yan Lam. Local differential privacy and its applications: A comprehensive survey. arXiv preprint arXiv:2008.03686, 2020.
Arachchige et al. [2019] Pathum Chamikara Mahawaga Arachchige, Peter Bertok, Ibrahim Khalil, Dongxi Liu, Seyit Camtepe, and Mohammed Atiquzzaman. Local differential privacy for deep learning. IEEE Internet of Things Journal, 7(7):5827–5842, 2019.
Cormode et al. [2018] Graham Cormode, Somesh Jha, Tejas Kulkarni, Ninghui Li, Divesh Srivastava, and Tianhao Wang. Privacy at scale: Local differential privacy in practice. In Proceedings of the 2018 International Conference on Management of Data, pages 1655–1658, 2018.
Alvim et al. [2018] Mário Alvim, Konstantinos Chatzikokolakis, Catuscia Palamidessi, and Anna Pazii. Local differential privacy on metric spaces: optimizing the trade-off with utility. In 2018 IEEE 31st Computer Security Foundations Symposium (CSF), pages 262–267, 2018.
Chatzikokolakis et al. [2013] Konstantinos Chatzikokolakis, Miguel E Andrés, Nicolás Emilio Bordenabe, and Catuscia Palamidessi. Broadening the scope of differential privacy using metrics. In International Symposium on Privacy Enhancing Technologies Symposium, pages 82–102, 2013.
He et al. [2014] Xi He, Ashwin Machanavajjhala, and Bolin Ding. Blowfish privacy: Tuning privacy-utility trade-offs using policies. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 1447–1458, 2014.
Mohassel and Rindal [2018] Payman Mohassel and Peter Rindal. Aby3: A mixed protocol framework for machine learning. In Proceedings of the 2018 ACM SIGSAC conference on computer and communications security, pages 35–52, 2018.
Garfinkel et al. [2003] Tal Garfinkel, Ben Pfaff, Jim Chow, Mendel Rosenblum, and Dan Boneh. Terra: A virtual machine-based platform for trusted computing. In Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 193–206, 2003.
Sabt et al. [2015] Mohamed Sabt, Mohammed Achemlal, and Abdelmadjid Bouabdallah. Trusted execution environment: what it is, and what it is not. In 2015 IEEE Trustcom/BigDataSE/Ispa, volume 1, pages 57–64, 2015.
Ke et al. [2017] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30, 2017.
Dorogush et al. [2018] Anna Veronika Dorogush, Vasily Ershov, and Andrey Gulin. Catboost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363, 2018.
Yao et al. [2022] Houpu Yao, Jiazhou Wang, Peng Dai, Liefeng Bo, and Yanqing Chen. An efficient and robust system for vertically federated random forest. arXiv preprint arXiv:2201.10761, 2022.
He et al. [2020b] Chaoyang He, Songze Li, Jinhyun So, Mi Zhang, Hongyi Wang, Xiaoyang Wang, Praneeth Vepakomma, Abhishek Singh, Hang Qiu, Li Shen, Peilin Zhao, Yan Kang, Yang Liu, Ramesh Raskar, Qiang Yang, Murali Annavaram, and Salman Avestimehr. Fedml: A research library and benchmark for federated machine learning. Advances in Neural Information Processing Systems, Best Paper Award at Federate Learning Workshop, 2020b.
Lai et al. [2022] Fan Lai, Yinwei Dai, Sanjay S. Singapuram, Jiachen Liu, Xiangfeng Zhu, Harsha V. Madhyastha, and Mosharaf Chowdhury. FedScale: Benchmarking model and system performance of federated learning at scale. In International Conference on Machine Learning (ICML), 2022.
Beutel et al. [2020] Daniel J Beutel, Taner Topal, Akhil Mathur, Xinchi Qiu, Javier Fernandez-Marques, Yan Gao, Lorenzo Sani, Hei Li Kwing, Titouan Parcollet, Pedro PB de Gusmão, and Nicholas D Lane. Flower: A friendly federated learning research framework. arXiv preprint arXiv:2007.14390, 2020.
Wang et al. [2022] Zhen Wang, Weirui Kuang, Yuexiang Xie, Liuyi Yao, Yaliang Li, Bolin Ding, and Jingren Zhou. Federatedscope-gnn: Towards a unified, comprehensive and efficient package for federated graph learning. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, page 4110–4120, 2022.
Kuang et al. [2024] Weirui Kuang, Bingchen Qian, Zitao Li, Daoyuan Chen, Dawei Gao, Xuchen Pan, Yuexiang Xie, Yaliang Li, Bolin Ding, and Jingren Zhou. Federatedscope-llm: A comprehensive package for fine-tuning large language models in federated learning. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, page 5260–5271, 2024.
Lim et al. [2020] Wei Yang Bryan Lim, Nguyen Cong Luong, Dinh Thai Hoang, Yutao Jiao, Ying-Chang Liang, Qiang Yang, Dusit Niyato, and Chunyan Miao. Federated learning in mobile edge networks: A comprehensive survey. IEEE Communications Surveys & Tutorials, 22(3):2031–2063, 2020.
Hardy et al. [2017] Stephen Hardy, Wilko Henecka, Hamish Ivey-Law, Richard Nock, Giorgio Patrini, Guillaume Smith, and Brian Thorne. Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption. arXiv preprint arXiv:1711.10677, 2017.
Zhao et al. [2022] Jiaqi Zhao, Hui Zhu, Wei Xu, Fengwei Wang, Rongxing Lu, and Hui Li. Sgboost: An efficient and privacy-preserving vertical federated tree boosting framework. IEEE Transactions on Information Forensics and Security, 18:1022–1036, 2022.