{CJK}

UTF8gbsn

SCMPPI: Supervised Contrastive Multimodal Framework for Predicting Protein-Protein Interactions

Shengrui Xu
Cuiying Honors College
Lanzhou University
[email protected]
   Tianchi Lu
Department of Computer Science
City University of Hong Kong
[email protected]
Corresponding author. Email: [email protected]
   Zikun Wang
School of Mathematics and Statistics
Lanzhou University
[email protected]
   Jixiu Zhai
School of Mathematics and Statistics
Lanzhou University
[email protected]
   Jingwan Wang
Department of Computer Science
City University of Hong Kong
[email protected]
Abstract

Protein-Protein Interaction (PPI) prediction is a key task in uncovering cellular functional networks and disease mechanisms. However, traditional experimental methods are time-consuming and costly, and existing computational models face challenges in cross-modal feature fusion, robustness, and false-negative suppression. In this paper, we propose a novel supervised contrastive multimodal framework, SCMPPI, for PPI prediction. By integrating protein sequence features (AAC, DPC, CKSAAP-ESMC) with PPI network topology information (Node2Vec graph embedding), and combining an improved supervised contrastive learning strategy, SCMPPI significantly enhances PPI prediction performance. For the PPI task, SCMPPI introduces a negative sample filtering mechanism and modifies the contrastive loss function, effectively optimizing multimodal features. Experiments on eight benchmark datasets, including yeast, human, and H.pylori, show that SCMPPI outperforms existing state-of-the-art methods (such as DF-PPI and TAGPPI) in key metrics such as accuracy ( 98.01%) and AUC (99.62%), and demonstrates strong generalization in cross-species prediction (AUC > 99% on multi-species datasets). Furthermore, SCMPPI has been successfully applied to CD9 networks, the Wnt pathway, and cancer-specific networks, providing a reliable tool for disease target discovery. This framework also offers a new paradigm for multimodal biological information fusion and contrastive learning in collaborative optimization for various combined predictions.

1 Introduction

Protein-Protein Interactions (PPI) are central to many biological processes within cells, including signal transduction, gene expression regulation, metabolic regulation, and the cell cycle [1, 2, 3]. Accurately predicting PPIs is of great significance for understanding cellular functions, revealing disease mechanisms, and identifying potential drug targets [4]. However, traditional experimental methods, such as yeast two-hybrid screening [5] and tandem affinity purification [6], are often time-consuming, labor-intensive, and costly. Therefore, in recent years, computational methods have become a popular research direction for PPI prediction, attracting considerable attention [7, 8, 9].

With the rapid development of bioinformatics, multimodal models based on multiple data types, such as sequence, structural, and network information, have gradually become the mainstream approach for PPI prediction. These methods significantly improve prediction accuracy by integrating information from different sources. For example, DF-PPI [10] enhances sequence feature capture and integrates handcrafted features and semantic embeddings, utilizing a dynamic weighted fusion strategy to improve model stability. TAGPPI[11], on the other hand, combines AlphaFold-predicted structural features with sequence information and improves prediction accuracy using a graph neural network model.

Contrastive learning, as a effective self-supervised learning method, has gained increasing attention in the bioinformatics field in recent years [12, 13], particularly for handling high-dimensional, unstructured biological data (protein sequences and structural information) and optimizing model generalization [14, 15]. However, to date, no studies have applied supervised contrastive learning to the PPI prediction task.

Existing multimodal PPI models still face several challenges: first, some methods rely on specific feature extraction methods, and the lack of high-precision protein structural data limits the application of these models. Second, the robustness and generalization of these models still need further improvement. Additionally, the problem of false-negative predictions persists.

To address these issues, this study proposes a novel supervised contrastive multimodal framework (SCMPPI), aimed at further improving PPI prediction performance by combining multimodal features with supervised contrastive learning. We perform sequence embeddings (AAC、DPC、CKSAAP-ESMC) and combine them with the Node2Vec graph embedding method to promote joint representation learning of sequence features and network information. For the PPI task, we have improved the supervised contrastive learning method by adding a negative sample filtering mechanism and modifying the contrastive learning loss function to enhance the model’s ability to embed both sequence and structural information, thus reducing the false-negative rate. Experimental validation on multiple benchmark datasets shows that SCMPPI outperforms existing methods in terms of prediction accuracy, robustness, and generalization ability.

Our main contributions are as follows:

  • We propose a deep learning framework, SCMPPI, capable of integrating protein sequence features and ppi network information, combined with an improved supervised comparative learning strategy.

  • SCMPPI not only achieves state-of-the-art performance on multiple benchmark datasets, but also exhibits good robustness, generalizability, and low false negatives.

  • The multimodal collaborative mechanism achieved through contrastive learning provides a more versatile framework for interaction prediction, which is expected to advance the development of biomedical research.

2 Related Works

The related prior works include studies on multimodal models for PPI prediction and contrastive learning in biological tasks.

Multimodal Models for PPI

In the field of protein-protein interaction (PPI) prediction, the exploration of multimodal models has driven the field toward greater technological complexity and data diversity. By integrating sequence, structural, and network information, these methods significantly enhance prediction accuracy. Models such as DF-PPI (Deep Fusion-PPI) and TAGPPI have gained attention due to their advantages in combining multiple sources of information. TAGPPI (Bosheng Song et al., 2022)[11] innovatively combines AlphaFold-predicted structural features with sequence information to construct a graph neural network model. However, its performance is limited by the accuracy of the structural predictions, and its robustness in low-quality data scenarios still requires further validation. DF-PPI (Hoai-Nhan Tran et al., 2024)[10]enhances sequence feature capture through an improved APAACplus descriptor and integrates handcrafted features with semantic embeddings using a dynamic weighted fusion strategy. It also improves model stability through batch normalization. However, its reliance on GPU resources and the redundancy of handcrafted features may limit its large-scale application.

In this study, we use the ESMC protein language model[16] to extract sequence embeddings, combining them with Node2Vec graph embedding methods[17] to achieve joint representation learning of sequence features and network information. In parallel with research on multimodal methods in the bioinformatics PPI field, the superiority of contrastive learning has been recognized, and attempts are being made to apply contrastive learning to biological information tasks.

Supervised Contrastive Learning in Biological Tasks

In the field of bioinformatics, supervised contrastive learning has emerged as a effective feature embedding method, demonstrating its unique advantages in various tasks. By optimizing the feature distributions between positive and negative samples while incorporating positive sample label information, contrastive learning enhances the model’s ability to distinguish between similar and dissimilar samples. This makes it particularly suitable for handling high-dimensional, non-structured biological data such as protein sequences and structural information. Consequently, contrastive learning has been widely applied to improve models’ ability to embed sequence and structural information. For instance, the EPACT (Yumeng Zhang et al., 2024) [13]framework harnesses contrastive learning to combine with the pre-trained protein language model (TAPE), significantly improving the prediction of T cell receptor-antigen binding specificity. However, the framework’s use of negative sampling strategies may introduce unavoidable biases, leading to higher false negative probabilities [42]. Additionally, TPpred-SC (Ke Yan et al., 2023)[12] extends contrastive learning to multilabel classification tasks, predicting peptides with multiple functional attributes. These methods not only enhance prediction accuracy but also improve the model’s ability to handle long sequences and complex networks.

In this study, we apply supervised contrastive learning [15]to the PPI task. By pulling positive sample pairs closer, pushing negative sample separation farther apart,and extracting label information, we have improved the ability of the multimodal model to embed sequence and structural information. We also incorporate a negative sample filtering mechanism to reduce the likelihood of false negatives and modify the loss of SuoCons outSupsubscriptsuperscript𝑆𝑢𝑝𝑜𝑢𝑡\mathcal{L}^{Sup}_{out}caligraphic_L start_POSTSUPERSCRIPT italic_S italic_u italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT to PSupsuperscript𝑃𝑆𝑢𝑝\mathcal{L}^{P-Sup}caligraphic_L start_POSTSUPERSCRIPT italic_P - italic_S italic_u italic_p end_POSTSUPERSCRIPT for PPI.

3 Approach

This section delineates the architecture of the SCMPPI framework, an innovative concept devised to facilitate efficient and high-quality PPI prediction.The structural and design principles of the framework are meticulously expounded in Section 3.1, followed by an overview of the application of multimodal fusion techniques in Section 3.2, and finally its core contrast learning module (Section 3.3) and classifier (Section 3.4), respectively.

3.1 Architecture of SCMPPI

Refer to caption
Figure 1: The Architecture of SCMPPI

We employed a divide-and-conquer paradigm to develop the architecture of SCMPPI.The detailed model architecture is presented in Figure 1. The protein multimodal feature fusion encoder primarily consists of a sequence encoder and Node2vec, which respectively extract sequence features of the protein and embedding features from the protein interaction graph. These two features are fused through a MLP to form the representation of the protein.The embeddings of protein pairs are then fused to provide model predictions for various downstream tasks or are fed into the contrastive co-embedding module.In summary, the protein sequences are sampled and fed into the multimodal protein encoder to obtain the embeddings for the proteins. These effective embeddings are then applied to downstream tasks for various model predictions.

3.2 Multimodal Fusion Encoder

Overall, the encoder extracts protein interaction graph information through Node2vec and protein sequence information through the Sequence Encoder, and then integrates them into a fixed-length protein representation using MLF.

Node2vec

We use the NW-align tool [18] to retrieve similar proteins from the STRING database [19], finding an interaction network of similar proteins. Then, we remove edges related to the original protein to avoid data leakage [20]. Next, Node2vec captures the graph embeddings of these similar proteins, which are used as the original protein’s graph embeddings.

In the PPI task, we focus on two types of similarity: vertex homogeneity and structural equivalence [21, 22, 23]. Vertex homogeneity suggests that proteins with similar features are more likely to interact, while structural equivalence indicates that proteins with similar roles in the network tend to interact. These similarities help predict protein interactions [24].

To capture these attributes, we generate the protein node context using biased random walks. The transition probability from protein node t𝑡titalic_t to protein node x𝑥xitalic_x is defined as:

πtx=αpq(t,x)wtx,subscript𝜋𝑡𝑥subscript𝛼𝑝𝑞𝑡𝑥subscript𝑤𝑡𝑥\pi_{tx}=\alpha_{pq}(t,x)\cdot w_{tx},italic_π start_POSTSUBSCRIPT italic_t italic_x end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_p italic_q end_POSTSUBSCRIPT ( italic_t , italic_x ) ⋅ italic_w start_POSTSUBSCRIPT italic_t italic_x end_POSTSUBSCRIPT ,

where wtxsubscript𝑤𝑡𝑥w_{tx}italic_w start_POSTSUBSCRIPT italic_t italic_x end_POSTSUBSCRIPT is the edge weight and αpq(t,x)subscript𝛼𝑝𝑞𝑡𝑥\alpha_{pq}(t,x)italic_α start_POSTSUBSCRIPT italic_p italic_q end_POSTSUBSCRIPT ( italic_t , italic_x ) is the search bias. This allows Node2vec to explore both local and global graph structures and learn meaningful protein node embeddings.

Sequence Encoder

The proposed sequence encoder transforms protein sequences into high-dimensional vector representations for downstream tasks such as function prediction, classification, or structural analysis. It integrates three feature extraction strategies: AAC, DPC, and CKSAAP-ESMC, capturing a range of sequence features, from global composition to local structural patterns and long-range dependencies.

(i) AAC (Amino Acid Composition) [25] calculates the frequency of each amino acid in the sequence, producing a 20-dimensional vector:

AAC(aai)=count of amino acid aaiL,AAC𝑎subscript𝑎𝑖count of amino acid 𝑎subscript𝑎𝑖𝐿\text{AAC}(aa_{i})=\frac{\text{count of amino acid }aa_{i}}{L},AAC ( italic_a italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG count of amino acid italic_a italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_L end_ARG ,

where L𝐿Litalic_L is the sequence length. This representation highlights the overall distribution of amino acids.

(ii) DPC (Dipeptide Composition) [25] extends this by considering adjacent amino acid pairs, resulting in a 400-dimensional vector. The formula is:

DPC(aai,aai+1)=count of dipeptide (aai,aai+1)L1,DPC𝑎subscript𝑎𝑖𝑎subscript𝑎𝑖1count of dipeptide 𝑎subscript𝑎𝑖𝑎subscript𝑎𝑖1𝐿1\text{DPC}(aa_{i},aa_{i+1})=\frac{\text{count of dipeptide }(aa_{i},aa_{i+1})}% {L-1},DPC ( italic_a italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a italic_a start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) = divide start_ARG count of dipeptide ( italic_a italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a italic_a start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_L - 1 end_ARG ,

This captures short-range interactions within the protein sequence.

(iii) CKSAAP(k-gap amino acid pairs) [26]uses a set of 400 possible pairings normalized by the rate of k-type amino acid pairs, extending the amino acid pairs to include pairs with a gap of k𝑘kitalic_k residues, resulting in a [20, 20, 1] embedding. The frequency is calculated by:

CKSAAP(k)=count of (aai,aaj) pairs at distance ktotal pairs at distance kCKSAAP𝑘count of 𝑎subscript𝑎𝑖𝑎subscript𝑎𝑗 pairs at distance 𝑘total pairs at distance 𝑘\text{CKSAAP}(k)=\frac{\text{count of }(aa_{i},aa_{j})\text{ pairs at distance% }k}{\text{total pairs at distance }k}CKSAAP ( italic_k ) = divide start_ARG count of ( italic_a italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) pairs at distance italic_k end_ARG start_ARG total pairs at distance italic_k end_ARG

where k𝑘kitalic_k specifies the gap, ranging from k=0𝑘0k=0italic_k = 0 to k=4𝑘4k=4italic_k = 4. To further extract sequence information, CKSAAP is combined with the ESMC[16] embedding, which extracts the embedding information of amino acid pairs at different intervals. This method iterates over each possible dipeptide pair in the sequence for each gap k𝑘kitalic_k, computes the average of their embedding vectors, and normalizes according to the gap k𝑘kitalic_k, ultimately generating a four-dimensional tensor that contains embedding information of dipeptide pairs at various gaps.

Specifically, protein input to the ESMC model produces amino acid-level embeddings (ae1,ae2,,aeL)𝑎subscript𝑒1𝑎subscript𝑒2𝑎subscript𝑒𝐿(ae_{1},ae_{2},\dots,ae_{L})( italic_a italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_a italic_e start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) with a shape of [L, 960]. The mean of amino acid pair embeddings replaces frequency in the formula to produce:

CKSAAP-ESMC(k)=mean of (aei,aej) pairs at distance ksum embeddings at distance k,CKSAAP-ESMC𝑘mean of 𝑎subscript𝑒𝑖𝑎subscript𝑒𝑗 pairs at distance 𝑘sum embeddings at distance 𝑘\text{CKSAAP-ESMC}(k)=\frac{\text{mean of }(ae_{i},ae_{j})\text{ pairs at % distance }k}{\text{sum embeddings at distance }k},CKSAAP-ESMC ( italic_k ) = divide start_ARG mean of ( italic_a italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) pairs at distance italic_k end_ARG start_ARG sum embeddings at distance italic_k end_ARG ,

This produces an embedding of size [20, 20, 960], and traversing the distance k𝑘kitalic_k from 0 to 3 results in a tensor with a shape of [k+1,20,20,960]𝑘12020960[k+1,20,20,960][ italic_k + 1 , 20 , 20 , 960 ]. This tensor can be reshaped based on downstream tasks, adjusting to a 3D shape [k×960,20,20]𝑘9602020[k\times 960,20,20][ italic_k × 960 , 20 , 20 ] for 3D convolution operations. In our work, with k=3𝑘3k=3italic_k = 3, the final reshaped embeddings have a shape of [3840, 20, 20].

This integration utilizes the biophysical and evolutionary insights of the ESMC model, embedding the contextual information of amino acid pairs up to a gap of k=3𝑘3k=3italic_k = 3, reflecting residue correlations and highlighting local interactions within the protein.

In brief,AAC provides the overall amino acid composition, DPC details local structure through adjacent pairs, while CKSAAP with ESMC embeddings captures long-range dependencies and deep contextual features. This fusion across multiple scales—composition, local structure, and global context—ensures a comprehensive and information-rich sequence representation, making it ideal for addressing advanced bioinformatics challenges. For specific parameter choices, refer to Section 3.2.

3.3 Contrastive Learning Module

Contrastive Learning Loss

Contrasive learning Loss of SCMPPI is based on the unsupervised contrastive learning method SimCLR [14]. The loss of SimCLR can be formulated as :

Sim=iIiSim=iIlogexp(zizj(i)/τ)aA(i)exp(ziza/τ),superscript𝑆𝑖𝑚subscript𝑖𝐼subscriptsuperscript𝑆𝑖𝑚𝑖subscript𝑖𝐼subscript𝑧𝑖subscript𝑧𝑗𝑖𝜏subscript𝑎𝐴𝑖subscript𝑧𝑖subscript𝑧𝑎𝜏\mathcal{L}^{Sim}=\sum_{i\in I}\mathcal{L}^{Sim}_{i}=-\sum_{i\in I}\log\frac{% \exp(z_{i}\cdot z_{j(i)}/\tau)}{\sum_{a\in A(i)}\exp(z_{i}\cdot z_{a}/\tau)},caligraphic_L start_POSTSUPERSCRIPT italic_S italic_i italic_m end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_I end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_S italic_i italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i ∈ italic_I end_POSTSUBSCRIPT roman_log divide start_ARG roman_exp ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT italic_j ( italic_i ) end_POSTSUBSCRIPT / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_a ∈ italic_A ( italic_i ) end_POSTSUBSCRIPT roman_exp ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT / italic_τ ) end_ARG , (6)

where iI=1,2,,2Nformulae-sequence𝑖𝐼122𝑁i\in I={1,2,\dots,2N}italic_i ∈ italic_I = 1 , 2 , … , 2 italic_N represents index of augmented samples in training batches of size N𝑁Nitalic_N. While j(i)𝑗𝑖j(i)italic_j ( italic_i ) indicates the index of another augmented sample from the same source sample. A(i)=Ii𝐴𝑖𝐼𝑖A(i)=I\setminus{i}italic_A ( italic_i ) = italic_I ∖ italic_i represents the set of indices excluding i𝑖iitalic_i. τ𝜏\tauitalic_τ represents the temperature parameter, which controls the penalty strength for hard negative samples [27]. Specifically, smaller temperature values lead to stronger penalties for the most difficult negative samples, representing a greater similarity between vectors. The sample indexed by j(i)𝑗𝑖j(i)italic_j ( italic_i ) is the positive sample related to i𝑖iitalic_i, while all other samples are considered negative samples. In SimCLR, each anchor sample has only one positive sample, resulting in 2N12𝑁12N-12 italic_N - 1 negative samples.

While SimCLR is typically used for pre-training on large unlabeled datasets, it ignores label information. SupCons [15] proposed a method to incorporate labels into the loss function, enabling effective utilization of labeled information. Its formula is formulated as:

outSup=iIiSup=1|P(i)|pP(i)logexp(zizp/τ)aA(i)exp(ziza/τ),subscriptsuperscript𝑆𝑢𝑝𝑜𝑢𝑡subscript𝑖𝐼subscriptsuperscript𝑆𝑢𝑝𝑖1𝑃𝑖subscript𝑝𝑃𝑖subscript𝑧𝑖subscript𝑧𝑝𝜏subscript𝑎𝐴𝑖subscript𝑧𝑖subscript𝑧𝑎𝜏\mathcal{L}^{Sup}_{out}=\sum_{i\in I}\mathcal{L}^{Sup}_{i}=-\frac{1}{|P(i)|}% \sum_{p\in P(i)}\log\frac{\exp(z_{i}\cdot z_{p}/\tau)}{\sum_{a\in A(i)}\exp(z_% {i}\cdot z_{a}/\tau)},caligraphic_L start_POSTSUPERSCRIPT italic_S italic_u italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_I end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_S italic_u italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG | italic_P ( italic_i ) | end_ARG ∑ start_POSTSUBSCRIPT italic_p ∈ italic_P ( italic_i ) end_POSTSUBSCRIPT roman_log divide start_ARG roman_exp ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_a ∈ italic_A ( italic_i ) end_POSTSUBSCRIPT roman_exp ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT / italic_τ ) end_ARG , (7)

where P(i)={p|pA(i)yp=yi}𝑃𝑖conditional-set𝑝𝑝𝐴𝑖subscript𝑦𝑝subscript𝑦𝑖P(i)=\{p|p\in A(i)\land y_{p}=y_{i}\}italic_P ( italic_i ) = { italic_p | italic_p ∈ italic_A ( italic_i ) ∧ italic_y start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } represents the set of positive samples, which are samples with the same label as i𝑖iitalic_i. Compared to SimCLR, SupCons expands the positive sample set for each anchor sample, effectively utilizing label information from positive samples. However, SupCons only utilizes positive sample label information and does not effectively utilize negative sample label information, assuming that all non-anchor samples of different classes are negative samples with low false negative probabilities.

To better collect negative sample information and reduce the false negative probability in this work, we modified SupCons to obtain the protein supervised contrastive learning loss (pSupsuperscript𝑝𝑆𝑢𝑝\mathcal{L}^{p-Sup}caligraphic_L start_POSTSUPERSCRIPT italic_p - italic_S italic_u italic_p end_POSTSUPERSCRIPT):

PSup=iIiPSup=1P(i)pP(i)logexp(zizp/τ)aA(i)exp(ziza/τ)I{tiaτ}(x),superscript𝑃𝑆𝑢𝑝subscript𝑖𝐼subscriptsuperscript𝑃𝑆𝑢𝑝𝑖1𝑃𝑖subscript𝑝𝑃𝑖subscript𝑧𝑖subscript𝑧𝑝𝜏subscript𝑎𝐴𝑖subscript𝑧𝑖subscript𝑧𝑎𝜏subscript𝐼subscript𝑡𝑖𝑎𝜏𝑥\mathcal{L}^{P-Sup}=\sum_{i\in I}\mathcal{L}^{P-Sup}_{i}=-\frac{1}{P(i)}\sum_{% p\in P(i)}\log\frac{\exp(z_{i}\cdot z_{p}/\tau)}{\sum_{a\in A(i)}\exp(z_{i}% \cdot z_{a}/\tau)\cdot I_{\{t_{ia}\leq\tau\}}(x)},caligraphic_L start_POSTSUPERSCRIPT italic_P - italic_S italic_u italic_p end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_I end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_P - italic_S italic_u italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_P ( italic_i ) end_ARG ∑ start_POSTSUBSCRIPT italic_p ∈ italic_P ( italic_i ) end_POSTSUBSCRIPT roman_log divide start_ARG roman_exp ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_a ∈ italic_A ( italic_i ) end_POSTSUBSCRIPT roman_exp ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT / italic_τ ) ⋅ italic_I start_POSTSUBSCRIPT { italic_t start_POSTSUBSCRIPT italic_i italic_a end_POSTSUBSCRIPT ≤ italic_τ } end_POSTSUBSCRIPT ( italic_x ) end_ARG , (8)
tiq=jzi,jzq,jjzi,j2jzq,j2+ϵ,subscript𝑡𝑖𝑞subscript𝑗subscript𝑧𝑖𝑗subscript𝑧𝑞𝑗subscript𝑗superscriptsubscript𝑧𝑖𝑗2subscript𝑗superscriptsubscript𝑧𝑞𝑗2italic-ϵt_{iq}=\frac{\sum_{j}z_{i,j}\cdot z_{q,j}}{\sqrt{\sum_{j}z_{i,j}^{2}}\cdot% \sqrt{\sum_{j}z_{q,j}^{2}}+\epsilon},italic_t start_POSTSUBSCRIPT italic_i italic_q end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT italic_q , italic_j end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_q , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_ϵ end_ARG , (9)

where tiqsubscript𝑡𝑖𝑞t_{iq}italic_t start_POSTSUBSCRIPT italic_i italic_q end_POSTSUBSCRIPT represents the Cos-score [28] between sample i𝑖iitalic_i and sample q𝑞qitalic_q, used to measure the similarity between samples i𝑖iitalic_i and q𝑞qitalic_q. The term I{tiaτ}(x)subscript𝐼subscript𝑡𝑖𝑎𝜏𝑥I_{\{t_{ia}\leq\tau\}}(x)italic_I start_POSTSUBSCRIPT { italic_t start_POSTSUBSCRIPT italic_i italic_a end_POSTSUBSCRIPT ≤ italic_τ } end_POSTSUBSCRIPT ( italic_x ) filters negative samples to reduce the false negative probability, where samples i𝑖iitalic_i and q𝑞qitalic_q with a Cos-score similarity score below the threshold τ𝜏\tauitalic_τ are accepted as negative samples.

When minimizing pSupsuperscript𝑝𝑆𝑢𝑝\mathcal{L}^{p-Sup}caligraphic_L start_POSTSUPERSCRIPT italic_p - italic_S italic_u italic_p end_POSTSUPERSCRIPT, samples with higher similarity scores are clustered more tightly in the feature space. Conversely, they are drawn apart.

Co-embedding Space

The two protein representations generated by the Multimodal Fusion Encoder are projected into a shared latent space using the Projection projector. In our contrastive learning approach, one protein is designated as the anchor, and the associated binding proteins are pulled closer to the anchor in the latent space, while "non-binding" (negative) proteins are pushed away. Given a protein, a set of binding (positive) proteins, and a set of decoy (negative) proteins along with their projections in the training batch, the cosine similarity between protein pairs is calculated.

Refer to caption
Figure 2: Before Contrastive Learning
Refer to caption
Figure 3: After Contrastive Learning
Figure 4: Comparison of results before and after contrastive learning

We use the Uniform Manifold Approximation and Projection(UMAP)[29] to reduce the dimensionality of the projections, which results in different protein embeddings in a two-dimensional space. In Figure 4, we randomly selected 15 pairs of binding (positive) proteins from the independent test set and projected their corresponding embeddings onto 2D UMAP plots, both before and after contrastive learning. A comparison of the two plots shows that the embeddings are closer together after contrastive learning, demonstrating the effectiveness of contrastive learning in PPI (protein-protein interaction).

3.4 Prediction Module

In a training batch of N𝑁Nitalic_N protein sequence pairs (Si,Sj)subscript𝑆𝑖subscript𝑆𝑗(S_{i},S_{j})( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), each pair generates latent vectors 𝐅𝐢subscript𝐅𝐢\mathbf{F_{i}}bold_F start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT and 𝐅𝐣subscript𝐅𝐣\mathbf{F_{j}}bold_F start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT through Protein Multimodal Feature Fusion Encoder. By concatenating them, we obtain the classification embedding 𝐅𝐢𝐣=[𝐅𝐢;𝐅𝐣]subscript𝐅𝐢𝐣subscript𝐅𝐢subscript𝐅𝐣\mathbf{F_{ij}}=[\mathbf{F_{i}};\mathbf{F_{j}}]bold_F start_POSTSUBSCRIPT bold_ij end_POSTSUBSCRIPT = [ bold_F start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ; bold_F start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ], which is passed through an MLP to produce the classification result:

y^ij=MLP(𝐅𝐢𝐣),subscript^𝑦𝑖𝑗MLPsubscript𝐅𝐢𝐣\hat{y}_{ij}=\text{MLP}(\mathbf{F_{ij}}),over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = MLP ( bold_F start_POSTSUBSCRIPT bold_ij end_POSTSUBSCRIPT ) ,

where y^ijsubscript^𝑦𝑖𝑗\hat{y}_{ij}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the predicted interaction probability between (Si,Sj)subscript𝑆𝑖subscript𝑆𝑗(S_{i},S_{j})( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ).

The binary cross-entropy loss for a batch of N𝑁Nitalic_N inputs (Si,Sj,yij)subscript𝑆𝑖subscript𝑆𝑗subscript𝑦𝑖𝑗(S_{i},S_{j},y_{ij})( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) is:

BCE=i=1Nj=1N(yijlog(y^ij)+(1yij)log(1y^ij)),subscript𝐵𝐶𝐸superscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝑁subscript𝑦𝑖𝑗subscript^𝑦𝑖𝑗1subscript𝑦𝑖𝑗1subscript^𝑦𝑖𝑗\mathcal{L}_{BCE}=-\sum_{i=1}^{N}\sum_{j=1}^{N}\left(y_{ij}\log(\hat{y}_{ij})+% (1-y_{ij})\log(1-\hat{y}_{ij})\right),caligraphic_L start_POSTSUBSCRIPT italic_B italic_C italic_E end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT roman_log ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) + ( 1 - italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) roman_log ( 1 - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ) ,

and the total loss function is:

=BCE+κpSup,subscript𝐵𝐶𝐸𝜅superscript𝑝Sup\mathcal{L}=\mathcal{L}_{BCE}+\kappa\mathcal{L}^{p-\text{Sup}},caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_B italic_C italic_E end_POSTSUBSCRIPT + italic_κ caligraphic_L start_POSTSUPERSCRIPT italic_p - Sup end_POSTSUPERSCRIPT ,

where κ𝜅\kappaitalic_κ controls the balance between contrastive loss pSupsuperscript𝑝Sup\mathcal{L}^{p-\text{Sup}}caligraphic_L start_POSTSUPERSCRIPT italic_p - Sup end_POSTSUPERSCRIPT and classification loss BCEsubscript𝐵𝐶𝐸\mathcal{L}_{BCE}caligraphic_L start_POSTSUBSCRIPT italic_B italic_C italic_E end_POSTSUBSCRIPT, optimizing the model’s generalization and robustness. For parameter details, refer to subsection A.2.

4 Experiments

In this section, we describe the experimental setup we used and analyse what we found.

4.1 Experimental Settings

Baseline Algorithms

In our experiments, we represent our algorithm as OURS and compare it with traditional optimizers and existing robust methods for PPI prediction. The traditional optimizers considered are Decision Tree (DT) and Random Forest (RF). The existing robust methods for PPI prediction include DeepFE-PPI[7], PIPR[8], KSGPPI[30], TAGPPI[11], DF-PPI[10], DeepTrio[31], DeeP-AAC[9], DeeP-CNN[9], DFC[32, 33], DCONV[34, 33], and HNSPPI[33]. In this experiment, we utilize a five-fold cross-validation method, which has been widely adopted in previous studies[35, 36, 37]. Additionally, as mentioned earlier, the best configurations of the above models were established as described in their respective works.

Datasets

We conducted experiments using eight benchmark PPI datasets. The Yeast PPI dataset[38, 39, 40] contains 2,497 proteins and 11,188 protein-protein interactions (PPIs), with a balanced sample distribution and complete sequences. The Human PPI dataset, constructed by Pan et al.[41], includes 3,899 positive samples and 949 negative samples, representing 2,502 proteins. To balance the samples, 2,950 new negative protein pairs were generated using the method.[33] The H.pylori dataset was originally derived from Rain’s work [42] and consists of 1458 protein pairs involving 1313 different proteins. The Yeast (PIPR-cut) dataset[30] is derived from the Yeast (PIPR) dataset, comprising 4,487 positive samples and 4,487 negative samples, totaling 8,974 PPIs and involving 2,039 proteins. The multi-species dataset[43] covers Escherichia coli, Caenorhabditis elegans, and Drosophila melanogaster, with 1,834, 2,637, and 7,058 proteins, respectively. The sample sizes are 6,954, 4,013, and 21,975 negative samples, each with an equal number of positive pairs. Finally, the CD9 network, Wnt-related pathway, and Cancer-specific network datasets consist of 16, 96, and 108 samples, respectively[10]. These datasets provide a rich and diverse set of resources for model evaluation. For detailed datasets information, refer to subsection A.1

Parameter Settings

The model utilizes convolutional layers (two Conv2D layers and a pooling layer) for feature extraction, followed by fully connected layers for sequence representation. It is trained using the AdamW optimizer[44] with a learning rate of 0.001 and a batch size of 32. The loss function comprises binary cross-entropy and contrastive loss, with coefficients set at 0.3, 0.6, or 1. Early stopping is employed with a patience of 3 or 5 epochs, depending on the respective phase of training. The model’s performance is assessed based on the Matthews correlation coefficient (MCC), and the best-performing model is retained. Training is conducted over 30 epochs to ensure the generation of robust sequence representations that facilitate accurate protein-protein interaction prediction.For detailed Parameter Settings , refer to subsection A.2.

Evaluation Configuration

All our experiments were conducted on an NVIDIA RTX 4070 GPU,an AMD Ryzen 9 7945HX with Radeon Graphics 2.501 GHz CPU, using Python 3.12.8 and PyTorch 2.6.0.

4.2 Results Analysis

4.2.1 Performance of SCMPPI on intraspecies dataset

Table 1: Five-fold cross-validation results of the SCMPPI on the intraspecies dataset
Acc(%) Pre(%) Sen(%) F1(%) AUC(%) AUPRC(%)
Yeast 98.01 97.70 98.30 98.09 99.62 99.68
Human 83.02 81.84 84.95 83.34 90.58 90.84
H.pylori 87.67 84.12 93.28 88.45 93.99 93.13
PIPR-cut 88.54 88.12 89.12 88.61 94.46 92.42

Table 1 shows the 5-fold cross-validation results of the SCMPPI model on the intraspecies datasets. As seen in the table, the SCMPPI model performs excellently and stably across different datasets. The accuracy (acc) and sensitivity (sen) evaluated on the Yeast dataset are both above 98%. AUC and AUPRC represent the model’s ability to distinguish between positive and negative samples at different thresholds and its performance evaluation when handling imbalanced data. The AUC and AUPRC values for SCMPPI on the four intraspecies datasets are all above 90%, demonstrating that our model has high discriminative power and the ability to handle imbalanced data across various datasets, providing a reliable solution for protein-protein interaction prediction.

4.2.2 Comparisons with existing algorithms

Table 2: Performance comparison on dataset Yeast (clc=0.6).
Model Acc(%) Pre(%) Sen(%) F1(%) MCC AUC(%) AUPRC(%)
Ours 98.08 97.77 98.48 98.09 0.962 99.67 99.55
DF-PPI 96.34 97.56 95.05 96.29 0.927 98.87 99.16
TAGPPI 97.81 98.10 98.26 97.80 0.956 97.74 NA
KSGPPI 97.64 97.44 97.85 97.62 0.956 97.25 97.99
PIPR 97.09 97.00 97.17 97.09 0.942 - -
DeepFE-PPI 94.78 96.45 92.99 94.69 0.896 98.83 98.53
RF 93.62 96.75 90.26 93.40 0.874 96.52 97.27
DF 87.78 88.47 86.86 87.65 0.756 87.78 83.42

To validate the superior performance, we compared the proposed model, SCMPPI, with existing robust methods for PPI prediction on all four species-specific benchmark datasets. As shown in Tables 2, 8, 9, and 10, SCMPPI delivers the best performance across the Yeast, Human, H. pylori, and PIPR-cut datasets.

The Yeast dataset is a widely recognized benchmark in the field of PPI and is commonly used to evaluate the performance of advanced methods[38, 39, 40]. We compared our SCMPPI model with the following representative methods on the Yeast dataset: DE-PPI, TAGPPI, KSGPPI, PIPR, DeepFE-PPI , RF, and DF. To ensure a fair comparison, we maintained the structural parameters and hyperparameters of these methods. The performance of these methods was evaluated using five-fold cross-validation on different datasets . As seen in Table 2 and Figure 7, on the Yeast dataset, SCMPPI achieved the best performance across almost all evaluation metrics. By integrating sequence and network features using a multimodal encoder and learning the latent space distances through contrastive learning, SCMPPI improved accuracy, sensitivity (Sen), and Matthews correlation coefficient (MCC) by 3.48%, 5.90%, and 7.37%, respectively, compared to DeepFE-PPI. Compared to KSGPPI, our model improved AUC by 2.49% and AUPRC by 1.59%. Notably, SCMPPI achieved the highest sensitivity (Sen) of 98.48% and F1 score of 98.09% among all the methods. F1 score and sensitivity are crucial metrics for evaluating a model’s ability to identify positive instances. A high sensitivity indicates that the model can recognize most of the positive samples. This is particularly important as SCMPPI is less likely to incorrectly predict interacting protein pairs as non-interacting, resulting in fewer false negatives (FN) compared to other methods. For Precision (Pre) , our model ranked second compared to the other methods, slightly lower than the best TAGPPI by 0.34%. However, on harmonic measurements such as AUC and MCC, which are essential for a binary classifier, our model outperformed TAGPPI by 1.97% and 0.63%, respectively. This demonstrates that our model strikes a harmonious balance across all evaluation metrics, achieving high accuracy and reliability.

In evaluating model performance, the stability of predictions is also an important factor. As shown in Figure 7, our model demonstrates high stability across all evaluation metrics. Specifically, the DT model shows relatively long boxes and whiskers in Accuracy (Acc), Sensitivity (Sen), AUC, F1, AUPRC, and MCC, indicating considerable fluctuations in the metrics and poor stability. In addition to the traditional models, DeepFE-PPI also exhibits poor stability in Precision (Pre), Specificity (Spec), F1, and MCC. These findings highlight the reliability of our fusion feature strategy in PPI prediction.

4.2.3 Robustness and Generalization Study

Table 3: Train on Yeast dataset and test on H.pylori or Human dataset
Test dataset Model Acc% Pre% F1% AUC% AUPRC% MCC
H.pylori OURS 58.41 59.04 58.74 60.04 57.04 0.168
KSGPPI 56.60 58.20 54.30 59.32 56.63 0.134
DeepPE-PPI 48.40 47.71 27.45 48.08 48.74 -0.030
Human OURS 55.26 55.36 54.82 57.23 55.79 0.1052
KSGPPI 53.59 52.41 62.70 55.57 55.01 0.0823
DeepPE-PPI 49.04 46.11 18.30 46.98 47.66 -0.0292

For protein-protein interaction (PPI) prediction, robustness and generalization are crucial because the model needs to accurately predict PPIs across different datasets and data distributions. Therefore, we evaluated the robustness and generalization of our model, and the results show that SCMPPI performs exceptionally well in both aspects.

Table 11 shows the performance of several models on the multi-species dataset (including C. elegans, D. melanogaster, and E. coli). SCMPPI outperforms other comparative models in multiple metrics, including accuracy, precision, and F1 score, demonstrating its stronger adaptability and robustness in predicting PPIs across different species. The model effectively overcomes data distribution differences, showing greater resistance to data noise and outliers while maintaining more stable prediction performance.

Regarding generalization, SCMPPI performs excellently, especially in Table 3, where the Yeast dataset is used for training and tested on H.pylori and Human datasets, it still maintains the highest prediction performance. This indicates that SCMPPI not only achieves excellent results on specific datasets but also effectively transfers and adapts to new data distributions across datasets, further validating its wide applicability and strong generalization ability.

4.2.4 Ablation Study

Through ablation experiments, we assessed the contributions of the sequence module, graph module, and contrastive learning module in protein-protein interaction (PPI) prediction. The results from ablation experiments conducted on multiple datasets (Yeast, Human, H. pylori, and PIPR-cut), as shown in Table 12, indicate that the combination of the sequence (Seq), graph (Graph), and contrastive learning (Cl) modules significantly enhances the model’s performance. In all of the datasets that were analysed, the model demonstrated the greatest performance when employing a combination of the three modules (Seq, Graph and Cl). The model demonstrated quasi-optimality in a range of metrics, including accuracy (Acc), the Matthews correlation coefficient (MCC), sensitivity (Sen), the area under the curve (AUC), and the area under the precision-recall curve (AUPRC). When any of the modules (Seq, Graph, or Cl) is removed, the performance drops significantly, highlighting the indispensable contribution of each module to the overall prediction capability.

Overall, the ablation experiments validate the complementary nature of the three modules, demonstrating that they not only improve the model’s robustness, generalization ability, and accuracy but also ensure the efficiency and accuracy of the PPI prediction task.

4.2.5 Testing on PPI Network Datasets

In this study, we used the SCMPPI model to predict protein-protein interactions (PPIs) in the CD9, Wnt-related pathway, and Cancer-specific networks, successfully predicting all the relevant interactions. Through these predictions, we identified several key proteins that are biologically significant and show potential in clinical applications[45, 46].

Firstly, CD9 is a tetraspanin transmembrane protein widely distributed across various cell types, involved in processes such as cell adhesion, migration, and fusion source[47]. In tumor metastasis, CD9 plays a dual role. In some cancers, such as prostate cancer, high expression of CD9 is associated with lower metastatic ability, exhibiting a tumor-suppressive effect. However, in other types of cancer, such as pancreatic cancer, high CD9 expression correlates with poor prognosis. Therefore, the role of CD9 in tumor prognosis and treatment should be analyzed based on specific cancer types.

The Wnt-related pathway plays a crucial role in embryonic development, cell proliferation, differentiation, migration, and tissue homeostasis. Abnormal activation of this pathway is closely associated with the onset and progression of various cancers source[48]. AXIN1 and WNT9A are core proteins in this pathway. AXIN1 inhibits the activity of the Wnt pathway by forming a complex that degrades β-catenin, while WNT9A, as a ligand protein, activates the pathway. Regulating the activity of the Wnt signaling pathway holds potential clinical value for cancer prevention and treatment.

Finally, in the Cancer-specific network, CDK1 and TP53 are two key proteins. CDK1 regulates cell proliferation during the G2/M phase of the cell cycle, and its abnormal activation can promote tumor cell proliferation source[49]. TP53, as a tumor suppressor gene, is responsible for DNA repair and apoptosis, and mutations in TP53 are commonly found in various cancers source[50]. Studying these proteins and their interactions can provide an important theoretical foundation for cancer therapy.

Through the accurate predictions of the SCMPPI model, we have not only validated the important roles of these core proteins in various biological processes but also opened up new research directions for the early diagnosis and targeted treatment of cancer.

5 Limitation and Conclusions

In this study, we introduced the SCMPPI framework, which offers an innovative solution for protein-protein interaction (PPI) prediction by combining multimodal features with supervised contrastive learning. The multimodal collaborative mechanism achieved through contrastive learning provides a more versatile solution for interaction analysis. The method uses a supervised contrastive learning objective to align sequence semantics (based on ESMC embeddings) and structural topology (from known graphs), effectively advancing multimodal representation learning and significantly improving feature alignment integrity and robustness. Experimental results show that SCMPPI outperforms existing methods in terms of both accuracy and generalization across multiple benchmark datasets, and it provides reliable results in PPI network prediction.

However, we acknowledge certain limitations. Despite the superior performance of SCMPPI, its complexity and computational demands still need to be reduced, especially when handling large-scale datasets. Furthermore, the model relies on specific feature extraction methods, such as Node2Vec (for graph embeddings) and sequence-based embeddings, which may limit its adaptability in some biological environments. Future work could explore multiscale graph neural network architectures and incorporate dynamic negative sample selection strategies to optimize feature space distribution.

Acknowledgments

References

  • [1] Bruce Alberts, Dennis Bray, Alexander Johnson, Julian Lewis, Martin Raff, Keith Roberts, Peter Walter, and David S Latchman. Essential cell biology: an introduction to the molecular biology of the cell. Nature, 393(6681):132, 1998.
  • [2] David D Chaplin. Overview of the immune response. Journal of allergy and clinical immunology, 125(2):S3–S23, 2010.
  • [3] Kai Simons and Derek Toomre. Lipid rafts and signal transduction. Nature reviews Molecular cell biology, 1(1):31–39, 2000.
  • [4] Edward M Marcotte, Matteo Pellegrini, Ho-Leung Ng, Danny W Rice, Todd O Yeates, and David Eisenberg. Detecting protein function and protein-protein interactions from genome sequences. Science, 285(5428):751–753, 1999.
  • [5] Takashi Ito, Tomoko Chiba, Ritsuko Ozawa, Mikio Yoshida, Masahira Hattori, and Yoshiyuki Sakaki. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proceedings of the National Academy of Sciences, 98(8):4569–4574, 2001.
  • [6] Anne-Claude Gavin, Markus Bösche, Roland Krause, Paola Grandi, Martina Marzioch, Andreas Bauer, Jörg Schultz, Jens M Rick, Anne-Marie Michon, Cristina-Maria Cruciat, et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, 415(6868):141–147, 2002.
  • [7] Yu Yao, Xiuquan Du, Yanyu Diao, and Huaixu Zhu. An integration of deep learning with feature embedding for protein–protein interaction prediction. PeerJ, 7:e7126, 2019.
  • [8] Muhao Chen, Chelsea J-T Ju, Guangyu Zhou, Xuelu Chen, Tianran Zhang, Kai-Wei Chang, Carlo Zaniolo, and Wei Wang. Multifaceted protein–protein interaction prediction based on siamese residual rcnn. Bioinformatics, 35(14):i305–i314, 2019.
  • [9] Kexin Huang, Tianfan Fu, Lucas M Glass, Marinka Zitnik, Cao Xiao, and Jimeng Sun. Deeppurpose: a deep learning library for drug–target interaction prediction. Bioinformatics, 36(22-23):5545–5547, 2020.
  • [10] Hoai-Nhan Tran, Phuc-Xuan-Quynh Nguyen, Fei Guo, and Jianxin Wang. Prediction of protein–protein interactions based on integrating deep learning and feature fusion. International Journal of Molecular Sciences, 25(11), 2024.
  • [11] Bosheng Song, Xiaoyan Luo, Xiaoli Luo, Yuansheng Liu, Zhangming Niu, and Xiangxiang Zeng. Learning spatial structures of proteins improves protein–protein interaction prediction. Briefings in Bioinformatics, Mar 2022.
  • [12] Shutao Chen, Ke Yan, Xuelong Li, and Bin Liu. Protein language pragmatic analysis and progressive transfer learning for profiling peptide–protein interactions. IEEE Transactions on Neural Networks and Learning Systems, 2025.
  • [13] Yumeng Zhang, Zhikang Wang, Yunzhe Jiang, Dene R Littler, Mark Gerstein, Anthony W Purcell, Jamie Rossjohn, Hong-Yu Ou, and Jiangning Song. Epitope-anchored contrastive transfer learning for paired cd8+ t cell receptor–antigen recognition. Nature Machine Intelligence, pages 1–15, 2024.
  • [14] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PmLR, 2020.
  • [15] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. Advances in neural information processing systems, 33:18661–18673, 2020.
  • [16] ESM Team. Esm cambrian: Revealing the mysteries of proteins with unsupervised learning, 2024.
  • [17] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, page 855–864, New York, NY, USA, 2016. Association for Computing Machinery.
  • [18] Yadvir Kaur and Neelofar Sohi. Comparison of different sequence alignment methods-a survey. International Journal of Advanced Research in Computer Science, 8(5), 2017.
  • [19] Christian von Mering, Martijn Huynen, Daniel Jaeggi, Steffen Schmidt, Peer Bork, and Berend Snel. String: a database of predicted functional associations between proteins. Nucleic acids research, 31(1):258–261, 2003.
  • [20] S. Kapoor and A. Narayanan. Guiding questions to avoid data leakage in biological machine learning applications. Nature Methods, 24:100804, 2023.
  • [21] Peter D Hoff, Adrian E Raftery, and Mark S Handcock. Latent space approaches to social network analysis. Journal of the american Statistical association, 97(460):1090–1098, 2002.
  • [22] Palash Goyal and Emilio Ferrara. Graph embedding techniques, applications, and performance: A survey. Knowledge-Based Systems, 151:78–94, 2018.
  • [23] Cennet Merve Yilmaz and Ahmet Onur Durahim. Spr2ep: A semi-supervised spam review detection framework. In 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pages 306–313. IEEE, 2018.
  • [24] Florencio Pazos and Alfonso Valencia. Similarity of phylogenetic trees as indicator of protein–protein interaction. Protein engineering, 14(9):609–614, 2001.
  • [25] S. Kapoor and A. Narayanan. Improved prediction and characterization of anticancer activities of peptides using a novel flexible scoring card method. Scientific Reports, 11:82513, 2021.
  • [26] Zhen Chen, Yong-Zi Chen, Xiao-Feng Wang, Chuan Wang, Ren-Xiang Yan, and Ziding Zhang. Prediction of ubiquitination sites by using the composition of k-spaced amino acid pairs. PloS one, 6(7):e22930, 2011.
  • [27] Feng Wang and Huaping Liu. Understanding the behaviour of contrastive loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2495–2504, 2021.
  • [28] Sasan Azizian and Juan Cui. Deepmirbp: a hybrid model for predicting microrna-protein interactions based on transfer learning and cosine similarity. BMC bioinformatics, 25(1):381, 2024.
  • [29] Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
  • [30] Jun Hu, Zhe Li, Bing Rao, Maha A Thafar, and Muhammad Arif. Improving protein-protein interaction prediction using protein language model and protein network features. Analytical Biochemistry, 693:115550, 2024.
  • [31] Xiaotian Hu, Cong Feng, Yincong Zhou, Andrew Harrison, and Ming Chen. Deeptrio: a ternary prediction system for protein–protein interaction using mask multiple parallel convolutional neural networks. Bioinformatics, 38(3):694–702, 2022.
  • [32] Garima Srivastava, Navita Srivastava, and Gulshan Wadhwa. Neural network model for prediction of ppi using domain frequency & association score base classification of protein pairs. International Journal of Advanced Research in Computer Science, 3(3), 2012.
  • [33] Shijie Xie, Xiaojun Xie, Xin Zhao, Fei Liu, Yiming Wang, Jihui Ping, and Zhiwei Ji. Hnsppi: a hybrid computational model combing network and sequence information for predicting protein–protein interaction. Briefings in Bioinformatics, 24(5):bbad261, 2023.
  • [34] Fang Yang, Kunjie Fan, Dandan Song, and Huakang Lin. Graph-based prediction of protein-protein interactions with attributed signed graph embedding. BMC bioinformatics, 21:1–16, 2020.
  • [35] Wenqi Chen, Shuang Wang, Tao Song, Xue Li, Peifu Han, and Changnan Gao. Dcse: Double-channel-siamese-ensemble model for protein protein interaction prediction. BMC genomics, 23(1):555, 2022.
  • [36] Hongli Gao, Cheng Chen, Shuangyi Li, Congjing Wang, Weifeng Zhou, and Bin Yu. Prediction of protein-protein interactions based on ensemble residual convolutional neural network. Computers in Biology and Medicine, 152:106471, 2023.
  • [37] Cheng Chen, Qingmei Zhang, Qin Ma, and Bin Yu. Lightgbm-ppi: Predicting protein-protein interactions through lightgbm with multi-information fusion. Chemometrics and Intelligent Laboratory Systems, 191:54–64, 2019.
  • [38] Somaye Hashemifar, Behnam Neyshabur, Aly A Khan, and Jinbo Xu. Predicting protein–protein interactions through sequence-based deep learning. Bioinformatics, 34(17):i802–i810, 2018.
  • [39] Leon Wong, Zhu-Hong You, Shuai Li, Yu-An Huang, and Gang Liu. Detection of protein-protein interactions from amino acid sequences using a rotation forest model with a novel pr-lpq descriptor. In Advanced Intelligent Computing Theories and Applications: 11th International Conference, ICIC 2015, Fuzhou, China, August 20-23, 2015. Proceedings, Part III 11, pages 713–720. Springer, 2015.
  • [40] Zhu-Hong You, Lin Zhu, Chun-Hou Zheng, Hong-Jie Yu, Su-Ping Deng, and Zhen Ji. Prediction of protein-protein interactions from amino acid sequences using a novel multi-scale continuous and discontinuous feature set. In BMC bioinformatics, volume 15, pages 1–9. Springer, 2014.
  • [41] Xiao-Yong Pan, Ya-Nan Zhang, and Hong-Bin Shen. Large-scale prediction of human protein- protein interactions from amino acid sequence based on latent topic features. Journal of proteome research, 9(10):4992–5001, 2010.
  • [42] Jean-Christophe Rain, Luc Selig, Hilde De Reuse, Veronique Battaglia, Celine Reverdy, Stephane Simon, Gerlinde Lenzen, Fabien Petel, Jerome Wojcik, Vincent Schächter, et al. The protein–protein interaction map of helicobacter pylori. Nature, 409(6817):211–215, 2001.
  • [43] Muhao Chen, Chelsea J-T Ju, Guangyu Zhou, Xuelu Chen, Tianran Zhang, Kai-Wei Chang, Carlo Zaniolo, and Wei Wang. Multifaceted protein–protein interaction prediction based on siamese residual rcnn. Bioinformatics, 35(14):i305–i314, 2019.
  • [44] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [45] Simona Rapposelli, Eugenio Gaudio, Fabio Bertozzi, and Sheraz Gul. Protein–protein interactions: Drug discovery for the future, 2021.
  • [46] Georgios N Dimitrakopoulos, Aristidis G Vrahatis, Themis P Exarchos, Marios G Krokidis, and Panagiotis Vlamos. Drug and protein interaction network construction for drug repurposing in alzheimer’s disease. Future Pharmacology, 3(4):731–741, 2023.
  • [47] Jing Jiang, Bo Jiang, and Wen-bin Li. Bioinformatics investigation of the prognostic value and mechanistic role of cd9 in glioma. Scientific Reports, 14(1):24502, 2024.
  • [48] Pan Song, Zirui Gao, Yige Bao, Li Chen, Yuhe Huang, Yanyan Liu, Qiang Dong, and Xiawei Wei. Wnt/β𝛽\betaitalic_β-catenin signaling pathway in carcinogenesis and cancer therapy. Journal of Hematology & Oncology, 17(1):46, 2024.
  • [49] Marcos Malumbres. Cyclin-dependent kinases. Genome biology, 15:1–10, 2014.
  • [50] Brandon J Aubrey, Andreas Strasser, and Gemma L Kelly. Tumor-suppressor functions of the tp53 pathway. Cold Spring Harbor perspectives in medicine, 6(5):a026062, 2016.

Appendix A Detailed Datasets and Parameter settings

A.1 Detailed Datasets

Baseline data are broadly categorized into two types of datasets: intraspecific (Yeast, Human, H. pylori, PIPR-cut) and interspecific datasets (multi-species).

Yeast.The Yeast PPI dataset is a widely recognized benchmark used to evaluate the performance of advanced methods. This dataset contains 2,497 proteins, resulting in a total of 11,188 PPI pairs, split evenly into positive and negative samples. The positive samples are primarily derived from the DIP database, while negative interactions are generated by randomly pairing proteins without supporting evidence of interaction. The protein sequences in the dataset are complete and sourced from UniProt. To ensure dataset quality, protein sequences with fewer than 50 amino acids or sequences with 40% or more identity were removed using CD-HIT. In the end, the dataset includes 5,594 positive protein interaction pairs and 5,594 negative samples, ensuring balance between positive and negative samples. This high-quality PPI data provides a reliable benchmark for related research.

Human.The Human PPI dataset was created by Pan et al., contains 3,899 positive samples and 949 negative samples, involving 2,502 human proteins. To balance positive and negative samples, 2,950 negative protein pairs were generated using the same method as the one used in the ksgppi dataset.

H.pylori.The H.pylori dataset, initially provided by Rain et al., includes 1,549 proteins from Helicobacter pylori, with 1,458 positive samples and 1,390 negative samples, providing a rich data foundation for related research.

PIPR-cut.The Yeast (PIPR-cut) dataset is derived from the Yeast (PIPR) dataset. To address redundancy within interaction pairs, repeated positive samples and those with an NW-alignment score above 0.4 were removed. After filtering, the final dataset contains 4,487 positive samples involving 2,039 proteins. To balance the number of positive and negative samples, negative samples were generated based on the positive proteins. A total of 2,966 negative samples from the PIPR dataset were retained, with the remaining 1,521 negative samples generated using the same method as the Human dataset. The final dataset includes 8,974 samples and 2,039 proteins.

multi-species.For evaluating the model’s performance in cross-species PPI prediction, we used a multi-species dataset. This dataset combines multiple benchmark datasets to test the model’s ability to predict protein interactions with low sequence identity across species. The specific datasets used in this study include Escherichia coli (E. coli), which consists of 1,834 proteins, 6,954 positive samples, and 6,954 negative samples; Caenorhabditis elegans (C. elegans), containing 2,637 proteins, 4,013 positive samples, and 4,013 negative samples; and Drosophila melanogaster (D. melanogaster), which includes 7,058 proteins, with 21,975 positive samples and 21,975 negative samples.

network. The three PPI network datasets used in our study include the CD9 network, which consists of 16 samples, focusing on specific protein core modules or functional units; the Wnt-related pathway, containing 96 samples, which focuses on proteins involved in the Wnt signaling pathway; and the cancer-specific network, with 108 samples, specifically constructed to explore cancer-related protein interactions. These datasets represent different biological contexts and provide valuable resources for studying protein interactions in various biological processes. The CD9 network dataset focuses on a specific protein core module or functional unit, the Wnt-related pathway investigates proteins involved in Wnt signaling, and the Cancer-specific targets cancer-related protein interactions.

These datasets contribute to a deeper understanding of protein functions and interactions in biological processes and offer important data for disease diagnosis and therapeutic target discovery.

A.2 Parameter settings

A.2.1 The contrastive loss coefs (clc) on model

To evaluate the impact of the contrastive loss coefficient (clc) on model performance, we conducted a grid search on the hyperparameter κ𝜅\kappaitalic_κ in equation (12) across four benchmark datasets (Yeast, Human, H. pylori, and PIPR-cut), with values set as 0, 0.3, 0.6, 1.0. The experimental results, shown in Tables 4567, indicate significant differences in various performance metrics for different settings of κ𝜅\kappaitalic_κ. The results demonstrate that the appropriate introduction of contrastive loss (e.g., κ=0.6𝜅0.6\kappa=0.6italic_κ = 0.6 or κ=1.0𝜅1.0\kappa=1.0italic_κ = 1.0) can effectively enhance the model’s generalization ability and robustness, validating the effectiveness of the supervised contrastive learning strategy in multimodal feature fusion.

Table 4: The impact of contrastive loss coefs (clc) on model output (Dataset: Yeast)
Acc Pre Recall Spe F1 MCC AUC AUPRC
clc=0 0.980157 0.978292 0.982123 0.978190 0.980197 0.960336 0.995379 0.995277
clc=0.3 0.980783 0.976969 0.984805 0.976760 0.980858 0.961624 0.996737 0.995502
clc=0.6 0.980068 0.977269 0.983018 0.977118 0.980126 0.960169 0.996201 0.996772
clc=1.0 0.980246 0.977950 0.982659 0.977834 0.980293 0.960517 0.995307 0.994691
Table 5: The impact of contrastive loss coefs (clc) on model output (Dataset: Human)
Acc Pre Recall Spe F1 MCC AUC AUPRC
clc=0 0.828545 0.834707 0.820465 0.836618 0.827163 0.657738 0.898334 0.901011
clc=0.3 0.833931 0.846354 0.816882 0.850980 0.830969 0.66881 0.906778 0.909063
clc=0.6 0.832009 0.829321 0.836621 0.827394 0.832770 0.664347 0.905250 0.908259
clc=1.0 0.830213 0.818392 0.849455 0.810978 0.833354 0.661408 0.905795 0.908416
Table 6: The impact of contrastive loss coefs (clc) on model output (Dataset: H.pylori)
Acc Pre Recall Spe F1 MCC AUC AUPRC
clc=0 0.872133 0.838160 0.929344 0.813380 0.880601 0.750231 0.938190 0.932346
clc=0.3 0.872823 0.838072 0.928640 0.815493 0.880856 0.750217 0.940444 0.932667
clc=0.6 0.872825 0.839097 0.927296 0.816901 0.880875 0.749772 0.934227 0.925803
clc=1.0 0.876650 0.841162 0.932783 0.819014 0.884489 0.757842 0.939882 0.931300
Table 7: The impact of contrastive loss coefs(clc) on model output (Dataset: PIPR-cut)
Acc Pre Recall Spe F1 MCC AUC AUPRC
clc=0 0.883441 0.876867 0.892799 0.874081 0.884588 0.767319 0.939408 0.919204
clc=0.3 0.885669 0.871867 0.904837 0.866508 0.887874 0.772216 0.942730 0.920376
clc=0.6 0.885448 0.881158 0.891236 0.879658 0.886094 0.771080 0.944578 0.924199
clc=1.0 0.883552 0.870030 0.901935 0.865171 0.885650 0.767708 0.940730 0.919098

A.2.2 The k-Spaced (k) and similarity threshold (τ𝜏\tauitalic_τ) on model

In our experiments using the Human dataset, we optimized two hyperparameters k-Spaced (k) and similarity threshold (τ𝜏\tauitalic_τ). Among them, the values of k are in the range of 0, 1, 2, 3, 4, and the experimental results show that the model performs best when k=3 after considering the sensitivity , F1 score, and runtime, so we determined k=3 as the final parameter setting (Equation 5). On the other hand, the τ𝜏\tauitalic_τ was traversed in the range of 0.0, 0.3, 0.5, 0.7, 1.0, and the experimental results showed that the negative sample filter could provide higher sensitivity , lower false-negative rate, and optimal F1 scores when τ𝜏\tauitalic_τ=0.7, so we chose τ𝜏\tauitalic_τ=0.7 as the final threshold setting (Eq. 8).

Refer to caption
Figure 5: The impact of k on SCMPPI
Refer to caption
Figure 6: The impact of τ𝜏\tauitalic_τ on SCMPPI

Appendix B More results and analysis

B.1 More Baseline Results and Ablation

Table 8: Performance comparison on dataset Human (clc=1.0)
Model Acc(%) Pre(%) Sen(%) Spe(%) F1(%) MCC AUC(%)
Ours 83.02 81.84 84.95 81.10 83.34 0.661 90.58
KSGPPI 81.44 82.64 79.69 83.20 81.10 0.630 85.86
DeepTrio 75.13 85.90 60.51 89.94 71.00 0.527 -
PIPR 76.07 80.99 68.46 83.71 73.99 0.530 -
DeepP-AAC 72.70 73.35 72.60 72.66 72.66 0.807 80.70
DeepP-CNN 70.63 72.74 68.41 69.63 69.63 0.786 78.58
DeepFE-PPI 66.36 66.70 65.56 64.70 64.70 0.719 71.88
Table 9: Performance comparison on dataset H.pylori (clc=1.0)
Acc(%) Pre(%) Sen(%) F1(%) AUC(%) AP%
OURS 87.67 84.12 93.28 88.45 93.99 89.65
DFC 77.14 77.11 77.17 77.09 77.18 68.87
DCONV 76.17 75.78 75.41 75.43 76.28 68.87
DeepP-AAC 66.14 68.10 65.62 65.67 72.86 69.31
DeepP-CNN 64.60 66.69 63.13 63.82 71.27 69.42
DeepFE-PPI 61.64 61.51 62.65 61.89 66.04 63.10
Table 10: Performance comparison on dataset pipr-cut (clc=0.6)
Acc(%) Pre(%) Sen(%) Spe(%) F1(%) MCC AUC(%)
Our Model 88.54 87.49 89.12 87.96 88.61 0.771 94.45
KSGPPI 88.37 87.40 89.70 87.05 88.53 0.768 89.96
TAGPPI 87.95 87.12 89.09 86.81 88.09 0.759 -
PIPR 86.37 89.04 83.16 84.23 85.90 0.731 -
DeepTrio 84.96 85.98 83.28 86.61 84.61 0.699 -
DeepFE-PPI 74.44 73.85 75.93 72.94 74.78 0.490 -
Table 11: Performance comparison on multi-species(C. eleg, D. mela and E. coli) dataset (clc=0.3).
Model Acc(%) Pre(%) Sen(%) Spe(%) F1(%) MCC AUC(%) AP(%)
Our Model 99.31 99.84 98.77 99.84 99.30 0.986 99.84 99.88
HNSPPI 98.57 98.30 98.85 94.94 98.57 0.949 98.57 92.42
TAGPPI 99.15 99.83 98.48 99.83 99.15 0.983 - -
PIPR 98.19 - - - 98.17 - - -
DeepP-AAC 85.14 84.40 86.65 - 85.48 0.807 80.70 81.54
DeepP-CNN 81.20 80.79 82.65 - 81.59 0.786 78.58 79.66
Table 12: Ablation Study on SCMPPI with three main modules
Dataset Seq Graph Cl Acc% MCC Sen% AUC% AUPRC%
Yeast(0.6) 98.01 0.960 98.30 99.62 99.68
× 98.02 0.960 98.21 99.54 99.53
× 97.55 0.951 98.00 99.60 99.65
× 94.44 0.889 92.62 98.10 98.49
Human(1.0) 83.39 0.669 81.69 90.68 90.91
× 82.85 0.658 83.66 89.83 90.10
× 82.19 0.644 81.25 89.33 89.80
× 81.98 0.640 82.35 88.85 89.64
H.pylori(1.0) 87.67 0.758 93.28 93.99 93.13
× 87.21 0.750 92.93 93.82 93.23
× 86.97 0.743 91.56 93.32 92.90
× 79.74 0.596 81.62 86.82 86.10
PIPR-cut(0.6) 88.54 0.771 89.12 94.46 92.42
× 88.34 0.767 89.28 93.94 91.92
× 87.88 0.758 87.09 93.84 91.79
× 83.43 0.669 82.88 90.10 89.33

B.2 Visualization

Refer to caption
Figure 7: The performance of models through 5-fold cross-validation on the Yeast dataset.
Refer to caption
Figure 8: Untrained
Refer to caption
Figure 9: Trained
Figure 10: PPI classification of Yeast in the UMAP space
Refer to caption
Figure 11: SCMPPI model to predict protein-protein interactions (PPIs) in the CD9(a), Wnt-related pathway(b), and Cancer-specific networks(c)